This afternoon we are going to get acquainted with R
. For your own computer, the program R
can be obtained from http://r-project.org. On the UU computers it is already available under ‘standard applications’ –> ‘R for windows’.
If you have any questions or suggestions, feel free to ask/suggest.
We start with the very basics and will move towards more advanced operations in R
. First we will get acquainted with the language and the environment we work in. After that we will work on making vectors and matrices, dataframes and lists, and we will spend some time on exploring functions in R. And finally, we combine all the techniques we learned to make our own dataset and do a basic t-test and regression analysis on it in R.
R
The following window will appear.
You now see the R GUI, with in it, the R console in which we will execute all code. First, explore the GUI a bit - look around what kind of options you can find in the GUI (e.g., under File… and so on).
Second, when you look at the R console after opening R, the first thing you will always see is information about the R version you are working in. Also note that they mention a handy function for citing R in your papers. Try typing in and running (by pressing enter)
citation()
They also mention some functions that provide R demo’s, and help. Lets leave those for now, and get started with our own R script.
R
-script.You can find the option to open a new `R’ script under the File menu. A new pane opens, and we can start writing our code.
It is preferable to work from an R
script instead of directly working in the console, because it is much easier to see and look back on what exactly you have done. Furthermore, if you save your script, you and others are able to exactly reproduce your work, which is of course key for anyone working in science.
R
-script# Exercise 1
a <- 100
The #
tells R
that everything that follows in that specific line is not to be considered as code. In other words, you can use #
to comment in your own R
scripts.
The line a <- 100
assigns the value 100
to object a
. The value 100 and the letter a
are chosen to illustrate assigning in R
. You might as well assign 123
to banana
if you like. Really, anything goes.
Your code is executed and now appears in the console. The object a
and its content are now stored in the global environment. As such, when you type a
in the console, R
will return the assigned value. Try it.
When you run code, all the objects you named will be stored in the “global environment”, basically the virtual space that you work in in your R session. If you quit your current R-session (by closing R) without saving and reloading your global environment, the next time you start R the global environment will be empty again. You then start your R session with a ‘clean slate’, as it were. (Unless you automatically load in previous R Workspaces, that is!)
Note: The shortcut Ctrl-R, Ctril-Enter or Cmd-Enter is your friend: it runs the current selection, or, if nothing is selected, the current line. If Ctrl-Enter or Cmd-Enter yields no result, you probably have selected the console pane instead of your R
script. You can switch to the code pane by moving the mouse cursor and clicking on the desired line in the code pane, or through Ctrl-1 (Windows/Linux/Mac). Alternatively, you can move to the console through Ctrl-2 (Windows/Linux/Mac).
Your working directory is the base directory for your current R
session. When you want to save something in R
, R
first automatically directs you to your working directory. Find our what your current working directory is by typing:
getwd()
## [1] "/home/noemi/Werk/Onderwijs/Noemi R cursus/practical1"
Now, set your working directory to a folder your created, named “Practicals”. You can set your working directory using the GUI, by clicking File…Set…working directory. Alternatively, you can specify it in the following way in the R console, using the function setwd()
. Setting your working directory will be convenient later, when you want to save your workspace, R
scripts, or other data from R
.
setwd("c:/documents/Rcodes/Practical1") # Be sure to specify the correct file path inside setwd for your PC
Practical1.R
in the folder you named Practicals
You can use the standard Ctrl-s (Windows/Linux) or Cmd-s (Mac) or use the file menu in the GUI. It is a good idea to regularly save your R
script throughout this practical.
rm(list = ls())
.At some point, you may want to clear everything out of your global environment. This can be done by running the following command that removes all objects from your global environment:
rm(list = ls())
If you now inspect the content of the global environment by running ls()
you will see it is empty.
Note: it is also possible to remove specific objects by running rm(objectname)
. For example, run the following in the console:
a<-100
b<-2
ls() #Inspect global environment
## [1] "a" "b"
rm(b) #Here we remove object b.
ls() #Inspect global environment again. Only object a is left.
## [1] "a"
ctrl l
. This removes all the code you ran from the console, providing you with a clean view.Pressing ctrl l
only empties the look of your console: All the objects you have named will still be in the global environment, and you can still call them. For example, try calling object a
once more.
"up arrow key"
.By pressing repeatedly on the arrow key that points up, you will browse through all the code you ran previously. By pressing enter, you will run that piece of code again.
This is very convenient, especially if you made an error in the line of code you ran: you can simply press up without typing the line of code again, correct the mistake, and press enter to run the corrected code.
#Exercise 2
vec1 <- c(1,2,3,4,5,6)
vec2 <- 7:10
vec3 <- c("A", "B", "C", "D", "E", "F")
vec4 <- c("10","20","30","40","50","60")
To create a vector we use the function c()
, which stands for “concatenation”. It contains a series of either numbers (numerics) or characters. Characters are always surrounded by quotation marks. When you use words or letters without the quatations marks, R will think that you mean the name of an object (whether you have already specified that object or not). To see, type in and run word
, and "word"
, and compare the results.
We can use the function mode()
to check what the mode is of the data in any object. The mode of the data matters, because for certain types of data you can’t use certain functions. For instance, you can calculate the mean for vec1
and vec2
because they have mode numeric, but not for vec3
and vec4
because they have mode character. Try it yourself:
vec1
vec2
vec3
vec4
mode(vec1)
mode(vec2)
mode(vec3)
mode(vec4)
mean(vec1)
mean(vec2)
mean(vec3)
mean(vec4)
We can turn vectors with numbers in quotation marks into numeric vectors, or numbers into characters as follows:
as.numeric(vec4)
as.character(vec1)
However, we cannot use the same trick to change letters or words to numerics:
as.numeric(vec3)
vec1
and vec2
.Next to creating new vectors where you typed in the data with c()
you can also use it to concatenate multiple vectors. Try it by running the following code from your R
script:
vec5 <- c(vec1,vec2)
vec5
## [1] 1 2 3 4 5 6 7 8 9 10
Select specific elements by from vec5 using square brackets []:
vec5[4]
vec5[6:10]
vec5[c(2,5)]
vec5[-3]
vec5[vec5>5]
What elements are selected with each command?
vec5
, and a matrix out of vec3
by running the following code:mat1 <- matrix(vec5,5,2)
mat2 <- matrix(vec3,2,3)
mat2 <- matrix(vec3,2,3, byrow=TRUE)
Inspect each matrix by calling it in the console. What is the difference between mat2 we made first, and mat2 we made second (note that the second mat2 has now overwritten the first)?
To create a matrix we used matrix()
. For a matrix we need to specify the dimensions (in this case 3 rows and 2 columns) and the input (in this case vec1
or vec2
) needs to match these dimensions.
Note: An array would be a stack of matrices (example code: array(mat, dim = c(3, 2, 2))
), but goes beyond the scope of this course.
mat1
:Use the square brackets to select certain rows, columns, or elements from mat1
:
mat1[,1]
mat1[2,]
mat1[3,2]
mat1[c(1,3),]
mat1[-3,2]
What elements are selected?
vec1
and vec3
with 6 rows and 2 columns. Inspect this matrix, and the mode of this matrix. What has happened?mat3 <- matrix(c(vec1, vec3), 6, 2)
mat3
## [,1] [,2]
## [1,] "1" "A"
## [2,] "2" "B"
## [3,] "3" "C"
## [4,] "4" "D"
## [5,] "5" "E"
## [6,] "6" "F"
mode(mat3)
## [1] "character"
If one or more elements in the matrix represent characters, all other elements are also converted to characters. As a result, we can’t do numeric operations on the matrix, such as calculating its mean.
To be able to store numeric data and character data in one object, we can use a dataframe. A dataframe is secretly a kind of list, but basically works like and looks like a matrix that may contain both numeric and character vectors. As a result, the dataframe is the preferred object to use for storing empirical data , because most empirical datasets will contain both character data, and numeric data. Matrices on the other hand are often used when purely numerical calculations are done, for instance using matrix algebra.
vec1
and vec2
are both columns. Call the dataframe to inspect it, and compare it to mat3
.dat3 <- as.data.frame(mat3)
dat3
mat3
or, alternatively
dat3 <- data.frame(V1 = vec1, V2 = vec3)
dat3
## V1 V2
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 6 F
mat3
## [,1] [,2]
## [1,] "1" "A"
## [2,] "2" "B"
## [3,] "3" "C"
## [4,] "4" "D"
## [5,] "5" "E"
## [6,] "6" "F"
Note that in the dataframe, our columns automatically recieved names, namely V1 and V2. In a dataframe, each column represents a variable in a dataset. We can change the names of these columns if we want. We will postpone this however until exercise 5.
dat3
:mode(dat3[, 1]) #1st column
## [1] "numeric"
mode(dat3[, 2]) #2nd column
## [1] "numeric"
mode(dat3[3,]) #3d row
## [1] "list"
mode(dat3[3,2]) #intersection
## [1] "numeric"
Note that interestingly, our column with characters from vec3
, is now reported to be numeric in our dataframe: the quatation marks for the values of vec3
are gone. In fact, R decided to make it a special kind of variable, that is somewhat in between a regular numeric variable and a character variable, namely a `factor’. A factor is a categorical variable, with specific labels for each category. Exactly what we usually use when we use categorical variables in our analyses. This is nice, because our statistical analyses cannot work with words (characters), but they can work with factors (categorical variables, sort of numeric variables). Note however that you cannot use functions that are designed for truly, purely, numeric variables on factor variables.
dat3
, using function str()
.A very useful function to inspect the structure of a dataframe is str()
. Try running it.
str(dat3)
## 'data.frame': 6 obs. of 2 variables:
## $ V1: num 1 2 3 4 5 6
## $ V2: Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6
Inspecting the structure of your data is vital when you use R in practice, as you will probably have imported your data from some other source. If we, at a later stage, start analyzing our data without the correct measurement levels (e.g., regular numeric, numeric factor, or character), we may run into problems.
For example, it is possible of course that we actually did not want the second column of dat3 to become a factor, but we wanted it to be a character variable. Fortunately, we can change it to a character variable like this:
dat3[,2] <- as.character(dat3[,2])
dat3
mode(dat3[,2])
and we can change it back into a factor like this:
dat3[,2] <- as.factor(dat3[,2])
dat3
mode(dat3[,2])
Similarly, it can occur is that categorical numeric variables are not coded as factors in your dataframe, but as regular numeric variables. This can be problematic when we use the variable in a statistical analysis, because r does not realise it should make dummy variables for it (it sees it as a regular continuous numeric variable). Imagine for instance that for the first variable V1
in our dataframe dat3
the values 1 to 6 actually represent grouping information about cities. We can include this information in our dataframe by making this variable a factor, using the following code.
dat3[,1] <- factor(dat3[,1], labels = c("Utrecht", "New York", "London", "Singapore", "Rome", "Capetown"))
dat3
## V1 V2
## 1 Utrecht A
## 2 New York B
## 3 London C
## 4 Singapore D
## 5 Rome E
## 6 Capetown F
str(dat3)
## 'data.frame': 6 obs. of 2 variables:
## $ V1: Factor w/ 6 levels "Utrecht","New York",..: 1 2 3 4 5 6
## $ V2: Factor w/ 6 levels "A","B","C","D",..: 1 2 3 4 5 6
and we can change it to a numeric variable that is NOT a factor like this:
dat3[,1] <- as.numeric(dat3[,1])
dat3
str(dat3)
vec1
, mat2
, and dat3
.list1<-list(vec1,mat2,dat3)
list1
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [,1] [,2] [,3]
## [1,] "A" "B" "C"
## [2,] "D" "E" "F"
##
## [[3]]
## V1 V2
## 1 Utrecht A
## 2 New York B
## 3 London C
## 4 Singapore D
## 5 Rome E
## 6 Capetown F
A list is an object that can be used to store different objects together, while keeping their characteristics intact. For example, a numeric vector, a character matrix, and a dataframe. Lists are often used by data analysis packages to present many different results from the data analysis at once, as we will see in Exercise 5, and next week.
list1
:list1[[1]]
list1[[2]]
list1[[3]]
list1[[1]][4]
list1[[3]][5,2]
What elements are selected? Do you get how the list is structured?
Any object you use to do operations on objects with data is a function. When a function is used to do some operation, it always looks like a word or letter with round brackets at the end, that is “word()”. Anytime you call something in R that ends with round brackets, R will consider it to be a function, even if the function does not exist yet. Try it by typing in and running word()
.
Can you find all the different functions you have seen before in the previous exercises? I counted 16 of them (23 if you include those visible in the figures).
You can request information about a function using ?nameoffunction()
, for example, try ?str()
.
You can search on a specific searchword through the R helppages - including those for all packages - using ??seachword
. In this way, you can look up helpfiles for specific functions that your obtained in a package you installed. It is also possible to look for functions that may be relevant to you with this search function. For instance, choose an analysis that has your interest and search for it in the helppages (e.g., ??growth
or ??"Growth Curve Model"
).
The help tells you all about a functions arguments (and input requirements), as well as the element the function returns. There are strict rules for publishing packages in R. For your packages to appear on the Comprehensive R Archive Network (CRAN), a rigorous series of checks have to be passed. As a result, all user-level components (functions, datasets, elements) that are published, have an acompanying help file that elaborates how the function should be used, what can be expected, or what a dataset contains. Help files often contain example code at the end that can be run as a demonstration.
For example, use the function mean(), var(), and min() to obtain the mean, variance, and minimum of the matrix. Can you also find functions for the maximum, standard deviation, and median? Can you find functions for any other descriptive statistics you would like to be able to calculate?
apply()
to carry out another function on all rows or all columns of an object.On some occasion you may want to use a function on many columns or rows of some object. For example, you may want to calculate the mean for each column (variable) in your dataframe. Or, you may want to calculate a sumscore across all columns for each row (participant) in a dataframe. You can use the function apply()
for this purpose.
As you could see from this function’s help file, this function needs at least three arguments. the first argument in apply is the object you want to use your function on. In our case, we will use it on the matrix you made in the previous exercise. The second argument is the ‘margin’ for the function, that is, you will specify if you want to do your functions on for instance all rows or all columns. From the helpfile: “a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.” The third argument is the name of the function you want apply to use, for example, mean
.
Give it a go, and use apply to use your function on each column. Also try do use it for each row. Here is an example for columns:
apply(mymatrix,2,mean) ##calculate the mean for each column.
plot()
is the core plotting function in R
. Find out more about plot()
(e.g. through ?plot
) and look at the examples by running example(plot)
.?plot
example(plot)
There are many more functions that can plot specific types of plots. For example, function hist()
plots histograms, but falls back on the basic plot()
function. Packages lattice
and ggplot2
are excellent packages to use for complex plots. Pretty much any type of plot can be made in R. A good reference for packages lattice
that provides all R
-code can be found at http://lmdvr.r-forge.r-project.org/figures /figures.html.
As you start using R
in your own research, you will find yourself in need of packages that are not part of the default R
installation. The beauty of R
is that its funcionality is community-driven. People can add packages to CRAN, which other people can use. Chances are that a function and/or package has been already developed for the analysis or operation you plan to carry out. If not, you are of course welcome to fill the gap by submitting your own package. Next week, we will install packages that are not part of the default R
installation.
There are many functions available within R, and many more are available in packages. However, at some point, you may feel the need to make your own functions. Here, we will make a very simple function to calculate a mean. Of course, the function mean() already exists to do just that, but this will give you a taste of how making functions works. Making a function is the first step to making your own package!
You need to do a couple of things to make a working function: 1. Assign your new function a name. 2. Specify which arguments the user of the function will need to specify for it to work. 3. Program the operations your function will perform behind the scenes. 4. Test if it works and fix any problems.
Functions that will be useful for you to calculate a mean (without using mean
) are length()
and sum()
. Here is a piece of code you can use to start with to make your function (the red words you should not alter):
myfunction <- function(arg1, arg2, ... ){
operations go here
return(object)
}
If you now call your function by running its name, you will see exactly what the function looks like. Now, use the function to calculate the mean of some variable.
For this exercise we will perform a t-test and multiple regression analysis on a dataset. Before we can do that however, we need a dataset. Rather than loading on from R, or importing an existing dataset (which we will do next week), we will make our own data using the techniques you just learned. To start, think of one continuous dependent variable, one dichotomous predictor variable, and one continuous predictor variable, and what your population will be. For example, math ability, gender, and reading ability, for a population of teenagers.
Decide how large your sample size will be. Make object n, and assign the sample size to it.
n<-1000
Generate data for the continuous predictor variable using function rnorm(), and for the dichotomous predictor variable using function sample(). Use ?rnorm() and ?sample() to figure out what arguments to use. Then use the functions to generate n samples for each predictor. Give each predictor an appropriate name. For example:
readability <- rnorm(n,10,3)
gender<-sample(c(0,1),n,replace=TRUE)
Call both predictor variables to see if you sucessfully generated the two predictor variables.
Make four objects: - one containing the intercept of your data - one containing the regression coefficient for the continuous predictor variable - one containing the regression coefficient for the dichotomous predictor variable - one containing the variance for the residuals of your data. For example:
int <- 0
b1 <- 0 ##regression coefficient for gender
b2 <- .45 ##regression coefficient for readability
resvar <- 4
Use rnorm() to make an object that contains normally distributed residuals with a mean equal to zero, and a variance equal to the residual variance you chose in the previous exercise.
residuals = rnorm(n,0,resvar^.5)
Create an object in which you calculate the dependent variable using the regression coefficients and predictors you created in a linear multiple regression model. You can do this as follows:
mathability = int + b1*gender + b2*readability + residuals
Call your dependent variable to see what it looks like.
Before we will analyse the data, we will put it in a dataframe. Make a dataframe with three columns; a column for the dependent variable and a column for each of the predictor variables.
mydataframe= data.frame(mathability,gender,readability)
Call the dataframe. Use function str() to inspect the structure of your dataframe. Lets make our dichotomous variable a factor variable. Give it appropriate labels.
str(mydataframe)
## 'data.frame': 1000 obs. of 3 variables:
## $ mathability: num 7.34 1.29 3.82 7.88 4.19 ...
## $ gender : num 0 0 0 0 0 1 1 0 0 1 ...
## $ readability: num 12.2 6.33 8.36 12.06 10.19 ...
mydataframe[,2]<-factor(mydataframe[,2], labels = c("male", "female"))
str(mydataframe)
## 'data.frame': 1000 obs. of 3 variables:
## $ mathability: num 7.34 1.29 3.82 7.88 4.19 ...
## $ gender : Factor w/ 2 levels "male","female": 1 1 1 1 1 2 2 1 1 2 ...
## $ readability: num 12.2 6.33 8.36 12.06 10.19 ...
Congratulations, you just made your own practice dataset!
Use the functions sd(), var(), and mean() to calculate some descriptive statistics for each variable in your dataset. Also try accomplishing this using the apply function. Finally, use the function summary() to get many summary statistics for your dataframe at once.
You can call each variable in your dataframe using the square brackets [], for example, mydataframe[,1] for mathability. However, with dataframes you can also call a relevant variable by using its name. For example, mydataframe$mathability.
mean(mydataframe[,1])
mean(as.numeric(mydataframe$gender)) # If you want to calculate a mean for a categorical factor variable (not very useful, but if you want to) you first need to change it back to a regular numeric variable.
sd(mydataframe$mathability)
apply(mydataframe, MARGIN=2, FUN=var) ##apply the var function for each column in the dataframe. It will return an NA for variables that are characters or factor variables (e.g. variables that are not truly numeric)
summary(mydataframe)
Make a histogram for your dependent variable using hist(), and a scatterplot for the dependent variable and the continuous predictor variable using plot().
hist(mydataframe$mathability)
plot(mydataframe$mathability,mydataframe$readability)
Inspect function t.test() to see some of the possible arguments you can add to change some options. Look at the examples provided in the help file.
## There are two ways to specify a ttest using function t.test()
## The first way is to provide the data for group 1 as x, and for group 2 as y. Like this:
myttest <- t.test(x=mydataframe$mathability[mydataframe$gender=="male"], y=mydataframe$mathability[mydataframe$gender=="female"], alternative="two.sided")
## For the second way you specify your what your dataframe with data is for the ttest, and the equation for the t.test as follows: name of your dependent variable in the dataframe ~ name of independent variable in your dataframe. Like this:
myttest <- t.test(mathability~gender, alternative="two.sided", data=mydataframe)
Call myttest to see the results. Interpret the results.
Inspect function lm() - lm stands for linear model - to see how the function works. Look at the examples provided in the help file. Perform the regression analysis.
myregression<-lm(mathability ~ 1 + gender + readability, data=mydataframe) ## the 1 you add to the regression equation is used to add an intercept to the model.
Call myregression to see the results. Use summary() to get more detailed results. Interpret the results. Note that the lm() function automatically made a dummy variable for gender, with female as a reference category (it states ‘genderfemale’ in the results).
myregression
summary(myregression)
That concludes all the exercises for our first practical. Be sure to save you R script if you do not want to lose your work! If you want to test your obtained skills some more, try the bonus exercises below.
Next week we will continue with more advanced analyses in R, by making use of statistical packages others contributed to R, and data that we will load from outside of R. Many of the things you learned today will prove relevant for what we will do next week, so be sure to practice in the meantime if you feel you need it :). See you next week!
You can find the equations you need on wikipedia :)
R
, and perform a basic analysis on it.You can find an overview of all the datasets stored in R using data()
. Use it to load a datafile that interests you. Find out details about the data. Find out in what kind of object the data is stored. Calculate some descriptive statistics, make a plot, and perform a basic analysis on the data (for example, calculate correlations, do a regression analysis, t-test, or an anova).