Practical 2

Last week you got acquainted with R. As a warm-up this afternoon, we will start with an exercise that functions as a recap of what you learned last time. After that, we will practice with reading in external data files and saving data and results, we will practice with exploring and plotting our data. Further, we will practice finding and installing packages in R, and we will use these packages to perform some more advanced analyses in R.

Note however that the aim of this afternoon is not necessarily to teach you to use these packages for these analyses, but to get you acquainted with using packages, and R, in general. Usually there are several available packages for certain analyses, and you may find that you prefer other packages than the ones we used today. This is what is nice about R: you can easily pick and choose, and mix and match what you like, as there are many resources available. On the other hand, this can be overwhelming!

If you have any questions or suggestions, feel free to ask/suggest.

Exercise 1: Lets see what you remember from last week!

1.1 Open `R`. Set your working directory to the folder of your choice, in which you will save all your work from the practical.

Your working directory is the base directory for your current R session. When you want to save something in R, R first automatically directs you to your working directory. Find our what your current working directory is by typing:

getwd()

## [1] "/home/noemi/Werk/Onderwijs/Noemi R cursus/practical2"

You can set your working directory using the GUI, by clicking File…Set…working directory. Alternatively, you can specify it in the following way in the R console, using the function setwd().

setwd("c:/documents/Rcodes/Practical2") # Be sure to specify the correct file path inside setwd for your PC

1.2 Open a new `R`-script. In your script, write code using the instructions below:

First, add a comment in your R-script to indicate you are working on Exercise 1. You can indicate that something is a comment rather than code by using # before you comment.
Make a character vector called vecchar with in it six elements (words or letters).
Make a numeric matrix called matnum with 2 columns and six rows of elements (numbers).
Make a dataframe called dat_chanum out of vecchar and matnum with 3 columns and six rows of elements. Remember: Dataframes and lists can contain data that have mode characters, as well as data that have mode numeric, while matrices and vectors can only contain 1 type of data.
Make a list called list_alles that contains vecchar, matnum, and dat_chanum.

Now, Run your code by selecting the code in you R-script, and pressing Ctrl-R (Windows), Ctrl-Enter (Linux), or Cmd-Enter (Mac). Call each of your objects to see what they look like. Inspect the mode of each of the four objects you made.

# Exercise 1 
vecchar <- c("apple", "pear", "orange", "cherry", "peach", "kiwi")
matnum <- matrix(c(1,2,0,4,1,0,3,5,1,10,2,4), ncol = 2, nrow = 6, byrow=FALSE)
dat_chanum <- data.frame(vecchar,matnum)
list_alles <- list(vecchar,matnum,dat_chanum)

tvecchar
matnum
dat_chanum
list_alles

mode(tvecchar)
mode(matnum)
mode(dat_chanum)
mode(list_alles)

1.3 Save your R-script file as `Practical2.R` in your working directory.

You can use the standard Ctrl-s (Windows/Linux) or Cmd-s (Mac) or use the file menu in the GUI. It is a good idea to regularly save your R script throughout this practical.

1.4 Inspect your global environment and empty it. Check your global environment again to make sure it is empty. Clean your console by pressing `ctrl l`. Browse through your history by pressing the `up arrow key` in the R-console.

ls()
rm(list = ls())
ls()

1.5 `R` comes with a bunch of datasets. Use the function `data()` to see what data is available. Call the dataset `ToothGrowth`, and the dataset `women`. Obtain information on the datasets using ?ToothGrowth and ?women. Inspect the structure of the datasets.

data()
ToothGrowth
women
?ToothGrowth
?women
str(ToothGrowth)
str(women)

1.6 Calculate the means and standard deviations for each variable in the two datasets.

mean(ToothGrowth[,1]) ##Or alternatively mean(ToothGrowth$len)
mean(ToothGrowth[,2]) ##Or alternatively mean(ToothGrowth$supp)
mean(ToothGrowth[,3]) ##Or alternatively mean(ToothGrowth$dose)

sd(ToothGrowth[,1]) ##Or alternatively sd(ToothGrowth$len)
sd(ToothGrowth[,2]) ##Or alternatively sd(ToothGrowth$supp)
sd(ToothGrowth[,3]) ##Or alternatively sd(ToothGrowth$dose)
#Interestingly, the function mean does not work for factor variables, but the function sd does because it treats factors as a truely numeric variable (and it uses the factor levels to calculate the sd).

mean(women$height) ##Or alternatively mean(women[,1])
mean(women$weight) ##Or alternatively mean(women[,2])



sd(women$height) ##Or alternatively sd(women[,1])
sd(women$weight) ##Or alternatively sd(women[,2])

#Bonuspoints for using the apply function:
apply(ToothGrowth,2,mean)  ## the apply function does not work here because the function mean does not work for our factor variable. Instead, you can use lapply or sapply, which are the same as the apply function, but then specifically for a list object. With these functions, you apply a function to each object in a list (and as you may remember, although dataframes look like matrices, they are secretly lists). lapply returns the calculated values in a list format, and sapply returns the values in a numeric format. Another demonstration why it is handy to know about different types of objects, and the modes of data, when you are working in R.
lapply(ToothGrowth, mean)
sapply(ToothGrowth,mean)

apply(women,2,mean) # for dataset women, all three apply functions work because although it is a dataframe, it contains only truely numeric variables, and in that sense behaves like a matrix.
lapply(women,mean)
sapply(women,mean)
apply(ToothGrowth,sd)
sapply(ToothGrowth,sd)
lapply(ToothGrowth,sd)
lapply(women,sd)

1.7 Make histogram for the guinea pigs’ tooth length in dataset `ToothGrowth`. Make a scatterplot for the height and weight of American women in dataset `women`.

hist(ToothGrowth$len)
plot(women$height, women$weight)

1.8 Calculate the correlation between the height and average weights in dataset women using function cor.test(). Perform a regression analysis for the dataset ToothGrowth with length as a dependent variable, and dose and supplement type as predictors.

correlatie<-cor.test(women[,1],women[,2])
correlatie

regressie<-lm(ToothGrowth$len ~ 1 + ToothGrowth$supp + ToothGrowth$dose)
regressie
summary(regressie)

It is also possible to add an interaction to the model.

interactie<-lm(ToothGrowth$len ~ 1 + ToothGrowth$supp + ToothGrowth$dose + ToothGrowth$supp*ToothGrowth$dose)
interactie
summary(interactie)

Exercise 2: Data Preparation in R.

In this exercise we will work with a dataset that we have to load into R. It is an adjusted version of the dataset that can be found here:http://spss.allenandunwin.com.s3-website-ap-southeast-2.amazonaws.com/data_files.html. The data contains 12 variables: participant id, sex, the participants main source of stress (Work, Family/Friends, Money/Finances, or Health/ilness), six items from an optimism questionnaire, a total score on a life satisfaction questionnaire (higher = more satisfied), a total score on a stress questionnaire (higher score = more stress), and a total score on a self esteem questionnaire (higher score = more self esteem).

Unfortunately, our data is in two separate files. One part of the data is stored in the .Rdata file RegressionANOVAdata1.Rdata. The second part is stored in the RegressionANOVAdata2.sav file. We will have to load both files, and get them into one dataframe. Further, we need to calculate sum or mean scores for the optimism questionnaire, of which some items are reverse coded. When we are done, we will save the data in a separate file. In exercise 3, we use the data for a more elaborate regression analysis (and we also obtain anova statistics) in R.

2.A First, empty your global environment. Then get part 1 of the data into R with function load().

###empty and check global environment.
rm(list = ls())
ls()

## character(0)

### great, now we can start with a clean slate.

Load the .Rdata file with the function “load()”. You specify the path to the file in between the round brackets. If the datafiles are saved in your working directory, you only have to specify the filename of the file.

load("c:\practical2\data\RegressionANOVAdata1.Rdata") ##use the correct path to the datafile for your computer :)

###or if the data is in the working directory
load("RegressionANOVAdata1.Rdata")

Your data is now loaded.

2.B Check your global environment. Call the object you just imported into R by loading the datafile. Inspect the structure of the dataset with `str()`. Use the function `head()` on the dataset.

Rdata files contains objects from R. Lets see what objects we loaded into our global environment when we loaded our datafile.

ls()

## [1] "datapart1"

Call datapart1 to see part 1 of our data. Use str() to inspect the structure of the data.

ls()
datapart1

You can use the function head() to only inspect the first few cases of your dataframe.

head(datapart1)

##    id     sex        mainsource tlifesat tpstress tslfest
## 2   9   MALES              Work       30       22      34
## 3 425 FEMALES Family or Friends       33       19      31
## 4 307   MALES              Work       33       31      40
## 5 440   MALES              Work       16       27      21
## 7 341 FEMALES    Money/Finances        5       39      18
## 8 300   MALES              Work       25       39      34

2.C Start by looking at the data file `RegressionANOVAdata2.sav` in spss. To read in spss data in R, we need a package. Install and load package `Hmisc`. Then, load the second part of the data in file `RegressionANOVAdata2.sav` into R with the function `spss.get()`.

You can intall a package using the buttons in the gui: .. … Then you need to choose a mirror where to download the package from, and pick the package you would like to install from a large alphabetical list.
You can also install a package using code in the R console, like this:

install.packages("Hmisc")

If the install was succesful, you will see this message at the end: “The downloaded source packages are in…”

Now we need to load the installed package in order to be able to use its functions.

library("Hmisc")

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

You may see it also loads some other packages the Hmisc package needs.

When it is loaded, we can use the function spss.get() to load our spss file, and transform it into an R dataframe. Note: This function is based on the function read.spss() in package foreign, which is a more well known function and package for importing spss data to R (when you google get spss file in r, you will most likely find package foreign before you find package Hmisc). However, spss.get() adds some helpful functionality, such as being able to read spss date variables and translating them to R data variables, and dealing better with variable labels.

Assign the data you read in with spss.get() to an object called datapart2.

#datapart2 <- spss.get(file="RegressionANOVAdata2.sav") #if your data is not saved in your working directory, specify the complete filepath for the file= argument.

Call the second part of the data to look at it, inspect its structure with str().

#datapart2
#head(datapart2)
#str(datapart2)

2.C Start by looking at the data file `RegressionANOVAdata2.csv` in a basic text editor or excel. Then, load the second part of the data in file `RegressionANOVAdata2.csv` into R with the function `read.table()`.

?read.table

The help file describes what the function read.table does: “Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.

Assign the data you read in to an object called datapart2.

datapart2 <- read.table(file="RegressionANOVAdata2.csv", header=TRUE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep="\t" because the data is tab delimited.

Call the second part of the data to look at it, inspect its structure with str().

datapart2
head(datapart2)
str(datapart2)

2.D Store datapart1 in an object called `fulldata`. Now add the data in datapart2 to `fulldata`, such that the variables in datapart2 are in columns 7 to 12 of `fulldata`.

fulldata<-datapart1
fulldata[,7:12]<-datapart2

head(fulldata)

2.E Item 2, 4, and 6 from the optimism scale (`op2`,`op4`, and `op6`) are reversely scored: The items are scored from 1 to 5, and the higher the score, the lower the optimism. Make new items that are not reversely scored (such that a high score indicates high optimism), and put them in column 13, 14, and 15 of our dataframa `fulldata`.

fulldata[,13]<- 6 - fulldata$op2
fulldata[,14]<- 6 - fulldata$op4
fulldata[,15]<- 6 - fulldata$op6

The names of our newly made variables are V13, V14 and V15. We can use function colnames (stands for column names) to change the names.

colnames(fulldata)[13:15]<- c("op2R", "op4R", "op6R")

2.F Use apply and the function mean or sum to calculate mean scores across the optimism items for each participant. Put the mean scores in the 16th column of `fulldata`.

Rirst, select the variables we want to sum from the dataset. We can store it in variable “items”. Then use the apply function to sum over each row of the object items.

items <- fulldata[,c(7,9,11,13:15)]
fulldata[,16]<- apply(items,MARGIN=1,FUN=sum)  ##or for mean scores rather than sumscores: apply(items,MARGIN=1,FUN=mean)

Alternatively, you can do it in one go like this:

fulldata[,16]<- apply(fulldata[,c(7,9,11,13:15)],MARGIN=1,FUN=sum) ##or for mean scores rather than sumscores: apply(fulldata[,c(7,9,11,13:15)],MARGIN=1,FUN=mean)

We can give our new variable a suitable name using colnames()

colnames(fulldata)[16]<- c("sumOpt")

2.G Save your full dataset in an .Rdata file using `save()`, and in an .csv or .txt file using `write.table()`.

###In an Rdata file you can easily save many objects you made in R. 
## to save just fulldata you would do:

save(fulldata, file="DataforExercise3.Rdata") ### If you want to save your file somewhere other than your working directory, specify the complete path, including the filename. Like this: file = "c:\practical2\data\DataExercise2.Rdata".

## to save fulldata, datapart1, and datapart2
save(fulldata, datapart1,datapart2, file="DataforExercise3.Rdata")

You can load this file into R like we did in exercise 2.A.

Now, to save it as a .txt or .csv:

write.table(fulldata, file="DataforExercise3.csv", sep="\t", row.names=FALSE, col.names=TRUE) #I chose to make a tab-delimited, but you can choose anything to separate the variables, for example, sep=";".

You now have successfully organized your data in R. We will analyse this dataset in Exercise 3 and 4.

Exercise 3: A Regression Analysis in R.

id
sex
the participants main source of stress (Work, Family/Friends, Money/Finances, or Health/ilness)
six items from an optimism questionnaire and three reversed items from the optimism questionnaire
a total score on a life satisfaction questionnaire (higher = more satisfied)
a total score on a stress questionnaire (higher score = more stress)
a total score on a self esteem questionnaire (higher score = more self esteem)
a total score on an optimism questionnaire (higher score = more optimism)

We will perform two regression analyses on the data to find out if gender moderates the relationship between life satisfaction and optimism. We will also be checking some assumptions.

3.A Load the data.

If you did Exercise 2, load one of the datafiles you made in exercise 2G: the file “DataforExercise3.csv” using read.table(), or the file “DataforExercise3.Rdata” using load().
If you did not do Exercise 2, open and inspect the file ’DataRegression.csv in a text editor or in excel. Then load it using function read.table().

?read.table

The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.

Assign the data you read in to an object called fulldataset.

fulldataset <- read.table(file="DataRegression.csv", header=TRUE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep="\t" because the data is tab delimited.

You can also use the function head() to only inspect the first few cases of your dataframe.

head(fulldataset)

##    id     sex        mainsource tlifesat tpstress tslfest op1 op2 op3 op4
## 1   9   MALES              Work       30       22      34   2   3   4   3
## 2 425 FEMALES Family or Friends       33       19      31   3   1   3   3
## 3 307   MALES              Work       33       31      40   3   1   5   3
## 4 440   MALES              Work       16       27      21   3   2   3   2
## 5 341 FEMALES    Money/Finances        5       39      18   3   5   1   4
## 6 300   MALES              Work       25       39      34   4   1   3   1
##   op5 op6 op2R op4R op6R sumOpt
## 1   5   4    3    3    2     19
## 2   3   4    5    3    2     19
## 3   5   1    5    3    5     26
## 4   1   3    4    4    3     18
## 5   1   4    1    2    2     10
## 6   4   2    5    5    4     25

3.B We do not need all the separate items of the optimism scale for our analyses. Make a new object called `datareg` with all variables except the individual optimism items (op1 to op6, and op2R, op4R, and op6R). Inspect the structure of the data with str(). Use summary() on the data to obtain some descriptives.

datareg<-fulldataset[,-c(7:15)]
str(datareg)
summary(datareg) ##Note that for the sumscore variables there are some missing data indicated with "NA". These will be listwise deleted automatically when we use these variables for the regression analysis in R.

3.C We want to find out if sex and optimisim predict life satisfaction, but first, try to explore the data a bit.

Some things you can do : - Make a histogram for the dependent variable using hist(). Give the plot the main title “Histogram of Life Satisfaction”.

Get the correlation matrix for all the continuous variables using cor().
Obtain the average life satisfaction for men, and for women, using mean().
Make a scatterplot of life satisfaction and optimism, for only the women using plot(). Use the argument col="red" to plot in red. Use the argument xlab to label the x axis “Optimism”.
Add the points for men to the plot using points(). Use the argument col="blue" to plot in blue, and pch=3 to change the shape of the points.

hist(datareg[,"tlifesat"], main="Histogram of Life Satisfaction")
cor(datareg[,4:7], use="pairwise.complete.obs") # What happens when you do not specify the second argument, na.rm=TRUE?

#make two filter variables, one to select men and one to select women:
filtf = datareg[,2]=="FEMALES"
filtm = filtf==FALSE

#use the filter variables to get the life satisfaction means for women and men
mean(datareg[filtf,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings
mean(datareg[filtm,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings

#use the filter variables to make a scatterplot for women
plot(datareg[filtf,"sumOpt"],datareg[filtf,"tlifesat"],col="red", xlab= "Optimism", ylab="Life Satisfaction")

# Then add the points for men
points(datareg[filtm,"sumOpt"],datareg[filtm,"tlifesat"], col="blue", pch=2)

3.D Fit a regression model with life satisfaction as the dependent variable, and sex, optimism and an interaction term for sex and optimism as predictors. Call your regression analysis `reg1`. Use summary() to get some more results, such as p-values for the regression coefficients and R-squared. Interpret the results.

reg1 <- lm(tlifesat ~ 1+ sex + sumOpt + sex*sumOpt, data=datareg) 
summary(reg1)

3.E We probably should have checked some assumptions before we interpreted the results… We will do this right now.

First, we will make some plots to see if our residuals are normally distributed, and to evaluate if there is heteroskedasticity.

Use plot() on your regression analysis to get some diagnostic plots.
Obtain the predicted values and residuals using predict() and residuals() on the regression analysis.
Make a histogram for the residuals.
use the function shapiro.test() on the residuals to test if the assumption of normally distributed residuals is violated.

plot(reg1) 
# The first plot we see shows a plot of our predicted values and the residuals. The plot should look like a pretty much evenly distributed blob of points. It also labels some potential outliers.
# Press enter to go to the next plot. This is a QQplot for the residuals, and if they are normally distributed the should fall on a straight line. It again labels some potential outliers. 
# Disregard the third plot
# The fourth plot can be used for outlier diagnostics, and gives on overview of the standardized residuals and leverage values.

resids = residuals(reg1)
pred = predict(reg1)

hist(resids) 
shapiro.test(resids)

## we can also remake the basic heteroskedasticityplot ourselves, quite simply:
plot(x=pred,y=resids)

Exercise 4: A Factorial ANOVA in R.

id
sex
the participants main source of stress (Work, Family/Friends, Money/Finances, or Health/ilness)
six items from an optimism questionnaire and three reversed items from the optimism questionnaire
a total score on a life satisfaction questionnaire (higher = more satisfied)
a total score on a stress questionnaire (higher score = more stress)
a total score on a self esteem questionnaire (higher score = more self esteem)
a total score on an optimism questionnaire (higher score = more optimism)

We will perform a factorual ANOVA to find out if and how experienced stress is related to the type of main source of stress, and to gender.

4.A Load the data.

If you did Exercise 2, load one of the datafiles you made in exercise 2G: the file “DataforExercise3.csv” using read.table(), or the file “DataforExercise3.Rdata” using load().
If you did not do Exercise 2, open and inspect the file ’DataRegression.csv in a text editor or in excel. Then load it using function read.table().

?read.table

The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.

Assign the data you read in to an object called fulldataset.

fulldataset <- read.table(file="DataRegression.csv", header=TRUE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep="\t" because the data is tab delimited.

You can also use the function head() to only inspect the first few cases of your dataframe.

head(fulldataset)

##    id     sex        mainsource tlifesat tpstress tslfest op1 op2 op3 op4
## 1   9   MALES              Work       30       22      34   2   3   4   3
## 2 425 FEMALES Family or Friends       33       19      31   3   1   3   3
## 3 307   MALES              Work       33       31      40   3   1   5   3
## 4 440   MALES              Work       16       27      21   3   2   3   2
## 5 341 FEMALES    Money/Finances        5       39      18   3   5   1   4
## 6 300   MALES              Work       25       39      34   4   1   3   1
##   op5 op6 op2R op4R op6R sumOpt
## 1   5   4    3    3    2     19
## 2   3   4    5    3    2     19
## 3   5   1    5    3    5     26
## 4   1   3    4    4    3     18
## 5   1   4    1    2    2     10
## 6   4   2    5    5    4     25

4.B We do not need all the separate items of the optimism scale for our analyses. Make a new object called `datareg` with all variables except the individual optimism items (op1 to op6, and op2R, op4R, and op6R). Inspect the structure of the data with str(). Use summary() on the data to obtain some descriptives.

datareg<-fulldataset[,-c(7:15)]
str(datareg)
summary(datareg) ##Note that for the sumscore variables there are some missing data indicated with "NA". These will be listwise deleted automatically when we use these variables for the regression analysis in R.

4.C We want to find out if sex and optimisim predict life satisfaction, but first, try to explore the data a bit.

Some things you can do : - Make a histogram for the dependent variable using hist(). Give the plot the main title “Histogram of Stress”.

Obtain the stress level satisfaction for men, and for women, using mean().
Make boxplots of stress per main source of stress, for only the women using plot(). Use the argument col="red" to plot in red. Use the argument xlab to label the x axis “main source of stress”.
Make boxplots of stress per main source of stress, for only the men using plot(). Use the argument col="blue" to plot in red. Use the argument xlab to label the x axis “main source of stress”.

hist(datareg[,"tpstress"], main="Histogram of Stress")

#make two filter variables, one to select men and one to select women:
filtf = datareg[,2]=="FEMALES"
filtm = filtf==FALSE

#use the filter variables to get the life satisfaction means for women and men
mean(datareg[filtf,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings
mean(datareg[filtm,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings

par(mfrow=c(1,2)) ##par() can be used to set all kind of options (graphical parameters) in advance for making plots. Here we use it to specify mfrow, which is used to determine how many rows and columns a plot should have. Here we say 1 row, 2 columns. In this way, we can make two plots next to each other. Want to know more? Search for R: plotting with par() in google. 

#use the filter variables to make a scatterplot for women
plot(datareg[filtf,"mainsource"],datareg[filtf,"tpstress"],col="red", xlab= "Main source of stress", ylab="Stress")

#use the filter variables to make a scatterplot for men
plot(datareg[filtm,"mainsource"],datareg[filtm,"tpstress"],col="blue", xlab= "Main source of stress", ylab="Stress")

4.D I know I said we are going to do an ANOVA, but actually we will first do a regression. We can use the results for the regression to get ANOVA statistics. To do this, do the following:

First check if sex and main source of stress are really coded as factors with str(). It would be strange to treat our categorical variable main source of stress as a numerical variable.
Fit a regression model with stress as the dependent variable, and sex, and main source of stress as predictors. Call your regression analysis reg_ano.
Use summary() to get some more results, such as p-values for the regression coefficients and R-squared. Note that R automatically made dummy variables for the factor predictor variables, the reference category is ‘Family/friends’ for main source of stress, and `females’ for sex. Interpret the (regression) results.

reg_ano <- lm(tpstress ~ 1+ sex + mainsource, data=datareg) 
summary(reg_ano)

We probably should have checked some assumptions before we interpreted the results… In Exercise 3.E we show how to check some model assumptions for the regression analysis: If you would like to check them you can use the same techniques presented there.

4.E Get the ANOVA results by using fucntion `anova()` on the regression analysis. Also get the ANOVA results with aov

Did you know that SPSS actually performs a regression analysis in the background when you perform an ANOVA model in the general linear model menu? We are basically doing the same thing, but openly. We can get the overall ANOVA results using the function anova() on the regression model.

anova(reg_ano)

Alternatively, it is possible to immediately fit an ANOVA. It works the same way as the regression analysis, but instead you use function aov().

anova1<-aov(tpstress ~ 1+ sex + mainsource, data=datareg )
summary(anova1)

Interpret the results.

4.F Do Tukey corrected post hoc tests for main sources of stress by using function `TukeyHSD()` on our oav analysis.

We didn’t specify any specific hypotheses about the differences in the means of stress, so we want to do post hoc tests to see if there are any paiwise differences in stress for different main sources of stress. We can do this as follows with the function TukeyHSD().

TukeyHSD(anova1)

4.G Specify contrasts and change the dummy coding for the main source of stress in the regression.

Healthy is often considered most important to peoples’ the happiness. If this is the case, we may expect that stress is higher when the main source of stress is health/ilness related than for the other categories. Use the function C() to specify contrasts (not to be confused with the vector function c()) to compare each other main source of stress to health/ilness (basically change the reference group in the dummy coding). Use another planned contrast to compare health/ilness to the other three categories together.

In R, you specify contrasts by altering the factor variable in your dataframe. The standard ‘treatment’ contrast can be used for a standard dummy coding, and we can change the reference category wit the argument base=... Check the order of the levels of your factor variable with the function levels(), and the current specified contrasts for the factor variable with contrasts(). Then use the number of the element that contains your reference of choice to specify the base argument. Use the contrast function in your regression analysis instead of the earlier predictor mainsource to perform the analysis with different dummy coding.

levels(datareg$mainsource)
contrasts(datareg$mainsource)
reg_ano2<-lm(tpstress ~ 1+ sex + C(datareg$mainsource, contr=contr.treatment, base=2), data=datareg)
summary(reg_ano2)

# Note if you want to change the factor in your dataset to have a certain contrast (not just only for one analysis), you should change the factor in the dataset as follows:
datareg$mainsource<-C(datareg$mainsource, contr=contr.treatment, base=2)
##check the contrast set for your factor variable
contrasts(datareg$mainsource)

Interpret the results.

Make a planned contrast to compare health/ilness to the other three categories together using C()

To do a contrast that is not default in R, we need to specify our own ’contrast matrix`. This is a matrix with contrast coding, with in each column a specific contrast.

# We only want to specify one contrast, so our matrix will have 1 column (so that it can actually be considered a vector), and a row for each level of our categorical predictor variable. Remember, we want to compare the second level to all the other levels combined. To specify an orthogonal contrast we want our contrast codings to sum up to zero, and the categories that should be pooled together should be assigned the same number. 
contrast_mat <- matrix(c(1,-3,1,1),4,1)
##now use this matrix in the C() function. We should also specify in the C() function that we want only 1 contrast with the arguments how.many - the default is one less than the number of levels.
reg_ano3<-lm(tpstress ~ 1+ sex + C(datareg$mainsource, contr=contrast_mat, how.many=1), data=datareg)
summary(reg_ano3)

Interpret the results.

Exercise 5: SEM in R with package Lavaan.

To do sem analyses in R we will use package lavaan. Lavaan is not the only sem package available in R. There is also package ‘sem’, and package ‘openmx’ (and there may be even more packages available.) Lavaan is in my opinion most userfriendly, so we will use this today. It aims to provide an alternative to mplus, can mimic mplus, and is continuously being developed. You may like to try the other packages some time as well - openmx is known as having a quite steep learning curve, but when you know it apparently you can fit almost any model with it.

To start, read a bit about Lavaan on its website: http://lavaan.ugent.be/. The website alsno has a nice tutorial you can follow. Note also that there are some things you can’t do with lavaan yet (from http://lavaan.ugent.be/tutorial/before.html):

“The lavaan package is not finished yet. But it is already very useful for most users, or so we hope. However, some important features that are currently NOT available in lavaan are:

support for hierarchical/multilevel datasets (multilevel cfa, multilevel sem)
support for discrete latent variables (mixture models, latent classes)
Bayesian estimation

We hope to add these features in the next (two?) year(s) or so."

We will start by installing the package, loading, data, and fitting a very basic confirmatory factor analysis on the continuous scores of 72 boys on 6 measures of certain cognitive skills, just to get you started. If you want to practice with more difficult models, you can practice by following the lavaan tutorial.

5.A Install and load package lavaan.

You can intall a package using the buttons in the gui: .. … Then you need to choose a mirror where to download the package from, and pick the package you would like to install from a large alphabetical list.
You can also install a package using code in the R console, like this:

install.packages("lavaan")

If the install was succesful, you will see this message at the end: “The downloaded source packages are in…”

Now we need to load the installed package in order to be able to use its functions.

library("lavaan")

## This is lavaan 0.5-20
## lavaan is BETA software! Please report any bugs.

5.B Load the data file `Boys.dat` into R by following the instructions below.

First, open and inspect the file Boys.dat in a text editor or in excel. This file Boys.dat contains the scores of 72 boys on 6 measures of cognitive skills. Note that the variable names are not included in the top of the file. In order, the six variables are: - Visual perception scores

Scores on a test of spatial visualization
Scores on a test of spatial orientation
Paragraph comprehension scores
Sentence completion scores
Word meaning test scores

Load the file Boys.dat into R using function read.table().

?read.table

The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.

Assign the data you read in to an object called boysdata.

boysdata<- read.table(file="Boys.dat", header=FALSE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = FALSE because the variable names are not specified at the top of the datafile. sep="\t" because the data is tab delimited.

The names for the variables in our dataframe are V1 to V6. Give them more appropriate names using function colnames().

colnames(boysdata)<- c("visper","spatvis", "spator", "parcom", "sencompl", "wordmean")

5.C Specify a CFA lavaan model.

It is assumed that the six variables in boysdata measure two factors, that is Spatial Ability, which is measured by the first three variables, and Verbal Ability, which is measured by the last three variables. Specify the two-factor model in lavaan.

Use the lavaan tutorial: http://lavaan.ugent.be/tutorial/syntax1.html to see how the syntax works.

We will later use function cfa() or sem() to fit the model. For these functions, lavaan will autoamtically scale the model for you. Further, by default, lavaan will include variances for all observed and latent variables, and covariances between the latent variables to the model, so you do not have to specify these in the model. That means that if you do not want those parameters included in the model, you will have to restrict these parameters to zero (we will try restricting parameters later). Alternatively, you can use function lavaan(), then you will have to specify the full model, exactly like you want it, yourself.

Call your lavaan model object boysmodel1.

boysmodel1 <- ' 

             ### Two factors, spatial and verbal
               spatial =~ visper + spatvis + spator 
               verbal =~ parcom + sencompl + wordmean
              
             '

5.D Fit the model using function cfa() or sem().

Look at the many arguments you can specify for function cfa() and sem() by looking up the functions help files. Currently these functions do basically the same thing, so it does not matter which one you will use.

Fit your model with lavaan using function sem or cfa. Call your sem analysis cfa1.

cfa1 <- sem(model=boysmodel1, data=boysdata)

cfa1 is now on object of lavaan class. Look at all the things you can obtain lavaan objects in the helpfiles by calling ?lavaan-class, under the heading ‘methods’. Here you will find all kinds of functions you can use on the lavaan object. Look at the summary() function for the lavaan class object at the bottom.

Inspect the results with summary(). Look at what results lavaan returns by default and interpret the results. How does lavaan automatically scale the model - in the factor loadings, or in the factor variances? Does the two-factor model fit the data? Are the two factors correlated?

summary(cfa1)

Now, use summary() and specify an argumentto obtain fit measures with your results.

summary(cfa1,fit.measures=TRUE)

Does the two-factor model fit the data?

5.E Mimic Mplus.

Lavaan does many things in the same way as Mplus, but not always. You can specify that you want lavaan to run exactly like Mplus - if they figured out how to do that for your type of model - by specifying mimic=“Mplus” in the sem() or cfa() function. Try it.

cfa1 <- sem(model=boysmodel1, data=boysdata, mimic="Mplus")
summary(cfa1)

5.F Changing the way we scale the model by freeing and fixing parameters in lavaan.

Lavaan automatically scaled our model by fixing one factor loading to 1 for each factor. However, we may want to scale in the factor variances instead. We can do this by fixing the factor variances to one, and freeing the factor loadings.

Use the lavaan tutorial: http://lavaan.ugent.be/tutorial/syntax1.html to see how to do more with the model syntax.

Change the model so scaling is done in the factor variances. Call you model boysmodel2.

boysmodel2<- ' 

             ### Two factors, spatial and verbal
               spatial =~ NA*visper + spatvis + spator   ##Free loadings with NA
               verbal =~ NA*parcom + sencompl + wordmean
              
            verbal ~~ 1*verbal  # variance spatial
            spatial ~~ 1*spatial  # variance verbal   '

cfa2 <- sem(model=boysmodel2, data=boysdata)
summary(cfa2)

Exercise 6: Multilevel modeling in R with package lme4.

To do multilevel modeling in R we will use package lme4. Lme4 is not the only mutlilevel modeling package available in R (there are quite many), but it is one of the most popular packages. One popular alternative is nlme.

We will start by installing the package and loading data. After that, we will fit a basic multilevel model on the data.

6.A Install and load package lme4.

You can intall a package using the buttons in the gui: .. … Then you need to choose a mirror where to download the package from, and pick the package you would like to install from a large alphabetical list.
You can also install a package using code in the R console, like this:

install.packages("lme4")

If the install was succesful, you will see this message at the end: “The downloaded source packages are in…”

Now we need to load the installed package in order to be able to use its functions.

library("lme4")

## Loading required package: Matrix

6.B Load the data file `popular.dat` into R by following the instructions below.

First, open and inspect the file popular.dat in a text editor or in excel. This file popular.dat is data from chapter 2 of Joop Hox’s book on multilevel modeling “Multilevel Analysis: Techniques and Applications”. It contains simulated data for 2000 pupils in 100 classes. It contains the following variables:

pupil number
class
extraversion score for each pupil
sex for each pupil
teacher experience in each class
popularity ratings for the pupil by their peers
popularity rating for the pupil bu their teacher.
standardized scores for the latter five variables presented above
centered scores for extraversion, teacher experience, and sex.

Load the file popular.dat into R using function read.table().

?read.table

The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.

Assign the data you read in to an object called popudat.

popudat<- read.table(file="popular.dat", header=TRUE, sep=";") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep=";" because the data is delimited with the symbol ;.

You can use the function head() to only inspect the first few cases of your dataframe.

head(popudat)

##   pupil class extrav  sex texp popular popteach    Zextrav       Zsex
## 1     1     1      5 girl   24     6.3        6 -0.1703149  0.9888125
## 2     2     1      7  boy   24     4.9        5  1.4140098 -1.0108084
## 3     3     1      4 girl   24     5.3        6 -0.9624772  0.9888125
## 4     4     1      3 girl   24     4.7        5 -1.7546396  0.9888125
## 5     5     1      5 girl   24     6.0        6 -0.1703149  0.9888125
## 6     6     1      4  boy   24     4.7        5 -0.9624772 -1.0108084
##      Ztexp   Zpopular   Zpopteach Cextrav Ctexp Csex
## 1 1.486153  0.8850133  0.66905609  -0.215 9.737  0.5
## 2 1.486153 -0.1276291 -0.04308451   1.785 9.737 -0.5
## 3 1.486153  0.1616973  0.66905609  -1.215 9.737  0.5
## 4 1.486153 -0.2722923 -0.04308451  -2.215 9.737  0.5
## 5 1.486153  0.6680185  0.66905609  -0.215 9.737  0.5
## 6 1.486153 -0.2722923 -0.04308451  -1.215 9.737 -0.5

5.C Specify a random intercept model for the peer popularity ratings of the pupils.

Call your multilevel model object randomint.

#randomint <-

5.D Fit the model using function cfa() or sem().

Fit your model with lavaan using function sem or cfa. Call your sem analysis cfa1.

cfa1 <- sem(model=boysmodel1, data=boysdata)

summary(cfa1)

Now, use summary() and specify an argumentto obtain fit measures with your results.

summary(cfa1,fit.measures=TRUE)

Does the two-factor model fit the data?

5.E Mimic Mplus.

cfa1 <- sem(model=boysmodel1, data=boysdata, mimic="Mplus")
summary(cfa1)

5.F Changing the way we scale the model by freeing and fixing parameters in lavaan.

Use the lavaan tutorial: http://lavaan.ugent.be/tutorial/syntax1.html to see how to do more with the model syntax.

Change the model so scaling is done in the factor variances. Call you model boysmodel2.

boysmodel2<- ' 

             ### Two factors, spatial and verbal
               spatial =~ NA*visper + spatvis + spator   ##Free loadings with NA
               verbal =~ NA*parcom + sencompl + wordmean
              
            verbal ~~ 1*verbal  # variance spatial
            spatial ~~ 1*spatial  # variance verbal   '

cfa2 <- sem(model=boysmodel2, data=boysdata)
summary(cfa2)

If you want to practice with more difficult models, you can practice by following the lavaan tutorial!

END of Practical 2

That concludes all the exercises for our second practical. Be sure to save you R script if you do not want to lose your work! If you want to test your obtained skills some more, try any of the analyses with your own data.

Practical 2

Noémi Schuurman

december 2015

Exercise 1: Lets see what you remember from last week!

1.1 Open R. Set your working directory to the folder of your choice, in which you will save all your work from the practical.

1.2 Open a new R-script. In your script, write code using the instructions below:

1.3 Save your R-script file as Practical2.R in your working directory.

1.4 Inspect your global environment and empty it. Check your global environment again to make sure it is empty. Clean your console by pressing ctrl l. Browse through your history by pressing the up arrow key in the R-console.

1.5 R comes with a bunch of datasets. Use the function data() to see what data is available. Call the dataset ToothGrowth, and the dataset women. Obtain information on the datasets using ?ToothGrowth and ?women. Inspect the structure of the datasets.

1.6 Calculate the means and standard deviations for each variable in the two datasets.

1.7 Make histogram for the guinea pigs’ tooth length in dataset ToothGrowth. Make a scatterplot for the height and weight of American women in dataset women.

1.8 Calculate the correlation between the height and average weights in dataset women using function cor.test(). Perform a regression analysis for the dataset ToothGrowth with length as a dependent variable, and dose and supplement type as predictors.

Exercise 2: Data Preparation in R.

2.A First, empty your global environment. Then get part 1 of the data into R with function load().

2.B Check your global environment. Call the object you just imported into R by loading the datafile. Inspect the structure of the dataset with str(). Use the function head() on the dataset.

2.C Start by looking at the data file RegressionANOVAdata2.sav in spss. To read in spss data in R, we need a package. Install and load package Hmisc. Then, load the second part of the data in file RegressionANOVAdata2.sav into R with the function spss.get().

2.C Start by looking at the data file RegressionANOVAdata2.csv in a basic text editor or excel. Then, load the second part of the data in file RegressionANOVAdata2.csv into R with the function read.table().

2.D Store datapart1 in an object called fulldata. Now add the data in datapart2 to fulldata, such that the variables in datapart2 are in columns 7 to 12 of fulldata.

2.F Use apply and the function mean or sum to calculate mean scores across the optimism items for each participant. Put the mean scores in the 16th column of fulldata.

2.G Save your full dataset in an .Rdata file using save(), and in an .csv or .txt file using write.table().

Exercise 3: A Regression Analysis in R.

3.A Load the data.

3.C We want to find out if sex and optimisim predict life satisfaction, but first, try to explore the data a bit.

3.E We probably should have checked some assumptions before we interpreted the results… We will do this right now.

Exercise 4: A Factorial ANOVA in R.

4.A Load the data.

4.C We want to find out if sex and optimisim predict life satisfaction, but first, try to explore the data a bit.

4.D I know I said we are going to do an ANOVA, but actually we will first do a regression. We can use the results for the regression to get ANOVA statistics. To do this, do the following:

4.E Get the ANOVA results by using fucntion anova() on the regression analysis. Also get the ANOVA results with aov

4.F Do Tukey corrected post hoc tests for main sources of stress by using function TukeyHSD() on our oav analysis.

4.G Specify contrasts and change the dummy coding for the main source of stress in the regression.

Exercise 5: SEM in R with package Lavaan.

5.A Install and load package lavaan.

5.B Load the data file Boys.dat into R by following the instructions below.

5.C Specify a CFA lavaan model.

5.D Fit the model using function cfa() or sem().

5.E Mimic Mplus.

5.F Changing the way we scale the model by freeing and fixing parameters in lavaan.

Exercise 6: Multilevel modeling in R with package lme4.

6.A Install and load package lme4.

6.B Load the data file popular.dat into R by following the instructions below.

5.C Specify a random intercept model for the peer popularity ratings of the pupils.

5.D Fit the model using function cfa() or sem().

5.E Mimic Mplus.

5.F Changing the way we scale the model by freeing and fixing parameters in lavaan.

If you want to practice with more difficult models, you can practice by following the lavaan tutorial!

END of Practical 2

1.1 Open `R`. Set your working directory to the folder of your choice, in which you will save all your work from the practical.

1.2 Open a new `R`-script. In your script, write code using the instructions below:

1.3 Save your R-script file as `Practical2.R` in your working directory.

1.4 Inspect your global environment and empty it. Check your global environment again to make sure it is empty. Clean your console by pressing `ctrl l`. Browse through your history by pressing the `up arrow key` in the R-console.

1.5 `R` comes with a bunch of datasets. Use the function `data()` to see what data is available. Call the dataset `ToothGrowth`, and the dataset `women`. Obtain information on the datasets using ?ToothGrowth and ?women. Inspect the structure of the datasets.

1.7 Make histogram for the guinea pigs’ tooth length in dataset `ToothGrowth`. Make a scatterplot for the height and weight of American women in dataset `women`.

2.B Check your global environment. Call the object you just imported into R by loading the datafile. Inspect the structure of the dataset with `str()`. Use the function `head()` on the dataset.

2.C Start by looking at the data file `RegressionANOVAdata2.sav` in spss. To read in spss data in R, we need a package. Install and load package `Hmisc`. Then, load the second part of the data in file `RegressionANOVAdata2.sav` into R with the function `spss.get()`.

2.C Start by looking at the data file `RegressionANOVAdata2.csv` in a basic text editor or excel. Then, load the second part of the data in file `RegressionANOVAdata2.csv` into R with the function `read.table()`.

2.D Store datapart1 in an object called `fulldata`. Now add the data in datapart2 to `fulldata`, such that the variables in datapart2 are in columns 7 to 12 of `fulldata`.

2.F Use apply and the function mean or sum to calculate mean scores across the optimism items for each participant. Put the mean scores in the 16th column of `fulldata`.

2.G Save your full dataset in an .Rdata file using `save()`, and in an .csv or .txt file using `write.table()`.

4.E Get the ANOVA results by using fucntion `anova()` on the regression analysis. Also get the ANOVA results with aov

4.F Do Tukey corrected post hoc tests for main sources of stress by using function `TukeyHSD()` on our oav analysis.

5.B Load the data file `Boys.dat` into R by following the instructions below.

6.B Load the data file `popular.dat` into R by following the instructions below.