Last week you got acquainted with R
. As a warm-up this afternoon, we will start with an exercise that functions as a recap of what you learned last time. After that, we will practice with reading in external data files and saving data and results, we will practice with exploring and plotting our data. Further, we will practice finding and installing packages in R, and we will use these packages to perform some more advanced analyses in R.
Note however that the aim of this afternoon is not necessarily to teach you to use these packages for these analyses, but to get you acquainted with using packages, and R, in general. Usually there are several available packages for certain analyses, and you may find that you prefer other packages than the ones we used today. This is what is nice about R: you can easily pick and choose, and mix and match what you like, as there are many resources available. On the other hand, this can be overwhelming!
If you have any questions or suggestions, feel free to ask/suggest.
R
. Set your working directory to the folder of your choice, in which you will save all your work from the practical.Your working directory is the base directory for your current R
session. When you want to save something in R
, R
first automatically directs you to your working directory. Find our what your current working directory is by typing:
getwd()
## [1] "/home/noemi/Werk/Onderwijs/Noemi R cursus/practical2"
You can set your working directory using the GUI, by clicking File…Set…working directory. Alternatively, you can specify it in the following way in the R console, using the function setwd()
.
setwd("c:/documents/Rcodes/Practical2") # Be sure to specify the correct file path inside setwd for your PC
R
-script. In your script, write code using the instructions below:#
before you comment.vecchar
with in it six elements (words or letters).matnum
with 2 columns and six rows of elements (numbers).dat_chanum
out of vecchar
and matnum
with 3 columns and six rows of elements. Remember: Dataframes and lists can contain data that have mode characters, as well as data that have mode numeric, while matrices and vectors can only contain 1 type of data.list_alles
that contains vecchar
, matnum
, and dat_chanum
.Now, Run your code by selecting the code in you R-script, and pressing Ctrl-R (Windows), Ctrl-Enter (Linux), or Cmd-Enter (Mac). Call each of your objects to see what they look like. Inspect the mode of each of the four objects you made.
# Exercise 1
vecchar <- c("apple", "pear", "orange", "cherry", "peach", "kiwi")
matnum <- matrix(c(1,2,0,4,1,0,3,5,1,10,2,4), ncol = 2, nrow = 6, byrow=FALSE)
dat_chanum <- data.frame(vecchar,matnum)
list_alles <- list(vecchar,matnum,dat_chanum)
tvecchar
matnum
dat_chanum
list_alles
mode(tvecchar)
mode(matnum)
mode(dat_chanum)
mode(list_alles)
Practical2.R
in your working directory.You can use the standard Ctrl-s (Windows/Linux) or Cmd-s (Mac) or use the file menu in the GUI. It is a good idea to regularly save your R
script throughout this practical.
ctrl l
. Browse through your history by pressing the up arrow key
in the R-console.ls()
rm(list = ls())
ls()
R
comes with a bunch of datasets. Use the function data()
to see what data is available. Call the dataset ToothGrowth
, and the dataset women
. Obtain information on the datasets using ?ToothGrowth and ?women. Inspect the structure of the datasets.data()
ToothGrowth
women
?ToothGrowth
?women
str(ToothGrowth)
str(women)
mean(ToothGrowth[,1]) ##Or alternatively mean(ToothGrowth$len)
mean(ToothGrowth[,2]) ##Or alternatively mean(ToothGrowth$supp)
mean(ToothGrowth[,3]) ##Or alternatively mean(ToothGrowth$dose)
sd(ToothGrowth[,1]) ##Or alternatively sd(ToothGrowth$len)
sd(ToothGrowth[,2]) ##Or alternatively sd(ToothGrowth$supp)
sd(ToothGrowth[,3]) ##Or alternatively sd(ToothGrowth$dose)
#Interestingly, the function mean does not work for factor variables, but the function sd does because it treats factors as a truely numeric variable (and it uses the factor levels to calculate the sd).
mean(women$height) ##Or alternatively mean(women[,1])
mean(women$weight) ##Or alternatively mean(women[,2])
sd(women$height) ##Or alternatively sd(women[,1])
sd(women$weight) ##Or alternatively sd(women[,2])
#Bonuspoints for using the apply function:
apply(ToothGrowth,2,mean) ## the apply function does not work here because the function mean does not work for our factor variable. Instead, you can use lapply or sapply, which are the same as the apply function, but then specifically for a list object. With these functions, you apply a function to each object in a list (and as you may remember, although dataframes look like matrices, they are secretly lists). lapply returns the calculated values in a list format, and sapply returns the values in a numeric format. Another demonstration why it is handy to know about different types of objects, and the modes of data, when you are working in R.
lapply(ToothGrowth, mean)
sapply(ToothGrowth,mean)
apply(women,2,mean) # for dataset women, all three apply functions work because although it is a dataframe, it contains only truely numeric variables, and in that sense behaves like a matrix.
lapply(women,mean)
sapply(women,mean)
apply(ToothGrowth,sd)
sapply(ToothGrowth,sd)
lapply(ToothGrowth,sd)
lapply(women,sd)
ToothGrowth
. Make a scatterplot for the height and weight of American women in dataset women
.hist(ToothGrowth$len)
plot(women$height, women$weight)
correlatie<-cor.test(women[,1],women[,2])
correlatie
regressie<-lm(ToothGrowth$len ~ 1 + ToothGrowth$supp + ToothGrowth$dose)
regressie
summary(regressie)
It is also possible to add an interaction to the model.
interactie<-lm(ToothGrowth$len ~ 1 + ToothGrowth$supp + ToothGrowth$dose + ToothGrowth$supp*ToothGrowth$dose)
interactie
summary(interactie)
In this exercise we will work with a dataset that we have to load into R. It is an adjusted version of the dataset that can be found here:http://spss.allenandunwin.com.s3-website-ap-southeast-2.amazonaws.com/data_files.html. The data contains 12 variables: participant id, sex, the participants main source of stress (Work, Family/Friends, Money/Finances, or Health/ilness), six items from an optimism questionnaire, a total score on a life satisfaction questionnaire (higher = more satisfied), a total score on a stress questionnaire (higher score = more stress), and a total score on a self esteem questionnaire (higher score = more self esteem).
Unfortunately, our data is in two separate files. One part of the data is stored in the .Rdata file RegressionANOVAdata1.Rdata
. The second part is stored in the RegressionANOVAdata2.sav
file. We will have to load both files, and get them into one dataframe. Further, we need to calculate sum or mean scores for the optimism questionnaire, of which some items are reverse coded. When we are done, we will save the data in a separate file. In exercise 3, we use the data for a more elaborate regression analysis (and we also obtain anova statistics) in R.
###empty and check global environment.
rm(list = ls())
ls()
## character(0)
### great, now we can start with a clean slate.
Load the .Rdata file with the function “load()”. You specify the path to the file in between the round brackets. If the datafiles are saved in your working directory, you only have to specify the filename of the file.
load("c:\practical2\data\RegressionANOVAdata1.Rdata") ##use the correct path to the datafile for your computer :)
###or if the data is in the working directory
load("RegressionANOVAdata1.Rdata")
Your data is now loaded.
str()
. Use the function head()
on the dataset.Rdata files contains objects from R. Lets see what objects we loaded into our global environment when we loaded our datafile.
ls()
## [1] "datapart1"
Call datapart1
to see part 1 of our data. Use str() to inspect the structure of the data.
ls()
datapart1
You can use the function head() to only inspect the first few cases of your dataframe.
head(datapart1)
## id sex mainsource tlifesat tpstress tslfest
## 2 9 MALES Work 30 22 34
## 3 425 FEMALES Family or Friends 33 19 31
## 4 307 MALES Work 33 31 40
## 5 440 MALES Work 16 27 21
## 7 341 FEMALES Money/Finances 5 39 18
## 8 300 MALES Work 25 39 34
RegressionANOVAdata2.sav
in spss. To read in spss data in R, we need a package. Install and load package Hmisc
. Then, load the second part of the data in file RegressionANOVAdata2.sav
into R with the function spss.get()
.You can intall a package using the buttons in the gui: .. … Then you need to choose a mirror where to download the package from, and pick the package you would like to install from a large alphabetical list.
You can also install a package using code in the R console, like this:
install.packages("Hmisc")
If the install was succesful, you will see this message at the end: “The downloaded source packages are in…”
Now we need to load the installed package in order to be able to use its functions.
library("Hmisc")
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
You may see it also loads some other packages the Hmisc
package needs.
When it is loaded, we can use the function spss.get()
to load our spss file, and transform it into an R dataframe. Note: This function is based on the function read.spss()
in package foreign
, which is a more well known function and package for importing spss data to R (when you google get spss file in r, you will most likely find package foreign before you find package Hmisc). However, spss.get() adds some helpful functionality, such as being able to read spss date variables and translating them to R data variables, and dealing better with variable labels.
Assign the data you read in with spss.get() to an object called datapart2
.
#datapart2 <- spss.get(file="RegressionANOVAdata2.sav") #if your data is not saved in your working directory, specify the complete filepath for the file= argument.
Call the second part of the data to look at it, inspect its structure with str().
#datapart2
#head(datapart2)
#str(datapart2)
RegressionANOVAdata2.csv
in a basic text editor or excel. Then, load the second part of the data in file RegressionANOVAdata2.csv
into R with the function read.table()
.?read.table
The help file describes what the function read.table does: “Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.
Assign the data you read in to an object called datapart2
.
datapart2 <- read.table(file="RegressionANOVAdata2.csv", header=TRUE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep="\t" because the data is tab delimited.
Call the second part of the data to look at it, inspect its structure with str().
datapart2
head(datapart2)
str(datapart2)
fulldata
. Now add the data in datapart2 to fulldata
, such that the variables in datapart2 are in columns 7 to 12 of fulldata
.fulldata<-datapart1
fulldata[,7:12]<-datapart2
head(fulldata)
op2
,op4
, and op6
) are reversely scored: The items are scored from 1 to 5, and the higher the score, the lower the optimism. Make new items that are not reversely scored (such that a high score indicates high optimism), and put them in column 13, 14, and 15 of our dataframa fulldata
.fulldata[,13]<- 6 - fulldata$op2
fulldata[,14]<- 6 - fulldata$op4
fulldata[,15]<- 6 - fulldata$op6
The names of our newly made variables are V13, V14 and V15. We can use function colnames
(stands for column names) to change the names.
colnames(fulldata)[13:15]<- c("op2R", "op4R", "op6R")
fulldata
.Rirst, select the variables we want to sum from the dataset. We can store it in variable “items”. Then use the apply function to sum over each row of the object items.
items <- fulldata[,c(7,9,11,13:15)]
fulldata[,16]<- apply(items,MARGIN=1,FUN=sum) ##or for mean scores rather than sumscores: apply(items,MARGIN=1,FUN=mean)
Alternatively, you can do it in one go like this:
fulldata[,16]<- apply(fulldata[,c(7,9,11,13:15)],MARGIN=1,FUN=sum) ##or for mean scores rather than sumscores: apply(fulldata[,c(7,9,11,13:15)],MARGIN=1,FUN=mean)
We can give our new variable a suitable name using colnames()
colnames(fulldata)[16]<- c("sumOpt")
save()
, and in an .csv or .txt file using write.table()
.###In an Rdata file you can easily save many objects you made in R.
## to save just fulldata you would do:
save(fulldata, file="DataforExercise3.Rdata") ### If you want to save your file somewhere other than your working directory, specify the complete path, including the filename. Like this: file = "c:\practical2\data\DataExercise2.Rdata".
## to save fulldata, datapart1, and datapart2
save(fulldata, datapart1,datapart2, file="DataforExercise3.Rdata")
You can load this file into R like we did in exercise 2.A.
Now, to save it as a .txt or .csv:
write.table(fulldata, file="DataforExercise3.csv", sep="\t", row.names=FALSE, col.names=TRUE) #I chose to make a tab-delimited, but you can choose anything to separate the variables, for example, sep=";".
You now have successfully organized your data in R. We will analyse this dataset in Exercise 3 and 4.
In this exercise we will work with a dataset that we have to load into R. It is an adjusted version of the dataset that can be found here:http://spss.allenandunwin.com.s3-website-ap-southeast-2.amazonaws.com/data_files.html. The data contains 16 variables: - participant
id
sex
the participants main source of stress (Work, Family/Friends, Money/Finances, or Health/ilness)
six items from an optimism questionnaire and three reversed items from the optimism questionnaire
a total score on a life satisfaction questionnaire (higher = more satisfied)
a total score on a stress questionnaire (higher score = more stress)
a total score on a self esteem questionnaire (higher score = more self esteem)
a total score on an optimism questionnaire (higher score = more optimism)
We will perform two regression analyses on the data to find out if gender moderates the relationship between life satisfaction and optimism. We will also be checking some assumptions.
If you did Exercise 2, load one of the datafiles you made in exercise 2G: the file “DataforExercise3.csv” using read.table()
, or the file “DataforExercise3.Rdata” using load()
.
If you did not do Exercise 2, open and inspect the file ’DataRegression.csv in a text editor or in excel. Then load it using function read.table().
?read.table
The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.
Assign the data you read in to an object called fulldataset
.
fulldataset <- read.table(file="DataRegression.csv", header=TRUE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep="\t" because the data is tab delimited.
You can also use the function head()
to only inspect the first few cases of your dataframe.
head(fulldataset)
## id sex mainsource tlifesat tpstress tslfest op1 op2 op3 op4
## 1 9 MALES Work 30 22 34 2 3 4 3
## 2 425 FEMALES Family or Friends 33 19 31 3 1 3 3
## 3 307 MALES Work 33 31 40 3 1 5 3
## 4 440 MALES Work 16 27 21 3 2 3 2
## 5 341 FEMALES Money/Finances 5 39 18 3 5 1 4
## 6 300 MALES Work 25 39 34 4 1 3 1
## op5 op6 op2R op4R op6R sumOpt
## 1 5 4 3 3 2 19
## 2 3 4 5 3 2 19
## 3 5 1 5 3 5 26
## 4 1 3 4 4 3 18
## 5 1 4 1 2 2 10
## 6 4 2 5 5 4 25
datareg
with all variables except the individual optimism items (op1 to op6, and op2R, op4R, and op6R). Inspect the structure of the data with str(). Use summary() on the data to obtain some descriptives.datareg<-fulldataset[,-c(7:15)]
str(datareg)
summary(datareg) ##Note that for the sumscore variables there are some missing data indicated with "NA". These will be listwise deleted automatically when we use these variables for the regression analysis in R.
Some things you can do : - Make a histogram for the dependent variable using hist()
. Give the plot the main title “Histogram of Life Satisfaction”.
Get the correlation matrix for all the continuous variables using cor()
.
Obtain the average life satisfaction for men, and for women, using mean()
.
Make a scatterplot of life satisfaction and optimism, for only the women using plot()
. Use the argument col="red"
to plot in red. Use the argument xlab to label the x axis “Optimism”.
Add the points for men to the plot using points()
. Use the argument col="blue"
to plot in blue, and pch=3 to change the shape of the points.
hist(datareg[,"tlifesat"], main="Histogram of Life Satisfaction")
cor(datareg[,4:7], use="pairwise.complete.obs") # What happens when you do not specify the second argument, na.rm=TRUE?
#make two filter variables, one to select men and one to select women:
filtf = datareg[,2]=="FEMALES"
filtm = filtf==FALSE
#use the filter variables to get the life satisfaction means for women and men
mean(datareg[filtf,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings
mean(datareg[filtm,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings
#use the filter variables to make a scatterplot for women
plot(datareg[filtf,"sumOpt"],datareg[filtf,"tlifesat"],col="red", xlab= "Optimism", ylab="Life Satisfaction")
# Then add the points for men
points(datareg[filtm,"sumOpt"],datareg[filtm,"tlifesat"], col="blue", pch=2)
reg1
. Use summary() to get some more results, such as p-values for the regression coefficients and R-squared. Interpret the results.reg1 <- lm(tlifesat ~ 1+ sex + sumOpt + sex*sumOpt, data=datareg)
summary(reg1)
First, we will make some plots to see if our residuals are normally distributed, and to evaluate if there is heteroskedasticity.
Use plot() on your regression analysis to get some diagnostic plots.
Obtain the predicted values and residuals using predict()
and residuals()
on the regression analysis.
Make a histogram for the residuals.
use the function shapiro.test()
on the residuals to test if the assumption of normally distributed residuals is violated.
plot(reg1)
# The first plot we see shows a plot of our predicted values and the residuals. The plot should look like a pretty much evenly distributed blob of points. It also labels some potential outliers.
# Press enter to go to the next plot. This is a QQplot for the residuals, and if they are normally distributed the should fall on a straight line. It again labels some potential outliers.
# Disregard the third plot
# The fourth plot can be used for outlier diagnostics, and gives on overview of the standardized residuals and leverage values.
resids = residuals(reg1)
pred = predict(reg1)
hist(resids)
shapiro.test(resids)
## we can also remake the basic heteroskedasticityplot ourselves, quite simply:
plot(x=pred,y=resids)
In this exercise we will work with a dataset that we have to load into R. It is an adjusted version of the dataset that can be found here:http://spss.allenandunwin.com.s3-website-ap-southeast-2.amazonaws.com/data_files.html. The data contains 16 variables: - participant
id
sex
the participants main source of stress (Work, Family/Friends, Money/Finances, or Health/ilness)
six items from an optimism questionnaire and three reversed items from the optimism questionnaire
a total score on a life satisfaction questionnaire (higher = more satisfied)
a total score on a stress questionnaire (higher score = more stress)
a total score on a self esteem questionnaire (higher score = more self esteem)
a total score on an optimism questionnaire (higher score = more optimism)
We will perform a factorual ANOVA to find out if and how experienced stress is related to the type of main source of stress, and to gender.
If you did Exercise 2, load one of the datafiles you made in exercise 2G: the file “DataforExercise3.csv” using read.table()
, or the file “DataforExercise3.Rdata” using load()
.
If you did not do Exercise 2, open and inspect the file ’DataRegression.csv in a text editor or in excel. Then load it using function read.table().
?read.table
The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.
Assign the data you read in to an object called fulldataset
.
fulldataset <- read.table(file="DataRegression.csv", header=TRUE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep="\t" because the data is tab delimited.
You can also use the function head()
to only inspect the first few cases of your dataframe.
head(fulldataset)
## id sex mainsource tlifesat tpstress tslfest op1 op2 op3 op4
## 1 9 MALES Work 30 22 34 2 3 4 3
## 2 425 FEMALES Family or Friends 33 19 31 3 1 3 3
## 3 307 MALES Work 33 31 40 3 1 5 3
## 4 440 MALES Work 16 27 21 3 2 3 2
## 5 341 FEMALES Money/Finances 5 39 18 3 5 1 4
## 6 300 MALES Work 25 39 34 4 1 3 1
## op5 op6 op2R op4R op6R sumOpt
## 1 5 4 3 3 2 19
## 2 3 4 5 3 2 19
## 3 5 1 5 3 5 26
## 4 1 3 4 4 3 18
## 5 1 4 1 2 2 10
## 6 4 2 5 5 4 25
datareg
with all variables except the individual optimism items (op1 to op6, and op2R, op4R, and op6R). Inspect the structure of the data with str(). Use summary() on the data to obtain some descriptives.datareg<-fulldataset[,-c(7:15)]
str(datareg)
summary(datareg) ##Note that for the sumscore variables there are some missing data indicated with "NA". These will be listwise deleted automatically when we use these variables for the regression analysis in R.
Some things you can do : - Make a histogram for the dependent variable using hist()
. Give the plot the main title “Histogram of Stress”.
Obtain the stress level satisfaction for men, and for women, using mean()
.
Make boxplots of stress per main source of stress, for only the women using plot()
. Use the argument col="red"
to plot in red. Use the argument xlab to label the x axis “main source of stress”.
Make boxplots of stress per main source of stress, for only the men using plot()
. Use the argument col="blue"
to plot in red. Use the argument xlab to label the x axis “main source of stress”.
hist(datareg[,"tpstress"], main="Histogram of Stress")
#make two filter variables, one to select men and one to select women:
filtf = datareg[,2]=="FEMALES"
filtm = filtf==FALSE
#use the filter variables to get the life satisfaction means for women and men
mean(datareg[filtf,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings
mean(datareg[filtm,"tlifesat"], na.rm=TRUE) #na.rm = TRUE to not use the missings
par(mfrow=c(1,2)) ##par() can be used to set all kind of options (graphical parameters) in advance for making plots. Here we use it to specify mfrow, which is used to determine how many rows and columns a plot should have. Here we say 1 row, 2 columns. In this way, we can make two plots next to each other. Want to know more? Search for R: plotting with par() in google.
#use the filter variables to make a scatterplot for women
plot(datareg[filtf,"mainsource"],datareg[filtf,"tpstress"],col="red", xlab= "Main source of stress", ylab="Stress")
#use the filter variables to make a scatterplot for men
plot(datareg[filtm,"mainsource"],datareg[filtm,"tpstress"],col="blue", xlab= "Main source of stress", ylab="Stress")
reg_ano
.reg_ano <- lm(tpstress ~ 1+ sex + mainsource, data=datareg)
summary(reg_ano)
We probably should have checked some assumptions before we interpreted the results… In Exercise 3.E we show how to check some model assumptions for the regression analysis: If you would like to check them you can use the same techniques presented there.
anova()
on the regression analysis. Also get the ANOVA results with aovDid you know that SPSS actually performs a regression analysis in the background when you perform an ANOVA model in the general linear model menu? We are basically doing the same thing, but openly. We can get the overall ANOVA results using the function anova()
on the regression model.
anova(reg_ano)
Alternatively, it is possible to immediately fit an ANOVA. It works the same way as the regression analysis, but instead you use function aov()
.
anova1<-aov(tpstress ~ 1+ sex + mainsource, data=datareg )
summary(anova1)
Interpret the results.
TukeyHSD()
on our oav analysis.We didn’t specify any specific hypotheses about the differences in the means of stress, so we want to do post hoc tests to see if there are any paiwise differences in stress for different main sources of stress. We can do this as follows with the function TukeyHSD()
.
TukeyHSD(anova1)
Healthy is often considered most important to peoples’ the happiness. If this is the case, we may expect that stress is higher when the main source of stress is health/ilness related than for the other categories. Use the function C()
to specify contrasts (not to be confused with the vector function c()
) to compare each other main source of stress to health/ilness (basically change the reference group in the dummy coding). Use another planned contrast to compare health/ilness to the other three categories together.
In R, you specify contrasts by altering the factor variable in your dataframe. The standard ‘treatment’ contrast can be used for a standard dummy coding, and we can change the reference category wit the argument base=..
. Check the order of the levels of your factor variable with the function levels(), and the current specified contrasts for the factor variable with contrasts()
. Then use the number of the element that contains your reference of choice to specify the base argument. Use the contrast function in your regression analysis instead of the earlier predictor mainsource to perform the analysis with different dummy coding.
levels(datareg$mainsource)
contrasts(datareg$mainsource)
reg_ano2<-lm(tpstress ~ 1+ sex + C(datareg$mainsource, contr=contr.treatment, base=2), data=datareg)
summary(reg_ano2)
# Note if you want to change the factor in your dataset to have a certain contrast (not just only for one analysis), you should change the factor in the dataset as follows:
datareg$mainsource<-C(datareg$mainsource, contr=contr.treatment, base=2)
##check the contrast set for your factor variable
contrasts(datareg$mainsource)
Interpret the results.
C()
To do a contrast that is not default in R, we need to specify our own ’contrast matrix`. This is a matrix with contrast coding, with in each column a specific contrast.
# We only want to specify one contrast, so our matrix will have 1 column (so that it can actually be considered a vector), and a row for each level of our categorical predictor variable. Remember, we want to compare the second level to all the other levels combined. To specify an orthogonal contrast we want our contrast codings to sum up to zero, and the categories that should be pooled together should be assigned the same number.
contrast_mat <- matrix(c(1,-3,1,1),4,1)
##now use this matrix in the C() function. We should also specify in the C() function that we want only 1 contrast with the arguments how.many - the default is one less than the number of levels.
reg_ano3<-lm(tpstress ~ 1+ sex + C(datareg$mainsource, contr=contrast_mat, how.many=1), data=datareg)
summary(reg_ano3)
Interpret the results.
To do sem analyses in R we will use package lavaan
. Lavaan is not the only sem package available in R. There is also package ‘sem’, and package ‘openmx’ (and there may be even more packages available.) Lavaan is in my opinion most userfriendly, so we will use this today. It aims to provide an alternative to mplus, can mimic mplus, and is continuously being developed. You may like to try the other packages some time as well - openmx is known as having a quite steep learning curve, but when you know it apparently you can fit almost any model with it.
To start, read a bit about Lavaan on its website: http://lavaan.ugent.be/. The website alsno has a nice tutorial you can follow. Note also that there are some things you can’t do with lavaan yet (from http://lavaan.ugent.be/tutorial/before.html):
“The lavaan package is not finished yet. But it is already very useful for most users, or so we hope. However, some important features that are currently NOT available in lavaan are:
support for hierarchical/multilevel datasets (multilevel cfa, multilevel sem)
support for discrete latent variables (mixture models, latent classes)
Bayesian estimation
We hope to add these features in the next (two?) year(s) or so."
We will start by installing the package, loading, data, and fitting a very basic confirmatory factor analysis on the continuous scores of 72 boys on 6 measures of certain cognitive skills, just to get you started. If you want to practice with more difficult models, you can practice by following the lavaan tutorial.
You can intall a package using the buttons in the gui: .. … Then you need to choose a mirror where to download the package from, and pick the package you would like to install from a large alphabetical list.
You can also install a package using code in the R console, like this:
install.packages("lavaan")
If the install was succesful, you will see this message at the end: “The downloaded source packages are in…”
Now we need to load the installed package in order to be able to use its functions.
library("lavaan")
## This is lavaan 0.5-20
## lavaan is BETA software! Please report any bugs.
Boys.dat
into R by following the instructions below.First, open and inspect the file Boys.dat
in a text editor or in excel. This file Boys.dat
contains the scores of 72 boys on 6 measures of cognitive skills. Note that the variable names are not included in the top of the file. In order, the six variables are: - Visual perception scores
Scores on a test of spatial visualization
Scores on a test of spatial orientation
Paragraph comprehension scores
Sentence completion scores
Word meaning test scores
Load the file Boys.dat
into R
using function read.table()
.
?read.table
The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.
Assign the data you read in to an object called boysdata
.
boysdata<- read.table(file="Boys.dat", header=FALSE, sep="\t") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = FALSE because the variable names are not specified at the top of the datafile. sep="\t" because the data is tab delimited.
The names for the variables in our dataframe are V1 to V6. Give them more appropriate names using function colnames().
colnames(boysdata)<- c("visper","spatvis", "spator", "parcom", "sencompl", "wordmean")
It is assumed that the six variables in boysdata measure two factors, that is Spatial Ability, which is measured by the first three variables, and Verbal Ability, which is measured by the last three variables. Specify the two-factor model in lavaan.
Use the lavaan tutorial: http://lavaan.ugent.be/tutorial/syntax1.html to see how the syntax works.
We will later use function cfa() or sem() to fit the model. For these functions, lavaan will autoamtically scale the model for you. Further, by default, lavaan will include variances for all observed and latent variables, and covariances between the latent variables to the model, so you do not have to specify these in the model. That means that if you do not want those parameters included in the model, you will have to restrict these parameters to zero (we will try restricting parameters later). Alternatively, you can use function lavaan(), then you will have to specify the full model, exactly like you want it, yourself.
Call your lavaan model object boysmodel1
.
boysmodel1 <- '
### Two factors, spatial and verbal
spatial =~ visper + spatvis + spator
verbal =~ parcom + sencompl + wordmean
'
Look at the many arguments you can specify for function cfa() and sem() by looking up the functions help files. Currently these functions do basically the same thing, so it does not matter which one you will use.
Fit your model with lavaan using function sem or cfa. Call your sem analysis cfa1.
cfa1 <- sem(model=boysmodel1, data=boysdata)
cfa1 is now on object of lavaan class. Look at all the things you can obtain lavaan objects in the helpfiles by calling ?lavaan-class
, under the heading ‘methods’. Here you will find all kinds of functions you can use on the lavaan object. Look at the summary() function for the lavaan class object at the bottom.
Inspect the results with summary(). Look at what results lavaan returns by default and interpret the results. How does lavaan automatically scale the model - in the factor loadings, or in the factor variances? Does the two-factor model fit the data? Are the two factors correlated?
summary(cfa1)
Now, use summary() and specify an argumentto obtain fit measures with your results.
summary(cfa1,fit.measures=TRUE)
Does the two-factor model fit the data?
Lavaan does many things in the same way as Mplus, but not always. You can specify that you want lavaan to run exactly like Mplus - if they figured out how to do that for your type of model - by specifying mimic=“Mplus” in the sem() or cfa() function. Try it.
cfa1 <- sem(model=boysmodel1, data=boysdata, mimic="Mplus")
summary(cfa1)
Lavaan automatically scaled our model by fixing one factor loading to 1 for each factor. However, we may want to scale in the factor variances instead. We can do this by fixing the factor variances to one, and freeing the factor loadings.
Use the lavaan tutorial: http://lavaan.ugent.be/tutorial/syntax1.html to see how to do more with the model syntax.
Change the model so scaling is done in the factor variances. Call you model boysmodel2
.
boysmodel2<- '
### Two factors, spatial and verbal
spatial =~ NA*visper + spatvis + spator ##Free loadings with NA
verbal =~ NA*parcom + sencompl + wordmean
verbal ~~ 1*verbal # variance spatial
spatial ~~ 1*spatial # variance verbal '
cfa2 <- sem(model=boysmodel2, data=boysdata)
summary(cfa2)
To do multilevel modeling in R we will use package lme4
. Lme4 is not the only mutlilevel modeling package available in R (there are quite many), but it is one of the most popular packages. One popular alternative is nlme
.
We will start by installing the package and loading data. After that, we will fit a basic multilevel model on the data.
You can intall a package using the buttons in the gui: .. … Then you need to choose a mirror where to download the package from, and pick the package you would like to install from a large alphabetical list.
You can also install a package using code in the R console, like this:
install.packages("lme4")
If the install was succesful, you will see this message at the end: “The downloaded source packages are in…”
Now we need to load the installed package in order to be able to use its functions.
library("lme4")
## Loading required package: Matrix
popular.dat
into R by following the instructions below.First, open and inspect the file popular.dat
in a text editor or in excel. This file popular.dat
is data from chapter 2 of Joop Hox’s book on multilevel modeling “Multilevel Analysis: Techniques and Applications”. It contains simulated data for 2000 pupils in 100 classes. It contains the following variables:
pupil number
class
extraversion score for each pupil
sex for each pupil
teacher experience in each class
popularity ratings for the pupil by their peers
popularity rating for the pupil bu their teacher.
standardized scores for the latter five variables presented above
centered scores for extraversion, teacher experience, and sex.
Load the file popular.dat
into R
using function read.table()
.
?read.table
The help file describes what the function read.table does:“Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.” That is, it is function that can be used to read for instance .csv, .txt, and .dat files, and puts it in a dataframe. In the arguments for the function you need to specify how the data was stored, for instance: - how are variables seperated from each other in the file (with a tab: sep=“”, or a space: sep=" “, or a comma: sep=”," or…) - if comma’s or dots are used to indicate decimal points - if there are variable names in the first line of the data file or not.
Assign the data you read in to an object called popudat
.
popudat<- read.table(file="popular.dat", header=TRUE, sep=";") #if your data is not saved in your working directory, specify the complete filepath for the file= argument. We specify the argument header = TRUE because the variable names are specified at the top of the datafile. sep=";" because the data is delimited with the symbol ;.
You can use the function head() to only inspect the first few cases of your dataframe.
head(popudat)
## pupil class extrav sex texp popular popteach Zextrav Zsex
## 1 1 1 5 girl 24 6.3 6 -0.1703149 0.9888125
## 2 2 1 7 boy 24 4.9 5 1.4140098 -1.0108084
## 3 3 1 4 girl 24 5.3 6 -0.9624772 0.9888125
## 4 4 1 3 girl 24 4.7 5 -1.7546396 0.9888125
## 5 5 1 5 girl 24 6.0 6 -0.1703149 0.9888125
## 6 6 1 4 boy 24 4.7 5 -0.9624772 -1.0108084
## Ztexp Zpopular Zpopteach Cextrav Ctexp Csex
## 1 1.486153 0.8850133 0.66905609 -0.215 9.737 0.5
## 2 1.486153 -0.1276291 -0.04308451 1.785 9.737 -0.5
## 3 1.486153 0.1616973 0.66905609 -1.215 9.737 0.5
## 4 1.486153 -0.2722923 -0.04308451 -2.215 9.737 0.5
## 5 1.486153 0.6680185 0.66905609 -0.215 9.737 0.5
## 6 1.486153 -0.2722923 -0.04308451 -1.215 9.737 -0.5
Call your multilevel model object randomint
.
#randomint <-
Look at the many arguments you can specify for function cfa() and sem() by looking up the functions help files. Currently these functions do basically the same thing, so it does not matter which one you will use.
Fit your model with lavaan using function sem or cfa. Call your sem analysis cfa1.
cfa1 <- sem(model=boysmodel1, data=boysdata)
cfa1 is now on object of lavaan class. Look at all the things you can obtain lavaan objects in the helpfiles by calling ?lavaan-class
, under the heading ‘methods’. Here you will find all kinds of functions you can use on the lavaan object. Look at the summary() function for the lavaan class object at the bottom.
Inspect the results with summary(). Look at what results lavaan returns by default and interpret the results. How does lavaan automatically scale the model - in the factor loadings, or in the factor variances? Does the two-factor model fit the data? Are the two factors correlated?
summary(cfa1)
Now, use summary() and specify an argumentto obtain fit measures with your results.
summary(cfa1,fit.measures=TRUE)
Does the two-factor model fit the data?
Lavaan does many things in the same way as Mplus, but not always. You can specify that you want lavaan to run exactly like Mplus - if they figured out how to do that for your type of model - by specifying mimic=“Mplus” in the sem() or cfa() function. Try it.
cfa1 <- sem(model=boysmodel1, data=boysdata, mimic="Mplus")
summary(cfa1)
Lavaan automatically scaled our model by fixing one factor loading to 1 for each factor. However, we may want to scale in the factor variances instead. We can do this by fixing the factor variances to one, and freeing the factor loadings.
Use the lavaan tutorial: http://lavaan.ugent.be/tutorial/syntax1.html to see how to do more with the model syntax.
Change the model so scaling is done in the factor variances. Call you model boysmodel2
.
boysmodel2<- '
### Two factors, spatial and verbal
spatial =~ NA*visper + spatvis + spator ##Free loadings with NA
verbal =~ NA*parcom + sencompl + wordmean
verbal ~~ 1*verbal # variance spatial
spatial ~~ 1*spatial # variance verbal '
cfa2 <- sem(model=boysmodel2, data=boysdata)
summary(cfa2)
That concludes all the exercises for our second practical. Be sure to save you R script if you do not want to lose your work! If you want to test your obtained skills some more, try any of the analyses with your own data.