R is an open source software that offers tremendous flexibility and capability. Unfortunately some of this flexibility comes at the cost of a steeper learning curve, but have no fear!
October 17, 2019
R is an open source software that offers tremendous flexibility and capability. Unfortunately some of this flexibility comes at the cost of a steeper learning curve, but have no fear!
Recommended software:
install.packages("rmarkdown")
First, let’s make R a little bit easier for everyone. When we use R, we have the option of adding comments to our code.
#This is a comment
Everything on a line to the left of a #
is a comment. R will ignore comments, so you can use comments to give yourself and your collaborators an explanation; leave yourself a reminder; or skip a line of code.
Comments are worth your time!!
Let’s start with some simple mathematical operations.
2+2
## [1] 4
2^3
## [1] 8
R uses standard order of operations
3+4*(4^2)
## [1] 67
We can also test for (in)equalities
2 == 3
## [1] FALSE
2 < 3
## [1] TRUE
2^2 == 4
## [1] TRUE
(Notice that we use “==” instead of “=” to test equality. The “=” operator is for setting things equal to one another.)
R can also store and use objects that we create, like the single value
x <- 2*3
which we can output using
print(x)
## [1] 6
or
x
## [1] 6
We can also create a vector of values using the c
command, which combines all of the elements inside the parantheses into a vector or list.
y <- c(2,4,7) y
## [1] 2 4 7
We can also access all or part of our vector:
y[1]
## [1] 2
y[3]
## [1] 7
R numbers its vector and matrices starting at 1. If you’re familiar with programming languages like C, this is something to be mindful of.
We can now work with an entire vector instead of individual values.
y+2
## [1] 4 6 9
x*y
## [1] 12 24 42
We can see a complete list of the objects we have saved
ls()
## [1] "x" "y"
We can overwrite the objects we have saved
x <- 10 x <- 20 x
## [1] 20
We can also create sequences of variables
x <- 1:10 x
## [1] 1 2 3 4 5 6 7 8 9 10
or
x <- seq(from = 1, to = 10, by = 1) x
## [1] 1 2 3 4 5 6 7 8 9 10
If we want to change the boundaries and increments,
x <- seq(from = 1, to = 12, by = .5) x
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 ## [15] 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0
repeat values a certain number of times,
x <- rep(1, 10) x
## [1] 1 1 1 1 1 1 1 1 1 1
or repeat an entire vector a particular number of times
x <- rep(c(1,4), 10) x
## [1] 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4
we can do those too!
Combining all of this…
rep(1:10, 10) + seq(1, 50.5, .5)
## [1] 2.0 3.5 5.0 6.5 8.0 9.5 11.0 12.5 14.0 15.5 7.0 8.5 10.0 11.5 ## [15] 13.0 14.5 16.0 17.5 19.0 20.5 12.0 13.5 15.0 16.5 18.0 19.5 21.0 22.5 ## [29] 24.0 25.5 17.0 18.5 20.0 21.5 23.0 24.5 26.0 27.5 29.0 30.5 22.0 23.5 ## [43] 25.0 26.5 28.0 29.5 31.0 32.5 34.0 35.5 27.0 28.5 30.0 31.5 33.0 34.5 ## [57] 36.0 37.5 39.0 40.5 32.0 33.5 35.0 36.5 38.0 39.5 41.0 42.5 44.0 45.5 ## [71] 37.0 38.5 40.0 41.5 43.0 44.5 46.0 47.5 49.0 50.5 42.0 43.5 45.0 46.5 ## [85] 48.0 49.5 51.0 52.5 54.0 55.5 47.0 48.5 50.0 51.5 53.0 54.5 56.0 57.5 ## [99] 59.0 60.5
If need be, we can remove certain objects from the working environment
rm(x)
or clear the working environment entirely
rm(list=ls())
We can do a lot with the few things we’ve already learned! But now we want to think about applying those to data.
To set up a working directory where all of your packages and other R files will be downloaded to / uploaded from
getwd()
## [1] "C:/Users/GradQuant/Dropbox (UCR GradSuccess)/UCR GradSuccess Team Folder/2019-20 UCR GradSuccess/GradQuant/Workshops/02_Fall/Statistics in R"
# setwd("C:\\Users\\GradQuant\\Desktop") getwd()
## [1] "C:/Users/GradQuant/Dropbox (UCR GradSuccess)/UCR GradSuccess Team Folder/2019-20 UCR GradSuccess/GradQuant/Workshops/02_Fall/Statistics in R"
We can import different types of data
data = read.table("hmnrghts.txt", header=TRUE) data = read.csv ("hmnrghts.csv", header=TRUE)
In RStudio, we can do this using “Import Dataset” in the “Environment” tab.
R also has a huge number of built in datasets (most of which are in specific packages). We can use these to test functions or to practice our R skills.
data(mtcars)
If we want to find out about the mtcars dataset
?mtcars
We’ll rename this “data” so that it’s easier to work with
data <- mtcars
To view the first few rows
head(data)
## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
or the last few rows
tail(data)
## mpg cyl disp hp drat wt qsec vs am gear carb ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
The head
and tail
functions are a nice way to make sure that your data was imported correctly without trying to print out too much. This is especially important for larger datasets.
Say we wanted to view just one column. We might be temped to try
mpg
but that won’t work.
Instead, we need to be specific about where our column comes from
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 ## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 ## [29] 15.8 19.7 15.0 21.4
That’s more like it! The $
command tells R that we want to extract the named element mpg
out of mtcars
.
We can also select columns by number. Say we want the first three. We can use either format:
newdata <- data[,c(1,2,3)] newdata <- data[,c(1:3)]
head(newdata)
## mpg cyl disp ## Mazda RX4 21.0 6 160 ## Mazda RX4 Wag 21.0 6 160 ## Datsun 710 22.8 4 108 ## Hornet 4 Drive 21.4 6 258 ## Hornet Sportabout 18.7 8 360 ## Valiant 18.1 6 225
If we want columns that are not adjacent to each other, say the first and third through sixth,
newdata<-data[,c(1,3:6)] head(newdata)
## mpg disp hp drat wt ## Mazda RX4 21.0 160 110 3.90 2.620 ## Mazda RX4 Wag 21.0 160 110 3.90 2.875 ## Datsun 710 22.8 108 93 3.85 2.320 ## Hornet 4 Drive 21.4 258 110 3.08 3.215 ## Hornet Sportabout 18.7 360 175 3.15 3.440 ## Valiant 18.1 225 105 2.76 3.460
We can also use this concept to remove variables by using a minus sign:
droppeddata <- data[,c(-3,-5)] head(droppeddata)
## mpg cyl hp wt qsec vs am gear carb ## Mazda RX4 21.0 6 110 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 110 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 93 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 110 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 175 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 105 3.460 20.22 1 0 3 1
Or a more concise way for multiple columns (now we remove three through five, instead of three and five)
newdata <- data[,c(-3:-5)]
We can select certain rows the same way
newdata <- data[1:5,]
Base R has a lot of functionality, but the real power comes from it’s open-sourced nature. User-written packages greatly expand R’s capabilities, but they need to be installed and loaded.
install.packages("matrixStats") library(matrixStats)
With a package loaded, we can use our help function to examine its documentation
?matrixStats
All packages should have documentation, but some have much better documentation than others!
Now let’s do some basic statistics on our mtcars dataset…this is what R is built for!
Let’s start by finding the mean and standard deviation
mean(data$mpg)
## [1] 20.09062
sd(data$mpg)
## [1] 6.026948
If we want more complete information, we can use
summary(data$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 15.43 19.20 20.09 22.80 33.90
or something like
psych::describe(data$mpg)
## vars n mean sd median trimmed mad min max range skew kurtosis ## X1 1 32 20.09 6.03 19.2 19.7 5.41 10.4 33.9 23.5 0.61 -0.37 ## se ## X1 1.07
Notice how we were able to use a function from a package without loading it - we called the psych
package as part of the describe
command.
The V/S variable tells us whether the car is a V-engine or a straight engine. We will compare mpg based on the V/S variable.
First, we need to make sure that vs
is a factor (categorical variable)
is.factor(data$vs)
## [1] FALSE
It’s not, so we need to convert it
data$vs <- as.factor(data$vs)
Let’s try it again
is.factor(data$vs)
## [1] TRUE
That’s better.
Now, using that describe
function again, we can examine the data broken down by group
psych::describeBy(data$mpg, group=data$vs)
## ## Descriptive statistics by group ## group: 0 ## vars n mean sd median trimmed mad min max range skew kurtosis ## X1 1 18 16.62 3.86 15.65 16.42 2.97 10.4 26 15.6 0.48 -0.05 ## se ## X1 0.91 ## -------------------------------------------------------- ## group: 1 ## vars n mean sd median trimmed mad min max range skew kurtosis ## X1 1 14 24.56 5.38 22.8 24.34 6 17.8 33.9 16.1 0.41 -1.4 ## se ## X1 1.44
Finally, we can do our t-test
t.test(mpg~vs, data=data)
## ## Welch Two Sample t-test ## ## data: mpg by vs ## t = -4.6671, df = 22.716, p-value = 0.0001098 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -11.462508 -4.418445 ## sample estimates: ## mean in group 0 mean in group 1 ## 16.61667 24.55714
Way significant!
We can so the same analysis in the form of a linear model (here we’re showing a regression model). This gives the same results as our t-test. (Phew!)
model <- lm(mpg~vs, data) summary(model)
## ## Call: ## lm(formula = mpg ~ vs, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.757 -3.082 -1.267 2.828 9.383 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 16.617 1.080 15.390 8.85e-16 *** ## vs1 7.940 1.632 4.864 3.42e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.581 on 30 degrees of freedom ## Multiple R-squared: 0.4409, Adjusted R-squared: 0.4223 ## F-statistic: 23.66 on 1 and 30 DF, p-value: 3.416e-05
We can run an ANOVA for more than two groups. We have three cylinder groups. They aren’t saved as factors, but we have a shortcut:
anova <- aov(mpg~as.factor(cyl), data) summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F) ## as.factor(cyl) 2 824.8 412.4 39.7 4.98e-09 *** ## Residuals 29 301.3 10.4 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also extract an ANOVA from a model:
model <- lm(mpg~as.factor(cyl), data) anova <- aov(model) summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F) ## as.factor(cyl) 2 824.8 412.4 39.7 4.98e-09 *** ## Residuals 29 301.3 10.4 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These two ANOVA tables are exactly the same! So why might we want the model?
Let’s take a look at all the elements in our model
names(model)
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "contrasts" "xlevels" "call" "terms" ## [13] "model"
This tells us the different elements that we can get from R’s storage of our model.
For instance, we can check the residuals…
model$residuals
## Mazda RX4 Mazda RX4 Wag Datsun 710 ## 1.25714286 1.25714286 -3.86363636 ## Hornet 4 Drive Hornet Sportabout Valiant ## 1.65714286 3.60000000 -1.64285714 ## Duster 360 Merc 240D Merc 230 ## -0.80000000 -2.26363636 -3.86363636 ## Merc 280 Merc 280C Merc 450SE ## -0.54285714 -1.94285714 1.30000000 ## Merc 450SL Merc 450SLC Cadillac Fleetwood ## 2.20000000 0.10000000 -4.70000000 ## Lincoln Continental Chrysler Imperial Fiat 128 ## -4.70000000 -0.40000000 5.73636364 ## Honda Civic Toyota Corolla Toyota Corona ## 3.73636364 7.23636364 -5.16363636 ## Dodge Challenger AMC Javelin Camaro Z28 ## 0.40000000 0.10000000 -1.80000000 ## Pontiac Firebird Fiat X1-9 Porsche 914-2 ## 4.10000000 0.63636364 -0.66363636 ## Lotus Europa Ford Pantera L Ferrari Dino ## 3.73636364 0.70000000 -0.04285714 ## Maserati Bora Volvo 142E ## -0.10000000 -5.26363636
But these aren’t very interesting on their own. Let’s check the residuals for normality using a qqplot.
qqnorm(model$residuals) qqline(model$residuals)
Say we wanted to correlate mpg with engine displacement.
cor(data$mpg, data$disp)
## [1] -0.8475514
but that doesn’t give much output. Let’s try this instead
cor.test(data$mpg, data$disp)
## ## Pearson's product-moment correlation ## ## data: data$mpg and data$disp ## t = -8.7472, df = 30, p-value = 9.38e-10 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.9233594 -0.7081376 ## sample estimates: ## cor ## -0.8475514
summary(glm(am~qsec, data=data, family=binomial()))
## ## Call: ## glm(formula = am ~ qsec, family = binomial(), data = data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.3062 -1.0448 -0.7826 1.2151 1.6189 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 4.7389 4.0452 1.171 0.241 ## qsec -0.2882 0.2279 -1.265 0.206 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 43.230 on 31 degrees of freedom ## Residual deviance: 41.465 on 30 degrees of freedom ## AIC: 45.465 ## ## Number of Fisher Scoring iterations: 4
If we get the order wrong, it returns an error (or does the wrong thing!)
lm(data, mpg~vs)
But we can use any order we like if we specify the argument
lm(data=data, formula = mpg~vs)
## ## Call: ## lm(formula = mpg ~ vs, data = data) ## ## Coefficients: ## (Intercept) vs1 ## 16.62 7.94
There’s more to learn about R, but that’s all we have time for today.
We will be having more intermediate R workshops this school year! Previous years’ workshops have included “Graphing in R” and “Data Manipulation in R”. We plan to do similar workshops soon.
If you have questions in the meantime, please feel free to schedule a consultation or come to drop-in hours!