Let’s do some polling! Go to pollev.com/kumarr436
learnr
videosRMarkdown
(and RProjects
)Why GitHub?
For you:
For students:
Motivation
Help students feel comfortable
I usually outline the core content by:
I strongly suggest an examples and exercises approach to teaching skills in R. We’ll practice this in a moment.
I always like to end with:
Open the RProject file and look in the working directory: you will see an exercises
subdirectory and an answers
subdirectory.
The following lesson snippets all use .R
code files for the exercises. You can also ask students to use .Rmd
, especially if this is part of a course where you will need to collect assignment submissions.
As we go through, ask any questions you have about how to design and use examples/exercises.
ggplot()
functiongeom
functions, such as geom_point()
or geom_hist()
aes()
function nested within ggplot()
or a geom
function+
operator# Load data
gapminder <- gapminder::gapminder
# Look at the structure of the data. You can use glimpse(), summary(), or head().
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
These will produce the same output:
Plot life expectancy as a function of GDP per capita for the year 2007, and add labels.
gapminder07
to ggplot()
geom_point()
+ Supply x=gdpPercap
and y=lifeExp
to aes()
title
, x
, and y
in labs()
There are may geom
functions we can choose to generate geometric objects:
Let’s try to add geom_smooth()
to the previous plot we created.
ggplot(gapminder07, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth() +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Plot the life expectancy of each continent in 2007.
Look at the ggplot cheatsheet and decide which kind of geom to use.
gapminder07
to ggplot()
geom
(e.g. geom_boxplot()
) and supply appropriate aesthetics in nested aes()
(e.g. x=continent, y=lifeExp
)labs()
You can think of the continent
or year
variables as grouping variables: they place each observation in one of several groups.
We can represent the groups through aesthetic mapping or facets rather than along one of the axes.
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth() +
facet_wrap(~year) +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Try adding the argument scales="free"
to the facet_wrap()
layer.
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth() +
facet_wrap(~year, scales="free") +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Visualize life expectancy by continent in 2007 again. This time, group continents by color or facet.
Choose your own adventure!
Create a plot that includes two geoms and facets
# Load tidyverse and lubridate
library(tidyverse)
library(lubridate)
# Import vaccine data
vaccines <- read_csv("data/chicago_vaccines_daily.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## date = col_character(),
## doses = col_double(),
## first_dose = col_double(),
## final_dose = col_double()
## )
## Rows: 154
## Columns: 4
## $ date <chr> "12/15/2020", "12/16/2020", "12/17/2020", "05/16/2021", "05…
## $ doses <dbl> 16, 157, 1990, 4783, 4697, 5729, 3438, 936, 4385, 3, 3655, …
## $ first_dose <dbl> 16, 157, 1990, 1719, 1672, 5729, 3438, 936, 4385, 3, 3655, …
## $ final_dose <dbl> 0, 0, 0, 3140, 3120, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3074, 24…
In your exercise file, check the class of the date variable.
Check the class of the date variable:
## [1] "character"
This is just a character string. And it’s not even ordered correctly!
## [1] "12/15/2020" "12/16/2020" "12/17/2020" "05/16/2021" "05/17/2021"
## [6] "12/18/2020"
It looks like as_date()
might be a helpful function from lubridate
. But what happens when we use it?
## Warning: All formats failed to parse. No formats found.
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA
Whoops! Let’s turn to the lubridate
cheatsheet for help. What function should we use?
We can use the tailored mdy()
function:
## [1] "2020-12-15" "2020-12-16" "2020-12-17" "2021-05-16" "2021-05-17"
## [6] "2020-12-18"
Or we can speficy the format of the values in the character string using the format=
argument in as_date()
. See the help file for strptime()
for how to define formats.
## [1] "2020-12-15" "2020-12-16" "2020-12-17" "2020-05-16" "2020-05-17"
## [6] "2020-12-18"
# Replace the date variable in the dataset with the converted version
vaccines$date <- mdy(vaccines$date)
# Check the class of our converted variable
class(vaccines$date)
## [1] "Date"
Now, let’s try to convert the date
variable, which is in Date
class, into a numeric value for month.
How about if we want to get the day of the week? Identify the right function using the lubridate
cheatsheet. Check the help file to see if there are any useful arguments.
We can use wday()
. Note the label=TRUE
argument.
Bonus exercise if time permits: Can you calculate the number of days it took for Chicago to fully vaccinate 1 million people?
Hint: You may need to use the function cumsum()
Let’s use the gapminder
data that we are already familiar with to practice implementing some linear regressions and examining the results.
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Are life expectancy and GDP per capita related? A scatterplot suggests … maybe! We can investigate the relationship in a different way with a regression.
The lm()
function in base R takes a formula and an argument specifying the data frame:
Let’s implement a regression with the DV lifeExp
and one IV, gdpPercap
. We’ll save the results as mod1
. Look at what the object class is in your environment.
We can print out the results using summary()
on the saved list object.
##
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.754 -7.758 2.176 8.225 18.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.95556 0.31499 171.29 <2e-16 ***
## gdpPercap_perthou 0.76488 0.02579 29.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared: 0.3407, Adjusted R-squared: 0.3403
## F-statistic: 879.6 on 1 and 1702 DF, p-value: < 2.2e-16
Implement a regression with the DV lifeExp
and the IVs gdpPercap
and year
.
Then examine the results using summary()
.
##
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou + year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.262 -6.954 1.219 7.759 19.553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -418.42426 27.61714 -15.15 <2e-16 ***
## gdpPercap_perthou 0.66973 0.02447 27.37 <2e-16 ***
## year 0.23898 0.01397 17.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.694 on 1701 degrees of freedom
## Multiple R-squared: 0.4375, Adjusted R-squared: 0.4368
## F-statistic: 661.4 on 2 and 1701 DF, p-value: < 2.2e-16
Alternatively, we can visualize the coefficients and uncertainty using coefplot()
. But the below isn’t very easy to read. What can we do to improve it? Check out the help file.
We can remove the intercept using the argument intercept=FALSE
.
Predict life expectancy as a function of GDP per capita, year, and continent.
##
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou + year + continent,
## data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.4264 -4.0725 0.2154 4.4853 19.9977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -520.67458 19.79081 -26.31 <2e-16 ***
## gdpPercap_perthou 0.29675 0.01996 14.87 <2e-16 ***
## year 0.28739 0.01000 28.73 <2e-16 ***
## continentAmericas 14.32676 0.49358 29.03 <2e-16 ***
## continentAsia 9.50561 0.45670 20.81 <2e-16 ***
## continentEurope 19.39554 0.51730 37.49 <2e-16 ***
## continentOceania 20.58592 1.46895 14.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.884 on 1697 degrees of freedom
## Multiple R-squared: 0.717, Adjusted R-squared: 0.716
## F-statistic: 716.6 on 6 and 1697 DF, p-value: < 2.2e-16
Note that the character variable continent
is treated as fixed effects. The excluded category is Asia
RMarkdown
, RProjects
, and RStudio
itself.Examples from NU:
Other resources:
learnr
interactive tutorials, from RStudioTake it to the next level (suggestions from Christina Maimone):