Teaching R

A brief introduction

Kumar Ramanathan

@kumarhk

May 20, 2021

Warm up

Let’s do some polling! Go to pollev.com/kumarr436

Learning objectives for today

  1. Construct a lesson plan for an R workshop/session
  2. Understand the examples-and-exercises approach to teaching R
  3. Articulate questions about how to teach R workshops/sessions
  4. Identify resources to address those questions

Outline

  • Preparing to teach
  • Determining the scope of your lesson
  • Building your lesson (nuts-and-bolts)
  • Structuring your lesson (what goes after what?)
  • Examples and Exercises

Preparing to teach

Why should you learn to teach R?

  • Teaching is the best way to learn 🤓
  • Lots of opportunities to teach R workshops/etc at Northwestern 👨‍🏫
  • Build your teaching portfolio 📂
  • Training for certain non-academic career paths 💸

Questions to ask yourself

  • What are your learning objectives for the course/session/workshop?
  • What are the opportunities and constraints you’ll have in your teaching environment?
  • How much background do your students have? How much do they need?

Learning objectives

  • Much of this will be determined by context: the students may need specific skills (e.g. regression, data viz) for a class, you may be teaching general-purpose skills for data analysis, etc.
  • I suggest that every R workshop should share two objectives: Articulate questions about the [technique/method taught] and Identify resources to address those questions.
  • In simpler language: you want students to walk away knowing how to ask for help when they run into problems.

Teaching environment

  • One-off workshop vs. series
  • Virtual vs. in-person
  • Solo teaching vs. group teaching

Determining the scope of your lesson

Background knowledge

  • What do the students already know? You may be able to answer this from program structure, or through a survey.
  • Sometimes students will be coming in with different levels of background knowledge. You can try to address this before the session and/or adjust your teaching accordingly.

Do you need to cover the basics? What are the basics?

  • Installing R and RStudio
  • Navigating RStudio
  • “code”, “comments”, “objects”
  • Syntax and data types
  • Data structures
  • Reading and writing files

Before the lesson

  • Registration, where you can ask students to list skills & how comfortable they feel, their goals, etc.
  • Assigning materials for background knowledge before the session: pre-session videos, DataQuest, learnr videos
  • Communicate info with students: versions of R and RStudio needed, what packages are needed, location of shared materials, etc.

Building your lesson

Building your lesson

  • Befriend RMarkdown (and RProjects)
  • Consider how to store and share materials: Github, Box folder, something else, none of the above
  • Other types of tools: RStudio Cloud
  • Connect to other course materials, pre-workshop assignments, etc.

The GitHub option

Why GitHub?

For you:

  • Version control!
  • Good for collaboration
  • Good integration with RStudio, easy-to-use desktop application
  • Useful skill for data science & programming outside social science
  • Training via NUIT RCS; training within department coming soon

For students:

  • Easy download options
  • They can see how your code evolved
  • Helpful exposure to common tool in data science & programming

Structuring your lesson

Where to start

Motivation

  • Use the end point as motivation: show them what they will learn!
  • Pick some real data that the students are likely to be interested in

Help students feel comfortable

  • Remind them that they are learning a skill, which only comes with practice
  • Encourage your students to learn from each other

Core content

I usually outline the core content by:

  1. Drafting the end products and breaking down each step.
  2. Looking up existing teaching materials on the same/related topic.

I strongly suggest an examples and exercises approach to teaching skills in R. We’ll practice this in a moment.

Building flexibility into your lesson

  • Plan for a little bit more material than you can teach
  • Provide data files that students can play around with
  • For complex skills, give yourself wiggle room to skip over exercises based on timing/interest

Ending with encouragement

I always like to end with:

  1. Main takeaways
  2. Resources
  3. Reminder that they’ve just learned to create things!

Examples and Exercises

Learning by doing

Learning by doing

  • I usually have three components: slides, lecture notes, and exercises
  • Students follow along from the exercises file, with the lecture notes as reference
  • Commenting the exercises file is key: it helps students locate themselves, and makes it easier for them to come back to the materials and learn

Lesson snippets

Open the RProject file and look in the working directory: you will see an exercises subdirectory and an answers subdirectory.

The following lesson snippets all use .R code files for the exercises. You can also ask students to use .Rmd, especially if this is part of a course where you will need to collect assignment submissions.

As we go through, ask any questions you have about how to design and use examples/exercises.

Lesson snippet 1: ggplot and the grammar of graphics

Components of a basic plot

  • data: a data frame, provided to the ggplot() function
  • geometric objects: the objects/shapes that you want to plot, indicated through one of the many available geom functions, such as geom_point() or geom_hist()
  • aesthetic mapping: the mapping from the data to the geometric objects, provided in an aes() function nested within ggplot() or a geom function
  • connected with the + operator
ggplot(data = <DATA FRAME>) + 
  <GEOM_FUNCTION>(mapping = aes(<VARIABLES>))
ggplot(<DATA FRAME>) + 
  <GEOM_FUNCTION>(aes(<VARIABLES>))

Prepare data

# Load data
gapminder <- gapminder::gapminder

# Look at the structure of the data. You can use glimpse(), summary(), or head().
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
# Create a new data frame with only the data for 2007
gapminder07 <- filter(gapminder, year==2007)

A basic scatterplot

ggplot(gapminder) + 
    geom_point(aes(x=year, y=pop))

A basic scatterplot

These will produce the same output:

ggplot(gapminder) + 
    geom_point(aes(x=year, y=pop))
ggplot(gapminder, aes(x=year, y=pop)) + 
    geom_point()

Add labels to the plot

ggplot(gapminder) + 
    geom_point(aes(x=year, y=pop)) + 
    labs(title="Population over time", x="Year", y="Population")

Your turn!

Plot life expectancy as a function of GDP per capita for the year 2007, and add labels.

  • Step 1: Supply the data gapminder07 to ggplot()
  • Step 2: Choose geom_point() + Supply x=gdpPercap and y=lifeExp to aes()
  • Step 3: Add title, x, and y in labs()

Your turn!

ggplot(gapminder07) + 
    geom_point(aes(x=gdpPercap, y=lifeExp)) + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")

Choosing geoms

There are may geom functions we can choose to generate geometric objects:

Let’s try to add geom_smooth() to the previous plot we created.

Smoothed conditional means

ggplot(gapminder07) + 
    geom_point(aes(x=gdpPercap, y=lifeExp)) + 
    geom_smooth(aes(x=gdpPercap, y=lifeExp)) + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")
ggplot(gapminder07, aes(x=gdpPercap, y=lifeExp)) + 
    geom_point() + 
    geom_smooth() + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")

Smoothed conditional means

ggplot(gapminder07, aes(x=gdpPercap, y=lifeExp)) + 
    geom_point() + 
    geom_smooth() + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Horizontal line

ggplot(gapminder07) + 
    geom_point(aes(x=gdpPercap, y=lifeExp)) + 
    geom_hline(aes(yintercept=mean(lifeExp))) + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")

Your turn!

Plot the life expectancy of each continent in 2007.

Look at the ggplot cheatsheet and decide which kind of geom to use.

  • Step 1: Supply gapminder07 to ggplot()
  • Step 2: Choose a geom (e.g. geom_boxplot()) and supply appropriate aesthetics in nested aes() (e.g. x=continent, y=lifeExp)
  • Step 3: Add some labels in labs()

Life expectancy in each continent

ggplot(gapminder07) + 
    geom_boxplot(aes(x=continent, y=lifeExp)) + 
    labs(title="Distribution of life expectancy", x="Continent", y="Distribution")

Life expectancy in each continent

ggplot(gapminder07) + 
    geom_col(aes(x=continent, y=lifeExp)) + 
    labs(title="Distribution of life expectancy", x="Continent", y="Distribution")

Grouping variables

You can think of the continent or year variables as grouping variables: they place each observation in one of several groups.

We can represent the groups through aesthetic mapping or facets rather than along one of the axes.

Grouping by color

ggplot(gapminder) + 
    geom_point(aes(x=gdpPercap, y=lifeExp, col=year)) + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")

Grouping by facet

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + 
    geom_point() + 
    geom_smooth() + 
    facet_wrap(~year) + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Grouping by facet … scales=“free”

Try adding the argument scales="free" to the facet_wrap() layer.

Grouping by facet … scales=“free”

ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + 
    geom_point() + 
    geom_smooth() + 
    facet_wrap(~year, scales="free") + 
    labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Your turn!

Visualize life expectancy by continent in 2007 again. This time, group continents by color or facet.

Grouping by colors

ggplot(gapminder07, aes(x=gdpPercap, y=lifeExp, 
                        col=continent, size=pop)) + 
    geom_point() + 
    labs(title="Life expectancy as a function of GDP per capita, by continent", x="GDP per capita", y="Life expectancy")

Grouping by facet

ggplot(gapminder07, aes(x=lifeExp)) + 
    geom_density() + 
    facet_wrap(~continent, scales="free") + 
    labs(title="Life expectancy as a function of GDP per capita, by continent", x="GDP per capita", y="Life expectancy")

Cumulative exercise

Choose your own adventure!

Create a plot that includes two geoms and facets

Lesson snippet 2: working with dates

Prepare libraries and data

# Load tidyverse and lubridate
library(tidyverse)
library(lubridate)

# Import vaccine data
vaccines <- read_csv("data/chicago_vaccines_daily.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   date = col_character(),
##   doses = col_double(),
##   first_dose = col_double(),
##   final_dose = col_double()
## )
# Let's glimpse the data
glimpse(vaccines)
## Rows: 154
## Columns: 4
## $ date       <chr> "12/15/2020", "12/16/2020", "12/17/2020", "05/16/2021", "05…
## $ doses      <dbl> 16, 157, 1990, 4783, 4697, 5729, 3438, 936, 4385, 3, 3655, …
## $ first_dose <dbl> 16, 157, 1990, 1719, 1672, 5729, 3438, 936, 4385, 3, 3655, …
## $ final_dose <dbl> 0, 0, 0, 3140, 3120, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3074, 24…

Working with dates

In your exercise file, check the class of the date variable.

Working with dates

Check the class of the date variable:

class(vaccines$date)
## [1] "character"

This is just a character string. And it’s not even ordered correctly!

head(vaccines$date)
## [1] "12/15/2020" "12/16/2020" "12/17/2020" "05/16/2021" "05/17/2021"
## [6] "12/18/2020"

Working with dates

It looks like as_date() might be a helpful function from lubridate. But what happens when we use it?

as_date(vaccines$date)
## Warning: All formats failed to parse. No formats found.
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA

Whoops! Let’s turn to the lubridate cheatsheet for help. What function should we use?

Converting dates

From lubridate cheatsheet

Converting dates

We can use the tailored mdy() function:

mdy(vaccines$date) %>% head()
## [1] "2020-12-15" "2020-12-16" "2020-12-17" "2021-05-16" "2021-05-17"
## [6] "2020-12-18"

Or we can speficy the format of the values in the character string using the format= argument in as_date(). See the help file for strptime() for how to define formats.

as_date(vaccines$date, format="%m/%d/%y") %>% head()
## [1] "2020-12-15" "2020-12-16" "2020-12-17" "2020-05-16" "2020-05-17"
## [6] "2020-12-18"

Converting dates

# Replace the date variable in the dataset with the converted version
vaccines$date <- mdy(vaccines$date)

# Check the class of our converted variable
class(vaccines$date)
## [1] "Date"

Utility of the Date format

ggplot(vaccines, aes(x=date, y=doses)) + geom_col() 

Converting dates

Now, let’s try to convert the date variable, which is in Date class, into a numeric value for month.

Converting dates

# Create a new variable for month. Refer to the cheatsheet for guidance.
vaccines$month <- month(vaccines$date)

# Repeat the plot, using month as the x-axis
ggplot(vaccines, aes(x=month, y=doses)) + geom_col()

Converting dates

How about if we want to get the day of the week? Identify the right function using the lubridate cheatsheet. Check the help file to see if there are any useful arguments.

Converting dates

We can use wday(). Note the label=TRUE argument.

# Create a new variable for day of the week, with label=TRUE
vaccines$wday <- wday(vaccines$date, label=TRUE)

# Repeat the plot, using day of the week as the x-axis
ggplot(vaccines, aes(x=wday, y=doses)) + geom_col()

Working with dates

Bonus exercise if time permits: Can you calculate the number of days it took for Chicago to fully vaccinate 1 million people?

Hint: You may need to use the function cumsum()

Lesson snippet 3: regression and coefplot

Regression models in R

Let’s use the gapminder data that we are already familiar with to practice implementing some linear regressions and examining the results.

Preparing data

# Import data
gapminder <- gapminder::gapminder

# Examine data
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
# Create GDP per capita (per thousand dollars) variable
gapminder$gdpPercap_perthou <- gapminder$gdpPercap/1000

Life expectancy and GDP per capita

Are life expectancy and GDP per capita related? A scatterplot suggests … maybe! We can investigate the relationship in a different way with a regression.

plot(lifeExp ~ gdpPercap_perthou, data=gapminder)

Implementing a regression

The lm() function in base R takes a formula and an argument specifying the data frame:

lm(formula = DEP_VAR ~ IV1 + IV2, data = DATAFRAME)

Let’s implement a regression with the DV lifeExp and one IV, gdpPercap. We’ll save the results as mod1. Look at what the object class is in your environment.

mod1 <- lm(lifeExp ~ gdpPercap_perthou, data=gapminder)

Examining regression results

We can print out the results using summary() on the saved list object.

summary(mod1)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou, data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       53.95556    0.31499  171.29   <2e-16 ***
## gdpPercap_perthou  0.76488    0.02579   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

Your turn! A multivariate model

Implement a regression with the DV lifeExp and the IVs gdpPercap and year.

Then examine the results using summary().

A multivariate model

mod2 <- lm(lifeExp ~ gdpPercap_perthou + year, data=gapminder)

summary(mod2)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou + year, data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.262  -6.954   1.219   7.759  19.553 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -418.42426   27.61714  -15.15   <2e-16 ***
## gdpPercap_perthou    0.66973    0.02447   27.37   <2e-16 ***
## year                 0.23898    0.01397   17.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.694 on 1701 degrees of freedom
## Multiple R-squared:  0.4375, Adjusted R-squared:  0.4368 
## F-statistic: 661.4 on 2 and 1701 DF,  p-value: < 2.2e-16

Examining regression results

Alternatively, we can visualize the coefficients and uncertainty using coefplot(). But the below isn’t very easy to read. What can we do to improve it? Check out the help file.

coefplot(mod2)

Examining regression results

We can remove the intercept using the argument intercept=FALSE.

coefplot(mod2, intercept=FALSE)

Add a title to the plot

coefplot(mod2, intercept=FALSE, title="Coefficients from Model 2")
coefplot(mod2, intercept=FALSE) + labs(title="Coefficients from Model 2")

One more multivariate model

Predict life expectancy as a function of GDP per capita, year, and continent.

One more multivariate model

mod3 <- lm(lifeExp ~ gdpPercap_perthou + year + continent, data=gapminder)

summary(mod3)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou + year + continent, 
##     data = gapminder)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.4264  -4.0725   0.2154   4.4853  19.9977 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -520.67458   19.79081  -26.31   <2e-16 ***
## gdpPercap_perthou    0.29675    0.01996   14.87   <2e-16 ***
## year                 0.28739    0.01000   28.73   <2e-16 ***
## continentAmericas   14.32676    0.49358   29.03   <2e-16 ***
## continentAsia        9.50561    0.45670   20.81   <2e-16 ***
## continentEurope     19.39554    0.51730   37.49   <2e-16 ***
## continentOceania    20.58592    1.46895   14.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.884 on 1697 degrees of freedom
## Multiple R-squared:  0.717,  Adjusted R-squared:  0.716 
## F-statistic: 716.6 on 6 and 1697 DF,  p-value: < 2.2e-16

Note that the character variable continent is treated as fixed effects. The excluded category is Asia

One more multivariate model

coefplot(mod3, intercept=F)

Summing up the lesson snippets

Key points on examples and exercises

  • How do you choose the right examples?
    • I like to come up with the end product, and break down the skills you need to get there
    • You can also look at examples in existing learning resources
    • Keep it engaging: choose relevant data and build on each output
    • Pick something you actually do in real life! It will be more interesting to you and to students
  • Comment your code!
  • Making sure your code works
    • I write in this sequence: lecture notes, answers, exercises, slides
    • Restart your R session and run the code in answers & exercises to make sure each works independently

Wrapping up

Takeaways

  1. Lesson planning is really important: Develop learning objectives. Assess what your students already know + what they need.
  2. Use the tools that R provides: RMarkdown, RProjects, and RStudio itself.
  3. Construct your lessons around examples and exercises, so that students practice writing code during the session.
  4. Lean on the broad community of R users and social scientists. You almost never have to start from scratch.

Resources

Examples from NU:

Other resources:

Take it to the next level (suggestions from Christina Maimone):

You can do it!

Is “Full House” still popular?