2019-11-04

Slides online

Learning objectives

You will learn to:

  • use dplyr / purrr for efficient data manipulation
  • tidying multiple linear models using broom
  • managing related things together in one tibble
  • summarise findings in one ggplot using relevant aesthetics

guided practical

interactive session

Managing multiple models

list-column cheatsheet

reminder

Keep all analyse steps (cols) per group (rows) together

reminder

workflow from last time

mtcars %>%
  group_nest(cyl) %>%
  mutate(model = map(data, ~lm(mpg ~ wt, data = .x)),
         summary = map(model, summary),
         r_squared = map_dbl(summary, "r.squared"))
# A tibble: 3 x 5
    cyl data               model  summary    r_squared
  <dbl> <list>             <list> <list>         <dbl>
1     4 <tibble [11 × 10]> <lm>   <smmry.lm>     0.509
2     6 <tibble [7 × 10]>  <lm>   <smmry.lm>     0.465
3     8 <tibble [14 × 10]> <lm>   <smmry.lm>     0.423

Gapminder

gapminder is a fact tank

dataset

Hans Rosling

  • died 2 years ago
  • fundamentaly optimistic
  • great talk

Guided practical

explore gapminder

  • install the gapminder package
  • load gapminder and tidyverse packages
  • use the pipe %>% to pass gapminder to ggplot()
  • plot the life expectency (lifeExp in y) ~ year (x)
  • use geom_line()

warning!

  • mind the grouping!

Gapminder

global vs individual trend

library(gapminder)
gapminder %>%
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line()

Keep related things together using list-column

Keep related things together

group_nest()

One country example

Germany

from original tibble

gapminder %>%
  filter(country == "Germany") %>%
  select(-country, -continent)
# A tibble: 12 x 4
    year lifeExp      pop gdpPercap
   <int>   <dbl>    <int>     <dbl>
 1  1952    67.5 69145952     7144.
 2  1957    69.1 71019069    10188.
 3  1962    70.3 73739117    12902.
 4  1967    70.8 76368453    14746.
 5  1972    71   78717088    18016.
 6  1977    72.5 78160773    20513.
 7  1982    73.8 78335266    22032.
 8  1987    74.8 77718298    24639.
 9  1992    76.1 80597764    26505.
10  1997    77.3 82011073    27789.
11  2002    78.7 82350671    30036.
12  2007    79.4 82400996    32170.

nested tibble

by_country %>%
  filter(country == "Germany")
# A tibble: 1 x 3
  continent country data             
  <fct>     <fct>   <list>           
1 Europe    Germany <tibble [12 × 5]>
by_country %>%
  filter(country == "Germany") %>%
  unnest(data)
# A tibble: 12 x 7
   continent country  year lifeExp      pop gdpPercap year1950
   <fct>     <fct>   <int>   <dbl>    <int>     <dbl>    <dbl>
 1 Europe    Germany  1952    67.5 69145952     7144.        2
 2 Europe    Germany  1957    69.1 71019069    10188.        7
 3 Europe    Germany  1962    70.3 73739117    12902.       12
 4 Europe    Germany  1967    70.8 76368453    14746.       17
 5 Europe    Germany  1972    71   78717088    18016.       22
 6 Europe    Germany  1977    72.5 78160773    20513.       27
 7 Europe    Germany  1982    73.8 78335266    22032.       32
 8 Europe    Germany  1987    74.8 77718298    24639.       37
 9 Europe    Germany  1992    76.1 80597764    26505.       42
10 Europe    Germany  1997    77.3 82011073    27789.       47
11 Europe    Germany  2002    78.7 82350671    30036.       52
12 Europe    Germany  2007    79.4 82400996    32170.       57

What happens in the DATA FRAME, stays in the data frame

Las Vegas principle, add linear models

  • using by_country
  • add a new column model with linear regressions of lifeExp on year1950
  • save as by_country_lm

ask yourself

  • if you see add column, do you use mutate or summarise?
  • dealing with a list-column (here the column data), do you need to use map?

Keep related things together

linear models

Explore a list column

  • count # rows per country using the data column
  • does any country have less data than others?
  • plot lifeExp ~ year1950 for Bulgaria by unnesting data

reminder

  • a list column is a list, you need to iterate through elements
  • for the plotting the steps are:
    • filter() for the desired country
    • unnest() raw data
    • pipe to ggplot()

Explore a list column

count how many rows per country

by_country_lm %>%
  mutate(n = map_int(data, nrow)) %>%
  select(country, n)
# A tibble: 142 x 2
   country                      n
   <fct>                    <int>
 1 Algeria                     12
 2 Angola                      12
 3 Benin                       12
 4 Botswana                    12
 5 Burkina Faso                12
 6 Burundi                     12
 7 Cameroon                    12
 8 Central African Republic    12
 9 Chad                        12
10 Comoros                     12
# … with 132 more rows
by_country_lm %>%
  mutate(n = map_int(data, nrow)) %>%
  distinct(n)
# A tibble: 1 x 1
      n
  <int>
1    12

Bulgaria

by_country_lm %>%
  filter(country == "Bulgaria") %>%
  unnest(data) %>%
  ggplot(aes(x = year1950, y = lifeExp)) +
  geom_line()

Explore nested tibble

  • display the summary for the linear model of Rwanda
  • how do you interpret the \(r^2\) for this particular model?

reminder

  • filter() for the desired country
  • list column is a list
  • to extract the xth element, use the pluck("model", x) purrr syntax
  • pipe this unique model to summary()

Explore nested tibble

linear model for Rwanda

by_country_lm %>%
  filter(country == "Rwanda") %>%
  pluck("model", 1) %>%
  summary()
Call:
lm(formula = lifeExp ~ year1950, data = .x)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.310  -1.445   2.410   3.073   6.021 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.83361    3.74890  11.426 4.63e-07 ***
year1950    -0.04583    0.10969  -0.418    0.685    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.558 on 10 degrees of freedom
Multiple R-squared:  0.01716,   Adjusted R-squared:  -0.08112 
F-statistic: 0.1746 on 1 and 10 DF,  p-value: 0.6849

\(r^2\) is close to 0, linearity sounds broken

broom will cleanup lm elements into tibbles

broom cleanup

Tidying models

extract from nested lists

  • using by_country_lm, add 4 new columns:
    • glance, using the broom function on the model column
    • tidy, using the broom function on the model column
    • augment, using the broom function on the model column
    • rsq from the glance column
  • save as models
  • why extracting the \(r^2\) in the main tibble is useful?

reminder

  • use map when dealing with a list column
  • in map, shortcut with quotes (like "r.squared") extract the specified variable
  • map takes and returns a list. Use map_dbl() to coerce output to doubles

Tidying models

extract from nested lists

useful info

  • coefficients estimates:
    • slope
    • intercept
  • \(r^2\)
  • residuals
library(broom)
models <- by_country_lm %>%
  mutate(glance  = map(model, glance),
         tidy    = map(model, tidy),
         augment = map(model, augment),
         rsq     = map_dbl(glance, "r.squared"))
models
# A tibble: 142 x 8
   continent country      data     model  glance   tidy    augment      rsq
   <fct>     <fct>        <list>   <list> <list>   <list>  <list>     <dbl>
 1 Africa    Algeria      <tibble… <lm>   <tibble… <tibbl… <tibble … 0.985 
 2 Africa    Angola       <tibble… <lm>   <tibble… <tibbl… <tibble … 0.888 
 3 Africa    Benin        <tibble… <lm>   <tibble… <tibbl… <tibble … 0.967 
 4 Africa    Botswana     <tibble… <lm>   <tibble… <tibbl… <tibble … 0.0340
 5 Africa    Burkina Faso <tibble… <lm>   <tibble… <tibbl… <tibble … 0.919 
 6 Africa    Burundi      <tibble… <lm>   <tibble… <tibbl… <tibble … 0.766 
 7 Africa    Cameroon     <tibble… <lm>   <tibble… <tibbl… <tibble … 0.680 
 8 Africa    Central Afr… <tibble… <lm>   <tibble… <tibbl… <tibble … 0.493 
 9 Africa    Chad         <tibble… <lm>   <tibble… <tibbl… <tibble … 0.872 
10 Africa    Comoros      <tibble… <lm>   <tibble… <tibbl… <tibble … 0.997 
# … with 132 more rows

extracting \(r^2\) in main tibble

  • no need to unnest for sort / filtering.

Exploratory plots

plotting \(r^2\) for countries

  • plot country ~ rsq
  • reorder country levels by \(r^2\) (rsq): snake plot
  • color points per continent
  • which continent shows most of the low \(r^2\) values?

reminder

to reorder discrete values:

  • must be of data type factor
  • use the forcats package
  • fct_reorder() to reorder according to a continuous variable

Do linear models fit all countries?

snake plot

library(forcats)
models %>%
  ggplot(aes(x = rsq, 
             y = fct_reorder(country,
                             rsq))) +
  geom_point(aes(colour = continent), 
             alpha = 0.5) +
  theme_classic(18) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = c(0.25, 0.75)) +
   guides(color = guide_legend(
     override.aes = list(alpha = 1))) +
  labs(x = "r square",
       y = "Country") 

display the real data for countries with a low \(r^2\)

  • focus on non-linear trends
  • filter the 20 countries with the lowest \(r^2\)
  • unnest column data
  • plot lifeExp ~ year with lines
  • colour per continent
  • facet per country
  • same questions for the top 20 \(r^2\)

reminder

  • arrange(col) will sort according to col
  • top_n(x, col) not only sort col but return only x top entries
  • top_n(x, desc(col)) same but sort from lowest values

Exploratory plots

focus on non-linear trends

Exploratory plots

focus on linear trends

interpreting the linear model

regression

  • what represents the intercept?
    • using year1950?
    • using year?
    • justify Hadley choice
  • what represents the slope?

Germany, lifeExp ~ year1950

filter(models, country == "Germany") %>%
  unnest(tidy) %>%
  select(rsq:estimate)
# A tibble: 2 x 6
    rsq augment            p.value statistic std.error estimate
  <dbl> <list>               <dbl>     <dbl>     <dbl>    <dbl>
1 0.990 <tibble [12 × 9]> 7.65e-21     282.    0.238     67.1  
2 0.990 <tibble [12 × 9]> 3.15e-11      30.7   0.00696    0.214

Germany, lifeExp ~ year

# A tibble: 2 x 3
    rsq term        estimate
  <dbl> <chr>          <dbl>
1 0.990 (Intercept) -350.   
2 0.990 year           0.214

Summarise on one plot

by Hadley Wickham

  • unnest coefficients (tidy column)
    • mind to keep the continent, country and rsq columns
  • put intercept and slope in their own columns
    • in wide format, only one value can be used.
    • discard unused columns.
  • plot slope ~ intercept (watch out the (Intercept) name which needs to be called between backsticks ‘`’)
  • colour per continent
  • size per \(r^2\) (use for scale_size_area() for lisibility)
  • add tendency with geom_smooth(method = "loess")

All in all

by Hadley Wickham

models %>%
  unnest(tidy) %>%
  select(continent, country, rsq, term, estimate) %>%
  spread(term, estimate) %>%
  ggplot(aes(x = `(Intercept)`, y = year1950)) +
  geom_point(aes(colour = continent, size = rsq)) +
  geom_smooth(se = FALSE, method = "loess") +
  scale_size_area() + labs(x = "Life expectancy (1950)", y = "Yearly improvement")

animation made easy

takes ~ 5 minutes due to easing

gganimate by Thomas Pedersen

library(gganimate)
gapminder %>%
  ggplot(aes(x = gdpPercap,
             y = lifeExp,
             size = pop, 
             color = continent)) +
  transition_time(year) +
  ease_aes("linear") +
  scale_size(range = c(2, 12)) +
  geom_point() +
  theme_bw(16) +
  labs(title = "Year: {frame_time}", 
       x = "GDP per capita", 
       y = "life expectancy") +
  scale_x_log10() -> p
animate(p)
anim_save("gapminder2.gif")

Before we stop

You learned to:

  • keep related things together:
    • input data
    • meaningful grouping ids
    • perform modelling
    • extract relevant model components
    • explore visually your findings

Acknowledgments

  • Hadley Wickham
  • Jennifer Bryan
  • David Robinson
  • Thomas Pedersen
  • Eric Koncina (iosp R package for slides)

R workshop @Beval

  • Data processing with R tidyverse 4 days in May 2020:
  • no fees, registration via Exilir-LU
  • 2 ECTS for the PhD students who attend the workshop and
    complete a short project.