Case study - gapminder

2019-11-04

Slides online

Rmd on gitlab

https://tinyurl.com/gap-canbio

Learning objectives

You will learn to:

use dplyr / purrr for efficient data manipulation
tidying multiple linear models using broom
managing related things together in one tibble
summarise findings in one ggplot using relevant aesthetics

guided practical

interactive session

Managing multiple models

Tutorial based on the great conference by Hadley Wickham

list-column cheatsheet

reminder

`nest()`

see pdf on rstudio’ github repository

Keep all analyse steps (cols) per group (rows) together

reminder

workflow from last time

mtcars %>%
  group_nest(cyl) %>%
  mutate(model = map(data, ~lm(mpg ~ wt, data = .x)),
         summary = map(model, summary),
         r_squared = map_dbl(summary, "r.squared"))

# A tibble: 3 x 5
    cyl data               model  summary    r_squared
  <dbl> <list>             <list> <list>         <dbl>
1     4 <tibble [11 × 10]> <lm>   <smmry.lm>     0.509
2     6 <tibble [7 × 10]>  <lm>   <smmry.lm>     0.465
3     8 <tibble [14 × 10]> <lm>   <smmry.lm>     0.423

Gapminder

gapminder is a fact tank

dataset

from an independent Swedish foundation
R package by Jenny Bryan!
install from CRAN: gapminder

Hans Rosling

died 2 years ago
fundamentaly optimistic
great talk

Guided practical

explore gapminder

05:00

install the gapminder package
load gapminder and tidyverse packages
use the pipe %>% to pass gapminder to ggplot()
plot the life expectency (lifeExp in y) ~ year (x)
use geom_line()

warning!

mind the grouping!

Gapminder

global vs individual trend

library(gapminder)
gapminder %>%
  ggplot(aes(x = year, y = lifeExp, group = country)) +
  geom_line()

Keep related things together using list-column

Keep related things together

`group_nest()`

One country example

Germany

from original tibble

gapminder %>%
  filter(country == "Germany") %>%
  select(-country, -continent)

# A tibble: 12 x 4
    year lifeExp      pop gdpPercap
   <int>   <dbl>    <int>     <dbl>
 1  1952    67.5 69145952     7144.
 2  1957    69.1 71019069    10188.
 3  1962    70.3 73739117    12902.
 4  1967    70.8 76368453    14746.
 5  1972    71   78717088    18016.
 6  1977    72.5 78160773    20513.
 7  1982    73.8 78335266    22032.
 8  1987    74.8 77718298    24639.
 9  1992    76.1 80597764    26505.
10  1997    77.3 82011073    27789.
11  2002    78.7 82350671    30036.
12  2007    79.4 82400996    32170.

nested tibble

by_country %>%
  filter(country == "Germany")

# A tibble: 1 x 3
  continent country data             
  <fct>     <fct>   <list>           
1 Europe    Germany <tibble [12 × 5]>

by_country %>%
  filter(country == "Germany") %>%
  unnest(data)

# A tibble: 12 x 7
   continent country  year lifeExp      pop gdpPercap year1950
   <fct>     <fct>   <int>   <dbl>    <int>     <dbl>    <dbl>
 1 Europe    Germany  1952    67.5 69145952     7144.        2
 2 Europe    Germany  1957    69.1 71019069    10188.        7
 3 Europe    Germany  1962    70.3 73739117    12902.       12
 4 Europe    Germany  1967    70.8 76368453    14746.       17
 5 Europe    Germany  1972    71   78717088    18016.       22
 6 Europe    Germany  1977    72.5 78160773    20513.       27
 7 Europe    Germany  1982    73.8 78335266    22032.       32
 8 Europe    Germany  1987    74.8 77718298    24639.       37
 9 Europe    Germany  1992    76.1 80597764    26505.       42
10 Europe    Germany  1997    77.3 82011073    27789.       47
11 Europe    Germany  2002    78.7 82350671    30036.       52
12 Europe    Germany  2007    79.4 82400996    32170.       57

What happens in the DATA FRAME, stays in the data frame

Las Vegas principle, add linear models

06:00

using by_country
add a new column model with linear regressions of lifeExp on year1950
save as by_country_lm

ask yourself

if you see add column, do you use mutate or summarise?
dealing with a list-column (here the column data), do you need to use map?

Keep related things together

linear models

Explore a list column

03:00

count # rows per country using the data column
does any country have less data than others?
plot lifeExp ~ year1950 for Bulgaria by unnesting data

reminder

a list column is a list, you need to iterate through elements
for the plotting the steps are:
- filter() for the desired country
- unnest() raw data
- pipe to ggplot()

Explore a list column

count how many rows per country

by_country_lm %>%
  mutate(n = map_int(data, nrow)) %>%
  select(country, n)

# A tibble: 142 x 2
   country                      n
   <fct>                    <int>
 1 Algeria                     12
 2 Angola                      12
 3 Benin                       12
 4 Botswana                    12
 5 Burkina Faso                12
 6 Burundi                     12
 7 Cameroon                    12
 8 Central African Republic    12
 9 Chad                        12
10 Comoros                     12
# … with 132 more rows

by_country_lm %>%
  mutate(n = map_int(data, nrow)) %>%
  distinct(n)

# A tibble: 1 x 1
      n
  <int>
1    12

Bulgaria

by_country_lm %>%
  filter(country == "Bulgaria") %>%
  unnest(data) %>%
  ggplot(aes(x = year1950, y = lifeExp)) +
  geom_line()

Explore nested tibble

06:00

display the summary for the linear model of Rwanda
how do you interpret the \(r^2\) for this particular model?

reminder

filter() for the desired country
list column is a list
to extract the xth element, use the pluck("model", x) purrr syntax
pipe this unique model to summary()

Explore nested tibble

linear model for Rwanda

by_country_lm %>%
  filter(country == "Rwanda") %>%
  pluck("model", 1) %>%
  summary()

Call:
lm(formula = lifeExp ~ year1950, data = .x)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.310  -1.445   2.410   3.073   6.021 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.83361    3.74890  11.426 4.63e-07 ***
year1950    -0.04583    0.10969  -0.418    0.685    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.558 on 10 degrees of freedom
Multiple R-squared:  0.01716,   Adjusted R-squared:  -0.08112 
F-statistic: 0.1746 on 1 and 10 DF,  p-value: 0.6849

\(r^2\) is close to 0, linearity sounds broken

broom will cleanup lm elements into tibbles

broom cleanup

Tidying models

extract from nested lists

06:00

using by_country_lm, add 4 new columns:
- glance, using the broom function on the model column
- tidy, using the broom function on the model column
- augment, using the broom function on the model column
- rsq from the glance column
save as models
why extracting the \(r^2\) in the main tibble is useful?

reminder

use map when dealing with a list column
in map, shortcut with quotes (like "r.squared") extract the specified variable
map takes and returns a list. Use map_dbl() to coerce output to doubles

Tidying models

extract from nested lists

useful info

coefficients estimates:
- slope
- intercept
\(r^2\)
residuals

library(broom)
models <- by_country_lm %>%
  mutate(glance  = map(model, glance),
         tidy    = map(model, tidy),
         augment = map(model, augment),
         rsq     = map_dbl(glance, "r.squared"))
models

# A tibble: 142 x 8
   continent country      data     model  glance   tidy    augment      rsq
   <fct>     <fct>        <list>   <list> <list>   <list>  <list>     <dbl>
 1 Africa    Algeria      <tibble… <lm>   <tibble… <tibbl… <tibble … 0.985 
 2 Africa    Angola       <tibble… <lm>   <tibble… <tibbl… <tibble … 0.888 
 3 Africa    Benin        <tibble… <lm>   <tibble… <tibbl… <tibble … 0.967 
 4 Africa    Botswana     <tibble… <lm>   <tibble… <tibbl… <tibble … 0.0340
 5 Africa    Burkina Faso <tibble… <lm>   <tibble… <tibbl… <tibble … 0.919 
 6 Africa    Burundi      <tibble… <lm>   <tibble… <tibbl… <tibble … 0.766 
 7 Africa    Cameroon     <tibble… <lm>   <tibble… <tibbl… <tibble … 0.680 
 8 Africa    Central Afr… <tibble… <lm>   <tibble… <tibbl… <tibble … 0.493 
 9 Africa    Chad         <tibble… <lm>   <tibble… <tibbl… <tibble … 0.872 
10 Africa    Comoros      <tibble… <lm>   <tibble… <tibbl… <tibble … 0.997 
# … with 132 more rows

extracting \(r^2\) in main tibble

no need to unnest for sort / filtering.

Exploratory plots

plotting \(r^2\) for countries

05:00

plot country ~ rsq
reorder country levels by \(r^2\) (rsq): snake plot
color points per continent
which continent shows most of the low \(r^2\) values?

reminder

to reorder discrete values:

must be of data type factor
use the forcats package
fct_reorder() to reorder according to a continuous variable

Do linear models fit all countries?

snake plot

library(forcats)
models %>%
  ggplot(aes(x = rsq, 
             y = fct_reorder(country,
                             rsq))) +
  geom_point(aes(colour = continent), 
             alpha = 0.5) +
  theme_classic(18) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = c(0.25, 0.75)) +
   guides(color = guide_legend(
     override.aes = list(alpha = 1))) +
  labs(x = "r square",
       y = "Country")

display the real data for countries with a low \(r^2\)

04:00

focus on non-linear trends
filter the 20 countries with the lowest \(r^2\)
unnest column data
plot lifeExp ~ year with lines
colour per continent
facet per country
same questions for the top 20 \(r^2\)

reminder

arrange(col) will sort according to col
top_n(x, col) not only sort col but return only x top entries
top_n(x, desc(col)) same but sort from lowest values

Exploratory plots

focus on non-linear trends

models %>%
  top_n(20, desc(rsq)) %>%
  unnest(data) %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = continent)) +
  facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45,
                                   hjust = 1),
        legend.position = "bottom")

Exploratory plots

focus on linear trends

models %>%
  top_n(20, rsq) %>%
  unnest(data) %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(colour = continent)) +
  facet_wrap(~ country) +
  theme(axis.text.x = element_text(angle = 45,
                                   hjust = 1),
        legend.position = "bottom")

interpreting the linear model

regression

what represents the intercept?
- using year1950?
- using year?
- justify Hadley choice
what represents the slope?

Germany, lifeExp ~ year1950

filter(models, country == "Germany") %>%
  unnest(tidy) %>%
  select(rsq:estimate)

# A tibble: 2 x 6
    rsq augment            p.value statistic std.error estimate
  <dbl> <list>               <dbl>     <dbl>     <dbl>    <dbl>
1 0.990 <tibble [12 × 9]> 7.65e-21     282.    0.238     67.1  
2 0.990 <tibble [12 × 9]> 3.15e-11      30.7   0.00696    0.214

Germany, lifeExp ~ year

# A tibble: 2 x 3
    rsq term        estimate
  <dbl> <chr>          <dbl>
1 0.990 (Intercept) -350.   
2 0.990 year           0.214

Summarise on one plot

by Hadley Wickham

04:00

unnest coefficients (tidy column)
- mind to keep the continent, country and rsq columns
put intercept and slope in their own columns
- in wide format, only one value can be used.
- discard unused columns.
plot slope ~ intercept (watch out the (Intercept) name which needs to be called between backsticks ‘`’)
colour per continent
size per \(r^2\) (use for scale_size_area() for lisibility)
add tendency with geom_smooth(method = "loess")

All in all

by Hadley Wickham

models %>%
  unnest(tidy) %>%
  select(continent, country, rsq, term, estimate) %>%
  spread(term, estimate) %>%
  ggplot(aes(x = `(Intercept)`, y = year1950)) +
  geom_point(aes(colour = continent, size = rsq)) +
  geom_smooth(se = FALSE, method = "loess") +
  scale_size_area() + labs(x = "Life expectancy (1950)", y = "Yearly improvement")

animation made easy

takes ~ 5 minutes due to easing

gganimate by Thomas Pedersen

library(gganimate)
gapminder %>%
  ggplot(aes(x = gdpPercap,
             y = lifeExp,
             size = pop, 
             color = continent)) +
  transition_time(year) +
  ease_aes("linear") +
  scale_size(range = c(2, 12)) +
  geom_point() +
  theme_bw(16) +
  labs(title = "Year: {frame_time}", 
       x = "GDP per capita", 
       y = "life expectancy") +
  scale_x_log10() -> p
animate(p)
anim_save("gapminder2.gif")

Before we stop

You learned to:

keep related things together:
- input data
- meaningful grouping ids
- perform modelling
- extract relevant model components
- explore visually your findings

Acknowledgments

Hadley Wickham
Jennifer Bryan
David Robinson
Thomas Pedersen
Eric Koncina (iosp R package for slides)

R workshop @Beval

Data processing with R tidyverse 4 days in May 2020:
no fees, registration via Exilir-LU
2 ECTS for the PhD students who attend the workshop and
complete a short project.

Slides online

Rmd on gitlab

Learning objectives

You will learn to:

guided practical

Managing multiple models

list-column cheatsheet

reminder

nest()

Keep all analyse steps (cols) per group (rows) together

reminder

workflow from last time

Gapminder

gapminder is a fact tank

dataset

Hans Rosling

Guided practical

explore gapminder

warning!

Gapminder

global vs individual trend

Keep related things together using list-column

Keep related things together

group_nest()

nest per country

helpers

One country example

Germany

from original tibble

nested tibble

What happens in the DATA FRAME, stays in the data frame

Las Vegas principle, add linear models

ask yourself

Keep related things together

linear models

Linear model per country

Explore a list column

reminder

Explore a list column

count how many rows per country

Bulgaria

Explore nested tibble

reminder

Explore nested tibble

linear model for Rwanda

broom cleanup

Tidying models

extract from nested lists

reminder

Tidying models

extract from nested lists

useful info

extracting \(r^2\) in main tibble

Exploratory plots

plotting \(r^2\) for countries

reminder

Do linear models fit all countries?

snake plot

display the real data for countries with a low \(r^2\)

reminder

Exploratory plots

focus on non-linear trends

Exploratory plots

focus on linear trends

interpreting the linear model

regression

Germany, lifeExp ~ year1950

Germany, lifeExp ~ year

Summarise on one plot

by Hadley Wickham

All in all

by Hadley Wickham

animation made easy

takes ~ 5 minutes due to easing

gganimate by Thomas Pedersen

Before we stop

You learned to:

Acknowledgments

R workshop @Beval

`nest()`

`group_nest()`