datasaurus

library(datasauRus)
install.packages("datasauRus")

Explore the dataset

Since we are dealing with a tibble, we can just type

datasaurus_dozen

only the first 10 rows are displayed.

dataset x y
dino 55.3846 97.1795
dino 51.5385 96.0256
dino 46.1538 94.4872
dino 42.8205 91.4103
dino 40.7692 88.3333
dino 38.7179 84.8718
dino 35.6410 79.8718
dino 33.0769 77.5641
dino 28.9744 74.4872
dino 26.1538 71.4103
# dim() returns the dimensions of the data frame, i.e number of rows and columns
dim(datasaurus_dozen)
## [1] 1846    3
# ncol() only number of columns
ncol(datasaurus_dozen)
## [1] 3
# nrow() only number of rows
nrow(datasaurus_dozen)
## [1] 1846
datasaurus_dozen <- datasaurus_dozen
in the Environment panel -> Global Environment

How many datasets are present?

Tip

you want to count the number of unique elements in the column dataset. The function length() returns the length of a vector, such as the unique elements
unique(datasaurus_dozen$dataset) %>% length()
## [1] 13

Check summary statistics per dataset

Tip

In summarise() you can define as many new columns as you wish. No need to call it for every single variable.
datasaurus_dozen %>%
  group_by(dataset) %>%
  summarise(mean_x = mean(x),
            mean_y = mean(y))
dataset mean_x mean_y
away 54.26610 47.83472
bullseye 54.26873 47.83082
circle 54.26732 47.83772
dino 54.26327 47.83225
dots 54.26030 47.83983
h_lines 54.26144 47.83025
high_lines 54.26881 47.83545
slant_down 54.26785 47.83590
slant_up 54.26588 47.83150
star 54.26734 47.83955
v_lines 54.26993 47.83699
wide_lines 54.26692 47.83160
x_shape 54.26015 47.83972
datasaurus_dozen %>%
  group_by(dataset) %>%
  summarise(sd_x = sd(x),
            sd_y = sd(y))
dataset sd_x sd_y
away 16.76983 26.93974
bullseye 16.76924 26.93573
circle 16.76001 26.93004
dino 16.76514 26.93540
dots 16.76774 26.93019
h_lines 16.76590 26.93988
high_lines 16.76670 26.94000
slant_down 16.76676 26.93610
slant_up 16.76885 26.93861
star 16.76896 26.93027
v_lines 16.76996 26.93768
wide_lines 16.77000 26.93790
x_shape 16.76996 26.93000
datasaurus_dozen %>%
  group_by(dataset) %>%
  summarise_if(is.double, funs(mean = mean, sd = sd))
dataset x_mean y_mean x_sd y_sd
away 54.26610 47.83472 16.76983 26.93974
bullseye 54.26873 47.83082 16.76924 26.93573
circle 54.26732 47.83772 16.76001 26.93004
dino 54.26327 47.83225 16.76514 26.93540
dots 54.26030 47.83983 16.76774 26.93019
h_lines 54.26144 47.83025 16.76590 26.93988
high_lines 54.26881 47.83545 16.76670 26.94000
slant_down 54.26785 47.83590 16.76676 26.93610
slant_up 54.26588 47.83150 16.76885 26.93861
star 54.26734 47.83955 16.76896 26.93027
v_lines 54.26993 47.83699 16.76996 26.93768
wide_lines 54.26692 47.83160 16.77000 26.93790
x_shape 54.26015 47.83972 16.76996 26.93000
all mean and sd are the same for the 13 datasets

Plot the datasauRus

Tip

the ggplot() and geom_point() functions must be linked with a + sign
ggplot(datasaurus_dozen, aes(x = x, y = y)) +
  geom_point()
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
  geom_point()
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  facet_wrap(~ dataset, ncol = 3)
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
  geom_point() +
  theme_void() +
  theme(legend.position = "none") +
  facet_wrap(~ dataset, ncol = 3)
no ;) We were fooled by the summary stats

Animation

install.packages("gganimate")
library(gganimate)

p <- ggplot(datasaurus_dozen, aes(x = x, y = y, frame = dataset)) +
  geom_point() +
  theme_gray(20) +
  theme(legend.position = "none")

gganimate(p, title_frame = TRUE, "./img/dino.gif")
## Executing: 
## convert -loop 0 -delay 100 Rplot1.png Rplot2.png Rplot3.png
##     Rplot4.png Rplot5.png Rplot6.png Rplot7.png Rplot8.png
##     Rplot9.png Rplot10.png Rplot11.png Rplot12.png Rplot13.png
##     'dino.gif'
## Output at: dino.gif

Conclusion

never trust summary statistics alone; always visualize your data | Alberto Cairo

Authors

from this post