Data Visualization with ggplot2

Author

Lindsay N. Hayes

Published

July 10, 2025

About the activity

Access the Quarto document here.
Download the raw file.
Open it in RStudio.

We will work our way through this quarto document together during class. The activity will cover plotting data with ggplot2!

Load the Packages

The tidyverse library includes ggplot2 and I added cowplot to make the plots prettier. We will use data from the palmerpenguins and the taylor packages.

library(tidyverse)
library(cowplot)
library(palmerpenguins)
library(taylor)

Histogram & Density plots

Use the palmerpenguins data to plot some histograms. Histograms are good for evaluating the distribution of continuous data.

1. How would you describe the distribution? Is it uni-modal or bi-modal?

penguins |> head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

# plot the distribution of the penguins' body mass using geom_histogram()

ggplot(penguins, aes(x = body_mass_g)) + geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

# lets make the plot and the code a little more pretty (note tidy syntax is to put each element on a new line, the "+" wraps the text)

ggplot(penguins, aes(x = body_mass_g)) + 
  geom_histogram(fill = "deeppink") + 
  theme_cowplot()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

# histograms are impacted by the "bin" or how each bar is averaging (how thick the bars are). We can change that will `bin` or `binwidth`

ggplot(penguins, aes(x = body_mass_g)) + 
  geom_histogram(binwidth = 25, fill = "deeppink") + 
  theme_cowplot()

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

# add some more labels

ggplot(penguins, aes(x = body_mass_g)) + 
  geom_histogram(binwidth = 50, fill = "deeppink") + 
  theme_cowplot() +
  labs(title = "Distribution of Penguin Body Mass",
         x = "Body Mass (g)",
         y ="Count")

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

2. Plot the other penguin bill and flipper measurements? What are their distribution? How does the binwidth change the interpretation of the distributions? Are they bi-modal or uni-modal?

penguins |> head()

# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

3. A benefit of geom_density() is that the plots are not dependent on the binsize. Compare the histograms and the density plots for each data feature.

# Hint. Change geom_histogram to geom_density

4. Where do you think the bimodal distributions may come from? Lets check the distribution across different species. Add `fill = species` to the `aes()` so we can see how the distributions change across species of penguins.

# Hint: add an `alpha` to the geom_density to sets the opacity of the density plots.

Barplots of Categorical data

Another type of distribution and count data is of categorical data. Count data for categorical data is best displayed in barplots

# some tidying
taylor_album_songs$key <- str_sub(taylor_album_songs$key_name, 1, 1)
taylor_album_songs$album_name <- fct_inorder(taylor_album_songs$album_name)

# You can see both the difference in the number of songs per album, but also in the numbers in each key. 

ggplot(taylor_album_songs, aes(y = album_name, fill = key)) + 
  geom_bar() +
  theme_cowplot() +
  labs(title = "Keys of TS albums",
       x = "Count", 
       y = "") + scale_fill_brewer(palette="Set2")

# Now lest look at the proportions
ggplot(taylor_album_songs, aes(y = album_name, fill = key)) + 
  geom_bar(position = "fill") +
  theme_cowplot() +
  labs(title = "Keys of TS albums",
       x = "Proportion", 
       y = "") + scale_fill_brewer(palette="Set2")

5. What factors distinguish Ideal cut versus Good cut diamonds?

Using the diamonds dataset, determine the distribution of the cut, color, and clarity of diamonds.

# Hint: first look at the distribution of each characteristic



# Hint: Now plot two of the variables using geom_bar() and the other as a fill color

Boxplots!

The best way to view group differences.

6. Does the bill length and depth, the flipper length, or the body mass differ across species of penguins?

Using the penguins dataset use geom_boxplot() to find out if there is a difference in the penguin features between different species of penguins

# Hint: plot the species as a categorical feature and the bill and the bill adn flipper as the continuous variable.

Isn’t this better than bar graphs??!!!

Note: geom_violin() can also be a good alternative to a boxplot if you have a very large sample size, such as in single cell RNA sequencing or flow cytometry data. Try the plots as a geom_violin() instead of geom_boxplot().

Scatter and Line Plots

Looking at 2 continuous variables and their relationship to one another.

7. Using the `iris` dataset use to find out the relationship between `Sepal.Length`, `Sepal.Width`, `Petal.Length`, and Petal.Width`and how that relationship changes across different`Species` of iris flowers.

iris |> head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, group = Species)) + 
  geom_point() + 
  geom_smooth(method = lm) +
  theme_cowplot() +
  labs( x = "Sepal Length (mm)", y= "Sepal Width (mm)")

`geom_smooth()` using formula = 'y ~ x'

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species, group = Species)) + 
  geom_point() + 
  geom_smooth(method = lm) +
  theme_cowplot() +
  labs( x = "Petal Length (mm)", y= "Petal Width (mm)")

`geom_smooth()` using formula = 'y ~ x'

8. Now use the `ChickWeight` data to determine which `Diet` lead to the greatest `weight` gain over `Time.` Remember to take into account repeated measures across `Chick`

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] taylor_3.2.0         palmerpenguins_0.1.1 cowplot_1.1.3       
 [4] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
 [7] dplyr_1.1.4          purrr_1.0.4          readr_2.1.5         
[10] tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.1       
[13] tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] utf8_1.2.4         generics_0.1.3     stringi_1.8.4      lattice_0.22-6    
 [5] hms_1.1.3          digest_0.6.37      magrittr_2.0.3     evaluate_1.0.3    
 [9] grid_4.4.1         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] Matrix_1.7-2       jsonlite_1.9.1     mgcv_1.9-1         scales_1.3.0      
[17] cli_3.6.4          rlang_1.1.5        munsell_0.5.1      splines_4.4.1     
[21] withr_3.0.2        yaml_2.3.10        tools_4.4.1        tzdb_0.4.0        
[25] colorspace_2.1-1   vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4   
[29] htmlwidgets_1.6.4  pkgconfig_2.0.3    pillar_1.10.1      gtable_0.3.6      
[33] glue_1.8.0         xfun_0.51          tidyselect_1.2.1   rstudioapi_0.17.1 
[37] knitr_1.50         farver_2.1.2       htmltools_0.5.8.1  nlme_3.1-167      
[41] rmarkdown_2.29     labeling_0.4.3     compiler_4.4.1