Chapter 2 Data Visualization

We begin the development of your data science toolbox with data visualization. By visualizing data, we gain valuable insights we couldn’t initially obtain from just looking at the raw data values. We’ll use the ggplot2 package, as it provides an easy way to customize your plots. ggplot2 is rooted in the data visualization theory known as the grammar of graphics (Wilkinson 2005), developed by Leland Wilkinson.

At their most basic, graphics/plots/charts (we use these terms interchangeably in this book) provide a nice way to explore the patterns in data, such as the presence of outliers, distributions of individual variables, and relationships between groups of variables. Graphics are designed to emphasize the findings and insights you want your audience to understand. This does, however, require a balancing act. On the one hand, you want to highlight as many interesting findings as possible. On the other hand, you don’t want to include so much information that it overwhelms your audience.

As we will see, plots also help us to identify patterns and outliers in our data. We’ll see that a common extension of these ideas is to compare the distribution of one numerical variable, such as what are the center and spread of the values, as we go across the levels of a different categorical variable.

Chapter Learning Objectives

At the end of this chapter, you should be able to…
• Explore and present data using plots (a.k.a., charts and graphics) in R.
• Use the grammar of graphics to create basic plots by specifying the data, its mapping and the geometric object(s).
• Produce more complex plots by adding layers to modify labels, add facets, etc.
• Adjust plot features such as colors and bin number by modifying function arguments (inputs).
• Select and create the most appropriate graph to visualize your data.
• Recognize and correct difficulties in visualizing data due to overplotting.
• Interpret data presented in histograms and boxplots.
• Explain the problems with using bar charts to present quantitative data and pie charts to present categorical data.

Needed packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 1.3 for information on how to install and load R packages.

library(openintro)
library(ggplot2)
library(dplyr)

2.1 The grammar of graphics

We start with a discussion of a theoretical framework for data visualization known as “the grammar of graphics.” This framework serves as the foundation for the ggplot2 package which we’ll use extensively in this chapter. Think of how we construct and form sentences in English by combining different elements, like nouns, verbs, articles, subjects, objects, etc. We can’t just combine these elements in any arbitrary order; we must do so following a set of rules known as a linguistic grammar. Similarly to a linguistic grammar, “the grammar of graphics” defines a set of rules for constructing statistical graphics by combining different types of layers. This grammar was created by Leland Wilkinson (Wilkinson 2005) and has been implemented in a variety of data visualization software platforms like R, but also Plotly and Tableau.

2.1.1 Components of the grammar

In short, the grammar tells us that:

A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.

Specifically, we can break a graphic into the following three essential components:

data: the dataset containing the variables of interest.
geom: the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.
aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.

You might be wondering why we wrote the terms data, geom, and aes in a computer code type font. We’ll see very shortly that we’ll specify the elements of the grammar in R using these terms. However, let’s first break down the grammar with an example.

2.1.2 Gapminder data

In February 2006, a Swedish physician and data advocate named Hans Rosling gave a TED talk titled “The best stats you’ve ever seen” where he presented global economic, health, and development data from the website gapminder.org. For example, for data on 142 countries in 2007, let’s consider only a few countries in Table 2.1 as a peek into the data.

TABLE 2.1: Gapminder 2007 Data: First 3 of 142 countries
Country	Continent	Life Expectancy	Population	GDP per Capita
Afghanistan	Asia	43.8	31889923	975
Albania	Europe	76.4	3600523	5937
Algeria	Africa	72.3	33333216	6223

Each row in this table corresponds to a country in 2007. For each row, we have 5 columns:

Country: Name of country.
Continent: Which of the five continents the country is part of. Note that “Americas” includes countries in both North and South America and that Antarctica is excluded.
Life Expectancy: Life expectancy in years.
Population: Number of people living in the country.
GDP per Capita: Gross domestic product (in US dollars).

Now consider Figure 2.1, which plots this for all 142 of the data’s countries.

FIGURE 2.1: Life expectancy over GDP per capita in 2007.

Let’s view this plot through the grammar of graphics:

The data variable GDP per Capita gets mapped to the x-position aesthetic of the points.
The data variable Life Expectancy gets mapped to the y-position aesthetic of the points.
The data variable Population gets mapped to the size aesthetic of the points.
The data variable Continent gets mapped to the color aesthetic of the points.

We’ll see shortly that data corresponds to the particular data frame where our data is saved and that “data variables” correspond to particular columns in the data frame. Furthermore, the type of geometric object considered in this plot are points. That being said, while in this example we are considering points, graphics are not limited to just points. We can also use lines, bars, and other geometric objects.

Let’s summarize the three essential components of the grammar in Table 2.2.

TABLE 2.2: Summary of the grammar of graphics for this plot
data variable	aes	geom
GDP per Capita	x	point
Life Expectancy	y	point
Population	size	point
Continent	color	point

2.1.3 Other components

There are other components of the grammar of graphics we can control as well. As you start to delve deeper into the grammar of graphics, you’ll start to encounter these topics more frequently. In this book, we’ll keep things simple and only work with these two additional components:

faceting breaks up a plot into several plots split by the values of another variable (Section 2.6)
position adjustments for barplots (Section 2.8)

Other more complex components like scales and coordinate systems are left for a more advanced text such as R for Data Science (Grolemund and Wickham 2017). Generally speaking, the grammar of graphics allows for a high degree of customization of plots and also a consistent framework for easily updating and modifying them.

2.1.4 ggplot2 package

In this book, we will use the ggplot2 package for data visualization, which is an implementation of the grammar of graphics for R (Wickham, Chang, et al. 2023). As we noted earlier, a lot of the previous section was written in a computer code type font. This is because the various components of the grammar of graphics are specified in the ggplot() function included in the ggplot2 package. For the purposes of this book, we’ll always provide the ggplot() function with the following arguments (i.e., inputs) at a minimum:

The data frame where the variables exist: the data argument.
The mapping of the variables to aesthetic attributes: the mapping argument which specifies the aesthetic attributes involved.

After we’ve specified these components, we then add layers to the plot using the + sign. The most essential layer to add to a plot is the layer that specifies which type of geometric object we want the plot to involve: points, lines, bars, and others. Other layers we can add to a plot include the plot title, axes labels, visual themes for the plots, and facets (which we’ll see in Section 2.6).

Let’s now put the theory of the grammar of graphics into practice.

2.2 Five named graphs - the 5NG

In order to keep things simple in this book, we will only focus on five different types of graphics, each with a commonly given name. We term these “five named graphs” or in abbreviated form, the 5NG:

scatterplots
linegraphs
histograms
boxplots
barplots

We’ll also present some variations of these plots, but with this basic repertoire of five graphics in your toolbox, you can visualize a wide array of different variable types. As we’ll see, certain plots are only appropriate for categorical variables, while others are only appropriate for numerical variables.

2.3 5NG#1: Scatterplots

The simplest of the 5NG are scatterplots, also called bivariate plots. They allow you to visualize the relationship between two numerical variables. While you may already be familiar with scatterplots, let’s view them through the lens of the grammar of graphics we presented in Section 2.1. Specifically, we will visualize the relationship between the following two numerical variables in the mammals data frame we created in Chapter 1.4:

brain_wt: brain weight (in kg) on the horizontal “x” axis and
life_span: total number of hours of sleep on the vertical “y” axis
for each mammalian species.

2.3.1 Scatterplots via `geom_point`

Let’s now go over the code that will create the desired scatterplot, while keeping in mind the grammar of graphics framework we introduced in Section 2.1. Let’s take a look at the code and break it down piece-by-piece.

ggplot(data = mammals, mapping = aes(x = brain_wt, y = life_span)) + 
  geom_point()

Within the ggplot() function, we specify two of the components of the grammar of graphics as arguments (i.e., inputs):

The data as the mammals data frame via data = mammals.
The aesthetic mapping by setting mapping = aes(x = brain_wt, y = life_span). Specifically, the variable brain_wt maps to the x position aesthetic, while the variable life_span maps to the y position.

We then add a layer to the ggplot() function call using the + sign. The added layer in question specifies the third component of the grammar: the geometric object. In this case, the geometric object is set to be points by specifying geom_point(). After running these two lines of code in your console, you’ll notice two outputs: a warning message and the graphic shown in Figure 2.2.

Warning: Removed 4 rows containing missing values (`geom_point()`).

FIGURE 2.2: Brain weight vs. life span in mammals data frame.

Let’s first unpack the graphic in Figure 2.2. Observe that a positive relationship exists overall between brain_wt and life_span: as brain_wt increases, the life_span tends to also increase. Observe also the large number of points clustered near (0, 0), which we’ll discuss further in Subsection 2.3.2.

Now, let’s turn our attention to the warning message. R is alerting us to the fact that some rows were ignored due to missing information. For these rows, either the value for brain_wt or life_span or both were missing, and thus these rows were ignored in our plot.

Before we continue, let’s make a few more observations about this code that created the scatterplot. Note that the + sign comes at the end of lines, and not at the beginning. You’ll get an error in R if you put it at the beginning of a line. When adding layers to a plot, you are encouraged to start a new line after the + (by pressing the Return/Enter button on your keyboard) so that the code for each layer is on a new line. As we add more and more layers to plots, you’ll see this will greatly improve the legibility of your code.

To stress the importance of adding the layer specifying the geometric object, consider Figure 2.3 where no layers are added. Because the geometric object was not specified, we have a blank plot which is not very useful!

ggplot(data = mammals, mapping = aes(x = brain_wt, y = life_span))

FIGURE 2.3: A plot with no layers.

Learning check

(LC2.1) What are some practical reasons why brain_wt and life_span have a positive relationship?

(LC2.2) For which of these two variables, brain_wt and life_span, is the data more frequently missing? How did you determine this?

(LC2.3) What are some other features of the plot that stand out to you?

(LC2.4) Create a new scatterplot using different quantitative variables in the mammals data frame by modifying the example given.

2.3.2 Overplotting

The large mass of points near (0, 0) in Figure 2.2 can cause some confusion since it is hard to tell the true number of points that are plotted. This is the result of a phenomenon called overplotting. As one may guess, this corresponds to points being plotted on top of each other over and over again. When overplotting occurs, it is difficult to know the number of points being plotted. There are two methods to address the issue of overplotting. Either by

Adjusting the transparency of the points or
Adding a little random “jitter”, or random “nudges”, to each of the points.

Method 1: Changing the transparency

The first way of addressing overplotting is to change the transparency/opacity of the points by setting the alpha argument in geom_point(). We can change the alpha argument to be any value between 0 and 1, where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque. By default, alpha is set to 1. In other words, if we don’t explicitly set an alpha value, R will use alpha = 1.

Note how the following code is identical to the code in Section 2.3 that created the scatterplot with overplotting, but with alpha = 0.2 added to the geom_point() function:

ggplot(data = mammals, mapping = aes(x = brain_wt, y = life_span)) + 
  geom_point(alpha = 0.2)

FIGURE 2.4: Brain weight vs. life span scatterplot with alpha = 0.2.

The key feature to note in Figure 2.4 is that the transparency of the points is cumulative: areas with a high-degree of overplotting are darker, whereas areas with a lower degree are less dark. Note furthermore that there is no aes() surrounding alpha = 0.2. This is because we are not mapping a variable to an aesthetic attribute, but rather merely changing the default setting of alpha. In fact, you’ll receive an error if you try to change the second line to read geom_point(aes(alpha = 0.2)).

Method 2: Jittering the points

The second way of addressing overplotting is by jittering all the points. This means giving each point a small “nudge” in a random direction. You can think of “jittering” as shaking the points around a bit on the plot. Let’s illustrate using a simple example first. Say we have a data frame with 4 identical rows of x and y values: (0,0), (0,0), (0,0), and (0,0). In Figure 2.5, we present both the regular scatterplot of these 4 points (on the left) and its jittered counterpart (on the right).

FIGURE 2.5: Regular and jittered scatterplot.

In the left-hand regular scatterplot, observe that the 4 points are superimposed on top of each other. While we know there are 4 values being plotted, this fact might not be apparent to others. In the right-hand jittered scatterplot, it is now plainly evident that this plot involves four points since each point is given a random “nudge.”

Keep in mind, however, that jittering is strictly a visualization tool; even after creating a jittered scatterplot, the original values saved in the data frame remain unchanged.

To create a jittered scatterplot, instead of using geom_point(), we use geom_jitter(). Observe how the following code is very similar to the code that created the scatterplot with overplotting in Subsection 2.3.1, but with geom_point() replaced with geom_jitter().

ggplot(data = mammals, mapping = aes(x = brain_wt, y = life_span)) + 
  geom_jitter(width = 30, height = 1)

FIGURE 2.6: Brain weight vs. life span jittered scatterplot.

In order to specify how much jitter to add, we adjusted the width and height arguments to geom_jitter(). This corresponds to how hard you’d like to shake the plot in horizontal x-axis units and vertical y-axis units, respectively. How much jitter should we add using the width and height arguments? On the one hand, it is important to add just enough jitter to break any overlap in points, but on the other hand, not so much that we completely alter the original pattern in points.

As can be seen in the resulting Figure 2.6, in this case jittering doesn’t really provide much new insight. In this particular case, it can be argued that changing the transparency of the points by setting alpha proved more effective. When would it be better to use a jittered scatterplot? When would it be better to alter the points’ transparency? There is no single right answer that applies to all situations. You need to make a subjective choice and own that choice. At the very least when confronted with overplotting, however, we suggest you make both types of plots and see which one better emphasizes the point you are trying to make.

Learning check

(LC2.5) Why is setting the alpha argument value useful with scatterplots? What further information does it give you that a regular scatterplot cannot?

(LC2.6) After viewing Figure 2.4, give an approximate range of brain weights and life spans that occur most frequently in mammalian species. How has that region changed compared to when you observed the same plot without alpha = 0.2 set in Figure 2.2?

2.3.3 Summary

Scatterplots display the relationship between two numerical variables. They are among the most commonly used plots because they can provide an immediate way to see the trend in one numerical variable versus another. However, if you try to create a scatterplot where either one of the two variables is not numerical, you might get strange results. Be careful!

With medium to large datasets, you may need to play around with the different modifications to scatterplots, such as changing the transparency/opacity of the points or by jittering the points. This tweaking is often a fun part of data visualization, since you’ll have the chance to see different relationships emerge as you tinker with your plots.

2.4 5NG#2: Linegraphs

The next of the five named graphs are linegraphs. Linegraphs show the relationship between two numerical variables when the variable on the x-axis, also called the explanatory variable, is of a sequential nature. In other words, there is an inherent ordering to the variable.

The most common examples of linegraphs have some notion of time on the x-axis: hours, days, weeks, years, etc. Since time is sequential, we connect consecutive observations of the variable on the y-axis with a line. Linegraphs that have some notion of time on the x-axis are also called time series plots. Let’s illustrate linegraphs using a dataset in the datasets package included in base R: the ChickWeight data frame.

Let’s explore the ChickWeight data frame by running View(ChickWeight) and glimpse(ChickWeight). Furthermore let’s read the associated help file by running ?ChickWeight to bring up the help file.

Observe that there is a variable called weight which records the weight (in grams) for the first 21 days of 50 new born chicks fed one of four different diets (1-4). However, for simplicity let’s focus on weights of Chick 1.

To focus on Chick 1, we will use dplyr's filter() function to choose the subset of rows in ChickWeight corresponding to Chick 1. Recall from Section 1.2 that testing for equality is specified with == and not =. If you don’t fully understand the code, don’t worry for now; we’ll cover the filter function in more depth in Chapter 3 on data wrangling.

chick1_weight <- filter(ChickWeight, Chick == 1)

Learning check

(LC2.7) Take a look at both the ChickWeight and chick1_weight data frames by running View(ChickWeight) and View(chick1_weight). In what respect do these data frames differ?

(LC2.8) Glimpse() the ChickWeight data frame again. Notice that two of the variables use a data type that you have seen previously. What are the data types?

(LC2.9) One of the ChickWeight variables uses a data type that you have not seen previously. What abbreviation is used for the new data type?

These ord and fct data types are used in R for categorical variables, when only a subset of options or “levels” are available. The <ord> or “ordered factor” data type is used when the different levels of the categorical variable are in a specific order, such as Small, Medium, and Large, etc.. The <fct> or (unordered) “factor” is used when the levels are not in a specific order, such as Red, Blue, Green, etc.

2.4.1 Linegraphs via `geom_line`

Let’s create a time series plot of the body weights saved in the chick1_weight data frame by using geom_line() to create a linegraph, instead of using geom_point() like we used previously to create scatterplots:

ggplot(data = chick1_weight, 
       mapping = aes(x = Time, y = weight)) +
  geom_line()

FIGURE 2.7: body weights of Chick 1 on Diet 1

Much as with the ggplot() code that created the scatterplot of brain weight and life span for mammals in Figure 2.2, let’s break down this code piece-by-piece in terms of the grammar of graphics:

Within the ggplot() function call, we specify two of the components of the grammar of graphics as arguments:

The data to be the chick1_weight data frame by setting data = chick1_weight.
The aesthetic mapping by setting mapping = aes(x = Time, y = weight). Specifically, the variable Time maps to the x position aesthetic, while the variable weight maps to the y position aesthetic.

We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object. In this case, the geometric object is a line set by specifying geom_line().

Learning check

(LC2.10) Why should linegraphs be avoided when there is not a clear ordering of the x-axis?

(LC2.11) Why are linegraphs frequently used when time is the explanatory variable on the x-axis?

2.4.2 Summary

Linegraphs, just like scatterplots, display the relationship between two numerical variables. However, it is preferred to use linegraphs over scatterplots when the variable on the x-axis (i.e., the explanatory variable) has an inherent ordering, such as time or altitude.

2.5 5NG#3: Histograms

Let’s consider the weight variable in the ChickWeight data frame once again, but unlike with the linegraphs in Section 2.4, let’s say we don’t care about its relationship with time, but rather we only care about how the values of weight distribute in 3-week old chicks. In other words:

What are the smallest and largest values?
What is the “center” or “most typical” value?
How do the values spread out?
What are frequent and infrequent values?

First, we filter the ChickWeight data frame for the desired rows, each chick’s weight at Day 21:

chick_weight_d21 <- filter(ChickWeight, Time == 21)

One way to visualize the distribution of this single variable weight is to plot the weights on a horizontal line as we do in Figure 2.8:

FIGURE 2.8: Plot of body weights for Chick 1

This gives us a general idea of how the values of weight distribute: observe that weights vary from around 70 grams up to 370 grams. Furthermore, there appear to be more recorded weights between 150 g and 250 g than outside this range. However, because of the high degree of overplotting in the points, it’s hard to get a sense of exactly how many values are between say 150 g and 200 g.

What is commonly produced instead of Figure 2.8 is known as a histogram. A histogram is a plot that visualizes the distribution of a numerical value as follows:

We first cut up the x-axis into a series of bins, where each bin represents a range of values.
For each bin, we count the number of observations that fall in the range corresponding to that bin.
Then for each bin, we draw a bar whose height marks the corresponding count.

Let’s drill-down on an example of a histogram, shown in Figure 2.9.

FIGURE 2.9: Example histogram.

Let’s focus only on weights between 100 g and 200 g for now. Observe that there are five bins of equal width between 100 g and 200 g. Thus we have five bins of width 20 g each: one bin for the 100-120 g range, another bin for the 120-140 g range, etc. Since:

The bins for the 100-120 g and 120-140 g ranges each have a height of 1. In other words, one of the Day 21 weight recordings is between 100-120 g and another one is between 120-140 g.
The bin for the 140-160 g range has a height of 6. In other words, six of the Day 21 weight recordings are between 140-160 g.

All 16 bins spanning 60 g to 380 g on the x-axis have this interpretation.

2.5.1 Histograms via `geom_histogram`

Let’s now present the ggplot() code to plot your first histogram! Unlike with scatterplots and linegraphs, there is now only one variable being mapped in aes(): the single numerical variable weight. The y-aesthetic of a histogram, the count of the observations in each bin, gets computed for you automatically. Furthermore, the geometric object layer is now a geom_histogram(). After running the following code, you’ll see the histogram in Figure 2.10 as well as warning messages. We’ll discuss the warning message first.

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

FIGURE 2.10: Histogram of chick weights at day 21.

The warning message is telling us that the histogram was constructed using bins = 30 for 30 equally spaced bins. This is known in computer programming as a default value; unless you override this default number of bins with a number you specify, R will choose 30 by default. We’ll see in the next section how to change the number of bins to another value than the default.

Now let’s unpack the resulting histogram in Figure 2.10. Observe that values less than 150 g as well as values above 325 g are rather rare. However, because of the large number of bins, it’s hard to get a sense for which range of weights is spanned by each bin; everything is one giant amorphous blob. So let’s add white vertical borders demarcating the bins by adding a color = "white" argument to geom_histogram() and ignore the warning about setting the number of bins to a better value:

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram(color = "white")

FIGURE 2.11: Histogram of chick weights at day 21.

We now have an easier time associating ranges of weights to each of the bins in Figure 2.11. We can also vary the color of the bars by setting the fill argument. For example, you can set the bin colors to be “blue steel” by setting fill = "steelblue":

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram(color = "white", fill = "steelblue")

If you’re curious, run colors() to see all 657 possible choice of colors in R!

2.5.2 Adjusting the bins

Observe in Figure 2.11 that in the 100-200 g range there appear to be about 10 bins. Thus each bin has width 100 divided by about 10, or about 10 g. Let’s improve this histogram by adjusting the number of bins in our histogram in one of two ways:

By adjusting the number of bins via the bins argument to geom_histogram().
By adjusting the width of the bins via the binwidth argument to geom_histogram().

Using the first method, we have the power to specify how many bins we would like to cut the x-axis up in. As mentioned in the previous section, the default number of bins is 30. We can override this default, to say 20 bins, as follows:

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram(bins = 20, color = "white")

Using the second method, instead of specifying the number of bins, we specify the width of the bins by using the binwidth argument in the geom_histogram() layer. For example, let’s set the width of each bin to be 20 g.

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram(binwidth = 20, color = "white")

We compare both resulting histograms side-by-side in Figure 2.12.

FIGURE 2.12: Setting histogram bins in two ways.

Learning check

(LC2.12) What does changing the number of bins from 30 to 20 tell us about the distribution of weights?

(LC2.13) Would you classify the distribution of weights as symmetric or skewed in one direction or another?

(LC2.14) What would you guess is the “center” value in this distribution? Why did you make that choice?

(LC2.15) Is this data spread out greatly from the center or is it close? Explain your answer.

2.5.3 Summary

Histograms, unlike scatterplots and linegraphs, present information on only a single numerical variable. Specifically, they are visualizations of the distribution of the numerical variable in question.

2.6 Facets

Before continuing with the next of the 5NG, let’s briefly introduce a new concept called faceting. Faceting is used when we’d like to split a particular visualization by the values of another variable. This will create multiple copies of the same type of plot with matching x and y axes, but whose content will differ.

For example, suppose we were interested in comparing the Day 21 weights for chicks on the four different diets. We could “duplicate” the histogram above but plot histograms of weight on Day 21 for each Diet separately. We make a histogram for chick weights on each diet by adding a facet_wrap(~Diet) layer. Note the ~ is a “tilde” and can generally be found on the key next to the “1” key on US keyboards. The tilde is required and you’ll receive an error message if you don’t include it here.

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram(bins = 20, color = "white") +
  facet_wrap(~Diet)

FIGURE 2.13: Faceted histogram of body weights by chicks on Diet 1

We can also specify the number of rows and columns in the grid by using the nrow and ncol arguments inside of facet_wrap(). For example, say we would like our faceted histogram to have one column for easier comparisons. We simply add an ncol = 1 argument to facet_wrap(~Diet).

ggplot(data = chick_weight_d21, mapping = aes(x = weight)) +
  geom_histogram(bins = 20, color = "white") +
  facet_wrap(~Diet, ncol = 1)

FIGURE 2.14: Faceted histogram with 1 instead of 2 columns.

While we can compare the distribution of chick weights on the four different diets in both Figure 2.13 and Figure 2.14, the comparison is easier in the second faceted graph .

Learning check

(LC2.16) What do the numbers above each plot correspond to? What about the numbers below the bottom plots?

(LC2.17) What other things do you notice about this faceted plot?

(LC2.18) Does the weight gain on each diet in the chick_weight_d21 dataset have a lot of variability? Why do you say that?

(LC2.19) How does a faceted plot help us see relationships between two variables?

(LC2.20) For which types of datasets would faceted plots not work well in comparing relationships between variables? Give an example describing the nature of these variables and other important characteristics.

2.7 5NG#4: Boxplots

While faceted histograms are one type of visualization used to compare the distribution of a numerical variable split by the values of another variable, another type of visualization that achieves this same goal is a side-by-side boxplot. A boxplot is constructed based on the five-number summary of the distribution of a numerical variable: that is, the minimum, first quartile (25th percentile), median (50th percentile), third quartile (75th percentile), and maximum values (described in Appendix A.1).

To keep things simple for now, let’s only consider the Day 21 weight recordings for the 16 chicks on Diet 1, each represented as a jittered point in Figure 2.15.

FIGURE 2.15: Day 21 weights of Diet 1 chicks represented as jittered points.

These 16 observations have the following five-number summary:

Minimum: 96 g
First quartile: 138 g
Median: 166 g
Third quartile: 208 g
Maximum: 305 g

In the leftmost plot of Figure 2.16, let’s mark these 5 values with dashed horizontal lines on top of the 16 points. In the middle plot of Figure 2.16 let’s add the boxplot. In the rightmost plot of Figure 2.16, let’s remove the points and the dashed horizontal lines for clarity’s sake.

FIGURE 2.16: Building up a boxplot of Diet 1 chick weights.

What the boxplot does is visually summarize the 16 points by cutting the 16 weight recordings into quartiles at the dashed lines, where each quartile contains roughly 16 \(\div\) 4 \(\approx\) 4 observations. Thus

25% of points fall below the bottom edge of the box, which is the first quartile of 138 g. In other words, 25% of observations were below 138 g.
25% of points fall between the bottom edge of the box and the solid middle line, which is the median of 166 g. Thus, 25% of observations were between 138 g and 166 g and 50% of observations were below 166 g.
25% of points fall between the solid middle line and the top edge of the box, which is the third quartile of 208 g. It follows that 25% of observations were between 166 g and 208 g and 75% of observations were below 208 g.
25% of points fall above the top edge of the box. In other words, 25% of observations were above 208 g.
The middle 50% of points lie within the interquartile range (IQR) between the first and third quartile. Thus, the IQR for this example is 208 - 138 = 70 g. The interquartile range is a measure of a numerical variable’s spread.

Furthermore, in the rightmost plot of Figure 2.16, we see the whiskers of the boxplot. The whiskers stick out from either end of the box all the way to the minimum and maximum observed weights of 96 g and 305 g, respectively. However, the whiskers don’t always extend to the smallest and largest observed values as they do here. They in fact extend no more than 1.5 \(\times\) the interquartile range from either end of the box. In this case of the Diet 1 weights, no more than 1.5 \(\times\) 70 g = 105 g from either end of the box. Any observed values outside this range get marked with points called outliers, which we’ll see in the next section.

2.7.1 Boxplots via `geom_boxplot`

Let’s now create a side-by-side boxplot of Day 21 weights split by the four diets. We do this by first filtering for the rows where Time equals 21 days. We then plot these values, mapping the Diet variable to the x-position aesthetic, the weight variable to the y-position aesthetic, and adding a geom_boxplot() layer:

Time21_ChickWeight <- ChickWeight %>%
  filter(Time == 21)
ggplot(data = Time21_ChickWeight, mapping = aes(x = Diet, y = weight)) +
  geom_boxplot()

FIGURE 2.17: Side-by-side boxplot of weight split by Diet.

Note that boxplots require a categorical – not a numerical – variable to be mapped to the x-position aesthetic. Although the Diets are numbered, recall that this variable is stored as a factor (<fct>) data type, which you can confirm using glimpse(ChickWeight). So R knows that this is a categorical variable on the x-position aesthetic. If the variable Diet had been stored as a <num> data type instead, we would have needed to convert it into a categorical variable by using the factor() function.

The resulting Figure 2.17 shows four separate “box and whiskers” plots similar to the rightmost plot of Figure 2.16 of only Diet 1 weights. Thus the different boxplots are shown “side-by-side.”

The “box” portions of the visualization represent the 1st quartile, the median (the 2nd quartile), and the 3rd quartile.
The height of each box (the value of the 3rd quartile minus the value of the 1st quartile) is the interquartile range (IQR). It is a measure of the spread of the middle 50% of values, with longer boxes indicating more variability.
The “whisker” portions of these plots extend out from the bottoms and tops of the boxes and represent points less than the 25th percentile and greater than the 75th percentiles, respectively. They’re set to extend out no more than \(1.5 \times IQR\) units away from either end of the boxes. We say “no more than” because the ends of the whiskers have to correspond to observed weights. The length of these whiskers show how the data outside the middle 50% of values vary, with longer whiskers indicating more variability.

If any values had been more than \(1.5 \times IQR\) away from either box end and therefore beyond the whiskers, then they would have been indicated by dots. These so-called outliers can be thought of as anomalous (“out-of-the-ordinary”) values. It is important to keep in mind that the definition of an outlier is somewhat arbitrary and not absolute. In these boxplots, they are defined by the length of the whiskers, which are no more than \(1.5 \times IQR\) units long for each boxplot.

Looking at this side-by-side plot we can see that Diet 3 produced chicks with the highest median weight at Day 21 as evidenced by the higher solid line in the middle of its box. We can easily compare weights across diets by drawing imaginary horizontal lines across the plot. Furthermore, the heights of the four boxes as quantified by the interquartile ranges are informative too; they tell us about variability, or spread, of weights recorded for a given diet.

Learning check

(LC2.21) Which diet produced the greatest variability in weights? Explain your reasoning.

(LC2.22) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?

2.7.2 Summary

Side-by-side boxplots provide us with a way to compare the distribution of a numerical variable across multiple values of another variable. One can see where the median falls across the different groups by comparing the solid lines in the center of the boxes.

To study the spread of a numerical variable within one of the boxes, look at both the length of the box and also how far the whiskers extend from either end of the box. Outliers are even more easily identified when looking at a boxplot than when looking at a histogram as they are marked with distinct points.

2.8 5NG#5: Barplots

Both histograms and boxplots are tools to visualize the distribution of numerical variables. Another commonly desired task is to visualize the distribution of a categorical variable. This is a simpler task, as we are simply counting different categories within a categorical variable, also known as the levels of the categorical variable. Often the best way to visualize these different counts, also known as frequencies, is with barplots (also called barcharts).

One complication, however, is how your data is represented. Is the categorical variable of interest “pre-counted” or not? For example, run the following code that manually creates two data frames representing a collection of fruit: 3 apples and 2 oranges.

fruits <- tibble(
  fruit = c("apple", "apple", "orange", "apple", "orange")
)
fruits_counted <- tibble(
  fruit = c("apple", "orange"),
  number = c(3, 2)
)

We see both the fruits and fruits_counted data frames represent the same collection of fruit. Whereas fruits just lists the fruit individually…

# A tibble: 5 × 1
  fruit 
  <chr> 
1 apple 
2 apple 
3 orange
4 apple 
5 orange

… fruits_counted has a variable count which represent the “pre-counted” values of each fruit.

# A tibble: 2 × 2
  fruit  number
  <chr>   <dbl>
1 apple       3
2 orange      2

Depending on how your categorical data is represented, you’ll need to add a different geometric layer type to your ggplot() to create a barplot, as we now explore.

2.8.1 Barplots via `geom_bar` or `geom_col`

Let’s generate barplots using these two different representations of the same basket of fruit: 3 apples and 2 oranges. Using the fruits data frame where all 5 fruits are listed individually in 5 rows, we map the fruit variable to the x-position aesthetic and add a geom_bar() layer:

ggplot(data = fruits, mapping = aes(x = fruit)) +
  geom_bar()

FIGURE 2.18: Barplot when counts are not pre-counted.

However, using the fruits_counted data frame where the fruits have been “pre-counted”, we once again map the fruit variable to the x-position aesthetic, but here we also map the count variable to the y-position aesthetic, and add a geom_col() layer instead.

ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
  geom_col()

FIGURE 2.19: Barplot when counts are pre-counted.

Compare the barplots in Figures 2.18 and 2.19. They are identical because they reflect counts of the same five fruits. However, depending on how our categorical data is represented, either “pre-counted” or not, we must add a different geom layer. When the categorical variable whose distribution you want to visualize

Is not pre-counted in your data frame, we use geom_bar().
Is pre-counted in your data frame, we use geom_col() with the y-position aesthetic mapped to the variable that has the counts.

Let’s now go back to the mammals data frame from earlier (Subsection 1.4) and visualize the distribution of the categorical variable predation, which indicates the likelihood of the mammalian species to be preyed upon (1, lowest; 5, highest). In other words, let’s visualize the number of mammalian species in each of the 5 predation categories. Recall from Subsection 1.4.3 when you first explored the mammals data frame, you saw that each row corresponds to a species. In other words, the mammals data frame is more like the fruits data frame than the fruits_counted data frame because the species have not been pre-counted by predation. Thus we should use geom_bar() instead of geom_col() to create a barplot. Much like a geom_histogram(), there is only one variable in the aes() aesthetic mapping: the variable predation gets mapped to the x-position.

ggplot(data = mammals, mapping = aes(x = predation)) +
  geom_bar()

FIGURE 2.20: Number of species in each predation category using geom_bar().

Observe in Figure 2.20 that predation category 4 has the fewest species. Also notice that bar graphs have white space between the bars whereas histograms have bars that touch.

Alternatively, you may have a data frame where the number of mammalian species in each predation category was pre-counted as in Table 2.3.

TABLE 2.3: Number of mammalian species pre-counted by predation category.
predation	number
1	14
2	15
3	12
4	7
5	14

In order to create a barplot visualizing the distribution of the categorical variable predation in this case, we would now use geom_col() instead of geom_bar(), with an additional y = number in the aesthetic mapping on top of the x = predation. The resulting barplot would be identical to Figure 2.20.

Learning check

(LC2.23) How are histograms and barplots similar? How do they differ?

(LC2.24) What was the 2nd most common predation category for the mammalian species? How could we rearrange the table to get this answer more quickly?

2.8.2 Two categorical variables

Barplots are a very common way to visualize the frequency of different categories, or levels, of a single categorical variable. Another use of barplots is to visualize the joint distribution of two categorical variables at the same time. Let’s examine the joint distribution of mammalian species by predation as well as exposure, (an index of how exposed the mammal is during sleep). In other words, the number of species for each combination of predation and exposure. For example, the number of species with predation category 2 and exposure category 1. Recall the ggplot() code that created the barplot of predation frequency in Figure 2.20:

ggplot(data = mammals, mapping = aes(x = predation)) + 
  geom_bar()

We can now map the additional variable exposure by adding fill = factor(exposure) inside the aes() aesthetic mapping. Note that we use the factor() function here because the fill argument is expecting a factor, so we coerced exposure from int to the fct data type.

ggplot(data = mammals, mapping = aes(x = predation, fill = factor(exposure))) +
  geom_bar()

FIGURE 2.21: Stacked barplot of mammalian species by predation category and exposure index.

Figure 2.21 is an example of a stacked barplot. While simple to make, in certain aspects it is not ideal. For example, it is difficult to compare the heights of the different colors between the bars, corresponding to comparing the number of species with each exposure type between the predation categories.

Before we continue, let’s address some common points of confusion among new R users. First, the fill aesthetic corresponds to the color used to fill the bars, while the color aesthetic corresponds to the color of the outline of the bars. This is identical to how we added color to our histogram in Subsection 2.5.1: we set the outline of the bars to white by setting color = "white" and the colors of the bars to blue steel by setting fill = "steelblue". Observe in Figure 2.22 that mapping exposure to color and not fill yields grey bars with different colored outlines.

ggplot(data = mammals, mapping = aes(x = predation, color = factor(exposure))) +
  geom_bar()

FIGURE 2.22: Stacked barplot with color aesthetic used instead of fill.

Second, note that fill is another aesthetic mapping much like x-position; thus we were careful to include it within the parentheses of the aes() mapping. The following code, where the fill aesthetic is specified outside the aes() mapping will not produce the desired plot. This is a fairly common error that new ggplot users make:

ggplot(data = mammals, mapping = aes(x = predation), fill = factor(exposure)) +
  geom_bar()

An alternative to stacked barplots are side-by-side barplots, also known as dodged barplots, as seen in Figure 2.23. The code to create a side-by-side barplot is identical to the code to create a stacked barplot, but with a position = "dodge" argument added to geom_bar(). In other words, we are overriding the default barplot type, which is a stacked barplot, and specifying it to be a side-by-side barplot instead.

ggplot(data = mammals, mapping = aes(x = predation, fill = factor(exposure))) +
  geom_bar(position = "dodge")

FIGURE 2.23: Side-by-side barplot comparing number of species by predation and exposure variables.

Another type of graph to visualize two categorical variables at the same time is a faceted barplot. Recall in Section 2.6 we visualized the distribution of weight gain for the chicks on Diet 1 split by the Chick identifier (1-50) using facets. We apply the same principle to our barplot visualizing the frequency of predation split by exposure: instead of mapping exposure to fill, we include it as the variable to create small multiples of the plot across the levels of exposure.

ggplot(data = mammals, mapping = aes(x = predation)) +
  geom_bar() +
  facet_wrap(~ exposure, ncol = 1)

FIGURE 2.24: Faceted barplot comparing the number of species by the predation and exposure variables.

Finally, in Section 7.1, we will visualize the joint frequency of two categorical variables at the same time using a mosaic plot.

Learning check

(LC2.25) What kinds of questions are not easily answered by looking at Figure 2.21?

(LC2.26) What can you say, if anything, about the relationship between predation category and exposure index in regards to the number of mammalian species?

(LC2.27) Why might the side-by-side barplot be preferable to a stacked barplot in this case?

(LC2.28) What are the disadvantages of using a dodged barplot, in general?

(LC2.29) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?

(LC2.30) What information about the different exposure indexes of different predation categories is more easily seen in the faceted barplot?

2.8.3 Summary

Barplots are a common way of displaying the distribution of a categorical variable, or in other words the frequency with which the different categories (also called levels) occur. They are easy to understand and make it easy to make comparisons across levels. Furthermore, when trying to visualize the relationship of two categorical variables, you have many options: stacked barplots, side-by-side barplots, and faceted barplots. Depending on what aspect of the relationship you are trying to emphasize, you will need to make a choice between these three types of barplots and own that choice.

2.9 Plots to avoid!

When we visualize data, the goal is to display the key features as clearly and succinctly as possible. These key features may include the central location of our data (e.g., median), its spread (e.g., IQR), and the sample size of each group. Depending on the data to display (numerical/categorical, one/multiple variables), some plots are more effective at presenting the key features of the data. In this section, we describe two plots that are frequently seen in the scientific literature but are less effective than common alternatives and should therefore be avoided.

2.9.1 Barplots with a continuous variable

The first plot to avoid is a bar plot to display the distribution of a continuous (numerical) variable. But wait, then why was Section 2.8 all about how to create bar plots? Recall from the opening paragraph of Section 2.8 that barplots should be used to visualize the distribution of categorical variables, while histograms and boxplots should be used to visualize the distribution of numerical variables. Despite this, a recent study (Riedel et al. 2022) found that bar plots are more often used to inappropriately display continuous data than they are used appropriately to display count data.

So, what is the problem with using barplots to display continuous data? As shown in Figure 2.25, barplots hide important information that can be easily indicated using a scatterplot (dotplot) or boxplot. While the bar graph displays the mean of the data and some indication of its spread, it hides much of the data (zone of invisibility). In contrast, the dotplot succinctly presents the complete distribution of the numerical variable and indicates the sample size of each group. Furthermore, the bar graph includes a lot of “dead space” (zone of irrelevance) distracting our attention from the difference between means compared to the spread. For these reasons, more and more scientific journals are banning the use of barplots to display continuous data – and you should too.

FIGURE 2.25: Bar graphs hide important information about continuous data. Source: (Weissgerber et al. 2017), CC BY 4.0.

Learning check

(LC2.31) Why should barplots be avoided when displaying continuous data?

(LC2.32) Why do you think people continue to use barplots when displaying continuous data

2.9.2 Pie charts

One of the most common plots used to visualize the distribution of categorical data is the pie chart. While they may seem harmless enough, pie charts actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book, Creating More Effective Graphs (Robbins 2013), we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine the relative size of one piece of the pie compared to another.

Let’s examine the same data used in our previous barplot of the number of mammalian species preferring each predation environment in Figure 2.20, but this time we will use a pie chart in Figure 2.26. Try to answer the following questions:

How much larger is the portion of the pie for predation category 2 compared to category 4?
What is the second most common predation category for mammalian species?

FIGURE 2.26: The dreaded pie chart.

While it is quite difficult to answer these questions when looking at the pie chart in Figure 2.26, we can much more easily answer these questions using the barchart in Figure 2.20. This is true since barplots present the information in a way such that comparisons between categories can be made with single horizontal lines, whereas pie charts present the information in a way such that comparisons must be made by comparing angles.

Learning check

(LC2.33) Why should pie charts be avoided and replaced by barplots?

(LC2.34) Why do you think people continue to use pie charts?

2.10 Conclusion

2.10.1 Summary table

Let’s recap all five of the five named graphs (5NG) in Table 2.4 summarizing their differences. Using these 5NG, you’ll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each geometric object’s aesthetic attribute options, further unlocking the awesome power of the ggplot2 package.

TABLE 2.4: Summary of Five Named Graphs
	Named graph	Shows	Geometric object	Notes
1	Scatterplot	Relationship between 2 numerical variables	`geom_point()`
2	Linegraph	Relationship between 2 numerical variables	`geom_line()`	Used when there is a sequential order to x-variable, e.g., time
3	Histogram	Distribution of 1 numerical variable	`geom_histogram()`	Like boxplots, facetted histograms show distribution of 1 numerical variable split by the values of a categorical variable
4	Boxplot	Distribution of 1 numerical variable split by the values of a categorical variable	`geom_boxplot()`
5	Barplot	Distribution of 1 categorical variable	`geom_bar()` when counts are not pre-counted, `geom_col()` when counts are pre-counted	Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables

2.10.2 Function argument specification

Let’s go over some important points about specifying the arguments (i.e., inputs) to functions. Run the following two segments of code:

# Segment 1:
ggplot(data = mammals, mapping = aes(x = predation)) +
  geom_bar()

# Segment 2:
ggplot(mammals, aes(x = predation)) +
  geom_bar()

You’ll notice that both code segments create the same barplot, even though in the second segment we omitted the data = and mapping = code argument names. This is because the ggplot() function by default assumes that the data argument comes first and the mapping argument comes second. As long as you specify the data frame in question first and the aes() mapping second, you can omit the explicit statement of the argument names data = and mapping =.

Going forward for the rest of this book, all ggplot() code will be like the second segment: with the data = and mapping = explicit naming of the argument omitted with the default ordering of arguments respected. We’ll do this for brevity’s sake; it’s common to see this style when reviewing other R users’ code.

2.10.3 Additional resources

If you want to further unlock the power of the ggplot2 package for data visualization, we suggest that you check out RStudio’s “Data Visualization with ggplot2” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter. In particular, the ggplot2 cheatsheet presents many more than the 5 geometric objects we covered in this chapter while providing quick and easy to read visual descriptions. For all the geometric objects, it also lists all the possible aesthetic attributes one can tweak. In the current version of RStudio, you can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheat Sheets -> “Data Visualization with ggplot2.” You can see a preview in the figure below.

FIGURE 2.27: Data Visualization with ggplot2 cheatsheet.

2.10.4 What’s to come

Recall in Figure 2.7 in Section 2.4 we visualized weight gain over time for Chick 1. This necessitated paring down the ChickWeight data frame to a new data frame chick1_weight consisting of body weight recordings only for Chick == 1:

chick1_weight <- filter(ChickWeight, Chick == 1)

This code segment is a preview of Chapter 3 on data wrangling using the dplyr package. Data wrangling is the process of transforming and modifying existing data with the intent of making it more appropriate for analysis purposes. For example, this code segment used the filter() function to create a new data frame (chick1_weight) by choosing only a subset of rows of an existing data frames (ChickWeight). In the next chapter, we’ll formally introduce the filter() and other data wrangling functions as well as the pipe operator %>% which allows you to combine multiple data wrangling actions into a single sequential chain of actions. On to Chapter 3 on data wrangling!

References

Grolemund, Garrett, and Hadley Wickham. 2017. R for Data Science. First. Sebastopol, CA: O’Reilly Media. https://r4ds.had.co.nz/.

Riedel, Nico, Robert Schulz, Vartan Kazezian, and Tracey L Weissgerber. 2022. “Replacing bar graphs of continuous data with more informative graphics: Are we making progress?” Clinical Science, July. https://doi.org/10.1042/CS20220287.

Robbins, Naomi. 2013. Creating More Effective Graphs. First. New York, NY: Chart House.

Weissgerber, Tracey L., Marko Savic, Stacey J. Winham, Dejana Stanisavljevic, Vesna D. Garovic, and Natasa M. Milic. 2017. “Data Visualization, Bar Naked: A Free Tool for Creating Interactive Graphics.” Journal of Biological Chemistry 292 (50): 20592–98. https://doi.org/https://doi.org/10.1074/jbc.RA117.000147.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2023. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). First. Secaucus, NJ: Springer-Verlag.

Chapter 2 Data Visualization

Needed packages

2.1 The grammar of graphics

2.1.1 Components of the grammar

2.1.2 Gapminder data

2.1.3 Other components

2.1.4 ggplot2 package

2.2 Five named graphs - the 5NG

2.3 5NG#1: Scatterplots

2.3.1 Scatterplots via geom_point

2.3.2 Overplotting

2.3.3 Summary

2.4 5NG#2: Linegraphs

2.4.1 Linegraphs via geom_line

2.4.2 Summary

2.5 5NG#3: Histograms

2.5.1 Histograms via geom_histogram

2.5.2 Adjusting the bins

2.5.3 Summary

2.6 Facets

2.7 5NG#4: Boxplots

2.7.1 Boxplots via geom_boxplot

2.7.2 Summary

2.8 5NG#5: Barplots

2.8.1 Barplots via geom_bar or geom_col

2.8.2 Two categorical variables

2.8.3 Summary

2.9 Plots to avoid!

2.9.1 Barplots with a continuous variable

2.9.2 Pie charts

2.10 Conclusion

2.10.1 Summary table

2.10.2 Function argument specification

2.10.3 Additional resources

2.10.4 What’s to come

References

2.3.1 Scatterplots via `geom_point`

2.4.1 Linegraphs via `geom_line`

2.5.1 Histograms via `geom_histogram`

2.7.1 Boxplots via `geom_boxplot`

2.8.1 Barplots via `geom_bar` or `geom_col`