These long form articles are part of a series in which I go through the book ‘R for Data Science’ and document my learnings and understanding of concepts in R in my own way.

The basics of data visualisation

We start with the fun part, exploring and visualizing data. This is considered by many to be the biggest pay-off when it comes to learning R and provided me with loads of motivation to keep learning more and more, and to be able to produce better looking graphs as a result.

To start of, we’re going to need to load only the tidyverse.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Note that the tidyverse loads eight packages and lists their current version number. If you don’t see the message above you have to install the package first (install.packages("tidyverse").

You also see conflicts: some functions are provided in two packages. You could specify which exact function from a package you would like to use by using package::function() like so ggplot2::ggplot().

Now let’s load up a data set about fuel usage for different types of cars (mpg) which comes with the tidyverse.

mpg
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows
## # i Use `print(n = ...)` to see more rows

The mpg data set has 11 columns containing variables and 234 rows containing observations. Now we create our first plot.

Mapping variables to the X and Y axis

Engine displacement (displ) versus highway miles per gallon (hwy). We can map aesthetics to variables via the ggplot package. You can do this explicitly like so:

# explicitly telling ggplot what to use
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

The class variable of the mpg data set classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

You can also map the colors aesthetic to the class variable. Here we will use the tidyverse way and make use of the pipe operator (%>%). The pipe tells whatever is on the right side, to take everything on the left side, and use it as input for the first argument on the right side. Your code becomes shorter and more intuitive to read.

# using the pipe operator the tidyverse way
mpg %>%  # take mpg and use it for the first argument on the right side
  ggplot(., aes(displ, hwy, color = class)) +  
  geom_point()

The dot character in ggplot(., aes(displ, hwy, color = class)) represents the location of the first argument and is where mpg gets piped into. You can omit this as it is always the first argument after a %>%.

Mapping variables to aesthetics

Besides the X and Y axis (which are also aesthetics) there are several other aesthetics you can map variables to.

  • color
  • shape
  • alpha; transparency

Shape

We can also map the class to the shape aesthetic:

mpg %>% 
  ggplot(aes(displ, hwy, shape = class)) +
  geom_point()
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 62 rows containing missing values (geom_point).

Note the warning: shapes are more difficult to compare than colors. Unless explicitly specified, 6 shapes are included in the base plot. In this case we have 7 unique values for the class variable and the SUV class has no shape assigned.

Alpha

For instance, if we map the class variable to the alpha aesthetic - controlling the transparency of the points - the results are not as clear.

mpg %>% 
  ggplot(aes(displ, hwy, alpha = class)) +
  geom_point()
## Warning: Using alpha for a discrete variable is not advised.

The class variable is discrete and the alpha aesthetic is not best suited to highlight this.

Color and shape are better suited to display categorical variables while size and alpha are better used for continuous variables.

You can also manually set specific aesthetics for a geom. You do this inside of the geom_point() function.

mpg %>% 
  ggplot(aes(displ, hwy, size = cyl)) +
  geom_point(color = "red")

Facets

When dealing with categorical variables, facets are quite useful and display their own subset of data. You can use facet_wrap() to facet your plot with a single variable. The first argument that goes into facet_wrap() is a discrete variable and is prefixed with the ~ character.

mpg %>% 
  ggplot(aes(displ, hwy)) +  # plot displacement versus highway miles per gallon
  geom_point() +  # add point geometry
  facet_wrap(~ class, nrow = 2)  # facet by class and use only 2 rows for the data

If you want to plot against two discrete variables, you can use facet_grid(). You use two variable names, separated by a ~.

mpg %>% 
  ggplot(aes(displ, hwy)) +  # plot displacement versus highway miles per gallon
  geom_point() +  # add point geometry
  facet_grid(drv ~ cyl)  # facet by drive and cylinders on x and y axes

Geometric objects

Up until now we have only used the geom_point() geom to plot and show data. There are other options that you can use. Instead of geom_point() you might use geom_smooth(), which creates a smoothed line chart.

mpg %>% 
  ggplot(aes(displ, hwy)) +
  geom_smooth()

Both visualizations represent the same variables on the x and y axis and are based on the same dataset. With ggplot(), you can use different geoms to visualize your data. Every geom object takes a mapping argument but not every aesthetic will work with every geom. You may decide the shape of a point by using geom_point(shape = 5) but you can’t set the shape of a line like that.

However, instead of using the shape aesthetic, you can use the linetype aesthetic to draw a different line for the unique variable that you map to linetype.

mpg %>% 
  ggplot(aes(displ, hwy)) + 
  geom_smooth(aes(linetype = drv))

In this plot we see the lines associated with their drv values, which stands for a car’s drivetrain (the group of components that deliver mechanical power from the prime mover to the driven components). We see a line for 4-wheel drive, front-wheel drive and rear-wheel drive.

But where are the original points? To make it more clear, you can simply add the points by adding another geom. Everything you define in the first aes() function inside ggplot() is applied on all geoms. In the case below we color both the line and the points by their drv value and we define specific linetypes for the geom_smooth() component.

mpg %>% 
  ggplot(aes(displ, hwy, color = drv)) + 
  geom_smooth(aes(linetype = drv)) +
  geom_point()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can add multiple geoms to a plot to make absolutely insane graphs. There are over 40 geoms in ggplot2 and if you use additional extensions to the tidyverse you get even more. Take a look at the ggplot2 cheatsheet and if you want to learn more about a specific geom, use ?geom_smooth.

Unlike geom_point(), which maps an observation (a single row of data) to a single point, geoms such as geom_smooth() map a whole set of observations (multiple rows of data) to a single object (a line chart).

# mapping a variable to a group doesn't add any aesthetic properties like color or size to the plot
mpg %>% 
  ggplot() +
  geom_smooth(aes(displ, hwy, group = drv))

# mapping a variable to an aesthetic like color obviously does
mpg %>% 
  ggplot() +
  geom_smooth(aes(displ, hwy, color = drv),
              show.legend = FALSE
  )

Multiple geoms

Simply add more geoms to the plot when you want to add more geometric elements. You may choose to map your variables inside the different geoms or in ggplot2. You may see some duplicate code if you decide to map inside the geom functions.

# duplicate mapping
mpg %>% 
  ggplot() +
  geom_point(aes(displ, hwy)) +
  geom_smooth(aes(displ, hwy))

These mappings are overruled if you explicitly specify another mapping inside the geom functions. Mapping variables to aesthetics inside the ggplot() function are seen as global mappings and apply to all geoms.

mpg %>% 
  ggplot(aes(displ, hwy)) +  # global mappings
  geom_point() +  # empty mapping
  geom_smooth()   # empty mapping

You can add a specific mapping to a single geom only. In this case we map color to the class variable.

# adding color mapping to geom_point only
mpg %>%
  ggplot(aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can also show a subset of the data for different layers (geoms). In this next plot, the smooth line represents a subset of the class variable. We only show the geom_smooth() for the suv class. You can remove the shaded area representing the standard error (se) by adding se = FALSE to the geom.

mpg %>% ggplot(aes(displ,hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(data = filter(mpg, class == "suv"), se = FALSE, show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note that inside the geom_smooth() function, an explicit call to data = filter(mpg, class == "suv") is required as you are adding layers and no longer piping the data with %>%.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Statistical transformations

So far, we have looked at functions that plot the data as-is and do not perform any mutations or statistical transformations on the data itself, such as creating new variables. In this section we’re going to look at bar charts with geom_bar().

We are using the diamonds data set which comes with ggplot2. It contains 10 variables for 53940 observations of diamonds.

diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows
## # i Use `print(n = ...)` to see more rows

A bar chart takes a specific variable and performs a count() of all the unique observations in that variable. In this case, cut ratings ranging from fair to ideal.

diamonds %>% 
  ggplot(aes(cut)) +
  geom_bar()

Here are some examples of how different geoms transform the data prior to creating the plot:

  • bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
  • smoothers fit a model to your data and then plot predictions from the model.
  • boxplots compute a robust summary of the distribution and then display a specially formatted box.

The transformations are performed under the hood in the so-called ‘stat’ functions. You can inspect the stat function of a specific geom by typing ?geom_bar. You see that stat = 'count'. Every geom has their own stat function:

  • ?geom_point : stat = 'identity'
  • ?geom_boxplot : stat = 'boxplot'
  • ?geom_smooth : stat = 'smooth'
  • ?geom_bar : stat = 'count'

Usually you can use geoms such as geom_bar() and their respective transformation stat_count() interchangeably. In some cases you might want to use a specific geom but use another stat function or use a stat function instead of a geom.

Here we use geom_bar() and a stat function to map proportions to the y-axis. If you leave group = 1 out all proportions will be equal to 1 or 100%. The bar geom will count all rows or observations with a specific cut, totalling 100% for each level. You override this behaviour by adding a dummy group such as group = "whatever". Then the correct proportions over all observations will be used.

diamonds %>% 
  ggplot() +
  geom_bar(aes(cut, after_stat(prop), group = "whatever"))

Here we use stat_count() to get the same results:

diamonds %>% 
  ggplot() +
  stat_count(aes(cut, after_stat(prop), group = 1))

You can use other stats for a succinct summary of the data too, such as stat_summary

diamonds %>% 
  ggplot(aes(cut, depth)) + 
  stat_summary(mapping = aes(cut, depth),
               fun.max = max,
               fun.min = min,
               fun = median) 

Note that this comes close to what geom_boxplot() (a statistical summary) actually looks like. You can work around the default functionality provided by all the geoms when you start using stats for additional customization. I must say, geoms very often prove sufficient for your everyday needs (so far, they have for me).

diamonds %>% 
  ggplot(aes(cut, depth)) +
  geom_boxplot()

The boxplot reveals important information regarding the distribution of your data, such as:

  • the median value
  • the upper and lower quartile (Q1 and Q3) range
  • the interquartile range (difference between Q1 and Q3)
  • minimum and maximum values
  • outliers

Position adjustments

When it comes to bar charts there is another tweak you can perform in order to change how the bar chart is displayed. You can fill with a color and map the color to a variable.

diamonds %>% 
  ggplot() +
  geom_bar(aes(cut, fill = cut))  # fill the bars with color based on the different cuts

There are a few noteworthy position adjustments for bar charts:

  • "stack"
  • "identity"
  • "fil"
  • "dodge"

Stack

You can add another variable by mapping fill to something else then cut, let’s say clarity. Now we add mapping (fill) for every subset of clarity for the different cuts available. The default position for geom_bar() is "stack".

diamonds %>% 
  ggplot() +
  geom_bar(aes(cut, fill = clarity),  # add mapping (fill) for every subset of clarity in available cuts
           position = "stack")  # this is the default position for geom_bar() but it needs another variable in order to fill!

The way these bars are stacked is performed automatically by the position adjustment which is specified by the position argument inside geom_bar(), the default position is "stack". It stacks the counts for all subsets of cuts and fills by clarity.

Identity

The "identity" position shows the data exactly where it would lie in the graph itself. It is not useful for bar charts and is the default position for 2d geoms like geom_point(). For bar charts, the "identity" position causes the segments to overlap because it counts all different subsets and plots the colored bars over each other.

diamonds %>% 
  ggplot(aes(cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

Fill

The "fill" position works just like the "stack" position but makes all the different bars equal height. This makes it easier to compare the proportions between groups.

diamonds %>% 
  ggplot() +
  geom_bar(aes(cut, fill = clarity), position = "fill")

Dodge

The "dodge" position places all objects side by side which makes it easier to compare the values per group.

diamonds %>% 
  ggplot() +
  geom_bar(aes(cut, fill = clarity), position = "dodge")

Jitter

There is another type of position adjustment that is quite handy. Not for bar charts but for scatter plots. In the first plot that we made, many of the individual points were overlapping. We note 234 observations. If you would count the individual points you wouldn’t arrive at this number. This problem is called overplotting and it is caused by rounding so the points appear on the grid used by ggplot.

nrow(mpg)  #count rows in mpg dataset
## [1] 234
mpg %>% 
  ggplot(aes(displ, hwy)) +
  geom_point()

You can avoid this grid by using position = "jitter" which adds random noise to all the points. There is even a shorthand for this position specifically, geom_jitter(). Note the randomness applied to all points in both graphs. Take care with this position adjustment at small scales as it makes your graph less accurate. It can be quite revealing on larger scales and with lots of data to work with.

mpg %>% 
  ggplot(aes(displ, hwy)) +
  geom_point(position = "jitter")  # or use geom_jitter() instead

Coordinate systems

Another aspect of building graphs is the coordinate system. The standard coordinate system used is the Cartesian system. In the Cartesian system, the X and Y axis act independently to determine the location of the data (go along the X axis, then up along the Y axis).

There are a few other coordinate systems that might be useful on the rare occasion:

  • coord_flip(): swaps the X and Y coordinates, very usefull for data with a lot of groups and labels (specifically longer labels)!
mpg %>% 
  ggplot(aes(manufacturer, hwy)) + 
  geom_boxplot()

mpg %>% 
  ggplot(aes(manufacturer, hwy)) +
  geom_boxplot() +
  coord_flip()

Note that you can also achieve the same result by flipping the X and Y aesthetic! I didn’t realize this until quite recently.

mpg %>% 
  ggplot(aes(y = class, x = hwy)) +  # instead of aes(x = hwy, y = class)
  geom_boxplot()
  • coord_polar(): uses polar coordinates in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction. The polar coordinate system is probably most often used to make pie charts.
chart <- diamonds %>% 
  ggplot(aes(cut, fill = cut)) + 
  geom_bar()

chart + coord_flip()

chart + coord_polar()

  • coord_quickmap() is used when plotting spatial data such as maps to set the correct aspect ratios. You can clearly see the difference in the plots of France below.
#library("maps")
map <- map_data("france")

ggplot(map, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black")

ggplot(map, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

The layered grammar of graphics

We’ve demonstrated that you build graphics using different layers. This way of visualizing data has been dubbed the grammar of graphics. Ultimately, the template for this is (the tidyverse way):

diamonds %>%  # specify the data you wish to use and pipe it in
  ggplot(aes()) +  # call the ggplot function and set global mapping if required
  geom_bar(  # use a specific geom for your plot
     mapping = aes(cut, fill = cut),  # set specific aesthetic mapping for the geom
     stat = "count",  # specify the stat used by the geom
     position = "stack"  # specify the position adjustments for the geom
  ) +
  coord_flip() +  # specify another coordinate system
  facet_wrap(~clarity, nrow = 4, ncol = 2)  # faceting by clarity in four rows and 2 columns

Remember, most often the default stat or position will work for you. There are cases where you want to overwrite or customize your graph and this knowledge will proof essential to manipulate your graph to your liking.