A boxplot provides an effective summary of one or more numeric variables, showcasing key statistical features through its distinct elements:
Median Line: The line that divides the box represents the median of the data. For example, if the median is 10, this indicates that half of the data points lie below 10 and half above.
Quartiles: The ends of the box indicate the upper (Q3) and lower (Q1) quartiles. If Q3 is 15, this means that 75% of the observations fall below this value.
Interquartile Range (IQR): The difference between Quartiles 1 and 3 is known as the interquartile range (IQR), which measures the spread of the middle 50% of the data.
Whiskers: The lines extending from the box show the range of values within Q3 + 1.5 × IQR to Q1 - 1.5 × IQR, representing the highest and lowest values, excluding outliers.
Outliers: Dots (or other markers) beyond the whiskers indicate potential outliers in the dataset.
Here’s a diagram illustrating the anatomy of a boxplot:
A boxplot summarizes the distribution of a numeric variable across one or more groups, making it a convenient tool for quickly grasping differences between those groups.
However, this summarization can also lead to the loss of important information, which can be a potential pitfall.
Consider the boxplot below. It may seem evident that group
C
has higher values than the others. Yet, we can’t discern
the underlying distribution of individual data points within each group
or the total number of observations.
# Libraries
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(plotly)
# create a dataset
data <- data.frame(
name=c( rep("A",500), rep("B",500), rep("B",500), rep("C",20), rep('D', 100) ),
value=c( rnorm(500, 10, 5), rnorm(500, 13, 1), rnorm(500, 18, 1), rnorm(20, 25, 4), rnorm(100, 12, 1) )
)
# Plot
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A somewhat misleading boxplot") +
xlab("")
Let’s see what happens when the boxplot is improved using additional elements.
If the amount of data you are working with is not too
large, adding jitter
on top of your boxplot can
make the graphic more insightful.
# Plot
data %>%
ggplot( aes(x=name, y=value, fill=name)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE) +
geom_jitter(color="grey", size=0.7, alpha=0.5) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A boxplot with jitter") +
xlab("")
Here, some new patterns become clear.
Group C
has a smaller sample size compared to the other
groups. This is definitely something to consider before concluding that
group C
has higher values than the others.
Additionally, it appears that group B
exhibits a
bimodal distribution: the data points are clustered in
two distinct groups around y = 18
and
y = 13
.
If you have a large sample size, using jitter
may no
longer be effective, as the dots can overlap and render
the figure uninterpretable.
An excellent alternative is the violin plot, which effectively illustrates the distribution of the data for each group. Unlike boxplots, violin plots provide a full understanding of the group’s distribution.
# sample size
sample_size = data %>% group_by(name) %>% summarize(num=n())
# Plot
data %>%
left_join(sample_size) %>%
mutate(myaxis = paste0(name, "\n", "n=", num)) %>%
ggplot( aes(x=myaxis, y=value, fill=name)) +
geom_violin(width=1.4) +
geom_boxplot(width=0.1, color="grey", alpha=0.2) +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("A violin plot") +
xlab("")
Here it is very clear that the groups have different distributions.
The bimodal distribution of group B
becomes obvious. Violin
plots are a powerful way to display information — they are probably
under-utilized compared to box plots.
In the previous chart, the sample size for each group is indicated on
the x-axis, below the group names. This is a good practice, as it
highlights the under-representation of group C
.
However, displaying the actual data points can often provide more insight. Therefore, a half-violin plot that includes the raw data can serve as an effective alternative.
It is called a raincloud plot!
library(ggplot2)
library(ggdist)
library(hrbrthemes)
library(dplyr)
library(viridis)
# Plot
data %>%
ggplot(aes(x = factor(name), y = value, fill = factor(name))) +
# Add half-violin from {ggdist} package
stat_halfeye(
adjust = 0.5,
justification = -0.2,
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = 0.12,
outlier.color = NA,
alpha = 0.5
) +
stat_dots(
side = "left",
justification = 1.1,
binwidth = 0.25
) +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
theme(
legend.position = "none",
plot.title = element_text(size = 11)
) +
ggtitle("A raincloud plot example") +
xlab("")
Data To Viz is a comprehensive classification of chart types organized by data input format. Get a high-resolution version of our decision tree delivered to your inbox now!
A work by Yan Holtz for data-to-viz.com