Box plots, sometimes boxplots or box-and-whisker plots, are standard way of displaying the distribution of a dataset variables based on a five elements: minimum, maximum, first quartile (Q1), median (Q2), and third quartile (Q3). They provide a visual representation of the spread and skewness of the data, and are particularly useful for identifying outliers. \newline
The key components of the box plot are:
box - the box represents the interquartile range (IQR) which contains 50\% of the data. Three key elements of the box are important lower and upper edge and the middle line. The lower edge of the box is the first quartile (Q1), which is 25th percentile. The upper edge of the box is the third quartile (Q3), which is 75th percentile and the line inside the box is the median (Q2) which is 50th percentile of the data.
whiskers - extend from the edges of the box to the smallest and largest values within a specified range, often defined as 1.5 times the IQR from the quartiles. Any data sample above or below the whiskers value is considered outlier.
outliers - data samples that fall outsidethe whiskers. They are typically plotted as individual pooints beyond the whiskers. Outliers are oftne visualized as dots or circles in a box plot.
All these elements (box, whiskers, and outliers) are shown in Figure 1.
Figure 1 - Basic elements of the boxplot
The general form of the matplotlib.pyplot.boxplot() function is given below.
The parameters of the matplotlib.pyplot.boxplot() function can be divided on:
Main parameters
Display options
Customization of Plot Appearance
Other Options
Main Parameters
The main parameters are x, notch, vert, patch_artist, widths, whis, bootstrap, usermedians, and conf_intervals.
x: The data that will be used to create the boxplot. It can be a list, numpy array, or any array-like object containing numerical data.
notch (default: None): If True, the boxplot will be drawn with a notch to indicate the confidence interval around the median. Notches give an idea of the uncertainty of the median.
vert (default: True): If True, the boxplot is drawn vertically. If False, it is drawn horizontally.
patch_artist (default: False): If True, the boxplot will be drawn with filled boxes (patches). Otherwise, the boxes will just be outlines.
widths (default: None): Specifies the width of the boxes. If a single value, all boxes have the same width. If an array, the widths will vary for each box.
whis (default: 1.5): Defines the position of the whiskers. Whiskers extend to the farthest data point within whis * IQR (interquartile range) from the quartiles. Outliers are data points beyond this range.
bootstrap (default: None): If specified, bootstrap resampling is used to calculate confidence intervals for the medians. The value specifies the number of bootstrap iterations.
usermedians (default: None): An array-like object providing custom median values for the boxes. If not provided, the medians are calculated from the data.
conf_intervals (default: None): An array-like object specifying custom confidence intervals for each box. If not provided, they are calculated using bootstrap.
Display options
The display options parameters are meanline, showmeans, showcaps, showbox, and showfliers. These parameters are described below.
meanline (default: False): If True, the mean will be displayed as a line inside the boxplot instead of as a point.
showmeans (default: False): If True, the mean of the data will be shown in the boxplot as a point or line (depending on the meanline parameter).
showcaps (default: True): If True, the caps on the ends of the whiskers are displayed.
showbox (default: True): If True, the box (representing the interquartile range) is displayed. If False, the box is hidden.
showfliers (default: True): If True, outliers beyond the whiskers are displayed as points.
Customization of Plot Appearance
The customization of plot appearance of the boxplot function can be achieved by modifying the boxpros, whiskerprops, capprops, flierprops, medianprops, and meanprops parameters. The definition and descirpiton of these parameters are:
boxprops (default: None): A dictionary of properties (e.g., color, linestyle) to customize the appearance of the box.
whiskerprops (default: None): A dictionary of properties to customize the appearance of the whiskers.
capprops (default: None): A dictionary of properties to customize the appearance of the caps.
flierprops (default: None): A dictionary of properties to customize the appearance of the outliers (fliers).
medianprops (default: None): A dictionary of properties to customize the appearance of the median line.
meanprops (default: None): A dictionary of properties to customize the appearance of the mean.
Other Options
The other options are manage_ticks, autorange, zorder, and data. The description of these parameters are:
manage_ticks (default: True): If True, ticks on the x-axis or y-axis are managed automatically to fit the plot. If False, tick labels might overlap or behave unpredictably.
autorange (default: False): If True, whiskers are adjusted to consider the data range, ignoring the whis parameter. This ensures that the whiskers reach the min and max values in the dataset.
zorder (default: None): Defines the drawing order of the boxplot components. A higher value means the component will be drawn on top.
data (default: None): If data is provided, x can be a string or list of strings referencing columns of the data parameter (which should be a dictionary or pandas DataFrame).
Creating box plots
In this example we will generate the dataset using numpy libarary and using this data we will create a boxplot.
import numpy as np
import matplotlib.pyplot as plt
The data will contain 3 array elements that will placed inside the list. The array will be generated using the np.random.normal(). The range of the first array will be from 0 to 1, the value range from the second array will be between 0 and 2, and the value range of the third array will be between 0 and 3. All three arrays will contain 100 elements.
data = [np.random.normal(0, std, 100) for std in range(1,4)]
After the data is generated we have to create the boxplot with the matplotlib.pyplot function. Before creating the boxplot we have to specify the figure size using plt.figure(figsize=(12,8)). The figure size will be 12 by 8 inches. The plt.title() will be used to create the title of the boxplot and the title is "Box Plot Example". We will also create the xlabel and ylabel. The xlabel will be "Dataset" and the ylabel "Value". The grid will be generated using the plt.grid() and finaly the plot will be shown using plt.show().
The entire code used in this example is shown below.
import numpy as np
import matplotlib.pyplot as plt
data = [np.random.normal(0, std, 100) for std in range(1, 4)]
plt.figure(figsize=(12,8))
plt.boxplot(data)
plt.title("Box Plot Example")
plt.xlabel("Dataset")
plt.ylabel("Value")
plt.grid(True)
plt.show()
The result of this example is shown in Figure 2.
Figure 2 - Boxplot example
We will modify this example slightly by adding some additional arguments to the plt.boxplot() function. These modification will be added:
notch = True - The notch is the parameter that displays a notched box plot. If True the boxplot will have a notch at the median line. The notch visually represents the confidence interval around the median. If the notches of two boxplots do not overlap, it suggests that the medians are significantly different. If the notch is False which it is by default, then the box plot will be a standard rectangular shape without the notch.
patch\_artists = True - The patch\_artists parameter specifies whether to fill the box plot with a color. If the value is True the boxes will be filled with color, making the plot more visually appealing. If the value is False the boxes will be unfilled, showing only the outline.
boxprops=dict(facecolor='lightblue', color='blue') - The boxprops parameter is a dictionary that defines the properties of the box's apearance, such as its color, line style, and line width. If the facecolor='lightblue' will set the fill color of the boxes to light blue. If the color ='blue' it will set the color of the edges of the boxes to blue.
medianprops=dict(color='red') - this parameter is dictionary that defines the properties of the median line inside the box plot. The red color will set the median color to red.