Saturday, October 5, 2024

How to create histogram?

The matplotlib pyplot histograms are used to visualize the distribution of a dataset. Using them you are able to see how data is distributed across different interval or "bins". With histogram you can understand the underlying frequency distribution of a dataset which can be for example normal, skewed, uniform, and etc The key concepts of a histograms are:
  • bins - intervals into which the data is divided. Each bin represents a range of data, and the height of the bar over a bin represents the frequency (count) of data points that are in that range.
  • frequency - or count of data points in each bin are shown on the y-axis of a histogram.
  • density - is the relative frequency (frequency divided by the total number of data points) can be shown o the histograms. Using density the area under the histogram sums to 1.
  • continuous data- histograms are used for continuous data where the data can take nay value within a range.

How to create histogram?

The histograms in python are using matplotlib library are plotted using \textbf{matplotlib.pyplot.hist()} function. However if you have imported the matplotlib.pyplot module as the plt then the function can be written as \textbf{plt.hist()}. The full form with the default values of hist function arguments is given below.
              matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None, cumulative=False, bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, log=False, color=None, label=None, stacked=False, **kwargs)	
              
Detailed explanation of parameters:
  • x (required) - array like. The data that is plotted in the histogram. This is a sequence of numbers.
  • bins (optional) - can be int or sequence. The default value inside the function is set to None which means that default value is taken from \textbf{rcParams["hist.bins"]} which is equal to 10. So bins define the number of bins or the specific bin edges. In case the bins is integer, it specifies the number of bins. If a sequence is provided than bins specifies the exact bin edges. For example if bins = 10 the hist will contain 10 bins. On the other and if bins is equal to [0,1,2,3,4] then the hist will use these as edges.
  • range (optional) - the value type of this variable is tuple and the default value is None. The lower and upper range of the bins. If not provided, range defaults to minimum and maximum values of the data.
  • density (optional) - the value type of this variable is bool and the default value is False. If the value is True, the histogram shows the probability density instead of the absolute count. This means the are under the histogram will sum to 1.
  • weights (optional) - is array-like, default value is None. An array of weights, of the same shape as x. Each value in x contributes its corresponding weights to the bin count (instead of 1).
  • cumulative (optional) - the value type is bool and the default value is False. If the value is True, then the cumulative histogram is computed where each bin gives the cumulative count or density of values up to that bin.
  • histtype (optional) - is the type of histogram that will be plotted. There are several options available i.e. bar, barstacked, step, and stepfilled. The default value is the bar.
    • bar - traditonal bar-type histogram
    • barstacked - stacked bar histogram
    • step - generates a lineplot that follows the edge of the bins.
    • stepfilled - similar to step, but the area under the step line is filled.
  • align (optional) - alignment of the bars relative to the edges. Several options are available i.e. left, mid, and right and the default value is mid. When set to mid the bars are centered on the bin edges. When set to left the bars are aligned to the left of the bin edges. When set to the right the bars are aligned to the right of the bin edges.
  • orientation (optional) - orientation of the histogram bars. There are two options available i.e. vertical (bars are vertical) and horizontal (bars are horizontal).
  • rwidth (optional) - is the relative width of the bars as a fraction of the bin width. If not specified the bars will take the full bin width. The variable value can be float or None. The default value is None.
  • log (optional) - the parameter value is bool and the default value is set to False. If the value is set to True the y-axis is logarithmic.
  • color (optional) - the color or sequence of colors, default value is None.
  • label (optional) - label for the histogram, useful when legend is used. The parameter value is str or None, and the default value is None.
  • stacked (optional) - the bool type parameter, the default value is set to False. If the value is set to True and multiple datasets are provided, the histograms are stacked on top of each other.
  • **kwargs (optional) - additional keyword arguments that are passed to the plt.bar() or plt.step() functions, depending on the histtype.

Example 1 - Basic Histogram

In this example we will create the basic histogram were everything in the histogram function is set to default value. The data that will be used for creating histogram will be created using numpy randpm.randn() function and the 1000 samples will be created.\newline The first step is to load required libraries i.e. numpy and the matplotlib.
import numpy as np
import matplotlib.pyplot as plt
The next step is to randomly generate the data using the random.randn function.
data = np.random.randn(1000)
Using the previous commend we have created 1000 random numbers in range from 0 to 1. The rest of the code are matplotlib.pyplot functions that are reserved for creating histogram and these are:
  • plt.figure(figsize=(12,8)) - set the size of the histogram plot. Here the figure size is set to 12 by 8 inches.
  • plt.hist(data) - the hist function with the x value defined (data), all other parameters are default which means that there will be 10 bins.
  • plt.title("Histogram Example") - the title of the histogram plot.
  • plt.xlabel("Value") - the name of the x-axis.
  • plt.ylabel("Frequency") - the name of the y-axis.
  • plt.show() - show the histogram plot.
The entire code regarding the creation of histogram plot is given below.
plt.figure(figsize=(12,8))
plt.hist(data)
plt.title("Histogram Example")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
The entire code used in this example is given below.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.title("Histogram Example")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
When previous code block is executed the solution shown in Figure 1 is obtained.
2024-10-04T23:10:29.381802 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Fgiure 1 - Histogram Example

Example 2 - How to modify the number of bins in a histogram?

In this example we will create a simple dataset for which the histogram will be generated containing only 5 bins. In this eample we will set the color of bars to "blue" and the edge color of the bars will be set to "black". The initial step is to import reuqired libraires. In this example the matplotlib pyplot module will be the only one.
import matplotlib.pyplot as plt
Next we will define the dataset, in this case a list and store it under variable name data.
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6]
In this dataset we have multiple integers that occur more than once. The final step is to define the required matplotlib functions required for creating the histogram plot. These functions are:
  • plt.figure(figsize=(12,8)) - size of the histogram figure, in this case the size is 12 by 8 inches.
  • plt.hist(data, bins = 5, color='blue', edgecolor='black') - the hist function for creating the histogram. The histogram will contain 5 bins, the color of the bars will be blue and the edgecolor of the bars will be black.
  • plt.xlabel('Value') - the function that will create a name of x-axis.
  • plt.ylabel('Frequency') - the function that will create a name of y-axis.
  • plt.title('Basic Histogram') - the title of the histogram plot.
  • plt.show() - display the histogram plot.
The entire code is given below.
import matplotlib.pyplot as plt
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6]
plt.figure(figsize=(12,8))
plt.hist(data, bins=5, color='blue', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Basic Histogram')
plt.show()
The plot generated by executing the previous code block is shwon in the following Figure.
2024-10-04T23:19:48.219254 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Figure 2 - Histogram with modified number of bins

Example 3 - Density plot

In this example we will create a dataset with 1000 random numbers between 0 and 1 and then used that data to create a histogram. However in histogram we will set the density to true to obtain the density plot. First step - required libraries i.e. numpy and the matplotlib.pyplot module
import numpy as np
import matplotlib.pyplot as plt
The dataset creation - 1000 random values in 0 to 1 range.
data = np.random.randn(1000)
The next step is to create the histogram for the defined dataset.
plt.hist(data, bins=30, density=True, color='blue', alpha=0.6)
This line generates the histogram, but with some specific options. Let's break it down:
  • data - This is the input data that you want to visualize. It could be any 1D array-like data structure, such as a list or NumPy array.
  • bins=30 - tells Matplotlib to divide the data into 30 equal-sized intervals (bins). The number of bins influences the smoothness of the histogram. More bins make the histogram more detailed, but fewer bins make the distribution clearer by reducing "noise."
  • density=True - normalizes the histogram, so the area under the histogram equals 1. This turns the histogram into a density plot, where the y-axis shows the probability density instead of simple counts (frequencies). In a regular histogram, the y-axis would show the count (frequency) of how many values fall into each bin. In a density plot, the y-axis shows the relative density of data points in each bin. The area under the curve adds up to 1, allowing it to be interpreted as a probability distribution. The height of each bar shows how probable it is for a data point to fall in that bin.
  • color='blue' - sets the color of the bars in the histogram to blue.
  • alpha=0.6 - sets the transparency level of the bars, with 1.0 being fully opaque and 0.0 being fully transparent. By setting alpha=0.6, the bars will be semi-transparent, which can be useful for overlaying multiple plots or simply improving visual aesthetics.
The nest step is to add title, and axis labels. This will be done using the plt.title() function to add the title, the plt.xlabel() to add the x-axis label, plt.ylabel() to add the y-axis label.
plt.title("Density Plot Example")
plt.xlabel("Value")
plt.ylabel("Density")
  • plt.title("Density Plot Example"): Adds the title "Density Plot Example" to the plot.
  • plt.xlabel("Value"): Adds a label to the x-axis, which in this case represents the range of values in the dataset.
  • plt.ylabel("Density"): Adds a label to the y-axis, indicating that it represents density (probability density). Unlike a regular histogram where the y-axis represents counts, here it represents the likelihood of values occurring in the range (bin).
To show the plot we will use the plt.show() function.
plt.show()
The entire code created in this example is shwon below.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randn(1000)
plt.hist(data, bins=30, density=True, color='blue', alpha=0.6)
plt.title("Density Plot Example")
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()
When the previous code is executed the density plot is created that is shown in Figure 3.
2024-10-05T22:31:26.748357 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Figure 3 - Density Plot Example

Example 4 - Cumulative histogram

In a cumulative histogram, each bar shows the number of data points up to and including the bin. So the first bin will contain all points that fall into its range, the second bin will contain all points in its range plus the number of points in the first bin, and so on. This results in the bars increasing cumulatively, which is useful for understanding how values accumulate across the distribution. In this example we will create the cumulative histogram with randomly generated dataset containing 1000 samples in 0 to 1 range. The dataset consist on following steps:
  • Import libraries,
  • Generating Random Data
  • Plotting the Cumulative Histogram
  • Adding Labels and Title
  • Displaying the Plot

Importing libraries

To create the cumulative histogram the NumPy and Matplotlib libraries are required.
import numpy as np
import matplotlib.pyplot as plt
The Numpy is used for numerical operations i.e. dataset generation with random numbers and the matplotlib is used for creating visualizations in this case it is used for generating cumulative histogram plot. It should be noted that from matplotlib we have imported pyplot module as plt.

Generating dataset with random numbers

In order to generate the dataset with random numbers from 0 to 1 range we will use the numpy randn() function from the random module.
data = np.random.randn(1000)
The np.random.randn(1000) will generate an array fo 1000 random numbers drawn from a standard normal distribution (mean = 0, standard deviation = 1). These random numbers will be used to plot the histogram.

Plot the Cumulative Histogram

For cumulative histogram we will use the plt.hist() function with cumulative parameter set to True.
plt.hist(data, bins=5, cumulative = True, color = 'purple', edgecolor = 'black')
So plt.hist() creates a histogram. A histogram is way to represent the distribution of data by dividing it into 'bins' or intervals and counting how many data points fall into each bin. The parameters of the plt.hist() defined in this example:
  • data: The array of random numbers generated earlier.
  • bins=5: Specifies that the data will be divided into 5 intervals (or bins). The histogram will display the cumulative frequency for each of these 5 bins.
  • cumulative=True: This option tells Matplotlib to make the histogram cumulative, meaning the frequencies in each bin will be added to the previous bins, producing a cumulative effect. So, as you move to higher bins, the total count of data points continues to increase.
  • color='purple': Specifies the color of the bars in the histogram (in this case, purple).
  • edgecolor='black': This adds a black outline to each bar in the histogram, making the bins more distinct.

Adding Labels and Title

The label on x-axis will be "Value", the label on y-axis will be "Cumulative Frequency", and the title will be "Cumulative Histogram".
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')
plt.title('Cumulative Histogram')
  • plt.xlabel(): Adds a label to the x-axis, in this case, "Value," which represents the value range of the data.
  • plt.ylabel(): Adds a label to the y-axis, in this case, "Cumulative Frequency," which represents the number of data points up to a certain value.
  • plt.title(): Sets the title of the plot as "Cumulative Histogram."

Displaying the plot

The cumulative histogram will be displayed using the plt.show() function.
plt.show()
The entire code created in this example is shown below.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randn(1000)
# Plotting the cumulative histogram
plt.hist(data, bins=5, cumulative=True, color='purple', edgecolor='black')
# Adding labels and title
plt.xlabel('Value')
plt.ylabel('Cumulative Frequency')
plt.title('Cumulative Histogram')
# Display the plot
plt.show()
When the previous code is executed the cumulative histogram is obtained that is shown in Figure 4.
2024-10-04T22:46:25.547347 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Figure 4 - Cumulative Histogram

Example 5 - Stacked Histogram

A stacked histogram allows you to compare two datasets by showing how the frequencies of data1 and data2 are distributed across the same set of bins. In this example, for each bin, the red section represents the count of values from data1, and the blue section on top represents the count of values from data2 that fall within that bin. The total height of each bar shows the combined frequency of both datasets for that bin.

Importing Libraries

For this example we will only need the matplotlib library, pyplot module.
import matplotlib.pyplot as plt
Matplotlib is a Python library used for creating static, interactive, and animated visualizations. Here, pyplot (imported as plt) is used to handle plotting commands and create the histogram.

Sample data

data1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data2 = [2, 3, 4, 5, 6, 7, 8, 9, 10]
data1 and data2 are lists of integers representing the data points we want to plot. Both lists contain 9 numbers:
  • data1: Contains values from 1 to 9.
  • data2: Contains values from 2 to 10.

Plotting Stacked Histograms

plt.hist([data1, data2], bins=5, stacked=True, color=['red', 'blue'], edgecolor='black')
This command generates two stacked histograms based on the data from data1 and data2. Let’s go through the parameters:
  • \([data1, data2]\):The first argument to plt.hist() is a list of datasets. Here, we are passing both data1 and data2 as a list. This means Matplotlib will plot both datasets in the same histogram.
  • bins = 5: This divides the data into 5 bins (intervals). The function will compute how many values from both data1 and data2 fall into each bin. A bin is an interval along the x-axis that groups together values from the data sets.
  • stacked=True: This makes the histograms stacked. Instead of plotting data1 and data2 side by side, the histogram bars for data2 are stacked on top of the bars for data1. This is useful for comparing datasets and seeing the cumulative total for each bin.
  • color=['red', 'blue']:This specifies the color of the bars for the two datasets. Here, the bars for data1 will be red, and the bars for data2 will be blue. Since they are stacked, the red part of each bar represents the frequency of data1, and the blue part stacked on top represents the frequency of data2 for each bin.
  • edgecolor='black': This adds a black outline to the edges of the bars, which makes the boundaries between the bars and the bins clearer.

Adding Labels and Title

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Stacked Histogram')
  • plt.xlabel(): This adds the label ‘Value’ to the x-axis, which represents the values from the datasets.
  • plt.title(): This adds the title ‘Stacked Histogram’ to the plot, describing what the plot represents.
  • plt.ylabel(): This adds the label ‘Frequency’ to the y-axis. The y-axis shows how many values (or frequencies) from data1 and data2 fall within each bin.

Display the Plot

plt.show()	
plt.title(): This adds the title ‘Stacked Histogram’ to the plot, describing what the plot represents. The entire code created in this example is shown below.
import matplotlib.pyplot as plt
# Sample data
data1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data2 = [2, 3, 4, 5, 6, 7, 8, 9, 10]
# Plotting stacked histograms
plt.hist([data1, data2], bins=5, stacked=True, color=['red', 'blue'], edgecolor='black')
# Adding labels and title
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Stacked Histogram')
# Display the plot
plt.show()
After executing the previous code block the stacked histogram is generated that is shwon in Figure 5.
2024-10-05T22:50:31.870807 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/
Figure 5 - Stacked Histogram Example

No comments:

Post a Comment