4.5.9 Visualize Data in a Box Plot

A box plot provides an overview of data distributions in numeric data. It provides general information about the symmetry, skewness, variance, and outliers in a dataset. The box plot uses boxes and lines to depict the data distribution.

The box plot has the following components:
  • Central Box—Inter-quartile range and quartiles:
    • Q1 (First Quartile)—This is the value below which 25% of the data falls. It represents the boundary between the lowest 25% and highest 75% of values.
    • Q3 (Third Quartile)—This represents the value below which 75% of the data falls, serving as a border between the lowest 75% and highest 25% of values.
    • Interquartile Range (IQR)—The IQR is the range in which the central 50% of the values fall. IQR = Q3 - Q1
  • Whiskers—The whiskers of the box plot extend from the central box to the minimum and maximum data values that are not considered outliers. They provide a graphical representation of the majority of the data's distribution.
  • Outliers—Outliers are data points that deviate significantly from other data points, typically due to data variability or errors. An outlier is plotted as a dot beyond the ends of the whiskers of a box plot.
  • Median—The median is the value that divides the dataset into two halves, with 50% of the values falling below it and 50% falling above it. In the box plot, a line or a mark inside the central frame represents the median.
When to use this chart: Use this chart to show distributions of numeric data, especially if you want to compare them between multiple groups.
Dataset: IRIS dataset. The IRIS dataset contains 3 classes (three different Iris species - Setosa, Versicolor, and Virginica) along with 50 samples each, and four numeric properties about those classes: Sepal Length, Sepal Width, Petal Length, and Petal Width.
To visualize data in a box plot:
  1. We have already created the IRIS dataset in the topic Visulize your data in a pie chart. We will use the same table IRIS_R to visualize the data in a box plot. Open the notebook and go to the paragraph where the IRIS_R table is populated.

    Figure 4-39 Iris Dataset in a table


    Iris Dataset in a table with the box plot icon highlighthed

  2. Click the box plot icon. The dataset is now displayed in a box plot.

    Figure 4-40 Box Plot 1 - Depicts the data is grouped by the 3 species (classes) - Setosa, Versicolor, and Virginca


    Box Plot 1 - Depicts the data is grouped by the 3 species (classes) - Setosa, Versicolor, and Virginca

    As you can see, by default the data is grouped by the 3 species (classes) - Setosa, Versicolor, and Virginca along the X-axis, and the sepal length along the Y axis. Hover your cursor over each box plot to view the count.
  3. Click on Settings to view how the data is plotted. Under Setup, go to Series to Show, and click to add the other three numeric properties—Sepal Width, Petal Length, and Petal Width.

    Figure 4-41 Box Plot 2 - Depicts the data for the 3 species (classes) alongwith the properties Sepal Width, Sepal Length, Petal Width, and Petal Length


    Box Plot 2 - Depicts the data for the 3 species (classes) alongwith the properties Sepal Width, Sepal Length, Petal Width, and Petal Length

  4. Under Settings, click Customizations, edit the following settings:
    • Visualization: Click Show Outliers.
    • In the Text field, enter Iris Species. Color: Enter rgb(7, 17, 215, 0.88)
    • Y-Axis: In the Text field, enter Petal & Sepal Properties. Color: Enter rgb(7, 17, 215, 0.88)
    • Description: Enter the following - Box Plot of the Iris flower dimension.
    • Color: Enter rgb(241, 8, 24)
    • Once done, close the dialog.

    Figure 4-42 Box Plot 3 - Shows the Outlier, Box Plot description, and descriptions for X-Axis and Y-Axis


    Box Plot 3 - Shows the Outlier, Box Plot description, and descriptions for X-Axis and Y-Axis

  5. The box plot now displays the dataset as below:
    • Hover your cursor over each box plot to view the values. In the screenshot here, the cursor is over the Sepal Length series for the species Virginica. The length ranges from 5.6 to 7.9. There is also an outlier for this, and it is indicated by the dot below the box plot whisker.

      Figure 4-43 Box Plot 4 - Shows the value for the class Virginica and property Sepal Length


      Box Plot 4 - Shows the value for the class Virginica and property Sepal Length

    • Hover your cursor over the dot that indicates the outlier for the group virginica. It shows the outlier value at 4.9 for Virginica sepal length. This means that in the species Virginica, there are sepals whose length is significantly below the lower count (5.6).

      Figure 4-44 Box Plot 5 - Shows the Outlier value for Virginica (Class) Sepal Length (Property)


      Box Plot 5

This completes the task of visualizing your data in a box plot.