Thanks to cool inventions like drones, satellites and other devices that allow you to see the world through the eyes of a bird, we are able to enjoy breathtaking views such as the one below.
And thanks to statistics and a neat method known as The Five-Number Summary, we are able to enjoy even more thrilling images such as the Box and Whisker Plot below, which is a graphical representation of the five-number summary of a dataset.
What is the purpose of the five-number summary?
You can think of the five-number summary as a satellite image of your dataset, which allows you to get a broad idea of the distribution of your data before you decide to dive in and zoom in on specific areas of interest. It allows you to easily spot outliers in your data, get a sense of the spread and range of your data and compare different sets of data in an easy to understand, straightforward way. This can be done with the help of the box and whisker plot, which is constructed using the five numbers and serves as a visual blueprint of your data.
So, what are the five numbers?
- Min (smallest value in your dataset)
- Q1 = First Quartile (25th percentile of your dataset)
- Q2 = Median (50th percentile of your dataset)
- Q3 = Third Quartile (75th percentile of your dataset)
- Max (largest value in your dataset)
What is the easiest way to get these in Python?
What insights about our dataset can we get from the five-number summary?
Before we go through a few examples of how to read a boxplot, we should mention another metric — the Interquartile Range (IQR) — often used together with the five-numbers to help us understand our data distribution. The two main functions of the IQR are to give us an idea of the spread and variability of our data and to help us determine if there are any outliers.
The IQR is calculated by subtracting Q1 from Q3 (IQR = Q3 — Q1), which gives us the range containing 50% of the observations in a given dataset.
Large IQR (wider box)= there is a wider spread and more variability in our data because 50% of our observations are spread over a wider range of values. This indicates large differences between the individual values of the dataset.
Small IQR (tighter box) = 50% of our observations are very close to the median and their values are not too far from each other, so there is more consistency in our data.
Outliers = outliers are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. These values set the highest and lowest limits of the range that contains all datapoints that are not outliers. Sometimes boxplots are drawn so that their whiskers end at those limits and any outliers in the data are represented as individual points outside the box whiskers.
Let’s make that drone fly!!!
Let’s end this by looking at the boxplot comparison below and use our knowledge of the five-number summary and IQR to figure out what assumptions and conclusions we can make about the data. These boxplots compare the test scores of 7th graders who took the exact same test during different class periods.
- IQR Box is narrower for 2nd Period data — there is more consistency in student scores during second period because 50% of the values fall between 80 and 90.
- IQR Box is wider for 1st Period data — there is more variability in student scores during 1st period (IQR = 90–75 = 15)
- Median 2nd Period > Median 1st Period — students did much better on the test as a group during the second period.
- Range 1st Period (98–60 = 38)> Range 2nd Period (95–70 = 25) — points to a wider spread and less consistency in the data for 1st Period and hints that there might be outliers skewing the data distribution.
- 25% of the scores in 1st Period fall between 60 and 75 while in 2nd Period that range is between 70–80 — this can be observed by the long whisker on the left, indicating that the lower 25% of scores really affected the data distribution for 1st Period.
- Is 60 an outlier? Let’s do some math! We know IQR for 1st Period is 15. Lower limit = Q1–1.5*IQR = 75–1.5*15 = 52.5. Since 60 is not below the lower limit of 52.25, it is not technically an outlier, but it’s close.
With the help of this bird’s view analysis of the data, we can posit certain assumptions about the cause of the inconsistency in student scores. For example, we may hypothesize that the much longer left whisker and lower median value of the boxplot for period 1 was due to the fact that students are generally less alert and not as focused earlier in the morning closer to their wake-up time. Also, a couple of the really low scores could be caused by students oversleeping or being late for class due to traffic or other early-morning reasons, which would result in less time available per question compared to the majority of students who took the test during 2nd Period.