By Konstantinos Patronas — 07 Jan 2023

Pandas: Understanding basic statistics

Pandas is a powerful library for performing data analysis in Python. Here are a few essential functions that are useful for simple…

Photo by Pascal Müller on Unsplash

Pandas is a powerful library for performing data analysis in Python. Here are a few essential functions that are useful for simple statistical analysis:

df.describe(): Compute basic statistical measures for each column in a DataFrame.
df.mean(): Compute the mean of each column.
df.median(): Compute the median of each column.
df.mode(): Compute the mode of each column.
df.std(): Compute the standard deviation of each column.
df.min(): Find the minimum value of each column.
df.max(): Find the maximum value of each column.

Here, df is a DataFrame. These functions allow you to quickly compute common statistical measures for your data, and can be useful for getting a sense of your data’s distribution and central tendency.

What is the mean in statistics?

In statistics, the mean is a measure of central tendency that represents the average value in a dataset. It is calculated by summing all of the values in the dataset and dividing by the number of values.

For example, consider the following dataset:

3, 5, 7, 9, 10

To calculate the mean of this dataset, you would sum all of the values (3 + 5 + 7 + 9 + 10) and divide by the number of values (5), like this:

mean = (3 + 5 + 7 + 9 + 10) / 5 
     = 34 / 5 
     = 6.8

The mean is a useful measure of central tendency because it takes into account all of the values in the dataset, and because it is in the same units as the original data. However, it can be affected by outliers or extreme values in the dataset, which can make it less representative of the data as a whole.

What is the median in statistics?

In statistics, the median is a measure of central tendency that represents the middle value in a dataset. It is the value that separates the higher half from the lower half of the dataset.

To calculate the median, you first need to order the values in the dataset from smallest to largest. Then, if the dataset has an odd number of values, the median is simply the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

For example, consider the following dataset:

3, 5, 7, 9, 10

This dataset has an odd number of values, so the median is simply the middle value, which is 7.

On the other hand, consider the following dataset:

3, 5, 7, 9, 10, 12

This dataset has an even number of values, so the median is the average of the two middle values, which is (7 + 9) / 2 = 8.

The median is a useful measure of central tendency because it is not affected by outliers or extreme values in the dataset. It is often used in combination with other measures, such as the mean and mode, to get a complete understanding of the data distribution.

What is the mode in statistics?

In statistics, the mode is a measure of central tendency that represents the most common value in a dataset. It is the value that appears most frequently in the dataset.

To calculate the mode, you first need to count the number of times each value appears in the dataset. The value that appears the most is the mode. If there is more than one value that appears the most, the dataset is said to have multiple modes. If there are no values that repeat in the dataset, the dataset is said to have no mode.

For example, consider the following dataset:

3, 5, 7, 5, 9, 10, 7

In this dataset, the value 5 appears twice, and the value 7 appears twice. These are the most common values, so the mode of this dataset is 5 and 7.

On the other hand, consider the following dataset:

3, 5, 7, 9, 10, 12

In this dataset, no value appears more than once, so the dataset has no mode.

The mode is a useful measure of central tendency when the dataset contains discrete values, such as integers or categories. It is often used in combination with other measures, such as the mean and median, to get a more complete understanding of the distribution of the data.

What is the standard deviation in statistics?

In statistics, the standard deviation is a measure of the spread or dispersion of a dataset. It is a measure of how far the values in the dataset are from the mean.

To calculate the standard deviation, you first need to calculate the mean of the dataset. Then, for each value in the dataset, you calculate the difference between the value and the mean, and square this difference. Finally, you take the average of the squared differences, and take the square root of this average to get the standard deviation.

The standard deviation is a useful measure of spread because it is in the same units as the original data, and because it is based on all of the values in the dataset. It is often used to compare the spread of different datasets or to determine whether a particular value is an outlier in the dataset.

For example, consider the following dataset:

3, 5, 7, 9, 10

The mean of this dataset is (3 + 5 + 7 + 9 + 10) / 5 = 7. The squared differences between the values and the mean are (5–7)² = 4, (7–7)² = 0, (9–7)² = 4, and (10–7)² = 9. The average of these squared differences is (4 + 0 + 4 + 9) / 4 = 4.75. The standard deviation is the square root of this average, which is sqrt(4.75) = 2.19.

Practical examples

Let’s do some practical examples of the methods we mentioned

import pandas as pd 
 
# Load data into a DataFrame 
df = pd.read_csv('data.csv')

Assume that data.csv has the following contents

x,y 
1,2 
2,3 
3,4 
4,5 
5,6

We can calculate the mean like this

df.mean()

Then the output of df.mean() would be:

x    3.0 
y    4.0 
dtype: float64

The median can be calculated like this

df.median()

Then the output of df.median() would be:

x    3.0 
y    4.0 
dtype: float64

The mode of can be calculated with the mode() function

df.mode()

Then the output of df.mode() would be:

The standard deviation can be calculated with the std() function

df.std()

Then the output of df.std() would be:

x    1.581139 
y    1.581139 
dtype: float64

The minimum value of the columns can be found using the min() function

df.min()

Then the output of df.min() would be:

x    1 
y    2 
dtype: int64

The maximum value of the columns can be found using the max() function

df.max()

Then the output of df.max() would be:

x    5 
y    6 
dtype: int64

Finally, there is the describe method which can give a quick overview of all the above methods we discussed, plus with some additional info like percentiles

df.describe()

Then the output of df.describe() would be:

x      y 
count  5.0    5.0 
mean   3.0    4.0 
std    1.581139   1.581139 
min    1.0    2.0 
25%    2.0    3.0 
50%    3.0    4.0 
75%    4.0    5.0 
max    5.0    6.0

What are percentiles

In statistics, percentiles are used to divide a dataset into 100 equal parts, or “percentiles.” For example, the 25th percentile is the value below which 25% of the values in the dataset fall. The 50th percentile is the value below which 50% of the values fall, and is also known as the median. The 75th percentile is the value below which 75% of the values fall.

To calculate percentiles, you first need to order the values in the dataset from smallest to largest. Then, you can use the following formula to calculate the nth percentile (where n is any number between 0 and 100):

nth percentile = (n / 100) * (number of values)

Assume you have this dataset

3, 5, 7, 9, 10

To calculate the 25th percentile of this dataset, you would use the following formula:

25th percentile = (25 / 100) * 5 
              = 1.25

This means that the 25th percentile is the first value in the dataset, which is 3.

Percentiles are a useful way to summarize the distribution of a dataset and to identify outliers or extreme values. They are often used in combination with other measures of central tendency and spread, such as the mean, median, and standard deviation.

I hope you found this article useful :)