Determine if Monday’s sales vary significantly from other Mondays using Pandas

In this tutorial, we will walk through a Python script that generates a dataset of daily values, filters the data based on a specific date…

Determine if Monday’s sales vary significantly from other Mondays using Pandas
Photo by Annie Spratt on Unsplash

In this tutorial, we will walk through a Python script that generates a dataset of daily values, filters the data based on a specific date range, and computes the average value of a particular day of the week.

Prerequisites

Make sure you have the following libraries installed in your Python environment:

pip install pandas numpy

Step 1: Import Required Libraries

We will start by importing the necessary libraries. Here’s the code snippet:

import pandas as pd 
import numpy as np

Step 2: Generate Sample Data

We will create a dataset that consists of daily values over a period of one year. The values will be generated using a sine function to simulate a weekly trend, combined with some random noise

# Generate dates for the year 2024 
dates = pd.date_range(start='2024-01-01', periods=366, freq='D') 
 
# Generate values with a sine wave pattern and random noise 
values = [50 + np.sin(i / 7 * 2 * np.pi) * 10 + np.random.normal(0, 3) for i in range(366)] 
 
# Create a DataFrame 
df = pd.DataFrame({'date': dates, 'value': values}) 
 
# Convert 'date' to datetime format and extract the day name 
df['date'] = pd.to_datetime(df['date']) 
df['day_name'] = df['date'].dt.day_name()

Step 3: Define the Date Range and Specific Date

Next, we’ll define the date range for our analysis and specify a particular date to examine further.

# Define start and end dates for filtering 
start_date = '2024-04-01' 
end_date = '2024-09-30' 
specific_date = '2024-06-17'

Step 4: Convert Dates to Datetime

We need to convert our defined dates to datetime objects for filtering the DataFrame effectively.

# Convert the date strings to datetime objects 
specific_date = pd.to_datetime(specific_date) 
start_date = pd.to_datetime(start_date) 
end_date = pd.to_datetime(end_date) 
 
# Extract the day name for the specific date 
specific_day_name = specific_date.day_name()

Step 5: Create Masks for Filtering

We will create masks to filter the DataFrame based on the specified date range and the specific day name.

# Create masks for filtering 
mask_date_range = (df['date'] >= start_date) & (df['date'] <= end_date) 
mask_day_name = df['day_name'] == specific_day_name 
 
# Combine masks to get the filtered DataFrame 
combined_mask = mask_date_range & mask_day_name 
dt_range = df[combined_mask]

Step 6: Calculate the Average Value

Now, we will calculate the average value of the filtered data and add a new column to the DataFrame to reflect this average.

# Calculate the percentage difference from the average 
dt_range['avg_value_pct_diff'] = ((dt_range['avg'] - dt_range['value']) / dt_range['value']) * 100

Step 8: Display the Results

Finally, we will print the filtered DataFrame, which now includes the average value and the percentage difference for each entry that matches our criteria.

# Print the resulting DataFrame 
print(dt_range)

The output would be something similar to this

date      value day_name        avg  avg_value_pct_diff 
91  2024-04-01  55.456441   Monday  50.830919           -8.340820 
98  2024-04-08  48.933026   Monday  50.830919            3.878553 
105 2024-04-15  50.040330   Monday  50.830919            1.579904 
112 2024-04-22  56.948229   Monday  50.830919          -10.741879 
119 2024-04-29  48.240673   Monday  50.830919            5.369425 
126 2024-05-06  48.845700   Monday  50.830919            4.064266 
133 2024-05-13  50.022438   Monday  50.830919            1.616237 
140 2024-05-20  55.327005   Monday  50.830919           -8.126385 
147 2024-05-27  41.923583   Monday  50.830919           21.246602 
154 2024-06-03  50.616145   Monday  50.830919            0.424320 
161 2024-06-10  51.507112   Monday  50.830919           -1.312814 
168 2024-06-17  47.580814   Monday  50.830919            6.830705

Summary

In this tutorial, we:

  1. Generated a dataset containing daily values for the year 2024.
  2. Defined a specific date and a date range for filtering the data.
  3. Created masks to filter the DataFrame based on the specified criteria.
  4. Calculated the average value of the filtered data and computed the percentage difference from the average.
  5. Printed the resulting DataFrame.

This analysis can help you identify how individual values for a specific day compare to the average, providing insights into trends and anomalies in the data.

Optional: Calculate Rolling Averages

Rolling averages can provide insights into trends over a specified window of time. In this case, we will calculate a rolling average over the last four Mondays.

Step 1: Calculate the Rolling Average

We can calculate the rolling average of the values within the filtered DataFrame using a specified window size.

# Calculate the rolling average with a window of 4 days 
dt_range['rolling_avg'] = dt_range['value'].rolling(window=4, min_periods=1).mean()
  • rolling(window=4): This specifies that we want to calculate the average over the last 4 values.
  • min_periods=1: This allows the calculation to return a result even if fewer than 4 data points are available (e.g., for the first few entries).

Step 2: Calculate the Percentage Difference from the Rolling Average

Next, we will calculate the percentage difference between the rolling average and the actual value to understand how much the value deviates from the recent trend.

# Calculate the percentage difference from the rolling average 
dt_range['roll_avg_value_pct_diff'] = ((dt_range['rolling_avg'] - dt_range['value']) / dt_range['value']) * 100

The output should be similar to this

date      value day_name        avg  rolling_avg  avg_value_pct_diff  roll_avg_value_pct_diff 
91  2024-04-01  45.272746   Monday  50.109394    45.272746           10.683354                 0.000000 
98  2024-04-08  51.746284   Monday  50.109394    48.509515           -3.163300                -6.255075 
105 2024-04-15  53.102456   Monday  50.109394    50.040495           -5.636391                -5.766137 
112 2024-04-22  50.523830   Monday  50.109394    50.161329           -0.820277                -0.717484 
119 2024-04-29  49.920351   Monday  50.109394    51.323230            0.378690                 2.810235 
126 2024-05-06  46.814591   Monday  50.109394    50.090307            7.037984                 6.997212 
133 2024-05-13  53.407352   Monday  50.109394    50.166531           -6.175102                -6.068118 
140 2024-05-20  49.797989   Monday  50.109394    49.985071            0.625337                 0.375682 
147 2024-05-27  51.335961   Monday  50.109394    50.338973           -2.389293                -1.942084

Summary of Optional Sections

  1. Rolling Averages: We calculated a rolling average over the last four days and added it to the DataFrame.
  2. Percentage Difference from Rolling Average: We computed how much each daily value deviates from its recent trend using the rolling average.

These optional steps can enhance your analysis, providing deeper insights into the behavior of your data over time. Feel free to adapt the window size and other parameters based on your specific analysis needs!