Determine if Monday’s sales vary significantly from other Mondays using Pandas
In this tutorial, we will walk through a Python script that generates a dataset of daily values, filters the data based on a specific date…
In this tutorial, we will walk through a Python script that generates a dataset of daily values, filters the data based on a specific date range, and computes the average value of a particular day of the week.
Prerequisites
Make sure you have the following libraries installed in your Python environment:
pip install pandas numpyStep 1: Import Required Libraries
We will start by importing the necessary libraries. Here’s the code snippet:
import pandas as pd
import numpy as npStep 2: Generate Sample Data
We will create a dataset that consists of daily values over a period of one year. The values will be generated using a sine function to simulate a weekly trend, combined with some random noise
# Generate dates for the year 2024
dates = pd.date_range(start='2024-01-01', periods=366, freq='D')
# Generate values with a sine wave pattern and random noise
values = [50 + np.sin(i / 7 * 2 * np.pi) * 10 + np.random.normal(0, 3) for i in range(366)]
# Create a DataFrame
df = pd.DataFrame({'date': dates, 'value': values})
# Convert 'date' to datetime format and extract the day name
df['date'] = pd.to_datetime(df['date'])
df['day_name'] = df['date'].dt.day_name()Step 3: Define the Date Range and Specific Date
Next, we’ll define the date range for our analysis and specify a particular date to examine further.
# Define start and end dates for filtering
start_date = '2024-04-01'
end_date = '2024-09-30'
specific_date = '2024-06-17'Step 4: Convert Dates to Datetime
We need to convert our defined dates to datetime objects for filtering the DataFrame effectively.
# Convert the date strings to datetime objects
specific_date = pd.to_datetime(specific_date)
start_date = pd.to_datetime(start_date)
end_date = pd.to_datetime(end_date)
# Extract the day name for the specific date
specific_day_name = specific_date.day_name()Step 5: Create Masks for Filtering
We will create masks to filter the DataFrame based on the specified date range and the specific day name.
# Create masks for filtering
mask_date_range = (df['date'] >= start_date) & (df['date'] <= end_date)
mask_day_name = df['day_name'] == specific_day_name
# Combine masks to get the filtered DataFrame
combined_mask = mask_date_range & mask_day_name
dt_range = df[combined_mask]Step 6: Calculate the Average Value
Now, we will calculate the average value of the filtered data and add a new column to the DataFrame to reflect this average.
# Calculate the percentage difference from the average
dt_range['avg_value_pct_diff'] = ((dt_range['avg'] - dt_range['value']) / dt_range['value']) * 100Step 8: Display the Results
Finally, we will print the filtered DataFrame, which now includes the average value and the percentage difference for each entry that matches our criteria.
# Print the resulting DataFrame
print(dt_range)The output would be something similar to this
date value day_name avg avg_value_pct_diff
91 2024-04-01 55.456441 Monday 50.830919 -8.340820
98 2024-04-08 48.933026 Monday 50.830919 3.878553
105 2024-04-15 50.040330 Monday 50.830919 1.579904
112 2024-04-22 56.948229 Monday 50.830919 -10.741879
119 2024-04-29 48.240673 Monday 50.830919 5.369425
126 2024-05-06 48.845700 Monday 50.830919 4.064266
133 2024-05-13 50.022438 Monday 50.830919 1.616237
140 2024-05-20 55.327005 Monday 50.830919 -8.126385
147 2024-05-27 41.923583 Monday 50.830919 21.246602
154 2024-06-03 50.616145 Monday 50.830919 0.424320
161 2024-06-10 51.507112 Monday 50.830919 -1.312814
168 2024-06-17 47.580814 Monday 50.830919 6.830705Summary
In this tutorial, we:
- Generated a dataset containing daily values for the year 2024.
- Defined a specific date and a date range for filtering the data.
- Created masks to filter the DataFrame based on the specified criteria.
- Calculated the average value of the filtered data and computed the percentage difference from the average.
- Printed the resulting DataFrame.
This analysis can help you identify how individual values for a specific day compare to the average, providing insights into trends and anomalies in the data.
Optional: Calculate Rolling Averages
Rolling averages can provide insights into trends over a specified window of time. In this case, we will calculate a rolling average over the last four Mondays.
Step 1: Calculate the Rolling Average
We can calculate the rolling average of the values within the filtered DataFrame using a specified window size.
# Calculate the rolling average with a window of 4 days
dt_range['rolling_avg'] = dt_range['value'].rolling(window=4, min_periods=1).mean()rolling(window=4): This specifies that we want to calculate the average over the last 4 values.min_periods=1: This allows the calculation to return a result even if fewer than 4 data points are available (e.g., for the first few entries).
Step 2: Calculate the Percentage Difference from the Rolling Average
Next, we will calculate the percentage difference between the rolling average and the actual value to understand how much the value deviates from the recent trend.
# Calculate the percentage difference from the rolling average
dt_range['roll_avg_value_pct_diff'] = ((dt_range['rolling_avg'] - dt_range['value']) / dt_range['value']) * 100The output should be similar to this
date value day_name avg rolling_avg avg_value_pct_diff roll_avg_value_pct_diff
91 2024-04-01 45.272746 Monday 50.109394 45.272746 10.683354 0.000000
98 2024-04-08 51.746284 Monday 50.109394 48.509515 -3.163300 -6.255075
105 2024-04-15 53.102456 Monday 50.109394 50.040495 -5.636391 -5.766137
112 2024-04-22 50.523830 Monday 50.109394 50.161329 -0.820277 -0.717484
119 2024-04-29 49.920351 Monday 50.109394 51.323230 0.378690 2.810235
126 2024-05-06 46.814591 Monday 50.109394 50.090307 7.037984 6.997212
133 2024-05-13 53.407352 Monday 50.109394 50.166531 -6.175102 -6.068118
140 2024-05-20 49.797989 Monday 50.109394 49.985071 0.625337 0.375682
147 2024-05-27 51.335961 Monday 50.109394 50.338973 -2.389293 -1.942084Summary of Optional Sections
- Rolling Averages: We calculated a rolling average over the last four days and added it to the DataFrame.
- Percentage Difference from Rolling Average: We computed how much each daily value deviates from its recent trend using the rolling average.
These optional steps can enhance your analysis, providing deeper insights into the behavior of your data over time. Feel free to adapt the window size and other parameters based on your specific analysis needs!