Data Analytics with Python: Analyzing Sales Data using Python

Data analysis refers to discovering insights from the data that help draw meaningful conclusions and aid decision-making. As organizations increasingly rely on data-backed decision-making, the need for data analysts is on the rise.

In today’s data-driven world, Python has emerged as a versatile tool for conducting data analysis efficiently and precisely. This article will focus on understanding data analytics and how data analytics with Python can be performed through a case study.

Let’s start by understanding the need for Python for Data Analytics.

Table of Contents

Why choose Python for Data Analytics?

As data analysis requires users to perform numerous operations on data, it becomes crucial to use a tool that is capable and sophisticated yet simple to use. This is where Python comes into the picture. Python for data analytics is excellent due to a plethora of reasons that include-

Easy to learn and use, making data analysis accessible to novice as well as experienced programmers
Several great Python libraries for data analytics exist, such as Pandas, NumPu, SciPu, StatsModels, Matplotlib, Seaborn, and Plotly.
Extensive community support helping to resolve the issue
Great integration capabilities help users analyze data from various DBMS, making Python for data science and analytics operations a great choice.
Good scalability allows users to analyze big data efficiently

As the reason why Python is crucial for data science and analytics is now clear, let’s focus on what data analysis is all about and how it will be approached.

Learn and Upskill Today with AnalytixLabs

Explore our signature data science courses in collaboration with Electronics & ICT Academy, IIT Guwahati, and join us for experiential learning to transform your career.

Broaden your learning scope with our elaborate Applied AI and Business Analytics, courses. Explore our ongoing courses here.
Learn the right skills to fully leverage AI’s power and unleash AI’s potential in your data findings and visualization. Have a question? Connect with us here. Follow us on social media for regular data updates and course help.

How to Analyze Data – 7 Key Steps

To analyze data, you need to follow a process. All the steps in the process are highly crucial as they allow for the practical analysis of data. The key steps for performing data analysis are as follows-

Setting up Objective: The first step is to define what you intend to achieve with the analysis and what aspects of the business you want to cover.
Collecting Data: The next step is to collect and collate the relevant data for analysis.
Exploratory Data Analysis: This step involves examining the data cursorily to understand it.
Cleaning Data: The most crucial step is data cleaning, where one needs to ensure that the data is complete, free of anomalies and inconsistencies, and fit for data analysis.
Mining Data: The next stage involves mining data, where a few key findings are provided.
Visualizing Data: Another crucial step in data analysis involves visualizing data where data is manipulated, including filtering and aggregating data and presenting the results through charts.
Statistical Analysis: Lastly, appropriate statistical tests and models will be used to understand the data and the relationship among its variables.

After reviewing the fundamentals of data analytics, it is time to examine an actual data analysis case study performed using Python.

How to perform Data Analysis using Python?

To perform data analysis using Python, you must follow the data analysis process discussed above using Python libraries. In this section, it is precisely what you will do.

1) Objective Definition

You will use sales data in this case study. Thus, this analysis aimed to survey the sales patterns and identify how they can be improved.

2) Data Collection and Exploratory Data Analysis (EDA)

Data collection and EDA can be easily performed with Python due to the libraries used for data analytics, such as Numpy and Pandas. Hence, you will start by importing these libraries along with other visualization and statistical libraries that helped you during the later steps.

# importing pandas for reading data and performing other dataframe-related operations
import pandas as pd

# importing numpy for performing various numerical operations
import numpy as np

# importing matplotlib, seaborn and plotly for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go
from plotly.subplots
import make_subplots

# importing statistical libraries
import math
import scipy.stats as stats
import statsmodels.api as sm

Once done, you imported the data as a Pandas DataFrame and created a copy for data cleaning and downstream analytical steps.

# importing the sales data as a pandas DataFrame
sales_data = pd.read_csv('sales_info.csv')

# creating a copy of the data for analysis
df = sales_data.copy()

The first obvious step was to examine the data, so you performed exploratory data analysis using Python. You dealt with sales data containing information about a company’s various products across multiple categories and countries. In addition to the product and transaction details, you also provided customer demographic information.

# viewing the first few rows of the data
df.head()

After reviewing the data, you created a table describing each column and how it could be used during data analysis.

Next, you calculated the data dimensions and found that you dealt with around 3.5K transactions and 19 columns.

# finding the number of rows and columns there in the data
print("The number of rows are {} and the number of columns are {}".format(df.shape[0],df.shape[1]))

As you had a basic understanding of the data, it was time to start with data cleaning. Data cleaning needs to focus on the following key aspects-

Basic understanding of data
Duplicates
Incorrect Data Types
Anomalies/Outliers
Missing Values
Structural Errors

As you have already understood the data, start by checking for duplicates.

3) Data Cleaning

Data cleaning is a crucial process in data analytics. It improves data quality, ensuring the insights gained from the analytical process are accurate, consistent, complete, reliable, and reproducible.

Data can have errors, biases, and incomplete information, so numerous data-cleaning steps are performed before data analysis. This section will show you the various data-cleaning steps.

a. Duplicate Removal

It found that there were eight duplicate rows in the data.

# checking if there are any duplicates df.duplicated().sum()

You will use drop.duplicated() command to remove them and ensure that the duplicates had gone.

# removing duplicate rows df = df.drop_duplicates()

# re-checking if there are any duplicates df.duplicated().sum()

b. Type Casting

Next, you check the data types in the columns. Several columns, such as `Date` and `Customer Estimated Income,` seemed to have the wrong data types.

# finding the data types and number of non-null values in the data df.info()

To understand why that is the case, you view and explore the unique values of all the columns. While for `Customer Estimated Income`, the presence of the $ symbol caused wrong data types, several other issues were present, such as invalid values masking missing values and inconsistent data (e.g., `Education`).

# need to find what all unwanted values are present in the different columns that are causing it to have incorrect datatypes

# first creating a copy of the data that does not have missing values df_without_na = df.dropna().copy()

# finding the unique values for each column in the data to find such unwanted values for i in df_without_na: print('\n', i, df_without_na[i].unique())

To resolve the invalid values, note them down and replace them with missing ones so they can be dealt with through imputation.

# listing invalid values to be replaced with np.nan
invalid_values = ["%48224*#(", "(#23#(@!", "!!@#$%^&*", "??><<{}[]", "ABCDE", "NANANANA", "-", "???"]

# replacing invalid values with np.nan
df = df.replace(invalid_values, np.nan)

You remove the $ symbol from Income and change its data type to float.

# removing the '$' sign from values in Estimated Income column and converting it to numeric.
df['Customer Estimated Income'] = df['Customer Estimated Income'].replace('[\$,]', '', regex=True).astype(float)

You also changed the data types of the remaining columns to their correct types, as the invalid values no longer caused any hindrance. Upon re-checking, all data types seemed to be correct.

# typecasting columns to their correct data types df['Loyalty Balance'] = pd.to_numeric(df['Loyalty Balance']) df['Date'] = pd.to_datetime(df['Date'])

# rechecking column data types
df.dtypes

c. Anomaly/Outlier Treatment

To detect outliers, you create a function that creates boxplots for specified columns.

# creating function for boxplots def plot_boxplots(df, columns, num_cols=3):

# calculating number of rows and columns for subplots num_rows = math.ceil(len(columns) / num_cols)

# creating subplots fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 6))

# flattening the axes array for easy iteration axes = axes.flatten()

# ploting boxplots for each specified column for i, col in enumerate(columns): sns.boxplot(data=df, y=col, ax=axes[i]) axes[i].set_title(f'Boxplot of {col}')

# removing empty subplots for j in range(i + 1, len(axes)): fig.delaxes(axes[j])

# showing plot plt.tight_layout() plt.show()

You will then create boxplots for the numeric columns. Upon inspection, you found that while a few columns had extreme/abnormally large values (as expected in sales data), some columns had anomalous values (e.g., `Customer Estimated Income` and `Customer Duration` with negative values).

# selecting numerical columns numerical_cols = df.drop('Year', axis=1).select_dtypes(include='number').columns

# creating boxplots plot_boxplots(df, numerical_cols)

To handle such anomalies, you create a function that caps outliers using the Inter-Quartile (IQR) method.

# creating function to perform outlier capping
def iqr_outlier_capping(df, columns):

# creating a copy of the DataFrame to avoid modifying the original data df_capped = df.copy() for col in columns

# calculating Q1 (25th percentile) and Q3 (75th percentile) Q1 = df_capped[col].quantile(0.25) Q3 = df_capped[col].quantile(0.75)

# calculating IQR (Interquartile Range) IQR = Q3 - Q1

# calculating lower and upper bounds lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR

# capping the outliers df_capped[col] = df_capped[col].clip(lower=lower_bound, upper=upper_bound)

return df_capped

Next, you will use the above-created function on columns with anomalies and recreate the boxplots for them to ensure that the anomalies were indeed gone.

# performing outlier capping with anomalous values columns_to_cap = ['Customer Duration', 'Loyalty Balance', 'Customer Estimated Income'] df = iqr_outlier_capping(df, columns_to_cap)

# re-creating boxplots for the columns where outlier capping is performed plot_boxplots(df, columns_to_cap)

d. Missing Value Imputation

One of the most crucial aspects of data cleaning is dealing with missing values. As you have already dealt with invalid values that were masking missing values, you now have a much better idea of the status of missing values in the data. Upon detection, you found several columns with missing values.

# printing column names with missing value
print(df.isna().sum()[df.isna().sum()>0].index)

You create a function that performs group-level missing value imputations using mean and mode for imputing numerical and categorical columns, respectively.

# creating a function for imputing missing values in a target column using group-level mean/mode values def impute_missing_values(df, target_column, group_by_column, column_type):

# making a copy of the dataframe to avoid modifying the original dataframe df_imputed = df.copy()

# ensuring target_column and group_by_column exist in the dataframe if target_column not in df_imputed.columns or group_by_column not in df_imputed.columns: raise ValueError("Specified columns are not present in the dataframe.")

# imputing based on column type if column_type == 'numeric':

# checking if the target column is numeric if not pd.api.types.is_numeric_dtype(df_imputed[target_column]): raise TypeError(f"Column '{target_column}' is not numeric. Please specify 'categorical' for non-numeric columns.")

# computing the mean value for each group and fill missing values group_means = df_imputed.groupby(group_by_column)[target_column].transform('mean') df_imputed[target_column] = df_imputed[target_column].fillna(group_means) elif column_type == 'categorical':

# checking if the target column is categorical if pd.api.types.is_numeric_dtype(df_imputed[target_column]): raise TypeError(f"Column '{target_column}' is numeric. Please specify 'numeric' for numeric columns.")

# computing the mode value for each group and fill missing values mode_values = df_imputed.groupby(group_by_column)[target_column].transform(lambda x: x.mode()[0] if not x.mode().empty else np.nan) df_imputed[target_column] = df_imputed[target_column].fillna(mode_values) else: raise ValueError("Invalid column type specified. Use 'numeric' or 'categorical'.") return df_imputed

You then used the function to impute missing values and used appropriate group columns. Upon re-checking, it became evident that the data was free of missing values.

# using the function above to perform missing value imputation for numerical columns df = impute_missing_values(df, target_column = 'Customer Estimated Income', group_by_column = 'State', column_type = 'numeric') df = impute_missing_values(df, target_column = 'Loyalty Balance', group_by_column = 'Country', column_type = 'numeric') df = impute_missing_values(df, target_column = 'Customer Duration', group_by_column = 'Customer Gender', column_type = 'numeric')

# using the function above to perform missing value imputation for categorical columns df = impute_missing_values(df, target_column = 'Customer Education Level', group_by_column = 'Customer Gender', column_type = 'categorical') df = impute_missing_values(df, target_column = 'Customer Marital Status', group_by_column = 'Country', column_type = 'categorical')

# rechecking if there are any missing values left df.isna().sum()

e. Fixing Inconsistent Data

During typecasting, you found some structural issues in the data with the `Customer Education Level` column having inconsistent categories. Upon carefully looking at the categories and their counts, you found that some categories are mentioned differently (e.g., PhD being also mentioned as Research Degree and Doctorate).

# checking the different categories in the Education Level column df['Customer Education Level'].value_counts().sort_index()

To resolve this, you removed the inconsistent categories by standardizing them, eventually providing you with only four education levels.

# mapping for standardizing categorical values education_mapping = { 'Bachelor':'Bachelor', 'Doctorate':'PhD', 'Graduate Degree':'Master', 'High School':'High School', 'Master':'Master', 'Masters Diploma':'Master', 'PhD':'PhD', 'Research Degree':'PhD', 'Senior Secondary Education':'High School', 'Undergrad':'Bachelor', 'Undergraduate Degree':'Bachelor' }

# standardizing the 'Customer Education Level' column df['Customer Education Level'] = df['Customer Education Level'].map(education_mapping)

# rechecking if the categories have been fixed df['Customer Education Level'].unique()

As the data is now clean, it’s time to analyze it. To perform data analysis using Python, you must have your data as a Pandas DataFrame, as it is much easier to deal with. As you already had the data in this form, you had multiple ways to analyze data.

In Python, you have data manipulation libraries like Pandas and Numpy, visualization libraries like Matplotlib, Seaborn, and Plotly, and statistical libraries like Scipy and Statsmodels. All such libraries allow you to look at the data from various perspectives and uncover hidden patterns.

The following sections will show all the libraries, functions, and techniques for slicing and diceing the data. Note that the focus will not be on the insights gained from the analysis, as discussed in the ‘How to communicate your insights’ section, which involves reporting. Therefore, the aim will be to understand how Python can be leveraged and the approach to be taken when performing data analysis.

4) Data Mining

The simplest and easiest way to analyze data is to mine it. This involves extracting key valuable information about the columns. Data analytics with Python is easy because of libraries like Pandas. If you have your data as a Pandas DataFrame, you can easily use the describe function, which provides statistical information about the columns.

# finding the key statistical values of the numerical columns df.describe().T

The same function can also be applied to finding key insights about the categorical columns.

# finding the key statistical values of the categorical columns df.drop('Date', axis=1).describe(exclude=np.number).T

It was necessary to derive a few important variables to mine the data further. These included calculating the year-month combination of the transaction, profit, profit margin, etc.

# Creating a year-month variable for easily understanding the date related columns

# dropping the year and month column as this information is already there in the Date column df = df.drop(['Year','Month'],axis=1)

# extracting year and creating a Year column df['Year'] = df['Date'].dt.year

# extracting year and month and creating a Year-Month column df['Year_Month'] = df['Date'].dt.strftime('%Y-%m')

# calculating profit df['Profit'] = df['Revenue'] - df['Cost']

# calculating unit profit df['Unit Profit']=df['Unit Price']-df['Unit Cost']

# calculating profit margin df['Profit Margins'] = df['Profit'] / df['Revenue']

# calculating unit profit margin df['Unit Profit Margin']= df['Unit Profit']/df['Unit Price']

Once such key variables become available, key performance metrics (KPI) can be easily calculated using Python, as performing arithmetic operations between different columns is extremely easy with Pandas.

# Saes KPIs ttl_revenue = df['Revenue'].sum() ttl_profit = df['Profit'].sum() ttl_revenue_previous = df[df['Year'] == 2015]['Revenue'].sum() ttl_revenue_current = df[df['Year'] == 2016]['Revenue'].sum() rga = (ttl_revenue_current-ttl_revenue_previous)/ttl_revenue_previous pm = (ttl_profit/ttl_revenue)*100 aov = ttl_revenue/len(df)

# Customer KPIs avg_age = df['Customer Age'].mean() median_income = df['Customer Estimated Income'].median() mode_martl_status = df['Customer Marital Status'].mode()[0] mode_edu_level = df['Customer Education Level'].mode()[0] male_percentage = (df[df['Customer Gender'] == 'M'].shape[0] / df.shape[0]) * 100 female_percentage = (df[df['Customer Gender'] == 'F'].shape[0] / df.shape[0]) * 100

# Product KPIs top_selling_prod = df.groupby('Sub Category')['Revenue'].sum().sort_values(ascending=False).index[0] most_purchased_prod = df.groupby('Sub Category')['Quantity'].sum().sort_values(ascending=False).index[0] most_costly_prod = df.groupby('Sub Category')['Cost'].sum().sort_values(ascending=False).index[0] most_profitable_prod = df.groupby('Sub Category')['Profit'].sum().sort_values(ascending=False).index[0] most_margin_prod = df.groupby('Sub Category')['Profit Margins'].sum().sort_values(ascending=False).index[0]

# Operational KPIs ttl_cost = df['Cost'].sum() avg_basket_size = df['Quantity'].mean() avg_profit_margin = df['Profit Margins'].mean() top_costly_country = df.groupby('Country')['Cost'].sum().sort_values(ascending=False).index[0]

# Print statements for KPIs print(f"Total Revenue: ${ttl_revenue:,.2f}") print(f"Total Profit: ${ttl_profit:,.2f}") print(f"Revenue Growth Rate (2015 to 2016): {rga:.2%}") print(f"Profit Margin: {pm:.2f}%") print(f"Average Order Value: ${aov:,.2f}")

print("\nCustomer KPIs:") print(f"Average Customer Age: {avg_age:.1f} years") print(f"Median Customer Estimated Income: ${median_income:,.2f}") print(f"Most Common Marital Status: {mode_martl_status}") print(f"Most Common Education Level: {mode_edu_level}") print(f"Percentage of Male Customers: {male_percentage:.2f}%") print(f"Percentage of Female Customers: {female_percentage:.2f}%")

print("\nProduct KPIs:") print(f"Top Selling Product Sub-Category: {top_selling_prod}") print(f"Most Purchased Product Sub-Category: {most_purchased_prod}") print(f"Most Costly Product Sub-Category: {most_costly_prod}") print(f"Most Profitable Product Sub-Category: {most_profitable_prod}") print(f"Product Sub-Category with Highest Profit Margin: {most_margin_prod}")

print("\nOperational KPIs:") print(f"Total Cost: ${ttl_cost:,.2f}") print(f"Average Basket Size: {avg_basket_size:.2f} items") print(f"Average Profit Margin: {avg_profit_margin:.2f}%") print(f"Country with Highest Total Cost: {top_costly_country}")

You’ll now do some complex analysis and calculate crucial financial information for different products. You will do this by grouping the dataframe by product categories and sub-categories and aggregating it by calculating the sum of the key financial columns like revenue, cost, profit, and margin.

# grouping data by product category and sub-category and aggregating by revenue and cost by summing them prod_analysis_df = df.groupby(['Product Category','Sub Category'])[['Revenue','Cost']].sum().reset_index()

# calculating profit and profit margin prod_analysis_df['Profits']= prod_analysis_df['Revenue'] - prod_analysis_df['Cost'] prod_analysis_df['Profit Margins']= prod_analysis_df['Profits']/ prod_analysis_df['Revenue']

# viewing data prod_analysis_df

Similarly, you can also calculate this information and more for different countries.

# grouping data by country with specific aggregation functions for each column aggregation_functions = { 'Revenue': 'sum', 'Cost': 'sum', 'Quantity': 'sum', 'Profit': 'sum', 'Profit Margins': 'mean' } country_analysis_df = df.groupby('Country').agg(aggregation_functions).reset_index()

# calculating the total sums for the columns to get proportions total_sums = country_analysis_df[['Revenue', 'Cost', 'Quantity', 'Profit', 'Profit Margins']].sum()

# calculating proportions and adding them as new columns country_analysis_df['Revenue Proportion'] = round(country_analysis_df['Revenue'] / total_sums['Revenue'],3)*100 country_analysis_df['Cost Proportion'] = round(country_analysis_df['Cost'] / total_sums['Cost'],3)*100 country_analysis_df['Quantity Proportion'] = round(country_analysis_df['Quantity'] / total_sums['Quantity'],3)*100 country_analysis_df['Profit Proportion'] = round(country_analysis_df['Profit'] / total_sums['Profit'],3)*100 country_analysis_df['Profit Margins Proportion'] = round(country_analysis_df['Profit Margins'] / total_sums['Profit Margins'],3)*100

# rearranging columns to place similar columns next to each other country_analysis_df = country_analysis_df[['Country', 'Revenue', 'Revenue Proportion', 'Cost', 'Cost Proportion', 'Quantity', 'Quantity Proportion', 'Profit', 'Profit Proportion', 'Profit Margins', 'Profit Margins Proportion']]

# viewing the data country_analysis_df

As you can see, data can be easily analyzed using Pandas, revealing many insights. However, when you try to perform more complex analyses, you may have a set of complex numbers to interpret, making visualization crucial.

5) Data Visualization

Visualization can be an excellent tool for performing complex analyses. It allows us to aggregate data involving two or more variables and understand it using graphs.

Of the many kinds of graphs, bar charts, and pie charts are the key ones as they allow you to understand data composition. Line charts and scatterplots help you comprehend the relationship of numerical variables with other numerical or time-based variables.

In this section, we will highlight some key information about the data by manipulating and visualizing data in multiple ways.

i) Creating User-Defined Functions

Data analytics with Python is feasible due to various aspects of this programming language, including the capability to create your functions. While some visualizations can be pretty straightforward, in other cases, they can be complex and involve creating the same graph for different subsets of data.

In such cases, user-defined functions are advised to be created. Below, you will understand what each plotting function you made does.

Function 1: Bar Chart with Margin Indicator

This function plots a bar chart for a given metric (e.g., `Revenue`, `Profit`) over the years, with annotations highlighting the maximum value and an arrow indicating the range of the metric’s values.

def plot_metric_by_year(df, metric, title, color):

# grouping by year and calculate the sum of the metric yearly_data = df.groupby('Year')[metric].sum()

# creating a bar chart plt.bar(yearly_data.index, yearly_data, color=color)

# adding a horizontal line at the maximum value max_value = yearly_data.max() plt.axhline(y=max_value, color='red', lw=3, ls='--', alpha=0.6)

# annotating the plot with arrows plt.annotate( text='', xy=(yearly_data.index[0], yearly_data.min()), xytext=(yearly_data.index[0], max_value), arrowprops=dict( arrowstyle='<|-|>', facecolor='red', edgecolor='red', shrinkA=5, shrinkB=5 ) )

# adding labels and title plt.xlabel('Year') plt.ylabel(metric) plt.title(title)

Function 2: Dodged Bar Chart for Year-country

This function plots a dodged bar chart showing the yearly trends of a given metric (e.g., `Revenue`, `Cost`, etc.) for different countries, with each country represented by a different color in the bar chart.

def plot_metric_by_year_country(df, metric, title):

# setting figure size plt.figure(figsize=(16, 10), facecolor='lightgrey')

# grouping by year and country and then calculating the sum of the metric yearly_country_data = df.groupby(['Year', 'Country'])[metric].sum().unstack('Country')

# ploting the data yearly_country_data.plot(kind='bar', figsize=(14, 8), colormap='tab20', edgecolor='black')

# setting labels, title, legend, etc plt.xlabel('Year') plt.ylabel(metric) plt.title(title) plt.legend(title='Country') plt.xticks(rotation=45) plt.tight_layout()

# showing plot plt.show()

Function 3: Frequency Donut/Pie Chart

This function generates a frequency pie chart (donut plot) to visualize the proportion of different categories within a specified column of a DataFrame. The chart displays both the count and percentage for each category.

def plot_frequency_pie_chart(df, column_name):

# calculating the frequency of each category in the specified column value_counts = df[column_name].value_counts()

# defining color palette colors = plt.cm.Set1(range(len(value_counts)))

# creating function to format the pie chart labels def func(pct, allvalues): absolute = int(round(pct / 100. * sum(allvalues))) return f'{absolute:,}\n({pct:.1f}%)'

# plotting the pie chart plt.figure(figsize=(8, 8)) wedges, texts, autotexts = plt.pie( value_counts, labels=value_counts.index, autopct=lambda pct: func(pct, value_counts), startangle=140, colors=colors, wedgeprops=dict(width=0.3) )

# customizing the label text and colors for text in texts: text.set_color('black') for autotext in autotexts: autotext.set_color('black')

# adding title plt.title(f'Proportion of Each {column_name}', color='black')

# showing the plot plt.show()

Function 4: Horizontal Bar Chart

This function generates a horizontal bar chart to visualize aggregated data. It groups data by a specified column, aggregates another column based on the chosen aggregation function, and creates a bar chart to display the results. Optionally, the chart can include text annotations for the bars.

def plot_bar_chart(df, group_by_col, aggregate_col, agg_func='sum', text_auto=None):

# checking if the specified aggregation function is valid valid_agg_funcs = ['sum', 'mean', 'median', 'count', 'min', 'max'] if agg_func not in valid_agg_funcs: raise ValueError(f"Invalid aggregation function. Choose from: {valid_agg_funcs}")

# grouping and aggregating the data aggregated_df = df.groupby(group_by_col)[aggregate_col].agg(agg_func).reset_index()

# creating the bar chart if text_auto: fig = px.bar(aggregated_df, y=group_by_col, x=aggregate_col, text_auto=text_auto, title=f"{agg_func.capitalize()} of {aggregate_col} by {group_by_col}") else: fig = px.bar(aggregated_df, y=group_by_col, x=aggregate_col, title=f"{agg_func.capitalize()} of {aggregate_col} by {group_by_col}")

# updating the axis titles fig.update_xaxes(title=f'{aggregate_col.capitalize()} ({agg_func.capitalize()})') fig.update_yaxes(title=group_by_col.capitalize())

# showing the plot fig.show()

Function 5: Multiple Bar Charts for Product Subcategories

This function generates a set of bar charts to visualize the financial performance of a specified product sub-category across different countries. It plots the following metrics for the given sub-category:

Revenue by Country
Cost by Country
Profit by Country
Profit Margin by Country

def plot_sub_category_performance(df, sub_category):

# setting figsize and facecolor plt.figure(figsize=(16, 5), facecolor='lightgrey')

# Plot #1: Revenue by Country plt.subplot(1, 4, 1, title=f"{sub_category} Revenue by Country") df[df['Sub Category'] == sub_category].groupby('Country')['Revenue'].sum().plot.bar(color='blue') plt.ylabel('Revenue')

# Plot #2: Cost by Country plt.subplot(1, 4, 2, title=f"{sub_category} Cost by Country") df[df['Sub Category'] == sub_category].groupby('Country')['Cost'].sum().plot.bar(color='green') plt.ylabel('Cost')

# Plot #3: Profit by Country plt.subplot(1, 4, 3, title=f"{sub_category} Profit by Country") df[df['Sub Category'] == sub_category].groupby('Country')['Profit'].sum().plot.bar(color='red') plt.ylabel('Margin')

# Plot #4: Profit Margin by Country plt.subplot(1, 4, 4, title=f"{sub_category} Profit Margin by Country") revenue = df[df['Sub Category'] == sub_category].groupby('Country')['Revenue'].sum() margin = df[df['Sub Category'] == sub_category].groupby('Country')['Profit'].sum() (margin / revenue * 100).plot.bar(color='orange') plt.ylabel('Margin %')

# showing plot plt.tight_layout() plt.show()

Function 6: Scatterplot for Product Subcategories

This function creates scatter plots to visualize the relationship between unit cost and unit price for different sub-categories across various countries. It generates a grid of subplots, one for each specified sub-category, and includes color-coded data points to distinguish between countries.

def plot_unit_cost_vs_price(df, sub_categories):

# determining the number of rows and columns for the grid n_sub_categories = len(sub_categories) n_cols = 2 # setting the number of columns in the subplot grid n_rows = (n_sub_categories + n_cols - 1) // n_cols # computing the number of rows needed plt.figure(figsize=(15, 5 * n_rows)) plt.suptitle('Unit Cost and Unit Price by Bike Sub-Category and Country', fontsize=16)

# looping through each sub-category and creating a subplot for i, sub_category in enumerate(sub_categories): plt.subplot(n_rows, n_cols, i + 1) plt.xlim(0, 3500) plt.ylim(0, 3500) plt.title(f'{sub_category} by Country') sns.scatterplot(data=df[df['Sub Category'] == sub_category], x='Unit Cost', y='Unit Price', hue='Country', palette='tab10') # Adjust palette if needed plt.xlabel('Unit Cost') plt.ylabel('Unit Price') plt.legend(title='Country')

# hiding any unused subplots if n_sub_categories % n_cols != 0: for j in range(n_sub_categories, n_rows * n_cols): plt.subplot(n_rows, n_cols, j + 1).axis('off') plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust to make room for the suptitle plt.show()

Function 7: Pie Chart for Product Categories

This function generates pie charts to visualize the proportion of different financial columns (e.g., `Revenue`, `Cost`, `Profit`) for each category within a specified categorical column (e.g., `Country`). It creates a grid of subplots, each showing the distribution of one monetary column across the categories.

# creating a function to plot a pie chart for different categories and financial performance columns def plot_pie_charts_1(df, category_col, financial_cols_metrics):

# getting unique categories and counting them unique_categories = df[category_col].unique() num_categories = len(unique_categories)

# determining the layout for subplots based on the number of financial columns num_financial_cols = len(financial_cols_metrics) rows = int(np.ceil(num_financial_cols / 2)) # number of rows needed cols = 2 if num_financial_cols > 1 else 1 # number of columns needed

# creating the subplots fig, axs = plt.subplots(rows, cols, figsize=(10, rows * 5))

# flattening the axes array if there are multiple subplots axs = axs.flatten() if num_financial_cols > 1 else [axs] for i, (financial_col, metric) in enumerate(financial_cols_metrics.items()):

# calculating the specified metric for the financial column grouped by the categorical column if metric == 'sum': proportions = df.groupby(category_col)[financial_col].sum() elif metric == 'mean': proportions = df.groupby(category_col)[financial_col].mean() elif metric == 'median': proportions = df.groupby(category_col)[financial_col].median() elif metric == 'count': proportions = df.groupby(category_col)[financial_col].count() elif metric == 'min': proportions = df.groupby(category_col)[financial_col].min() elif metric == 'max': proportions = df.groupby(category_col)[financial_col].max() else: raise ValueError(f"Invalid metric '{metric}' for column '{financial_col}'") labels = proportions.index # getting the category labels sizes = proportions.values # getting the corresponding values ax = axs[i] # selecting the current axis for plotting

# defining the custom autopct function to display percentage and absolute values def autopct(pct): absolute = int(pct / 100. * np.sum(sizes)) # calculating the absolute value return f'{pct:.1f}%\n({absolute})' # formatting the string

# creating the pie chart wedges, texts, autotexts = ax.pie(sizes, labels=labels, autopct=autopct, startangle=90)

# adjusting the color of the text based on wedge color for autotext, wedge in zip(autotexts, wedges):

# calculating brightness of the wedge color r, g, b, _ = wedge.get_facecolor() brightness = (r * 299 + g * 587 + b * 114) / 1000

# setting text color to white if the wedge is dark, otherwise black autotext.set_color('white' if brightness < 0.5 else 'black')

# ensuring the pie is drawn as a circle ax.axis('equal')

# setting title for the subplot ax.set_title(f'Proportion of {financial_col} by {category_col} ({metric})')

# hiding any unused subplots for j in range(i + 1, len(axs)): axs[j].axis('off')

# adjusting layout to prevent overlap plt.tight_layout()

# displaying the plot plt.show()

# setting key column metrics financial_cols_metrics = { 'Revenue': 'sum', 'Quantity': 'sum', 'Cost': 'sum', 'Profit': 'sum', 'Profit Margins': 'mean' }

Function 8: Pie Chart for Country & Product Categories

This function generates pie charts to visualize the proportion of profit contributed by different product categories for each country. It creates a grid of subplots, with each subplot representing the distribution of profit among product categories for a specific country.

def plot_pie_charts_2(df):

# grouping by Country and Product Category, then sum the Profit profit_by_category_country = df.groupby(['Country', 'Product Category'])['Profit'].sum().reset_index()

# ensuring there are no negative or zero profits (filtering them out) profit_by_category_country = profit_by_category_country[profit_by_category_country['Profit'] > 0]

# getting the unique countries countries = profit_by_category_country['Country'].unique()

# defining the number of plots (one for each country) n_countries = len(countries)

# creating a figure with subplots n_cols = 3 n_rows = (n_countries + n_cols - 1) // n_cols # Compute the number of rows needed fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 4 * n_rows))

# flattening the axes array for easy iteration axes = axes.flatten() for i, country in enumerate(countries): ax = axes[i]

# filtering data for the current country country_data = profit_by_category_country[profit_by_category_country['Country'] == country]

# ploting pie chart wedges, texts, autotexts = ax.pie( country_data['Profit'], labels=country_data['Product Category'], autopct='%1.1f%%', startangle=140, colors=plt.cm.tab20.colors # Use a color map for diverse colors ) ax.set_title(f'{country}')

# hiding any unused subplots for j in range(i + 1, len(axes)): axes[j].axis('off')

# adjusting layout to prevent overlap plt.tight_layout()

# showing the plot plt.show()

Function 9: Pie Chart for Product Sub-categories

This function generates pie charts to visualize the percentage distribution of the top 5 sub-categories within a specified product category for multiple countries. It creates a grid of subplots, with each subplot representing the distribution for a specific country.

def percentage_calc(country, product_category):

# filtering the dataframe for the specified country and product category sub_counts = df[(df["Country"] == country) & (df["Product Category"] == product_category)]["Sub Category"].value_counts()

# sorting and selecting top 5 sub-categories based on count top_sub_counts = sub_counts.nlargest(5)

# calculating the percentage for each top sub-category total_count = sum(top_sub_counts) return {index: np.round(count / total_count * 100) for index, count in top_sub_counts.items()}

# creating a function to plot pie chart showing the percentage distribution of top 5 sub-categories for each specified country and product category def plot_pie_charts_3(countries, product_category):

# getting unique product categories and determine number of unique categories unique_categories = df["Product Category"].unique() num_categories = len(unique_categories)

# generating a color palette based on the number of unique product categories colors = plt.cm.get_cmap('Pastel1', num_categories)

# creating a figure with a grid of subplots (2 rows, 2 columns) plt.figure(figsize=(12, 8))

# iterating over the list of countries and create a subplot for each for i, country in enumerate(countries): plt.subplot(2, 2, i + 1)

# getting the percentage distribution for the current country and product category data = percentage_calc(country, product_category)

# creating a pie chart with the calculated percentages and using the color palette wedges, texts, autotexts = plt.pie(data.values(), labels=data.keys(), autopct='%1.1f%%', colors=colors(range(num_categories)))

# setting the title of the subplot to the country name plt.title(country)

# adding the main title for the figure plt.suptitle(f"Top 5 {product_category} Sub Category Percentage for Each Country", fontsize=16)

# displaying the pie charts plt.show()

Function 10: Bar Chart for Finding Top-performing Product Subcategories

This function generates bar charts to indicate the top-performing sub-categories for different product categories based on various performance metrics (e.g., `Quantity`, `Profit Margins`, `Profit`, `Cost`) across different countries.

def plot_top_products_by_metric(df, group_by_cols, aggregate_col, agg_func='sum', title=""):

# grouping by the specified columns and aggregating the specified column if agg_func == 'sum': df_grouped = df.groupby(group_by_cols)[aggregate_col].sum().reset_index() elif agg_func == 'mean': df_grouped = df.groupby(group_by_cols)[aggregate_col].mean().reset_index() else: raise ValueError("Invalid aggregation function. Use 'sum' or 'mean'.")

# finding the top sub-category product for each country based on the aggregated column top_products = df_grouped.groupby(group_by_cols[0]).apply(lambda x: x.loc[x[aggregate_col].idxmax()]).reset_index(drop=True)

# creating the bar chart fig = px.bar(top_products, x=group_by_cols[0], y=aggregate_col, color=group_by_cols[1], title=title)

# showing the plot fig.show()

Once the key functions have been created, you can start analyzing the data.

ii. Overview Analysis

You started by analyzing the overall sales situation. You counted the number of observations where revenue was less than cost (indicating loss) and created a pie chart to understand the proportion of loss-making transactions.

# creating a loss dataframe that only has those observations where cost is greater than revenue df_loss = df[df['Revenue'] < df['Cost']]

# saving number of observations where loss is there loss_count = len(df_loss)

# saving number of observations with no loss non_loss_count = len(df)-loss_count

# creating labels and size values for pie chart labels = ['Loss', 'Non-Loss'] sizes = [loss_count, non_loss_count]

# creating the pie chart fig, ax = plt.subplots() ax.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, colors=['red', '#90EE90']) ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. # adding title plt.title('Proportion of Loss vs. Non-Loss Observations')

# showing plot plt.show()

iii. Time-Based Analysis

Next, you will analyze the financial performance between 2015 and 2016.

Analysis: Financial Performance by Time

To see how financial performance has evolved, you aggregated the financial performance columns (`Revenue`, `Cost`, `Profit`, `Profit Margins`) by month and then created a line chart to visualize the trends.

The x-axis represents the Year-Month, and the y-axis shows the amount in dollars, allowing you to observe how the financial performance metrics have evolved over the given period.

# saving all the financial performance columns financial_columns = ['Revenue', 'Cost', 'Profit', 'Profit Margins']

# aggregaring data by Year financial_by_year_df = df.groupby(["Year_Month"])[financial_columns].sum().reset_index()

# creating a line chart fig = px.line(financial_by_year_df, x = 'Year_Month', y = financial_columns, title='Financial Performance by Time')

# adding axes fig.update_xaxes(title='Year-Month') fig.update_yaxes(title='Amount in $')

# creating plot fig.show()

Analysis: Financial Performance between Years

You created a barplot with multiple subplots to compare financial columns between years. Using different, you plotted `Revenue`, `Cost`, `Profit`, and Profit Margin % by year and added an arrow highlighting the maximum difference.

The x-axis represents the year, and the y-axis shows the respective values or percentages, helping you compare and interpret the trends in these financial metrics across different years.

# creating a figure and subplots plt.figure(figsize=(20, 12), facecolor='lightgrey')

# plotting Revenue plt.subplot(2, 2, 1) plot_metric_by_year(df, 'Revenue', 'Revenue by Year', 'skyblue')

# plotting Cost plt.subplot(2, 2, 2) plot_metric_by_year(df, 'Cost', 'Cost by Year', 'lightcoral')

# plotting Profit plt.subplot(2, 2, 3) plot_metric_by_year(df, 'Profit', 'Profit by Year', 'lightgreen')

# plotting Profit Margin plt.subplot(2, 2, 4) profit_margin_percentage = (df.groupby('Year')['Profit'].sum() / df.groupby('Year')['Revenue'].sum() * 100) plt.bar(profit_margin_percentage.index, profit_margin_percentage, color='orchid')

# adding arrow to indicate the difference plt.axhline(y=profit_margin_percentage.max(), color='red', lw=3, ls='--', alpha=0.6) plt.annotate( text='', xy=(profit_margin_percentage.index[0], profit_margin_percentage.min()), xytext=(profit_margin_percentage.index[0], profit_margin_percentage.max()), arrowprops=dict( arrowstyle='<|-|>', facecolor='red', edgecolor='red', shrinkA=5, shrinkB=5 ) )

# adding labels and title plt.xlabel('Year') plt.ylabel('Profit Margin %') plt.title('Profit Margin % by Year')

# adjusting layout and showing plot plt.tight_layout() plt.show()

Analysis: Financial Performance between Years for Different Countries

To examine financial performance across different countries, you plotted `Revenue`, `Cost`, `Profit`, and `Profit Margin` by year for each country using separate charts. Each plot showed how these metrics vary over time, enabling you to compare financial performance and identify trends country-by-country.

# plotting Revenue by Year for different countries plot_metric_by_year_country(df, 'Revenue', 'Revenue by Year and Country')

# plotting Cost by Year for different countries plot_metric_by_year_country(df, 'Cost', 'Cost by Year and Country')

# plotting Profit by Year for different countries plot_metric_by_year_country(df, 'Profit', 'Profit by Year and Country')

iv. Customer Demography Analysis

Next, you analyzed the customer demographics to understand better the individuals involved in the transactions.

Analysis: Customer Age

You created a histogram with 30 bins to understand the distribution of customer ages. The plot displays the frequency of customers within different age ranges, with the x-axis representing age and the y-axis showing the count of customers in each age range. This visualization helps you analyze the age distribution of the customer base.

# creating a histogram plt.hist(df['Customer Age'], bins=30, edgecolor='black')

# adding titles and labels plt.title('Histogram of Age') plt.xlabel('Age') plt.ylabel('Count')

# showing the plot plt.show()

Analysis: Products Purchased by Different Age Groups

You first categorized customer ages into bins to analyze how the quantity of products purchased varies across different age groups and product categories. You then aggregated the total quantity purchased by `Product Category` and `Age Bin`.

Using a stacked bar plot, you visualized the total quantity purchased for each product category, with different colors representing age bins. This plot helped us understand which age groups are most active in purchasing different product categories.

# defining age bins and labels bins = range(20, 71, 10) # bins ranging from 20 to 60 with a step of 10 labels = [f'{i}-{i+9}' for i in bins[:-1]]

# creating a new column for age bins df['Age Bin'] = pd.cut(df['Customer Age'], bins=bins, labels=labels, right=False)

# aggregating data by Product Category and Age Bin agg_df = df.groupby(['Product Category', 'Age Bin'])['Quantity'].sum().unstack().fillna(0)

# creating stacked bar plot agg_df.plot(kind='bar', stacked=True, figsize=(12, 8), edgecolor='black')

# adding titles and labels plt.title('Quantity Purchased by Product Category and Age Bin') plt.xlabel('Product Category') plt.ylabel('Total Quantity') plt.legend(title='Age Bin', bbox_to_anchor=(1.05, 1), loc='upper left') plt.xticks(rotation=45)

# adjusting layout to make room for the legend plt.tight_layout()

# showing the plot plt.show()

Analysis: Age Distribution of Different Product Sub-categories

You created a grid of histograms with density plots to explore the distribution of customer ages across different subcategories. Each subplot represents a specific subcategory and displays the age distribution of customers within that subcategory.

This visualization helped you compare age distributions across subcategories and understand how customer age profiles vary with product types.

# getting unique sub-categories sub_categories = df['Sub Category'].unique() n_sub_categories = len(sub_categories)

# determining the number of rows and columns for the grid n_cols = 3 n_rows = (n_sub_categories + n_cols - 1) // n_cols # Compute the number of rows needed

# creating subplots fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 4 * n_rows), sharex=True, sharey=False) fig.suptitle('Distribution of Customer Age by Sub Category', fontsize=16)

# flattening the axes array for easier iteration axes = axes.flatten()

# plotting histograms with density plots for each sub-category for i, sub_category in enumerate(sub_categories): ax = axes[i] sub_df = df[df['Sub Category'] == sub_category]

# plotting histogram with density plot to understand distribution shape sns.histplot(sub_df['Customer Age'], bins=10, kde=True, color='skyblue', edgecolor='black', ax=ax) ax.set_title(sub_category) ax.set_ylabel('Count') ax.set_xlabel('Customer Age') # ensuring x-axis label is set for each subplot

# adding x-axis ticks to show actual age values ax.tick_params(axis='x', labelrotation=45) # rotating x-axis labels

# hiding any unused subplots for j in range(i + 1, len(axes)): axes[j].axis('off')

# adjusting layout plt.tight_layout(rect=[0, 0, 1, 0.96]) # adjusting layout to make room for the suptitle

# showing the plot plt.show()

v. Product Analysis

The most crucial analysis would be of products as they directly impact operational and strategic decisions.

Analysis: Frequency of Different Products

You started by creating a frequency pie (donut) chart to visualize the distribution of different product categories.

# creating a frequency donut plot plot_frequency_pie_chart(df, 'Product Category')

Analysis: Financial Contribution of Different Products

You created pie charts for each financial performance column to visualize the financial performance across different product categories. This allowed you to see how each product category contributed to different financial metrics, providing insights into financial performance distribution across categories.

# plotting pie charts for different product categories are financial performance columns plot_pie_charts_1(df, 'Product Category', financial_cols_metrics)

Analysis: Financial Contribution of Different Product Subcategories

To understand product sub-category performance, you created a dodged bar plot based on the product analysis dataframe created at the data mining stage. This chart displayed the financial metrics across different sub-categories, allowing for easy visual comparison.

# creating dodged bar plot prod_analysis_df.plot( x = 'Sub Category', kind = 'bar', stacked = False, title = 'Analysis of Products', figsize=(20, 10))

# showing plot plt.show()

Analysis: Financial Performance by Product Sub-categories

To analyze financial performance and sales volume across different product sub-categories, you created bar plots for the following metrics:

Quantity: Aggregated the total quantity sold by each product sub-category.
Revenue: Aggregated the total revenue by each product sub-category.
Profit: Aggregated the total profit by each product sub-category.
Profit Margins: Calculate the average profit margins by each product sub-category.

These plots helped you understand the distribution and performance of products within each sub-category, providing a clear view of their contributions to overall financial performance and sales.

# creating bar plots for product subcategories and financial performance columns plot_bar_chart(df, group_by_col = 'Sub Category', aggregate_col = 'Quantity', agg_func = 'sum', text_auto='.2s') plot_bar_chart(df, group_by_col = 'Sub Category', aggregate_col = 'Revenue', agg_func = 'sum', text_auto='.2s') plot_bar_chart(df, group_by_col = 'Sub Category', aggregate_col = 'Profit', agg_func = 'sum', text_auto='.2s') plot_bar_chart(df, group_by_col = 'Sub Category', aggregate_col = 'Profit Margins', agg_func = 'mean')

Analysis: Revenue

To analyze revenue data across different product categories and countries, you created a 2×2 grid of bar plots with various configurations:

Mean Revenue by Product Category and Country: Shows average revenue per product category, differentiated by country.
Mean Revenue by Country and Product Category: Displays average revenue per country, differentiated by product category.
Sum Revenue by Product Category and Country: Illustrates total revenue per product category, segmented by country.
Sum Revenue by Country and Product Category: Represents total revenue per country, broken down by product category.

These plots helped you compare revenue distribution and performance across different categories and countries, providing insights into revenue patterns and trends.

# defining the configurations for the plots plot_configs = [ {"x": "Product Category", "y": "Revenue", "hue": "Country", "estimator": "mean", "title": "Mean Revenue"}, {"x": "Country", "y": "Revenue", "hue": "Product Category", "estimator": "mean", "title": "Mean Revenue"}, {"x": "Product Category", "y": "Revenue", "hue": "Country", "estimator": "sum", "title": "Sum Revenue"}, {"x": "Country", "y": "Revenue", "hue": "Product Category", "estimator": "sum", "title": "Sum Revenue"} ]

# creating a figure and a 2x2 grid of subplots fig, axs = plt.subplots(2, 2, figsize=(21, 15))

# looping through the configurations and plot for i, config in enumerate(plot_configs): ax = axs[i // 2, i % 2] # Determine the subplot position sns.barplot( data=df, x=config["x"], y=config["y"], hue=config["hue"], estimator=config["estimator"], errorbar=None, palette='pastel', # Use the larger color palette ax=ax )

# adjusting the opacity of bars for patch in ax.patches: patch.set_alpha(1.0) ax.set_title(config["title"])

# adjusting layout for better spacing plt.tight_layout()

# displaying the plots plt.show()

Analysis: Financial Contribution of Different Products & their Subcategories

Lastly, you created a grid of pie charts to analyze and compare the distribution of various metrics across different product categories. Each row represented a different product category, and each column within that row corresponded to a specific metric:

Frequency: Displays the distribution of sub-categories within each product category.
Sum of Revenue: Shows the total revenue for each sub-category.
Sum of Profit: Illustrates the total profit for each sub-category.
Sum of Cost: Represents the total cost for each sub-category.
Average Profit Margins: Visualizes the average profit margins for each sub-category.

This layout helped you compare how different metrics are distributed across sub-categories within each product category.

# getting unique product categories product_categories = df['Product Category'].unique()

# creating subplots fig = sp.make_subplots( rows=len(product_categories), cols=5, subplot_titles=[f'{pc} - {metric}' for pc in product_categories for metric in ['Frequency', 'Sum of Revenue', 'Sum of Profit', 'Sum of Cost', 'Avg Profit Margins']], specs=[[{'type': 'pie'}]*5 for _ in range(len(product_categories))], vertical_spacing=0.15, horizontal_spacing=0.1 )

# adding pie charts for each product category and metric for i, category in enumerate(product_categories): category_df = df[df['Product Category'] == category]

# Pie Chart #1: Frequency Pie Chart freq_data = category_df['Sub Category'].value_counts() fig.add_trace( go.Pie(labels=freq_data.index, values=freq_data.values, textinfo='label+percent+value', name=f'{category} - Frequency'), row=i + 1, col=1 )

# Pie Chart #2: Sum of Revenue Pie Chart sum_revenue = category_df.groupby('Sub Category')['Revenue'].sum() fig.add_trace( go.Pie(labels=sum_revenue.index, values=sum_revenue.values, textinfo='label+percent+value', name=f'{category} - Sum of Revenue'), row=i + 1, col=2 )

# Pie Chart #3: Sum of Profit Pie Chart sum_profit = category_df.groupby('Sub Category')['Profit'].sum() fig.add_trace( go.Pie(labels=sum_profit.index, values=sum_profit.values, textinfo='label+percent+value', name=f'{category} - Sum of Profit'), row=i + 1, col=3 )

# Pie Chart #4: Sum of Cost Pie Chart sum_cost = category_df.groupby('Sub Category')['Cost'].sum() fig.add_trace( go.Pie(labels=sum_cost.index, values=sum_cost.values, textinfo='label+percent+value', name=f'{category} - Sum of Cost'), row=i + 1, col=4 )

# Pie Chart #5: Average Profit Margins Pie Chart avg_profit_margin = category_df.groupby('Sub Category')['Profit Margins'].mean() fig.add_trace( go.Pie(labels=avg_profit_margin.index, values=avg_profit_margin.values, textinfo='label+percent+value', name=f'{category} - Avg Profit Margins'), row=i + 1, col=5 )

# updating layout for better presentation fig.update_layout( title_text='Distribution by Sub Categories within Each Product Category', showlegend=True, height=1500, width=2000, title_x=0.5 )

# showing the plot fig.show()

Analysis: Product Category – Bike

While the insights are reserved for the next section, during the data analytics steps, you understood that Bikes as a product category needed special attention as it was the most revenue-generating segment and thereby had a lot of potential. You started the analysis by understanding the financial performance of the ‘Bikes’ product category across different countries, for which you created a series of bar plots:

Revenue by Country for Bikes: Displays total revenue generated from bikes in each country.
Cost by Country for Bikes: Shows the total cost of bikes in each country.
Profit by Country for Bikes: Illustrates each country’s total profit earned from bikes.
Profit Margin by Country for Bikes: This represents each country’s profit margin percentage for bikes.

These plots comprehensively show how bike revenue, cost, profit, and profit margins vary across countries.

# initializing the figure and subplots plt.figure(figsize=(20, 6), facecolor='lightgrey')

# Plot #1: Revenue by Country for Bikes plt.subplot(1, 4, 1, title="Bikes Revenue by Country") df[df['Product Category'] == 'Bikes'].groupby('Country')['Revenue'].sum().plot.bar(color='blue')

# Plot #2: Cost by Country for Bikes plt.subplot(1, 4, 2, title="Bikes Cost by Country") df[df['Product Category'] == 'Bikes'].groupby('Country')['Cost'].sum().plot.bar(color='green')

# Plot #3: Profit by Country for Bikes plt.subplot(1, 4, 3, title="Bikes Profit by Country") df[df['Product Category'] == 'Bikes'].groupby('Country')['Profit'].sum().plot.bar(color='red')

# Plot #4: Profit Margin by Country for Bikes plt.subplot(1, 4, 4, title="Bikes Profit Margin by Country") margin_percentage = (df[df['Product Category'] == 'Bikes'].groupby('Country')['Profit'].sum() / df[df['Product Category'] == 'Bikes'].groupby('Country')['Revenue'].sum() * 100) margin_percentage.plot.bar(color='purple')

# adjusting layout plt.tight_layout()

# showing the plot plt.show()

Analysis: Subcategories of Product – Bike

You continued exploring the Bike segment and evaluated the performance of different bike sub-categories (Mountain Bikes, Touring Bikes, and Road Bikes) across various countries. Each plot displayed the performance metrics (such as quantity sold, revenue, etc.) for these specific sub-categories in different countries. This helped you understand how each bike sub-category performed internationally.

# plotting bar charts for different subcategories of Bike for different countries plot_sub_category_performance(df, 'Mountain Bikes') plot_sub_category_performance(df, 'Touring Bikes') plot_sub_category_performance(df, 'Road Bikes')

Analysis: Unit Cost & Unit Price of Different Bikes by Country

Lastly, you concluded by analyzing the bike segment and exploring the relationship between unit cost and price for different bike sub-categories. You created scatterplots to illustrate how unit costs compared to unit prices for each bike sub-category, helping you analyze pricing strategies and cost structures.

# plotting scatterplots plot_unit_cost_vs_price(df, ['Mountain Bikes', 'Road Bikes', 'Touring Bikes'])

vi. Country Analysis

While `Country` was involved in the earlier section, it was time to analyze this column more closely.

Analysis: Proportion of Each Category

You started by visualizing the distribution of observations (transactions) across different countries. You created a frequency donut plot for the `Country` column to do so. This plot helped you understand the relative proportion of each country within the dataset and highlighted how sales are distributed geographically.

# plotting frequency donut plot for 'Country' column plot_frequency_pie_chart(df, 'Country')

Analysis: Profit of Different Countries

To analyze the contribution of profit by country, you calculated the total profit for each country and then created a pie chart. This chart visualizes the proportion of total profit contributed by each country, helping you understand which countries are driving the profit.

# calculating the profit contribution by country country_sales = df.groupby('Country')['Profit'].sum().reset_index()

# plotting a pie chart fig = px.pie(df, values='Profit', names='Country', color_discrete_sequence=px.colors.sequential.Sunsetdark_r, title="Profit by Country")

# showing plot fig.show()

Analysis: Financial Contribution of Countries

You created pie charts for each financial performance column by country to explore the financial performance across different countries. This visualization helped you understand how each country contributes to these financial metrics, revealing each country’s distribution and relative importance in the overall financial performance.

# plotting pie charts for countries and financial performance related columns plot_pie_charts_1(df, 'Country', financial_cols_metrics)

Analysis: Most Profitable Product Category by Country

To identify the top-performing product category in terms of profit for each country, you grouped the data by `Country` and `Product Category`, calculated the total profit for each combination, and then sorted the results.

You selected the top product category for each country based on profit and created a bar plot to visualize these top categories and their respective profits. Each bar is annotated with the product category and profit value, providing a clear view of which product category generates the highest profit in each country.

# grouping by 'Country' and 'Product Category', and summing 'Profit' category_profit = df.groupby(['Country', 'Product Category'])['Profit'].sum().reset_index()

# sorting by 'Profit' in descending order and taking the top product category for each country category_profit = category_profit.sort_values('Profit', ascending=False).groupby('Country').head(1).reset_index(drop=True)

# creating a bar plot of 'Profit' by 'Country' and 'Product Category' g = sns.barplot(data=category_profit, x='Country', y='Profit')

# annotating each bar with the 'Product Category' and 'Profit' value for v in category_profit.itertuples(): g.text(x=v.Index, y=v.Profit + 1e4, s=f"{v._2}--{round(v.Profit, 2)}", size=10, rotation=45, ha="center")

# rotating the x-axis labels for better readability plt.xticks(rotation=0)

# displaying the plot plt.show()

Analysis: Contribution of Different Product Categories to Different Countries

To analyze the distribution of sub-categories within each country, you created pie charts that visualize the proportion of each sub-category for different countries. This allows you to see how various sub-categories contribute to each country’s overall profit, providing insights into regional preferences and market dynamics.

# plotting pie charts for countries and sub-categories plot_pie_charts_2(df)

Analysis: Product Subcategory Composition by Countries

You created pie charts for each product category (Bikes, Clothing, and Accessories) to examine the distribution of product sub-categories across different countries. Each chart visualizes how the product sub-categories are distributed within the various countries, helping you understand regional variations and preferences.

# setting country names countries = df['Country'].unique().tolist()

# plotting pie charts for the specified countries and product category plot_pie_charts_3(countries, product_category = "Bikes") plot_pie_charts_3(countries, product_category = "Clothing") plot_pie_charts_3(countries, product_category = "Accessories")

Analysis: Top Product Sub-categories for Different Countries

To analyze the top products by various metrics across different countries, you created bar charts for:

Quantity: Displays the top products by the total quantity sold in each country.
Profit Margins: Shows the top products by average profit margins in each country.
Profit: Illustrates the top products by average profit in each country.
Cost: Represents the top products by average cost in each country.

These charts helped you identify which products perform best in terms of quantity, profit margins, profit, and cost across different countries.

# creating bar chart for analyzing countries by quantity, cost, profit and profit margin plot_top_products_by_metric(df, ["Country", "Sub Category"], "Quantity", agg_func='sum', title="Top Product by Country and Quantity") plot_top_products_by_metric(df, ["Country", "Sub Category"], "Cost", agg_func='mean', title="Top Product by Country and Cost") plot_top_products_by_metric(df, ["Country", "Sub Category"], "Profit", agg_func='mean', title="Top Product by Country and Profit") plot_top_products_by_metric(df, ["Country", "Sub Category"], "Profit Margins", agg_func='mean', title="Top Product by Country and Profit Margin")

6) Statistical Analysis

While visualization can be great, statistical analysis is necessary to understand the relationships between variables precisely. Stats is a powerful statistical Python module that reaffirms why data analytics with Python is the right thing to do.

Next, you used different statistical tests and techniques to analyze the data.

Analysis: Comparison of Revenue between Gender

To determine if there is a significant difference in revenue between male and female customers, you performed an independent t-test. You separated the revenue data by gender and then conducted the test.

The resulting t-statistic and p-value help you assess whether the observed difference in revenue is statistically significant. Based on the p-value compared to a significance level (alpha), you interpreted whether to reject or fail to reject the null hypothesis, indicating whether gender significantly impacts revenue.

# separating the revenue data by gender male_revenue = df[df['Customer Gender'] == 'M']['Revenue'] female_revenue = df[df['Customer Gender'] == 'F']['Revenue']

# performing independent t-test t_stat, p_value = stats.ttest_ind(male_revenue, female_revenue, equal_var=False)

# printing results print(f"T-statistic: {t_stat:.4f}") print(f"P-value: {p_value:.4f}")

# interpreting results alpha = 0.05 # significance level if p_value < alpha: print("Reject the null hypothesis: There is a significant difference in revenue between males and females.") else: print("Fail to reject the null hypothesis: There is no significant difference in revenue between males and females.")

Analysis: Correlation between Revenue and Profit

You calculated the Pearson correlation coefficient to assess the relationship between revenue and profit. This statistic measures the strength and direction of the linear relationship between these two variables. You also computed the p-value to determine if the observed correlation is statistically significant.

# calculating the Pearson correlation coefficient corr_coefficient, p_value = stats.pearsonr(df['Revenue'], df['Profit'])

# printing results print(f"Correlation coefficient: {corr_coefficient:.4f}") print(f"P-value: {p_value:.4f}")

# interpreting results alpha = 0.05 if p_value < alpha: print("Reject the null hypothesis: There is a significant correlation between revenue and profit.") else: print("Fail to reject the null hypothesis: There is no significant correlation between revenue and profit.")

Analysis: Comparing the Average Revenue across Different Product Categories

To investigate if there is a significant difference in average revenue among different product categories, you performed a One-Way ANOVA. You compared the revenue across three product categories: Accessories, Bikes, and Clothing. The F-statistic and p-value provided insights into whether the observed differences in average revenue are statistically significant.

# performing the One-Way ANOVA f_stat, p_value = stats.f_oneway( df[df['Product Category'] == 'Accessories']['Revenue'], df[df['Product Category'] == 'Bikes']['Revenue'], df[df['Product Category'] == 'Clothing']['Revenue'] )

# printing results print(f"F-statistic: {f_stat:.4f}") print(f"P-value: {p_value:.4f}")

# interpreting results alpha = 0.05 if p_value < alpha: print("Reject the null hypothesis: There is a significant difference in average revenue across product categories.") else: print("Fail to reject the null hypothesis: There is no significant difference in average revenue across product categories.")

Analysis: Checking the Association between Product Category and Customer Marital Status

You created a contingency table and performed a Chi-Square test to examine the association between product category and customer marital status. The Chi-Square statistic and p-value helped you assess whether there is a significant relationship between these two categorical variables.

# creating contingency table contingency_table = pd.crosstab(df['Product Category'], df['Customer Marital Status])

# performing the Chi-Square test chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# printing results print(f"Chi-Square Statistic: {chi2_stat:.4f}") print(f"P-value: {p_value:.4f}")

# interpreting results alpha = 0.05 if p_value < alpha: print("Reject the null hypothesis: There is a significant association between Product Category and Customer Marital Status.") else: print("Fail to reject the null hypothesis: There is no significant association between Product Category and Customer Marital Status.")

Analysis: Finding Predictors for Income Column

You performed a multiple linear regression to analyze the relationship between several independent variables and profit. You defined the independent variables, added a constant term, and then specified the dependent variable as profit.

After fitting the regression model, you reviewed the summary, which includes coefficients, standard errors, p-values, and other statistics. This summary helped you understand how each independent variable influences profit and assessed the model’s overall fit.

# defining the independent variables and add a constant X = df[['Customer Age', 'Customer Estimated Income', 'Loyalty Balance', 'Customer Duration', 'Quantity', 'Unit Price', 'Profit Margins', 'Unit Profit Margin']] X = sm.add_constant(X)

# defining the dependent variable y = df['Profit'] # Replace with your actual profit data

# creating and fit the regression model model = sm.OLS(y, X).fit()

# printing summary print(model.summary())

As the data analysis has been done using Python, the next thing to learn is how to report the findings.

How to Communicate your Insights and Resolve Anomalies?

Once you complete your analysis, the next crucial step is communicating your findings. This step is highly vital as if not done correctly, it can ruin the work that you have done so far. To effectively report your insights, you can follow these guidelines-

Define Audience: Understand your target audience, i.e., whether they are business stakeholders, technical experts, etc., and tailor your report accordingly.
Storytelling: It’s crucial as you need a clear and coherent narrative to present your insights. This ensures that your findings look compelling and logical. Highlighting key findings can be essential.
Visualization: Visualization is critical as it allows the audience to follow your findings easily.
Set Context: The narrative is crucial throughout the communication setting and provides recommendations and action items.
Address of Anomalies: If, during analysis, unusual patterns have been found, they must be highlighted, and their potential meaning and impact must be discussed. You must find their root cause and communicate about them to resolve them.

FAQs:

What are the quick tips to start with Python for data analytics?

To learn Python for data science and analytics, you must become proficient with critical libraries like Pandas, NumPy, SciPy, and Matplotlib. You should also regularly work on small to medium datasets to apply your analytical skills.

Can Python replace spreadsheets?

Yes, Python can replace spreadsheets. It has superior data manipulation, visualization, and statistical analysis capabilities and works better with big data. However, spreadsheets are still handy for smaller data and more straightforward use cases, as they are highly accessible and easy to use.

How do you learn Python for data analytics?

You can self-learn or enroll in online courses to learn Python for data analytics. You can also consider enrolling in specialized programs like AnalytixLabs offers for structured learning.

How to Perform Data Analysis using Python – A Case Study With Sales Data [Codes Included]

Write A Comment Cancel Reply

Thank You For Your Interest

How to Perform Data Analysis using Python – A Case Study With Sales Data [Codes Included]

Why choose Python for Data Analytics?

How to Analyze Data – 7 Key Steps

How to perform Data Analysis using Python?

1) Objective Definition

2) Data Collection and Exploratory Data Analysis (EDA)

3) Data Cleaning

a. Duplicate Removal

b. Type Casting

c. Anomaly/Outlier Treatment

d. Missing Value Imputation

e. Fixing Inconsistent Data

4) Data Mining

5) Data Visualization

i) Creating User-Defined Functions

Function 1: Bar Chart with Margin Indicator

Function 2: Dodged Bar Chart for Year-country

Function 3: Frequency Donut/Pie Chart

Function 4: Horizontal Bar Chart

Function 5: Multiple Bar Charts for Product Subcategories

Function 6: Scatterplot for Product Subcategories

Function 7: Pie Chart for Product Categories

Function 8: Pie Chart for Country & Product Categories

Function 9: Pie Chart for Product Sub-categories

Function 10: Bar Chart for Finding Top-performing Product Subcategories

ii. Overview Analysis

iii. Time-Based Analysis

Analysis: Financial Performance by Time

Analysis: Financial Performance between Years

Analysis: Financial Performance between Years for Different Countries

iv. Customer Demography Analysis

Analysis: Customer Age

Analysis: Products Purchased by Different Age Groups

Analysis: Age Distribution of Different Product Sub-categories

v. Product Analysis

Analysis: Frequency of Different Products

Analysis: Financial Contribution of Different Products

Analysis: Financial Contribution of Different Product Subcategories

Analysis: Financial Performance by Product Sub-categories

Analysis: Revenue

Analysis: Financial Contribution of Different Products & their Subcategories

Analysis: Product Category – Bike

Analysis: Subcategories of Product – Bike

Analysis: Unit Cost & Unit Price of Different Bikes by Country

vi. Country Analysis

Analysis: Proportion of Each Category

Analysis: Profit of Different Countries

Analysis: Financial Contribution of Countries

Analysis: Most Profitable Product Category by Country

Analysis: Contribution of Different Product Categories to Different Countries

Analysis: Product Subcategory Composition by Countries

Analysis: Top Product Sub-categories for Different Countries

6) Statistical Analysis

Analysis: Comparison of Revenue between Gender

Analysis: Correlation between Revenue and Profit

Analysis: Comparing the Average Revenue across Different Product Categories

Analysis: Checking the Association between Product Category and Customer Marital Status

Analysis: Finding Predictors for Income Column

How to Communicate your Insights and Resolve Anomalies?

FAQs:

Related Posts

Top 15 Must-Have AI Tools for Python Programming

How to Use Python for Image-to-Text Conversion – Guide and Python Codes to Use

What is Data Analyst Course Syllabus: A Step-by-Step Comprehensive Guide

How to Clean Data in Microsoft Excel: A Step-by-Step Illustrated Guide

Write A Comment Cancel Reply

Thank You For Your Interest