Hello everyone, I've shared many articles about Pandas before, today I'm sharing 5 small and beautiful Pandas real-world examples.
The content is mainly divided into:
- How to simulate data on your own
- Multiple data processing methods
- Data statistics and visualization
- User RFM Model
- User Repurchase Cycle
Build data
The data used in this case is the editor's own simulation, mainly contains two data: order data and fruit information data, and will merge the two data
import pandas as pd import numpy as np import random from datetime import * import time import as px import plotly.graph_objects as go import plotly as py # Drawing subgraphs from import make_subplots
1. Time fields
2. Fruits and users
3、Generate order data
order = ({ "time":time_range, # Time of order "fruit":fruit_list, # Name of fruit "name":name_list, # Customer Name # of purchases "kilogram":(list(range(50,100)), size=len(time_range),replace=True) }) order
4. Generate information data on fruits
infortmation = ({ "fruit":fruits, "price":[3.8, 8.9, 12.8, 6.8, 15.8, 4.9, 5.8, 7], "region":["South China","North China","Northwest.","Central China","Northwest.","South China","North China","Central China"] }) infortmation
5. Data consolidation
The order information and fruit information are directly merged into a complete DataFrame, and this df is the data that will be processed next
6. Generate a new field: order amount
Go here and you can learn:
- How to generate time-related data
- How to generate random data from lists (iterable objects)
- Pandas' DataFrame creates itself, including generating new fields
- Pandas Data Merge
Analyzing Dimension 1: Time
Monthly Sales Trend 2019-2021
1, first extract the year and month:
df["year"] = df["time"]. df["month"] = df["time"]. # Simultaneous extraction of year and month df["year_month"] = df["time"].('%Y%m') df
2. View the field types:
3. Statistics by year and month and display:
# Sales by year and month df1 = (["year_month"])["kilogram"].sum().reset_index() fig = (df1,x="year_month",y="kilogram",color="kilogram") fig.update_layout(xaxis_tickangle=45) # Tilt angle ()
2019-2021 Sales Trend
df2 = (["year_month"])["amount"].sum().reset_index() df2["amount"] = df2["amount"].apply(lambda x:round(x,2)) fig = () fig.add_trace(( # x=df2["year_month"], y=df2["amount"], mode='lines+markers', # mode mode selection name='lines')) # Name fig.update_layout(xaxis_tickangle=45) # Tilt angle ()
Annual sales, sales and average sales
Analysis Dimension 2: Commodities
Percentage of annual fruit sales
df4 = (["year","fruit"]).agg({"kilogram":"sum","amount":"sum"}).reset_index() df4["year"] = df4["year"].astype(str) df4["amount"] = df4["amount"].apply(lambda x: round(x,2)) from import make_subplots import plotly.graph_objects as go fig = make_subplots( rows=1, cols=3, subplot_titles=["2019","2020.","2021."], specs=[[{"type": "domain"}, # Specify the type by type {"type": "domain"}, {"type": "domain"}]] ) years = df4["year"].unique().tolist() for i, year in enumerate(years): name = df4[df4["year"] == year].fruit value = df4[df4["year"] == year].kilogram fig.add_traces((labels=name, values=value ), rows=1,cols=i+1 ) fig.update_traces( textposition='inside', # 'inside','outside','auto','none' textinfo='percent+label', insidetextorientation='radial', # horizontal、radial、tangential hole=.3, hoverinfo="label+percent+name" ) ()
Comparison of annual sales amount by fruit
years = df4["year"].unique().tolist() for _, year in enumerate(years): df5 = df4[df4["year"]==year] fig = (( labels = df5["fruit"].tolist(), parents = df5["year"].tolist(), values = df5["amount"].tolist(), textinfo = "label+value+percent root" )) ()
Monthly change in merchandise sales
fig = (df5,x="year_month",y="amount",color="fruit") fig.update_layout(xaxis_tickangle=45) # Tilt angle ()
Changes in line graph presentation:
Analytical Dimension 3: Region
Sales in different regions
Average annual sales by region
df7 = (["year","region"])["amount"].mean().reset_index()
Analysis Dimension 4: Users
Comparison of user order volume and amount
df8 = (["name"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"order_number"}) .background_gradient(cmap="Spectral_r")
User Fruit Preferences
The analysis is based on the number of orders and the amount of orders per user for each type of fruit:
df9 = (["name","fruit"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"number"}) df10 = df9.sort_values(["name","number","amount"],ascending=[True,False,False]) (subset=["number","amount"],color="#a97fcf")
(df10, x="fruit", y="amount", # color="number", facet_col="name" )
User Layering - RFM Model
RFM modeling is an important tool and instrument for measuring customer value and profit generation.
Through this model can reflect a user's delivery transaction behavior, the overall frequency of transactions and the total transaction amount of three indicators, through the three indicators to describe the value of the customer's status; at the same time based on these three indicators will be divided into eight categories of customer value:
- Recency (R) is the number of days from the date of the customer's most recent purchase, a metric that is variable as it relates to the point in time being analyzed. Theoretically the more recent a customer's purchase has occurred, the more likely they are to repurchase
- Frequency (F) refers to the number of times a customer engages in purchasing behavior-consumers who buy most often are more loyal. Increasing the number of times a customer buys means capturing a larger share of the hours.
- Monetary value (M) is the total amount spent by the customer on the purchase.
This 3 metrics are solved separately below through multiple methods in Pandas, starting with F and M: the number of orders and total amount per customer
How do you solve for the R indicator?
1, first solve for the difference between each order and the current time
2, according to each user of this difference R to ascending order, ranked first in that data is his recent purchase records: xiaoming users, for example, the most recent is December 15, and the difference between the current time is 25 days
3. According to the user de-emphasis, the first data is retained, so that the R index of each user is obtained:
4. The data were merged to obtain three indicators:
When the amount of data is large enough and there are enough users, it is possible to use only the RFM model to categorize users into 8 types
User Repurchase Cycle Analysis
The repurchase cycle is the time interval between every two purchases made by a user: in the case of a xiaoming user, the first 2 repurchase cycles are 4 days and 22 days, respectively.
Here's the process of solving for each user's repurchase cycle:
1. Each user's purchase time in ascending order
2. Move the time by one unit:
3. Merged differences:
The null value is the first record for each user before there is no data, and the null portion is deleted directly afterward.
Take out the numeric portion of the days directly:
5、Comparison of repurchase cycle
(df16, x="day", y="name", orientation="h", color="day", color_continuous_scale="spectral" # purples )
The narrower rectangles in the graph above indicate smaller intervals; each user's overall repurchase cycle is determined by the entire length of the rectangle. See the sum of each user's overall repurchase cycle and the average repurchase cycle:
Get a conclusion: the overall repurchase cycle of two users, Michk and Mike, is relatively long, and they are loyal users in the long run; and in terms of the average repurchase cycle, it is relatively low, which indicates that the repurchase is active in a short period of time.
As can also be observed in the violins below, Michk and Mike have the most concentrated distribution of repurchase cycles.
to this article on the five Pandas combat cases to take you to analyze the operation of the data article is introduced to this, more related to Pandas analysis of data content, please search for my previous posts or continue to browse the following related articles I hope that you will support me more!