Five Pandas real-world examples to take you to analyze operational data

Hello everyone, I've shared many articles about Pandas before, today I'm sharing 5 small and beautiful Pandas real-world examples.

The content is mainly divided into:

How to simulate data on your own
Multiple data processing methods
Data statistics and visualization
User RFM Model
User Repurchase Cycle

Build data

The data used in this case is the editor's own simulation, mainly contains two data: order data and fruit information data, and will merge the two data

import pandas as pd
import numpy as np
import random
from datetime import *
import time

import  as px
import plotly.graph_objects as go
import plotly as py

# Drawing subgraphs
from  import make_subplots

1. Time fields

2. Fruits and users

3、Generate order data

order = ({
    "time":time_range,  # Time of order
    "fruit":fruit_list,  # Name of fruit
    "name":name_list,  # Customer Name
    # of purchases
    "kilogram":(list(range(50,100)), size=len(time_range),replace=True) 
})

order

4. Generate information data on fruits

infortmation = ({
    "fruit":fruits,
    "price":[3.8, 8.9, 12.8, 6.8, 15.8, 4.9, 5.8, 7],
    "region":["South China","North China","Northwest.","Central China","Northwest.","South China","North China","Central China"]
})

infortmation

5. Data consolidation

The order information and fruit information are directly merged into a complete DataFrame, and this df is the data that will be processed next

6. Generate a new field: order amount

Go here and you can learn:

How to generate time-related data
How to generate random data from lists (iterable objects)
Pandas' DataFrame creates itself, including generating new fields
Pandas Data Merge

Analyzing Dimension 1: Time

Monthly Sales Trend 2019-2021

1, first extract the year and month:

df["year"] = df["time"].
df["month"] = df["time"].
# Simultaneous extraction of year and month
df["year_month"] = df["time"].('%Y%m')

df

2. View the field types:

3. Statistics by year and month and display:

# Sales by year and month
df1 = (["year_month"])["kilogram"].sum().reset_index()

fig = (df1,x="year_month",y="kilogram",color="kilogram")
fig.update_layout(xaxis_tickangle=45)   # Tilt angle

()

2019-2021 Sales Trend

df2 = (["year_month"])["amount"].sum().reset_index()

df2["amount"] = df2["amount"].apply(lambda x:round(x,2))

fig = ()
fig.add_trace((  #
    x=df2["year_month"],
    y=df2["amount"],
    mode='lines+markers', # mode mode selection
    name='lines')) # Name

fig.update_layout(xaxis_tickangle=45)   # Tilt angle

()

Annual sales, sales and average sales

Analysis Dimension 2: Commodities

Percentage of annual fruit sales

df4 = (["year","fruit"]).agg({"kilogram":"sum","amount":"sum"}).reset_index()
df4["year"] = df4["year"].astype(str)
df4["amount"] = df4["amount"].apply(lambda x: round(x,2))

from  import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=1, 
    cols=3,
    subplot_titles=["2019","2020.","2021."],
    specs=[[{"type": "domain"},   # Specify the type by type
           {"type": "domain"},
           {"type": "domain"}]]
)  

years = df4["year"].unique().tolist()

for i, year in enumerate(years):
    name = df4[df4["year"] == year].fruit
    value = df4[df4["year"] == year].kilogram
    
    fig.add_traces((labels=name,
                        values=value
                       ),
                 rows=1,cols=i+1
                )

fig.update_traces(
    textposition='inside',   # 'inside','outside','auto','none'
    textinfo='percent+label',
    insidetextorientation='radial',   # horizontal、radial、tangential
    hole=.3,
    hoverinfo="label+percent+name"
)

()

Comparison of annual sales amount by fruit

years = df4["year"].unique().tolist()

for _, year in enumerate(years):
    
    df5 = df4[df4["year"]==year]
    fig = (( 
        labels = df5["fruit"].tolist(),
        parents = df5["year"].tolist(),
        values = df5["amount"].tolist(),
        textinfo = "label+value+percent root"
    ))
    
    ()

Monthly change in merchandise sales

fig = (df5,x="year_month",y="amount",color="fruit")
fig.update_layout(xaxis_tickangle=45)   # Tilt angle
()

Changes in line graph presentation:

Analytical Dimension 3: Region

Sales in different regions

Average annual sales by region

df7 = (["year","region"])["amount"].mean().reset_index()

Analysis Dimension 4: Users

Comparison of user order volume and amount

df8 = (["name"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"order_number"})
.background_gradient(cmap="Spectral_r")

User Fruit Preferences

The analysis is based on the number of orders and the amount of orders per user for each type of fruit:

df9 = (["name","fruit"]).agg({"time":"count","amount":"sum"}).reset_index().rename(columns={"time":"number"})

df10 = df9.sort_values(["name","number","amount"],ascending=[True,False,False])

(subset=["number","amount"],color="#a97fcf")

(df10,
       x="fruit",
       y="amount",
#            color="number",
       facet_col="name"
      )

User Layering - RFM Model

RFM modeling is an important tool and instrument for measuring customer value and profit generation.

Through this model can reflect a user's delivery transaction behavior, the overall frequency of transactions and the total transaction amount of three indicators, through the three indicators to describe the value of the customer's status; at the same time based on these three indicators will be divided into eight categories of customer value:

Recency (R) is the number of days from the date of the customer's most recent purchase, a metric that is variable as it relates to the point in time being analyzed. Theoretically the more recent a customer's purchase has occurred, the more likely they are to repurchase
Frequency (F) refers to the number of times a customer engages in purchasing behavior-consumers who buy most often are more loyal. Increasing the number of times a customer buys means capturing a larger share of the hours.
Monetary value (M) is the total amount spent by the customer on the purchase.

This 3 metrics are solved separately below through multiple methods in Pandas, starting with F and M: the number of orders and total amount per customer

How do you solve for the R indicator?

1, first solve for the difference between each order and the current time

2, according to each user of this difference R to ascending order, ranked first in that data is his recent purchase records: xiaoming users, for example, the most recent is December 15, and the difference between the current time is 25 days

3. According to the user de-emphasis, the first data is retained, so that the R index of each user is obtained:

4. The data were merged to obtain three indicators:

When the amount of data is large enough and there are enough users, it is possible to use only the RFM model to categorize users into 8 types

User Repurchase Cycle Analysis

The repurchase cycle is the time interval between every two purchases made by a user: in the case of a xiaoming user, the first 2 repurchase cycles are 4 days and 22 days, respectively.

Here's the process of solving for each user's repurchase cycle:

1. Each user's purchase time in ascending order

2. Move the time by one unit:

3. Merged differences:

The null value is the first record for each user before there is no data, and the null portion is deleted directly afterward.

Take out the numeric portion of the days directly:

5、Comparison of repurchase cycle

(df16,
       x="day",
       y="name",
       orientation="h",
       color="day",
       color_continuous_scale="spectral"   # purples
      )

The narrower rectangles in the graph above indicate smaller intervals; each user's overall repurchase cycle is determined by the entire length of the rectangle. See the sum of each user's overall repurchase cycle and the average repurchase cycle:

Get a conclusion: the overall repurchase cycle of two users, Michk and Mike, is relatively long, and they are loyal users in the long run; and in terms of the average repurchase cycle, it is relatively low, which indicates that the repurchase is active in a short period of time.

As can also be observed in the violins below, Michk and Mike have the most concentrated distribution of repurchase cycles.

to this article on the five Pandas combat cases to take you to analyze the operation of the data article is introduced to this, more related to Pandas analysis of data content, please search for my previous posts or continue to browse the following related articles I hope that you will support me more!