A complete guide to data cleaning and efficient analysis of Pandas advanced usage

1. Preface

In the fields of data science and data analysis, Pandas is undoubtedly one of the most powerful data processing libraries in the Python ecosystem. This article will dive into the advanced usage of Pandas, focusing on tips for data cleaning and efficient analysis to help you grow from a beginner Pandas user to a senior data analyst.

2. Review of Pandas core data structure

Series and DataFrame

import pandas as pd

# Create Seriess = ([1, 3, 5, , 6, 8])

# Create DataFramedf = ({
    'A': 1.,
    'B': ('20230101'),
    'C': (1, index=list(range(4)), dtype='float32'),
    'D': ([3] * 4, dtype='int32'),
    'E': (["test", "train", "test", "train"]),
    'F': 'foo'
})

3. Advanced data cleaning skills

3.1 Missing value processing

3.1.1 Detect missing values

# Detect missing values().sum()

# Visualize missing valuesimport missingno as msno
(df)

3.1.2 Handling missing values

# Delete missing values(how='any')  # Delete any column if it is missing(subset=['col1', 'col2'])  # Delete if the column is missing
# Fill in missing values(value={'col1': 0, 'col2': 'unknown'})  # Different columns and different fill values(method='ffill')  # Forward filling(method='bfill', limit=2)  # Backward filling, up to 2
# Interpolation method fill(method='linear')  # Linear interpolation(method='time')  # Time series interpolation

3.2 Outlier value processing

3.2.1 Detect outliers

# Use descriptive statistics()

# Use the Z-score methodfrom scipy import stats
z_scores = (df['numeric_col'])
abs_z_scores = (z_scores)
filtered_entries = (abs_z_scores &lt; 3)
df_clean = df[filtered_entries]

# Use IQR methodQ1 = df['numeric_col'].quantile(0.25)
Q3 = df['numeric_col'].quantile(0.75)
IQR = Q3 - Q1
df_clean = df[~((df['numeric_col'] &lt; (Q1 - 1.5 * IQR)) | (df['numeric_col'] &gt; (Q3 + 1.5 * IQR)))]

3.2.2 Handling outliers

# Replace with boundary valuedf['numeric_col'] = (df['numeric_col'] &gt; upper_bound, upper_bound, 
                            (df['numeric_col'] &lt; lower_bound, lower_bound, df['numeric_col']))

# Use boxingdf['binned'] = (df['numeric_col'], bins=5, labels=False)

3.3 Data conversion

3.3.1 Standardization and normalization

# Min-Max Normalizationdf['normalized'] = (df['col'] - df['col'].min()) / (df['col'].max() - df['col'].min())

# Z-score standardizationdf['standardized'] = (df['col'] - df['col'].mean()) / df['col'].std()

3.3.2 Category type data encoding

# One-Hot encodingpd.get_dummies(df, columns=['categorical_col'])

# Label encodingfrom  import LabelEncoder
df['encoded'] = LabelEncoder().fit_transform(df['categorical_col'])

4. Efficient data analysis skills

4.1 High-performance data processing

4.1.1 Using eval() and query()

# eval method accelerates calculation('new_col = col1 + col2', inplace=True)

# query method efficient filtering('col1 &gt; col2 &amp; col3 == "value"')

4.1.2 Save memory using category type

# Convert category typedf['category_col'] = df['category_col'].astype('category')

# Check memory usagedf.memory_usage(deep=True)

4.2 Advanced grouping operations

4.2.1 agg aggregation function

# Multifunction aggregation('group_col').agg({
    'col1': ['mean', 'max', 'min'],
    'col2': lambda x: (x, 95)
})

# Named Aggregation (Pandas 0.25+)('group_col').agg(
    mean_col1=('col1', 'mean'),
    max_col2=('col2', 'max'),
    custom=('col3', lambda x: () / ())
)

4.2.2 transform and apply

# transform keeps the original DataFrame shapedf['group_mean'] = ('group_col')['value_col'].transform('mean')

# apply Flexible application functionsdef custom_func(group):
    return (group - ()) / ()

('group_col').apply(custom_func)

4.3 Time series analysis

4.3.1 Resampling

# downsampling('M').mean()  # Monthly average
# Upsampling('D').ffill()  # Fill by day
# Custom resamplingdef custom_resampler(array_like):
    return (array_like) * 1.5

('W').apply(custom_resampler)

4.3.2 Scrolling window calculation

# Simple scrolling average(window=7).mean()

# Extended Window().sum()

# Custom scrolling functiondef custom_roll(x):
    return x[-1] * 2 + x[0]

(window=3).apply(custom_roll)

5. Data visualization integration

5.1 Direct drawing

# Line diagram(x='date_col', y=['col1', 'col2'], figsize=(12, 6))

# Box chart(column=['col1', 'col2', 'col3'])

# Hexagonal box diagram(x='col1', y='col2', gridsize=20)

5.2 Advanced visualization skills

# Use seaborn integrationimport seaborn as sns
(df, hue='category_col')

# Use plotly interactive visualizationimport  as px
fig = px.scatter_matrix(df, dimensions=['col1', 'col2', 'col3'], color='category_col')
()

6. Performance optimization skills

6.1 Using efficient data types

# Optimize numerical typesdf['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')

# Use boolean typesdf['bool_col'] = df['bool_col'].astype('bool')

6.2 Avoid chain assignment

# Bad practice - chain assignmentdf[df['col'] &gt; 100]['new_col'] = 1  # May not take effect
# Good practice - use loc[df['col'] &gt; 100, 'new_col'] = 1

6.3 Using parallel processing

# Use swifter to speed up applyimport swifter
df['new_col'] = df['col'].(lambda x: x*2)

# Use dask to process big dataimport  as dd
ddf = dd.from_pandas(df, npartitions=4)
result = ('group_col').mean().compute()

7. Practical cases: E-commerce data analysis

7.1 Data preparation

# Read dataorders = pd.read_csv('', parse_dates=['order_date'])
products = pd.read_csv('')
customers = pd.read_csv('')

# Merge datamerged = (orders, products, on='product_id')
merged = (merged, customers, on='customer_id')

7.2 RFM Analysis

# Calculate RFM indicatorsnow = pd.to_datetime('today')
rfm = ('customer_id').agg({
    'order_date': lambda x: (now - ()).days,  # Recency
    'order_id': 'count',  # Frequency
    'total_price': 'sum'  # Monetary
}).rename(columns={
    'order_date': 'recency',
    'order_id': 'frequency',
    'total_price': 'monetary'
})

# RFM Ratingrfm['r_score'] = (rfm['recency'], q=5, labels=[5,4,3,2,1])
rfm['f_score'] = (rfm['frequency'], q=5, labels=[1,2,3,4,5])
rfm['m_score'] = (rfm['monetary'], q=5, labels=[1,2,3,4,5])
rfm['rfm_score'] = rfm['r_score'].astype(str) + rfm['f_score'].astype(str) + rfm['m_score'].astype(str)

# Customer groupingsegment_map = {
    r'[4-5][4-5][4-5]': 'High value customers',
    r'[3-5][3-5][3-5]': 'Potential Client',
    r'[1-2][1-2][1-2]': 'Cause Risk Customer',
    r'.*': 'General Customer'
}
rfm['segment'] = rfm['rfm_score'].replace(segment_map, regex=True)

8. Summary

This article provides a comprehensive introduction to Pandas' advanced usage in data cleaning and efficient analysis, including:

Advanced data cleaning skills: missing value processing, outlier value detection and processing, data conversion

Efficient data analysis methods: high-performance data processing, advanced grouping operations, time series analysis

Data visualization integration: Direct drawing methods and advanced visualization techniques

Performance optimization tips: data type optimization, avoid chain assignment, parallel processing

Practical case: RFM model application in e-commerce data analysis

With these advanced techniques, you will be able to process and analyze a variety of complex data sets more efficiently, providing strong support for data-driven decision-making. Remember, the key to proficient in using Pandas is to constantly practice and explore its rich features.

This is the article about this complete guide to data cleaning and efficient analysis of Pandas advanced usage. For more related Pandas data cleaning and efficient analysis, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!