SoFunction
Updated on 2024-11-16

Datawhale Exercise of Used Car Price Prediction

Exploratory Data Analysis (EDA)

1. Overview of overview data

Database loading

#coding:utf-8
# Import the warnings package and use filters to implement ignoring warning statements.
import warnings
('ignore')
import pandas as pd
import numpy as np
import  as plt
import seaborn as sns
import missingno as msno

data entry

## 1) Load the training and test sets;
path = './'
Train_data = pd.read_csv(path+'car_train_0110.csv', sep=' ')
Test_data = pd.read_csv(path+'car_testA_0110.csv', sep=' ')

Determine the path, if I'm in a notebook environment I usually use !dir to see the current directory

在这里插入图片描述

Characterization

在这里插入图片描述

New skill: using .append() to look at the first 5 rows and the next 5 rows at the same time

## 2) Abbreviated look at data (head()+shape)
Train_data.head().append(Train_data.tail())

在这里插入图片描述

Observation data dimensions

Train_data.shape,Test_data.shape

在这里插入图片描述

Overview profile: .describe() to see statistics, .info() to see data types

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

1.1 Determination of missing data and anomalies

1.1.1 Viewing nan

Train_data.shape,Test_data.shape

在这里插入图片描述

It is also possible to view nan directly in the following two ways ↓ :

Train_data.isnull().sum()

在这里插入图片描述

Visualization na more intuitive

# find na 
tmp = df_train.isnull().any()
tmp[==True]

在这里插入图片描述

New skill: use of msno library (missing value visualization)

Train_data.isnull().sum().plot( kind= 'bar')

在这里插入图片描述

Visualize to see the default values

(Train_data.sample(250))

where Train_data.sample(250) indicates a random sample of 250 rows and white bars indicate missing

在这里插入图片描述

Direct display of the number of non-missing samples/per feature

(Train_data.sample(250),labels= True)

在这里插入图片描述

Use .heatmap() in msno to see the correlation between missing values

(Train_data.sample(250))

在这里插入图片描述

1.1.2 *Outlier detection (important! Easy to ignore)

Understanding data types via Train_data.info()

Train_data.info()

1.2 Understanding the distribution of predicted values

Features are categorized into category features and numeric features

The significance of viewing the distribution is:

a. Timely change of non-normally distributed data to normally distributed data

b. Anomaly detection

1.2.1 Digital characterization

Train_data['price']

Found out it's all int

在这里插入图片描述

Statistical distribution ↓

Train_data['price'].value_counts()

在这里插入图片描述

## 1) General distribution profile (unbounded Johnson distribution, etc.)
import  as st
y = Train_data['price']
(1); ('Johnson SU')
(y, kde=False, fit=)
(2); ('Normal')
(y, kde=False, fit=)
(3); ('Log Normal')
(y, kde=False, fit=)

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

CONCLUSION: PRICE does not obey a normal distribution, so it must be transformed before regression can be performed. The unbounded Johnson distribution fits better.

1.2.1.1 Correlation analysis
1.2.1.2 *Skewness and peaks

Skewness, the direction and degree of skewness in the distribution of statistical data, is a numerical characteristic of the degree of asymmetry in the distribution of statistical data. By definition skewness is the third-order standardized moments of the sample.

在这里插入图片描述

Peakedness (peakedness; kurtosis), also known as the kurtosis coefficient. The number of characteristics that characterize the height of the peak at the mean of a probability density distribution curve. Intuitively, kurtosis reflects the sharpness of the peaks.

在这里插入图片描述

## 2) View skewness and kurtosis
(Train_data['price']);
print("Skewness: %f" % Train_data['price'].skew())
print("Kurtosis: %f" % Train_data['price'].kurt())

在这里插入图片描述

Batch calculation of skew

Train_data.skew()

在这里插入图片描述

See the distribution of skew

在这里插入图片描述

Batch calculation of kurt

Train_data.kurt()

在这里插入图片描述

See the distribution of kurt

在这里插入图片描述

View the distribution of the target variable

## 3) View specific frequencies of predicted values
(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
()

在这里插入图片描述

Conclusion: there are very few values greater than 20,000, in fact, you can also treat these values as special values (outliers) directly with the filling or deletion

Since (0) == -inf, I can't plot it, so instead I plot the distribution bar using log(1+x), which is different from the tutorial, which plots it with log as follows: (I can't plot it because -inf reports an error)

在这里插入图片描述

# The distribution is more uniform after the log transformation and can be log transformed for prediction, which is also commonly used for prediction problems trick
((1+Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
()

在这里插入图片描述

Separate labels i.e. predicted values

Y_train = Train_data['price']

# This distinction applies to data without direct label coding

# Doesn't apply here, requires an artificial distinction based on actual meaning

#Digital features

numeric_features = Train_data.select_dtypes(include=[])

numeric_features.columns

#Type Characteristics

categorical_features = Train_data.select_dtypes(include=[])

categorical_features.columns

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]
categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
# Feature nunique distribution
for cat_fea in categorical_features:
    print(cat_fea + "The characteristic distribution is as follows:")
    print("{}There's a feature.{}different value".format(cat_fea, Train_data[cat_fea].nunique()))
    print(Train_data[cat_fea].value_counts())

Each characterization case is shown below, one by one:

在这里插入图片描述

The test data display is the same

numeric_features.append('price')
numeric_features

在这里插入图片描述

price_numeric = Train_data[numeric_features]
correlation = price_numeric.corr()
correlation

Only part of it.

在这里插入图片描述

View correlation (strong->weak)

print(correlation['price'].sort_values(ascending = False),'\n')

在这里插入图片描述

Visualizationcorrection

f , ax = (figsize = (7, 7))
('Correlation of Numeric Features with Price',y=1,size=16)
(correlation,square = True,  vmax=0.8)

在这里插入图片描述

Price has fulfilled his historical mission. Delete.

del price_numeric['price']
## 2) Look at the skewness and kurtosis of several features
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_data[col].kurt())  
         )

在这里插入图片描述

1.2.1.3 *Visualization of the distribution of each digital feature (easy to ignore)
## 3) Visualization of the distribution of each digital feature
f = (Train_data, value_vars=numeric_features)
g = (f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = (, "value")

Only partially cut off:

在这里插入图片描述

在这里插入图片描述

Conclusion: anonymous features (v_*) are relatively evenly distributed

1.2.1.4 * Visualization of digital features in relation to each other (easily ignored)
## 4) Visualization of digital features in relation to each other
()
columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')
()
1.2.1.5 *Visualization of multivariate inter-regression relationships (easily ignored)
## 5) Visualization of multivariate inter-regression relationships
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = (nrows=5, ncols=2, figsize=(24, 20))
# ['v_12', 'v_8' , 'v_0', 'power', 'v_5',  'v_2', 'v_6', 'v_1', 'v_14']
v_12_scatter_plot = ([Y_train,Train_data['v_12']],axis = 1)
(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)
v_8_scatter_plot = ([Y_train,Train_data['v_8']],axis = 1)
(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
v_0_scatter_plot = ([Y_train,Train_data['v_0']],axis = 1)
(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
power_scatter_plot = ([Y_train,Train_data['power']],axis = 1)
(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
v_5_scatter_plot = ([Y_train,Train_data['v_5']],axis = 1)
(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
v_2_scatter_plot = ([Y_train,Train_data['v_2']],axis = 1)
(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
v_6_scatter_plot = ([Y_train,Train_data['v_6']],axis = 1)
(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)
v_1_scatter_plot = ([Y_train,Train_data['v_1']],axis = 1)
(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)
v_14_scatter_plot = ([Y_train,Train_data['v_14']],axis = 1)
(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)
v_13_scatter_plot = ([Y_train,Train_data['v_13']],axis = 1)
(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)

1.2.2 Category characterization (will draw, will not use results)

Viewing the UNIQUE distribution for category features

.value_counts()
## 1) unique distribution
for fea in categorical_features:
    print(Train_data[fea].nunique())
categorical_features
1.2.2.1 Box plot visualization
## 2) Category feature box plot visualization
# Because the categories of name and regionCode are too sparse, here we draw the ones that aren't sparse
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_data[c] = Train_data[c].astype('category')
    if Train_data[c].isnull().any():
        Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])
        Train_data[c] = Train_data[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
    (x=x, y=y)
    x=(rotation=90)
f = (Train_data, id_vars=['price'], value_vars=categorical_features)
g = (f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = (boxplot, "value", "price")
Train_data.columns
1.2.2.2 Violin map visualization
## 3) Violin map visualization of category features
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    (x=catg, y=target, data=Train_data)
    ()
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']

1.2.2.3 Bar graph visualization categories

## 4) Bar chart visualization of category features
def bar_plot(x, y, **kwargs):
    (x=x, y=y)
    x=(rotation=90)
f = (Train_data, id_vars=['price'], value_vars=categorical_features)
g = (f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = (bar_plot, "value", "price")

1.2.2.4 Frequency per category visualization of features (count_plot)

## 5) Frequency per category visualization of category features (count_plot)
def count_plot(x,  **kwargs):
    (x=x)
    x=(rotation=90)
f = (Train_data,  value_vars=categorical_features)
g = (f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = (count_plot, "value")

2. :: Generate data reports with pandas_profiling (new skill)

import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./")

3. Summary

This note addresses the small sample size, but there are still some valuable thoughts:

a. Identify features that require further processing by examining the absence of nan:

Fill (what is the fill method, mean fill, 0 fill, plurality fill, etc.);

Shedding;

Do the sample classification firstUse different feature models to predict

b. Anomaly detection through distribution

Analyze whether the label of the feature anomaly is anomalous (either deviating far from the mean or matter special symbols);

Should outliers be eliminated or filled with normal values, etc.?

c. Analyze the distribution of labels by mapping the laebl.

d. Through the features of the graph, features and label joint graph (statistical graphs, discrete graphs), intuitive understanding of the distribution of features, through this step can also be found in the data of some of the outliers, etc., through the box plot analysis of some of the features of the deviation of the value of the box plot for the features and features of the joint graph, for the features of the joint graph and the label, to analyze the correlation of some of these

To this point this article on Datawhale exercise is introduced to this, more related python prediction content please search for my previous articles or continue to browse the following related articles, I hope you will support me more in the future!