Exploratory Data Analysis (EDA)
1. Overview of overview data
Database loading
#coding:utf-8 # Import the warnings package and use filters to implement ignoring warning statements. import warnings ('ignore') import pandas as pd import numpy as np import as plt import seaborn as sns import missingno as msno
data entry
## 1) Load the training and test sets; path = './' Train_data = pd.read_csv(path+'car_train_0110.csv', sep=' ') Test_data = pd.read_csv(path+'car_testA_0110.csv', sep=' ')
Determine the path, if I'm in a notebook environment I usually use !dir to see the current directory
Characterization
New skill: using .append() to look at the first 5 rows and the next 5 rows at the same time
## 2) Abbreviated look at data (head()+shape) Train_data.head().append(Train_data.tail())
Observation data dimensions
Train_data.shape,Test_data.shape
Overview profile: .describe() to see statistics, .info() to see data types
1.1 Determination of missing data and anomalies
1.1.1 Viewing nan
Train_data.shape,Test_data.shape
It is also possible to view nan directly in the following two ways ↓ :
Train_data.isnull().sum()
Visualization na more intuitive
# find na tmp = df_train.isnull().any() tmp[==True]
New skill: use of msno library (missing value visualization)
Train_data.isnull().sum().plot( kind= 'bar')
Visualize to see the default values
(Train_data.sample(250))
where Train_data.sample(250) indicates a random sample of 250 rows and white bars indicate missing
Direct display of the number of non-missing samples/per feature
(Train_data.sample(250),labels= True)
Use .heatmap() in msno to see the correlation between missing values
(Train_data.sample(250))
1.1.2 *Outlier detection (important! Easy to ignore)
Understanding data types via Train_data.info()
Train_data.info()
1.2 Understanding the distribution of predicted values
Features are categorized into category features and numeric features
The significance of viewing the distribution is:
a. Timely change of non-normally distributed data to normally distributed data
b. Anomaly detection
1.2.1 Digital characterization
Train_data['price']
Found out it's all int
Statistical distribution ↓
Train_data['price'].value_counts()
## 1) General distribution profile (unbounded Johnson distribution, etc.) import as st y = Train_data['price'] (1); ('Johnson SU') (y, kde=False, fit=) (2); ('Normal') (y, kde=False, fit=) (3); ('Log Normal') (y, kde=False, fit=)
CONCLUSION: PRICE does not obey a normal distribution, so it must be transformed before regression can be performed. The unbounded Johnson distribution fits better.
1.2.1.1 Correlation analysis
1.2.1.2 *Skewness and peaks
Skewness, the direction and degree of skewness in the distribution of statistical data, is a numerical characteristic of the degree of asymmetry in the distribution of statistical data. By definition skewness is the third-order standardized moments of the sample.
Peakedness (peakedness; kurtosis), also known as the kurtosis coefficient. The number of characteristics that characterize the height of the peak at the mean of a probability density distribution curve. Intuitively, kurtosis reflects the sharpness of the peaks.
## 2) View skewness and kurtosis (Train_data['price']); print("Skewness: %f" % Train_data['price'].skew()) print("Kurtosis: %f" % Train_data['price'].kurt())
Batch calculation of skew
Train_data.skew()
See the distribution of skew
Batch calculation of kurt
Train_data.kurt()
See the distribution of kurt
View the distribution of the target variable
## 3) View specific frequencies of predicted values (Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red') ()
Conclusion: there are very few values greater than 20,000, in fact, you can also treat these values as special values (outliers) directly with the filling or deletion
Since (0) == -inf, I can't plot it, so instead I plot the distribution bar using log(1+x), which is different from the tutorial, which plots it with log as follows: (I can't plot it because -inf reports an error)
# The distribution is more uniform after the log transformation and can be log transformed for prediction, which is also commonly used for prediction problems trick ((1+Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') ()
Separate labels i.e. predicted values
Y_train = Train_data['price']
# This distinction applies to data without direct label coding
# Doesn't apply here, requires an artificial distinction based on actual meaning
#Digital features
numeric_features = Train_data.select_dtypes(include=[])
numeric_features.columns
#Type Characteristics
categorical_features = Train_data.select_dtypes(include=[])
categorical_features.columns
numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ] categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]
# Feature nunique distribution for cat_fea in categorical_features: print(cat_fea + "The characteristic distribution is as follows:") print("{}There's a feature.{}different value".format(cat_fea, Train_data[cat_fea].nunique())) print(Train_data[cat_fea].value_counts())
Each characterization case is shown below, one by one:
The test data display is the same
numeric_features.append('price') numeric_features
price_numeric = Train_data[numeric_features] correlation = price_numeric.corr() correlation
Only part of it.
View correlation (strong->weak)
print(correlation['price'].sort_values(ascending = False),'\n')
Visualizationcorrection
f , ax = (figsize = (7, 7)) ('Correlation of Numeric Features with Price',y=1,size=16) (correlation,square = True, vmax=0.8)
Price has fulfilled his historical mission. Delete.
del price_numeric['price']
## 2) Look at the skewness and kurtosis of several features for col in numeric_features: print('{:15}'.format(col), 'Skewness: {:05.2f}'.format(Train_data[col].skew()) , ' ' , 'Kurtosis: {:06.2f}'.format(Train_data[col].kurt()) )
1.2.1.3 *Visualization of the distribution of each digital feature (easy to ignore)
## 3) Visualization of the distribution of each digital feature f = (Train_data, value_vars=numeric_features) g = (f, col="variable", col_wrap=2, sharex=False, sharey=False) g = (, "value")
Only partially cut off:
Conclusion: anonymous features (v_*) are relatively evenly distributed
1.2.1.4 * Visualization of digital features in relation to each other (easily ignored)
## 4) Visualization of digital features in relation to each other () columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14'] (Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde') ()
1.2.1.5 *Visualization of multivariate inter-regression relationships (easily ignored)
## 5) Visualization of multivariate inter-regression relationships fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = (nrows=5, ncols=2, figsize=(24, 20)) # ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14'] v_12_scatter_plot = ([Y_train,Train_data['v_12']],axis = 1) (x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1) v_8_scatter_plot = ([Y_train,Train_data['v_8']],axis = 1) (x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2) v_0_scatter_plot = ([Y_train,Train_data['v_0']],axis = 1) (x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3) power_scatter_plot = ([Y_train,Train_data['power']],axis = 1) (x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4) v_5_scatter_plot = ([Y_train,Train_data['v_5']],axis = 1) (x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5) v_2_scatter_plot = ([Y_train,Train_data['v_2']],axis = 1) (x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6) v_6_scatter_plot = ([Y_train,Train_data['v_6']],axis = 1) (x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7) v_1_scatter_plot = ([Y_train,Train_data['v_1']],axis = 1) (x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8) v_14_scatter_plot = ([Y_train,Train_data['v_14']],axis = 1) (x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9) v_13_scatter_plot = ([Y_train,Train_data['v_13']],axis = 1) (x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)
1.2.2 Category characterization (will draw, will not use results)
Viewing the UNIQUE distribution for category features
.value_counts()
## 1) unique distribution for fea in categorical_features: print(Train_data[fea].nunique()) categorical_features
1.2.2.1 Box plot visualization
## 2) Category feature box plot visualization # Because the categories of name and regionCode are too sparse, here we draw the ones that aren't sparse categorical_features = ['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage'] for c in categorical_features: Train_data[c] = Train_data[c].astype('category') if Train_data[c].isnull().any(): Train_data[c] = Train_data[c].cat.add_categories(['MISSING']) Train_data[c] = Train_data[c].fillna('MISSING') def boxplot(x, y, **kwargs): (x=x, y=y) x=(rotation=90) f = (Train_data, id_vars=['price'], value_vars=categorical_features) g = (f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5) g = (boxplot, "value", "price")
Train_data.columns
1.2.2.2 Violin map visualization
## 3) Violin map visualization of category features catg_list = categorical_features target = 'price' for catg in catg_list : (x=catg, y=target, data=Train_data) ()
categorical_features = ['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
1.2.2.3 Bar graph visualization categories
## 4) Bar chart visualization of category features def bar_plot(x, y, **kwargs): (x=x, y=y) x=(rotation=90) f = (Train_data, id_vars=['price'], value_vars=categorical_features) g = (f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5) g = (bar_plot, "value", "price")
1.2.2.4 Frequency per category visualization of features (count_plot)
## 5) Frequency per category visualization of category features (count_plot) def count_plot(x, **kwargs): (x=x) x=(rotation=90) f = (Train_data, value_vars=categorical_features) g = (f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5) g = (count_plot, "value")
2. :: Generate data reports with pandas_profiling (new skill)
import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data) pfr.to_file("./")
3. Summary
This note addresses the small sample size, but there are still some valuable thoughts:
a. Identify features that require further processing by examining the absence of nan:
Fill (what is the fill method, mean fill, 0 fill, plurality fill, etc.);
Shedding;
Do the sample classification firstUse different feature models to predict。
b. Anomaly detection through distribution
Analyze whether the label of the feature anomaly is anomalous (either deviating far from the mean or matter special symbols);
Should outliers be eliminated or filled with normal values, etc.?
c. Analyze the distribution of labels by mapping the laebl.
d. Through the features of the graph, features and label joint graph (statistical graphs, discrete graphs), intuitive understanding of the distribution of features, through this step can also be found in the data of some of the outliers, etc., through the box plot analysis of some of the features of the deviation of the value of the box plot for the features and features of the joint graph, for the features of the joint graph and the label, to analyze the correlation of some of these
To this point this article on Datawhale exercise is introduced to this, more related python prediction content please search for my previous articles or continue to browse the following related articles, I hope you will support me more in the future!