I. Introduction
LightGBM is extended machine learning system. It is a distributed gradient boosting framework based on GBDT (Gradient Boosting Decision Tree) algorithm. Its design ideas are focused on reducing the use of data for memory and computational performance, as well as reducing the cost of communication when computing in parallel with multiple machines
1 Benefits of LightGBM
- Easy to use. Provides the mainstream Python\C++\R language interface, users can easily use LightGBM modeling and get quite good results.
- Efficient and scalable. Efficient and rapid, highly accurate in processing large-scale data sets, with low requirements for hardware resources such as memory.
- Robustness. Compared to deep learning models no fine tuning of parameters is required to achieve approximate results.
- LightGBM directly supports missing values and class features without additional special processing of the data.
2 Disadvantages of LightGBM
- As opposed to deep learning models that are unable to model spatio-temporal locations, they do not capture high-dimensional data such as images, speech, and text well.
- The accuracy of deep learning can be far ahead of LightGBM when there is a huge amount of training data and a suitable deep learning model can be found.
II. Realization process
1 Introduction to the data set
League of Legends dataset Extract code: 1234
This data is used for LightGBM classification in practice. This dataset has a total of 9,881 League of Legends Korean Service qualifying matches above the diamond segment. The data provides information on the game state at the ten minute mark, including the number of kills, the amount of gold, the experience value, the level, and so on.
2 Coding
# Import basic libraries import numpy as np import pandas as pd ## Library of drawing functions import as plt import seaborn as sns #%% Data reading: use Pandas' own read_csv function to read and convert to DataFrame format. df = pd.read_csv('D:\Python\ML\data\high_diamond_ranked_10min.csv') y = #%% View sample data #print(y.value_counts()) #Labeled feature columns drop_cols=['gameId','blueWins'] x=(drop_cols,axis=1) # Statistical characterization of digital features x_des=()
#%% Remove redundant data because red and blue are in competition and only one side needs to know the situation of the other side, the other side is the opposite therefore removing the data information of the red side drop_cols = ['redFirstBlood','redKills','redDeaths' ,'redGoldDiff','redExperienceDiff', 'blueCSPerMin', 'blueGoldPerMin','redCSPerMin','redGoldPerMin'] (drop_cols, axis=1, inplace=True) #%% Visual description. In order to have a good presentation, the first nine features and the middle nine features are shown in two violin diagrams, and the same later ones are no longer repeated data = x data_std = (data - ()) / () data = ([y, data_std.iloc[:, 0:9]], axis=1)# splice the labels with the first nine columns at this point to the data is (9879*10) metric data = (data, id_vars='blueWins', var_name='Features', value_name='Values')#melt the above data into a (88911*3) metric fig, ax = (1,2,figsize=(15,8)) # Mapping the violin (x='Features', y='Values', hue='blueWins', data=data, split=True, inner='quart', ax=ax[0], palette='Blues') fig.autofmt_xdate(rotation=45)#Realistic way to change x-axis coordinates that can be represented diagonally (tilted 45 degrees) without having to squeeze them flat into a pile data = x data_std = (data - ()) / () data = ([y, data_std.iloc[:, 9:18]], axis=1) data = (data, id_vars='blueWins', var_name='Features', value_name='Values') # Mapping the violin (x='Features', y='Values', hue='blueWins', data=data, split=True, inner='quart', ax=ax[1], palette='Blues') fig.autofmt_xdate(rotation=45) ()
#%% Draw a heat map of the correlation between individual features fig,ax=(figsize=(15,18)) (round((),2),cmap='Blues',annot=True) fig.autofmt_xdate(rotation=45) ()
#%% Based on the above feature map, the redundant features with high correlation are eliminated (redAvgLevel, blueAvgLevel) # Remove redundant features drop_cols = ['redAvgLevel','blueAvgLevel'] (drop_cols, axis=1, inplace=True) (style='whitegrid', palette='muted') # Construct two new features x['wardsPlacedDiff'] = x['blueWardsPlaced'] - x['redWardsPlaced'] x['wardsDestroyedDiff'] = x['blueWardsDestroyed'] - x['redWardsDestroyed'] data = x[['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff','wardsDestroyedDiff']].sample(1000) data_std = (data - ()) / () data = ([y, data_std], axis=1) data = (data, id_vars='blueWins', var_name='Features', value_name='Values') (figsize=(15,8)) (x='Features', y='Values', hue='blueWins', data=data) ()
#%% From the discrete plot of the number of eyes inserted in the above figure, we can find a significant pattern between the number of eyes inserted and the win or loss of the game, the first ten minutes of the game inserting eyes or not has little impact on the final win or loss, so these features are removed ## Remove features related to eye position ## drop_cols = ['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff', 'wardsDestroyedDiff','redWardsPlaced','redWardsDestroyed'] (drop_cols, axis=1, inplace=True) #%% The data distributions of kills, deaths and assists do not differ much, but the distributions of kills minus deaths and assists minus deaths do not differ much from the margins, constructing two new features x['killsDiff'] = x['blueKills'] - x['blueDeaths'] x['assistsDiff'] = x['blueAssists'] - x['redAssists'] x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].hist(figsize=(15,8), bins=20) ()
#%% data = x[['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']].sample(1000) data_std = (data - ()) / () data = ([y, data_std], axis=1) data = (data, id_vars='blueWins', var_name='Features', value_name='Values') (figsize=(10,6)) (x='Features', y='Values', hue='blueWins', data=data) (rotation=45) ()
#%% data = ([y, x], axis=1).sample(500) (data, vars=['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists'], hue='blueWins') ()
#%%Some features are improved for data segmentation when combined two by two x['dragonsDiff'] = x['blueDragons'] - x['redDragons']# Get the dragon x['heraldsDiff'] = x['blueHeralds'] - x['redHeralds']# Get the Canyon Vanguard # x['eliteDiff'] = x['blueEliteMonsters'] - x['redEliteMonsters']#Kill the big monsters data = ([y, x], axis=1) eliteGroup = (['eliteDiff'])['blueWins'].mean() dragonGroup = (['dragonsDiff'])['blueWins'].mean() heraldGroup = (['heraldsDiff'])['blueWins'].mean() fig, ax = (1,3, figsize=(15,4)) (kind='bar', ax=ax[0]) (kind='bar', ax=ax[1]) (kind='bar', ax=ax[2]) print(eliteGroup) print(dragonGroup) print(heraldGroup) ()
#%% Number of towers pushed and games won or lost x['towerDiff'] = x['blueTowersDestroyed'] - x['redTowersDestroyed'] data = ([y, x], axis=1) towerGroup = (['towerDiff'])['blueWins'] print(()) print(()) fig, ax = (1,2,figsize=(15,5)) ().plot(kind='line', ax=ax[0]) ax[0].set_title('Proportion of Blue Wins') ax[0].set_ylabel('Proportion') ().plot(kind='line', ax=ax[1]) ax[1].set_title('Count of Towers Destroyed') ax[1].set_ylabel('Count')
#%% Training and Prediction with LightGBM ## In order to properly evaluate the model performance, the data is divided into a training set and a test set, and the model is trained on the training set and the model performance is verified on the test set. from sklearn.model_selection import train_test_split ## Select samples whose categories are 0 and 1 (excluding samples whose category is 2) data_target_part = y data_features_part = x ## Test set size of 20%, 80%/20% split x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020) #%%## Importing LightGBM models from import LGBMClassifier ## Define the LightGBM model clf = LGBMClassifier() # Train the LightGBM model on the training set (x_train, y_train) #%% Predictions using trained models on training and test sets respectively train_predict = (x_train) test_predict = (x_test) from sklearn import metrics ## Evaluate model effectiveness using accuracy [number of correctly predicted samples as a percentage of total predicted samples]. print('The accuracy of the LightGBM is:',metrics.accuracy_score(y_train,train_predict)) print('The accuracy of the LightGBM is:',metrics.accuracy_score(y_test,test_predict)) ## View Confusion Matrix (statistical matrix of various scenarios of predicted and true values) confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test) print('The confusion matrix result:\n',confusion_matrix_result) # Use of heat maps to visualize results (figsize=(8, 6)) (confusion_matrix_result, annot=True, cmap='Blues') ('Predicted labels') ('True labels') ()
#%%Use lightgbm for feature selection, also use the attribute feature_importances_ to see the importance of features (y=data_features_part.columns, x=clf.feature_importances_)
#%% In addition to feature_importances_, other attributes in LightGBM can be used for evaluation (gain,split) from import accuracy_score from lightgbm import plot_importance def estimate(model,data): ax1=plot_importance(model,importance_type="gain") ax1.set_title('gain') ax2=plot_importance(model, importance_type="split") ax2.set_title('split') () def classes(data,label,test): model=LGBMClassifier() (data,label) ans=(test) estimate(model, data) return ans ans=classes(x_train,y_train,x_test) pre=accuracy_score(y_test, ans) print('acc=',accuracy_score(y_test,ans))
Adjusting Parameters for Better Results: Important Parameters in LightGBM
- learning_rate: sometimes called eta, system default value is 0.3. The step size of each iteration is very important. Too big to run with low accuracy, too small to run slowly.
- num_leaves: the system default is 32. this parameter controls the maximum number of leaf nodes in each tree.
- feature_fraction: the system default value is 1. We usually set it to about 0.8. It is used to control the proportion of the number of columns sampled randomly in each tree (each column is a feature).
- max_depth: the system default value is 6, we often use a number between 3-10. This value is the maximum depth of the tree. This value is used to control overfitting. the larger the max_depth, the more specific the model learns.
#%% Adjust the parameters to get better results ## Import mesh tuning functions from sklearn library from sklearn.model_selection import GridSearchCV ## Define parameter ranges learning_rate = [0.1, 0.3, 0.6] feature_fraction = [0.5, 0.8, 1] num_leaves = [16, 32, 64] max_depth = [-1,3,5,8] parameters = { 'learning_rate': learning_rate, 'feature_fraction':feature_fraction, 'num_leaves': num_leaves, 'max_depth': max_depth} model = LGBMClassifier(n_estimators = 50) ## Perform a grid search clf = GridSearchCV(model, parameters, cv=3, scoring='accuracy',verbose=3, n_jobs=-1) clf = (x_train, y_train) #%% to see what the best parameter values are respectively print(clf.best_params_)
#%% to see what the best parameter values are respectively print(clf.best_params_) #%% Distribute predictions using the best model parameters over the training and test sets ## Define the LightGBM model with parameters clf = LGBMClassifier(feature_fraction = 1, learning_rate = 0.1, max_depth= 3, num_leaves = 16) # Train the LightGBM model on the training set (x_train, y_train) train_predict = (x_train) test_predict = (x_test) ## Evaluate model effectiveness using accuracy [number of correctly predicted samples as a percentage of total predicted samples]. print('The accuracy of the LightGBM is:',metrics.accuracy_score(y_train,train_predict)) print('The accuracy of the LightGBM is:',metrics.accuracy_score(y_test,test_predict)) ## View Confusion Matrix (statistical matrix of various scenarios of predicted and true values) confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test) print('The confusion matrix result:\n',confusion_matrix_result) # Use of heat maps to visualize results (figsize=(8, 6)) (confusion_matrix_result, annot=True, cmap='Blues') ('Predicted labels') ('True labels') ()
III. Keys
Important parameters of LightGBM
Basic parameter adjustment
- This is the main parameter that controls the complexity of the tree model. Generally, we will make num_leaves smaller than (2 times max_depth) to prevent overfitting. Since LightGBM is a leaf-wise tree building method different from XGBoost's depth-wise tree building method, num_leaves has a bigger role than depth.
- min_data_in_leaf This is a very important parameter in dealing with overfitting problems. Its value depends on the sample size of the tree and the num_leaves parameter. Setting this parameter to a larger value will avoid generating a tree that is too deep, but may lead to underfitting. In practice, for large datasets, a value of a few hundred or a few thousand is sufficient.
- max_depth the depth of the tree, the concept of depth is not very useful in leaf-wise trees, because there is no reasonable mapping from leaves to depth
Parameter tuning for training speed
- Use the bagging method by setting the bagging_fraction and bagging_freq parameters.
- Use subsampling of features by setting the feature_fraction parameter.
- Choose a smaller max_bin parameter. Use save_binary to speed up data loading in future learning processes.
Parameter tuning for accuracy
- Use larger max_bin (learning may be slower)
- Use smaller learning_rate and larger num_iterations
- Use larger num_leaves (may lead to overfitting)
- Use larger training data
- Try dart mode
Parameter tuning for overfitting
- Use a smaller max_bin
- Use smaller num_leaves
- Using min_data_in_leaf and min_sum_hessian_in_leaf
- Use bagging by setting bagging_fraction and bagging_freq.
- Use feature subsampling by setting feature_fraction
- Use larger training data
- Use lambda_l1, lambda_l2 and min_gain_to_split to use regularity
- Try max_depth to avoid generating a tree that is too deep.
Recently more and more feel the importance of good coding habits! Debug is yyds, from just learning C language when the teacher has been educated, then tasted the sweetness of the debug, and later most of the code written even if there is no bug or will debug again, and now is still, I hope that we also develop the habit of debug, and of course, is to write annotations! The first thing you need to do is to write the annotation, annotation is your own thoughts at the time, do not write your own return to see a large degree of time for a long time, you do not know the intention of each step. 886~~~
To this point this article on Python machine learning applications based on LightGBM classification prediction chapter interpretation of the article is introduced to this, more related Python LightGBM classification prediction content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!