I. Requirements
II. Principles
A decision tree is a flowchart-like structure where each internal node represents a "test" on an attribute, each branch represents the result of a test, and each leaf node represents a test result. Class labels (decisions made after computing all attributes). The path from the root to the leaves represents the classification rules.
The purpose of decision tree learning is to produce a decision tree that generalizes well, i.e., handles unseen examples well. Therefore how to construct a decision tree is the key to subsequent prediction! And constructing a decision tree requires determining the sequence of class labeling judgments, which determines the performance of the constructed decision tree. The branch nodes of the decision tree should belong to the same class as far as possible, i.e., the "purity" of the nodes should be higher and higher, only in this way can the best decision be made.
The classic approach to attribute partitioning:
- information gain
- gain ratio
- Gini index
The information gain was used in this experiment, so only the information gain is described below.
III. Calculation of information gain
Where D is the set of samples, a is the attribute in the set of samples,Dv denotes the set of samples in the set of D samples where a attribute is v.
Ent(x) function is to calculate the information entropy, which represents the purity information of the sample set, the information entropy is calculated as follows:
where pk denotes the proportion of the sample that is in one of the final outcome categories, e.g., if there are 10 samples, of which 5 are good and 5 are bad, then where p1 = 5/10, p2 = 5/10.
Generally speaking, the higher the information gain, the higher the "purity improvement" obtained by using attribute α for segmentation, therefore, attributes with high information gain are preferred when selecting attribute nodes!
IV. Realization process
This design uses pandas and numpy libraries, which are mainly utilized for fast processing and use of data.
First read the data in:
You can see that the labels of the dataset are the different attributes of the melon, and the data in the table are the different values under the different attributes, and so on.
if(len(set(D.lit. good melon)) == 1): # Marked for return return D.lit. good melon.iloc[0] elif((len(A) == 0) or Check(D, A[:-1])): # Choice D with the highest number of results is labeled cnt = ('Good melon').size() maxValue = cnt[cnt == ()].index[0] return maxValue else: A1 = (A) attr = Choose(D, A1[:-1]) tree = {attr:{}} for value in set(D[attr]): tree[attr][value] = TreeGen(D[D[attr] == value], A1) return tree
TreeGen function is the main function to generate the tree, through its recursive call to return the next level of tree structure (dictionary) to complete the generation of decision trees.
In the spanning tree process, there are two conditions for terminating the iteration, the first is when all cases of the input data source D have the same result, then this result is returned as a leaf node; the second is when there are no attributes that can be further divided down, or the samples in D have the same value under all the attributes of A, then the one that has the most results in all the cases of D is returned as a leaf node.
Among them, Choose(D:, A:list) function is the function of selecting tags, which calculates the information gain of the corresponding tags according to the input data source and the remaining list of attributes, and selects the tag that can maximize the information gain to return
def Choose(D:, A:list): result = 0.0 resultAttr = '' for attr in A: tmpVal = CalcZengYi(D, attr) if(tmpVal > result): resultAttr = attr result = tmpVal (resultAttr) return resultAttr
And finally, the results:
{'texture': {'slightly mushy': {'touch': {'hard and smooth': 'no', 'soft and sticky': 'yes'}}, {'texture': {'slightly mushy': {'touch': {'hard and smooth': 'no', 'soft and sticky': 'yes'}}, {'texture': {'texture': {'slightly mushy': {'touch': {'hard and smooth': 'no', 'soft and sticky': 'yes'}} 'clear': {'rooted': {'stiff': 'no', 'curled': 'yes', 'slightly curled': { 'Color': {'Green': 'Yes', 'Black': {'Touch': {'Hard and smooth': 'Yes', 'Soft and sticky'. 'Soft and sticky': 'No' }}}}}}, 'Fuzzy': 'No'}}
The drawing is as follows:
V. Procedures
main program
#!/usr/bin/python3 # -*- encoding: utf-8 -*- ''' @Description:Decision Tree. @Date :2021/04/25 15:57:14 @Author :willpower @version :1.0 ''' import pandas as pd import numpy as np import treeplot import copy import math """ @description :Calculating entropy values --------- @param :The input is a basic pandas type dataFrame, where the last line of the input is the actual result. ------- @Returns :Returns the entropy value, type floating point. ------- """ def CalcShang(D:): setCnt = [0] result = 0.0 # for i in ([-1]).size().index: # Iterate over each value for i in set(D[[-1]]): # of times to get a value under this attribute cnt = [:,-1].value_counts()[i] result = result + (cnt/setCnt)*(cnt/setCnt, 2) return (-1*result) """ @description :Calculating Gain --------- @param :Input is the DataFrame data source, followed by the value of the attribute for which the gain needs to be calculated ------- @Returns :Returns the gain value as a floating point value. ------- """ def CalcZengYi(D:, attr:str): sumShang = CalcShang(D) setCnt = [0] result = 0.0 valus = (attr).size() for subVal in : result = result + (valus[subVal]/setCnt)*CalcShang(D[D[attr] == subVal]) return sumShang - result """ @description :Select the best attributes --------- @param :Input is the data source, and a list of attributes that are still remaining ------- @Returns :Returns the best attributes ------- """ def Choose(D:, A:list): result = 0.0 resultAttr = '' for attr in A: tmpVal = CalcZengYi(D, attr) if(tmpVal > result): resultAttr = attr result = tmpVal (resultAttr) return resultAttr """ @description :Checks that the data has the same value under each attribute --------- @param :Input is the DataFrame and a list of the remaining attributes ------- @Returns :Returns a bool value, 1 for same, 0 for different. ------- """ def Check(D:, A:list): for i in A: if(len(set(D[i])) != 1): return 0 return 1 """ @description :Spanning Tree Master Functions --------- @param :The data source DataFrame and all its types. ------- @Returns :Returns the generated dictionary tree. ------- """ def TreeGen(D:, A:list): if(len(set(D.lit. good melon)) == 1): # Marked for return return D.lit. good melon.iloc[0] elif((len(A) == 0) or Check(D, A[:-1])): # Choice D with the highest number of results is labeled cnt = ('Good melon').size() # of results found maxValue = cnt[cnt == ()].index[0] return maxValue else: A1 = (A) attr = Choose(D, A1[:-1]) tree = {attr:{}} for value in set(D[attr]): tree[attr][value] = TreeGen(D[D[attr] == value], A1) return tree """ @description :validation set --------- @param :Input is the data to be validated(The last column is the true result)and decision tree modeling ------- @Returns :not have ------- """ def Test(D:, model:dict): for i in range([0]): data = [i] subModel = model while(1): attr = list(subModel)[0] subModel = subModel[attr][data[attr]] if(type(subModel).__name__ != 'dict'): print(subModel, end='') break print('') name = ['Color', 'Roots', 'Knocking', 'Texture', 'Umbilical', 'Tactile', 'Good melon'] df = pd.read_csv('./', names=name) # CalcZengYi(df, 'color') resultTree = TreeGen(df, name) print(resultTree) # print(df[name[:-1]]) Test(df[name[:-1]], resultTree) treeplot.plot_model(resultTree,"")
plotting program
from graphviz import Digraph def plot_model(tree, name): g = Digraph("G", filename=name, format='png', strict=False) first_label = list(())[0] ("0", first_label) _sub_plot(g, tree, "0") () root = "0" def _sub_plot(g, tree, inc): global root first_label = list(())[0] ts = tree[first_label] for i in (): if isinstance(tree[first_label][i], dict): root = str(int(root) + 1) (root, list(tree[first_label][i].keys())[0]) (inc, root, str(i)) _sub_plot(g, tree[first_label][i], root) else: root = str(int(root) + 1) (root, tree[first_label][i]) (inc, root, str(i))
./
Green, huddled, cloudy, clear, dimpled, hard and slippery, and is
Dark, huddled, dull, clear, dimpled, hard and slippery, it is
Dark, curled up, cloudy sounding, clear, dimpled, hard and slippery, is the
Lime green, huddled, dull, clear, dimpled, hard and smooth, yes
Pale white, curled up, cloudy sounding, clear, dimpled, hard and slippery, and is
Green, slightly curled, cloudy, clear, slightly concave, soft and sticky, yes.
Black, slightly curled, cloudy, slightly fuzzy, slightly concave, soft and sticky.
Black, slightly curled, cloudy, clear, slightly concave, hard, smooth, yes.
Black, slightly curled, dull, slightly fuzzy, slightly concave, hard and smooth, no
lime green, hard, crisp, clear, flat, soft and sticky, no
Light white, stiff, crisp, fuzzy, flat, hard and smooth, no
Pale white, curled up, cloudy, fuzzy, flat, soft and sticky, no
Green, slightly curled, cloudy, slightly fuzzy, concave, hard, slippery, no
Light white, slightly curled, dull, slightly fuzzy, sunken, hard and smooth, no
Black, slightly curled, cloudy, clear, slightly concave, soft and sticky, no
Pale white, curled up, cloudy, fuzzy, flat, hard and smooth, no
Green, huddled, dull, slightly fuzzy, slightly concave, hard and smooth, no
VI. Problems encountered
graphviz Not a directory: ‘dot'
method settle an issue
To this article on Python machine learning decision tree article is introduced to this, more related Python decision tree content, please search my previous posts or continue to browse the following related articles I hope you will support me more in the future!