An introduction to the method of customizing metrics for binary task evaluation metrics in keras and the code for it

For the binary classification task, the only existing evaluation metric in keras is binary_accuracy, i.e., binary accuracy, but some other evaluation metrics are sometimes needed to evaluate the performance of the model, e.g., precision, recall, F1-score, and so on, so we need to use the customized evaluation function function provided by keras to construct various kinds of evaluation metrics for the binary classification task. Therefore, we need to use the custom evaluation function provided by keras to construct various evaluation metrics for the binary classification task.

The custom evaluation function function provided by keras requires the following two tensors as input and returns a tensor as output.

y_true: a first-order tensor consisting of the true values of the dataset.

y_pred: a first-order tensor consisting of the output values of the dataset.

() rounds the tensor so that (y_pred) is the predicted value tensor.

(y_pred) i.e., the prediction value tensor is inverted.

1-y_true is the true value tensor inverted.

tf.reduce_sum() sums the tensor.

From this, four basic indicators TP, TN, FP, and FN can be constructed according to the definition, and then the advanced indicators precision, recall, and F1score can be further constructed, and finally, the above customized evaluation indicators can be referenced at the compilation stage.

The code for customizing the common evaluation metrics for binary classification tasks and their references in keras is as follows

import tensorflow as tf

# Precision rate evaluation indicators
def metric_precision(y_true,y_pred): 
 TP=tf.reduce_sum(y_true*(y_pred))
 TN=tf.reduce_sum((1-y_true)*((y_pred)))
 FP=tf.reduce_sum((1-y_true)*(y_pred))
 FN=tf.reduce_sum(y_true*((y_pred)))
 precision=TP/(TP+FP)
 return precision

# Recall evaluation metrics
def metric_recall(y_true,y_pred): 
 TP=tf.reduce_sum(y_true*(y_pred))
 TN=tf.reduce_sum((1-y_true)*((y_pred)))
 FP=tf.reduce_sum((1-y_true)*(y_pred))
 FN=tf.reduce_sum(y_true*((y_pred)))
 recall=TP/(TP+FN)
 return recall

#F1-score evaluation indicator
def metric_F1score(y_true,y_pred): 
 TP=tf.reduce_sum(y_true*(y_pred))
 TN=tf.reduce_sum((1-y_true)*((y_pred)))
 FP=tf.reduce_sum((1-y_true)*(y_pred))
 FN=tf.reduce_sum(y_true*((y_pred)))
 precision=TP/(TP+FP)
 recall=TP/(TP+FN)
 F1score=2*precision*recall/(precision+recall)
 return F1score

#Example of customized evaluation metrics quoted at the compilation stage
(optimizer='adam',
	 loss='binary_crossentropy',
	 metrics=['accuracy',
	 		metric_precision,
	 		metric_recall,
	 		metric_F1score])

Additional knowledge:Technical miscellany of two-classification/multiclassification under keras sklearn (cross-validation and evaluation metrics)

I. Preamble

This blog is intended to document the problems encountered in the supplemental experiments for the thesis, as well as the solutions, which are presented primarily in the form of programs.

II. Objects

Deep learning framework: keras

Study population: biclassified/multiclassified

III. Technical miscellany

-FOLD cross validation

1. Concepts

A model is trained K times, each training divides the entire dataset into random K copies, K-1 is used as the training set, and the remaining 1 copy is used as the validation set, and the performance metrics on the validation set are saved at the end of each training, and finally the K results are averaged to obtain the final model performance metrics.

2. Advantages and disadvantages

Advantage: more robust model evaluation

Cons: Increased training time

3.Code

① sklearn and keras are used independently of each other

from sklearn.model_selection import StratifiedKFold
import numpy

seed = 7 # Random seeds
(seed) # Generate fixed random numbers
num_k = 5 # How many discounts

# Entire dataset (self-defined)
X = 
Y = 

kfold = StratifiedKFold(n_splits=num_k, shuffle=True, random_state=seed) # Layered K-folds to ensure consistent category ratios

cvscores = []
for train, test in (X, Y):

	# Can be modeled as sequential or function (define your own)
	model = 
 () # Customization
 
	# Model training
 (X[train], Y[train], epochs=150, batch_size=10, verbose=0)
 
 # Model testing
 scores = (X[test], Y[test], verbose=0)
 
 print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) # Print out the validation set accuracy
 
 (scores[1] * 100)
 
print("%.2f%% (+/- %.2f%%)" % ((cvscores), (cvscores))) # Output k-fold model mean and standard deviation results

② sklearn in combination with keras

from .scikit_learn import KerasClassifier # Use the sklearn API under keras
from sklearn.cross_validation import StratifiedKFold, cross_val_score
import numpy as np

seed = 7 # Random seeds
(seed) # Generate fixed random numbers
num_k = 5 # How many discounts

# Entire dataset (self-defined)
X = 
Y = 

# Create models
def model():
 # Can be modeled as sequential or function (define your own)
	model = 
	return model 

model = KerasClassifier(build_fn=model, epochs=150, batch_size=10)
kfold = StratifiedKFold(Y, n_folds=num_k, shuffle=True, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)
print((results)) # Output k-fold model averaging results

Addendum: Introducing callbacks for keras

Just add an arg to ①②: callbacks=[()] # This saves the weights of the model, and of course you can use it to save the training process.

2. Dichotomous/multi-dichotomous evaluation indicators

1. Concepts

Binary classification means that a target has only one of two labels (e.g., 0 or 1, corresponding to one-hot labels of [1,0] or [0,1]). For this kind of problem, it can generally be accomplished using softmax or logistic regression, with cross-entropy and mse loss functions, respectively, for network training, outputting probability distributions and a single sigmoid prediction (0,1), respectively.

Multi-categorization means that a target is labeled in one of several ways (e.g. 0, 1, 2...)

2. Evaluation indicators

The main components are: accuracy, error rate, precision, recall = true positive rate (TPR) = sensitivity, F1-measure (both micro and macro), false positive rate (FPR), specificity, ROC (receiver operation characteristic curve) (both micro and macro), AUC (area under curve), P-R curve (precision-recall), P-R curve (precision-recall). specificity, ROC (receiver operation characteristic curve) (both micro and macro), AUC (area under curve), P-R curve (precision-recall),confusion matrix

① Accuracy and error rate

accuracy = (TP+TN)/ (P+N) or accuracy = (TP+TN)/ (T+F)

error rate = (FP+FN) / (P+N) or (FP+FN) / (T+F)

accuracy = 1 - error rate

It can be seen that: accuracy and error rate are the evaluation metrics of the classifier on the overall data.

② Accuracy rate

precision=TP /（TP+FP）

It can be seen that the accuracy is an evaluation metric of the classifier on data that is predicted to be positive.

③ Recall Rate/True Positive Rate/Sensitivity

recall = TPR = sensitivity = TP/(TP+FN)

It can be seen that: recall/true positive rate/sensitivity is an evaluation metric for the classifier on the whole positive data.

④ F1-measure

F1-measure = 2 * (recall * precision / (recall + precision))

Two types are included: micro and macro (for multi-category classification problems, note the distinction from multi-label classification problems)

1)micro

Calculate the total precision and recall for all categories and then calculate the F1-measure

2)macro

Calculate the F1-measure after calculating the precison and recall for each class and finally average the F1-measure

It can be seen that the F1-measure is a reconciliation of two contradictory indicators precision and recall.

⑤ False positive rate

FPR=FP / (FP+TN)

It can be seen that the false positive rate is an indicator of the evaluation of the classifier on the whole negative data, against false positives.

⑥ Specificity

specificity = 1- FPR

As can be seen: the specificity is an evaluation metric of the classifier on the entire negative data, for true negatives.

⑦ ROC curve and AUC

Role: combined indicator of sensitivity and specificity

Transverse: FPR/1-specificity

Vertical coordinate: TPR/sensitivity/recall

AUC is the area of the lower right corner of the ROC, the larger it is, the better the performance of the classifier is indicated

Two types are included: micro and macro (for multi-category classification problems, note the distinction from multi-label classification problems)

Suppose there are a total of M samples and N categories. The predicted probability matrix P (M,N), the labeling matrix L (M,N)

1)micro

Based on each column in P and L (for the whole dataset), the TPR and FPR at each threshold are calculated, and a total of N sets of data can be obtained, N ROC curves are plotted separately, and finally the average is taken

2)macro

Expand P and L by rows, then transpose to two columns, and finally draw a ROC curve

⑧ P-R curve

Horizontal axis: recall

Vertical axis: precision

Judging: 1) the larger the area surrounded by P-R the better, and the larger the point where P=R the better, when viewed straight on; 2) by the F1-measure

Comparing ROC and P-R: When the positive and negative ratios in the sample are unbalanced, the ROC curve remains essentially unchanged, while the P-R curve varies considerably for the following reasons:

When the proportion of negative samples increases, with a certain recall rate, then the poorer performing model will inevitably recall more negative samples, the TP decreases, the FP increases rapidly (for poorly performing classifiers), and the PRECISION decreases, so the area enclosed by the P-R curve becomes smaller.

⑨ Confusion matrix

The rows represent the results predicted for one of the true categories in the sample, and the columns represent the true category corresponding to one of the predicted labels.

3.Code

Note: The following code is written together with comments.

from sklearn import datasets
import numpy as np
from  import label_binarize
from sklearn.linear_model import LogisticRegression
from  import confusion_matrix, precision_score, accuracy_score,recall_score, f1_score,roc_auc_score, precision_recall_fscore_support, roc_curve, classification_report
import  as plt

iris = datasets.load_iris()
x, y = , 
print("label:", y)
n_class = len(set())
y_one_hot = label_binarize(y, (n_class))

# alpha = (-2, 2, 20) # set the hyperparameter range
# model = LogisticRegressionCV(Cs = alpha, cv = 3, penalty = 'l2') # use L2 regularization
model = LogisticRegression() # The maximum number of iterations is built in and can be changed.
(x, y)
y_score = (x) # The output is an integer label
mean_accuracy = (x, y)
print("mean_accuracy: ", mean_accuracy)
print("predict label:", y_score)
print(y_score==y)
print(y_score.shape)
y_score_pro = model.predict_proba(x) # Output probability
print(y_score_pro)
print(y_score_pro.shape)
y_score_one_hot = label_binarize(y_score, (n_class)) # The input to this function must be an integer label #
print(y_score_one_hot.shape)

obj1 = confusion_matrix(y, y_score) # Note that the input must be of integer type, shape=(n_samples, )
print('confusion_matrix\n', obj1)

print(y)
print('accuracy:{}'.format(accuracy_score(y, y_score))) # No average
print('precision:{}'.format(precision_score(y, y_score,average='micro')))
print('recall:{}'.format(recall_score(y, y_score,average='micro')))
print('f1-score:{}'.format(f1_score(y, y_score,average='micro')))
print('f1-score-for-each-class:{}'.format(precision_recall_fscore_support(y, y_score))) # for macro
# print('AUC y_pred = one-hot:{}\n'.format(roc_auc_score(y_one_hot, y_score_one_hot,average='micro'))) # For multi-class inputs it must be a proba, so this is wrong

# AUC values
auc = roc_auc_score(y_one_hot, y_score_pro,average='micro') # With micro, n_classes roc curves are calculated and averaged.
print("AUC y_pred = proba:", auc)
# Draw ROC curves
print("one-hot label ravelled shape:", y_one_hot.ravel().shape)
fpr, tpr, thresholds = roc_curve(y_one_hot.ravel(),y_score_pro.ravel()) # ravel() means to ravel away, because the input shape must be (n_samples,)
print("threshold： ", thresholds)
(fpr, tpr, linewidth = 2,label='AUC=%.3f' % auc)
([0,1],[0,1], 'k--') # Draw a line with y = x. Color and type of line
([0,1.0,0,1.0]) # Limit coordinate range
('False Postivie Rate')
('True Positive Rate')
()
()

# p-r curves are for binary classification and are not described here
ans = classification_report(y, y_score,digits=5) # 5 significant digits after the decimal point
print(ans)

The above this talk keras in the custom dichotomous task evaluation metrics metrics method as well as the code is all I share with you, I hope to be able to give you a reference, and I hope you support me more.