Tensorflow 2.1 Completes Prediction of MPG Regression Detailed

preamble

The main focus of this paper is to accomplish the task of regression prediction on the Auto MPG dataset using the cpu version of tensorflor-2.1.

Acquisition of Auto MPG data and normalization of data

(1) Auto MPG dataset describes the eigenvalues and labeled values of the fuel efficiency of a car, we can find the pattern from the features by learning from the model, and finally predict the target MPG with minimum error.

(2) We use a function that comes with keras to download data directly from the web and save it locally.

(3) Each row contains eight columns of data such as MPG, Cylinder, Displacement, Horsepower, Weight, Acceleration, Model Year, and Country of Origin, where MPG is our labeled value and the others are characteristics.

dataset_path = .get_file("", "/ml/machine-learning-databases/auto-mpg/")
column_names = ['MPG','Cylinder','Displacement','Horsepower','Weight', 'Accelerate', 'Model year', 'Origin']
raw_dataset = pd.read_csv(dataset_path, names=column_names, na_values = "?", comment='\t',  sep=" ", skipinitialspace=True)
dataset = raw_dataset.copy()

Processing of data

(1) Because there are some null values in the data, which will affect our computation of the features and prediction of the target, the rows with null data are removed from the data.

dataset = ()

(2) Since there are only three values in the "Origin" column, 1, 2, and 3, which represent three countries, we present each of them as a separate column, which is equivalent to converting each country category into an ont-hot.

origin = ('Origin')
dataset['América'] = (origin == 1)*1.0
dataset['Beaters'] = (origin == 2)*1.0
dataset['Little Japs'] = (origin == 3)*1.0

(3) Take 90% of the data as training data and 10% of the data as test data according to a certain ratio.

train_datas = (frac=0.9, random_state=0)
test_datas = (train_dataset.index)

(4) Here is the main use of some built-in functions to view the training set for each column of data on a variety of common statistical indicators, mainly count, mean, std, min, 25%, 50%, 75%, max , which eliminates the need for us to calculate the latter, can be used directly.

train_stats = train_datas.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()

（The MPG in the (5) data is the regression target we need to predict, and we popped this column out of the training and test sets and made it a separate label. MPG means Miles per Gallon, which is a mid-level measure of how many miles a car can travel with only one gallon of gasoline or diesel in the mailbox.

train_labels = train_datas.pop('MPG')
test_labels = test_datas.pop('MPG')

(6) Here is mainly to normalize the training data and test data, each feature should be scaled to the same range independently, because when the input data feature values exist in different ranges, it is not conducive to the rapid convergence of the model training, I put a model training evaluation metrics without data normalization in section VII at the end of the article, and you can see that it is very haphazard.

def norm(stats, x):
    return (x - stats['mean']) / stats['std']
train_datas = norm(train_stats, train_datas)
test_datas = norm(train_stats, test_datas)

Building Deep Learning Models

Build deep learning models, and complete model configuration and compilation

The main focus here is to build the deep learning model, configure it and compile it.

(1) The model has three main layers:

The first layer is mainly a fully connected layer operation that takes all the feature values of each sample as input, varies them nonlinearly by relu activation function, and finally outputs a 64-dimensional vector.
The second layer is mainly a fully-connected layer operation that takes the 64-dimensional vector from the previous layer, varies it nonlinearly by the relu activation function, and finally outputs a 32-dimensional vector.
The third layer is mainly a fully connected layer operation that takes the 32-dimensional vector from the previous layer and finally outputs a 1-dimensional result, which is actually the output prediction back to the MPG.

(2) The optimizer in the model is chosen here as RMSprop with a learning rate of 0.001.

(3) The indicator of the loss value in the model is the MSE , which is actually the mean square error, a statistical parameter that is the mean of the sum of the squares of the errors in the model's predictions and the MPG values of the original sample.

(4) The evaluation metrics of the model are chosen to be MAE and MSE. MSE is the same as above, and MAE is the Mean Absolute Error, a statistical parameter that refers to the average of the absolute errors between the model's predicted values and the MPG of the original sample.

def build_model():
    model = ([  (64, activation='relu', input_shape=[len(train_datas.keys())]),
                                (32, activation='relu'),
                                (1) ])
    optimizer = (0.001)
    (loss='mse', optimizer=optimizer, metrics=['mae', 'mse'])
    return model
model = build_model()

Complete model training with EarlyStoping

(1) Here we use the training set data and labels for model training, a total of 1,000 epochs need to be carried out, and 20% of the training data is selected as the validation set to evaluate the model effect during the training process, in order to avoid the phenomenon of overfitting, here we use EarlyStopping technology for optimization, that is, when after a certain number of epochs (we here), the training will automatically stop. In order to avoid overfitting phenomenon, we use EarlyStopping technique to optimize, that is, after a certain number of epochs (we define 20 here), if there is no improvement effect, the training is automatically stopped.

early_stop = (monitor='val_loss', patience=20)
history = (train_datas, train_labels, epochs=1000, validation_split = 0.2, verbose=2, callbacks=[early_stop])

The output of the training process metrics is as follows, and we can see that the training stops after the 106th epoch:

Train on 282 samples, validate on 71 samples
Epoch 1/1000
282/282 - 0s - loss: 567.8865 - mae: 22.6320 - mse: 567.8865 - val_loss: 566.0270 - val_mae: 22.4126 - val_mse: 566.0270
Epoch 2/1000
282/282 - 0s - loss: 528.5458 - mae: 21.7937 - mse: 528.5459 - val_loss: 526.6008 - val_mae: 21.5748 - val_mse: 526.6008
...
Epoch 105/1000
282/282 - 0s - loss: 6.1971 - mae: 1.7478 - mse: 6.1971 - val_loss: 5.8991 - val_mae: 1.8962 - val_mse: 5.8991
Epoch 106/1000
282/282 - 0s - loss: 6.0749 - mae: 1.7433 - mse: 6.0749 - val_loss: 5.7558 - val_mae: 1.8938 - val_mse: 5.7558

(2) Here also shows the model in the training process, using the training set and the validation set of mae, mse drawing two pictures, we can see that after reaching more than 100 epochs, the training process is terminated, to avoid the model overfitting.

Evaluation of models using test data

loss, mae, mse = (test_datas, test_labels, verbose=2)
print("test set of MAE because of: {:5.2f} MPG ，MSE because of : {:5.2f} MPG".format(mae, mse))

The output result is:

The MAE of the test set is: 2.31 MPG and the MSE is: 9.12 MPG.

Prediction using models

We selected a test data and used the model to predict its MPG.

predictions = (test_data[:1]).flatten()
predictions

Results for :

array([15.573855], dtype=float32)

The actual test sample data MPG is 15.0, which can be seen with the predicted value of 0.573855 error, in fact, we can also build a more complex model, select more features to train the model, theoretically, can achieve a smaller prediction error.

Demonstrate the training process without the normalization operation

We show the metrics of the data without normalization during the training process, and we can see that the training metrics are cluttered. So in general we recommend doing normalization on the data, which is conducive to the rapid convergence of model training.

Above is the Tensorflow 2.1 completion of the MPG regression prediction details, more information about Tensorflow MPG regression prediction please pay attention to my other related articles!