When it comes to machine learning, the first thing that comes to mind may be Python and algorithms, in fact, Python and algorithms are not enough, data is the prerequisite for machine learning.
Most of the data will be stored in files, in order to call the algorithm through Python to learn about the data, the first thing you need to do is to read the data into the program, this article introduces two ways to load the data, in the subsequent introduction to the algorithm, will be frequent use of the two ways to load the data into the program.
In the following, we will take the Logistic Regression model loading data as an example, and introduce the two different ways of loading data respectively.
I. Load using the open() function
def load_file(file_name): ''' Load a file using the open() function :param file_name: file name :return: feature matrix, label matrix ''' f = open(file_name) # Open the document where the training dataset is located feature = [] # A list of features label = [] #Store a list of labels for row in (): f_tmp = [] # Intermediate list of features l_tmp = [] # Intermediate list of tags number = ().split("\t") # Split the elements of each row according to \t to get each row features and labels f_tmp.append(1) # Setting bias items for i in range(len(number) - 1): f_tmp.append(float(number[i])) l_tmp.append(float(number[-1])) (f_tmp) (l_tmp) () # Close the file, very important operation return (feature), (label)
Second, the use of Pandas library read_csv () method to load the
def load_file_pd(path, file_name): ''' Loading a file with the pandas library :param path: file path :param file_name: name of the file :return: feature matrix, label matrix ''' feature = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[0, 1]) = ["a", "b"] feature = (columns=list('cab'), fill_value=1) label = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[2]) return ,
III. Examples
We can use the above two methods to load part of the data for testing, the data content is as follows:
The data is divided into three columns, the first two columns are features and the last column is labels.
The code for loading data is as follows:
''' Two ways to load a file ''' import pandas as pd import numpy as np def load_file(file_name): ''' Load a file using the open() function :param file_name: file name :return: feature matrix, label matrix ''' f = open(file_name) # Open the document where the training dataset is located feature = [] # A list of features label = [] #Store a list of labels for row in (): f_tmp = [] # Intermediate list of features l_tmp = [] # Intermediate list of tags number = ().split("\t") # Split the elements of each row according to \t to get each row features and labels f_tmp.append(1) # Setting bias items for i in range(len(number) - 1): f_tmp.append(float(number[i])) l_tmp.append(float(number[-1])) (f_tmp) (l_tmp) () # Close the file, very important operation return (feature), (label) def load_file_pd(path, file_name): ''' Loading a file with the pandas library :param path: file path :param file_name: name of the file :return: feature matrix, label matrix ''' feature = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[0, 1]) = ["a", "b"] feature = (columns=list('cab'), fill_value=1) label = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[2]) return , if __name__ == "__main__": path = "C://Users//Machenike//Desktop//xzw//" feature, label = load_file(path + "") feature_pd, label_pd = load_file_pd(path, "") print(feature) print(feature_pd) print(label) print(label_pd)
Test results:
[[ 1. 1.43481273 4.54377111]
[ 1. 5.80444603 7.72222239]
[ 1. 2.89737803 4.84582798]
[ 1. 3.48896827 9.42538199]
[ 1. 7.98990181 9.38748992]
[ 1. 6.07911968 7.81580716]
[ 1. 8.54988938 9.83106546]
[ 1. 1.86253147 3.64519173]
[ 1. 5.09264649 7.16456405]
[ 1. 0.64048734 2.96504627]
[ 1. 0.44568267 7.27017831]]
[[ 1. 1.43481273 4.54377111]
[ 1. 5.80444603 7.72222239]
[ 1. 2.89737803 4.84582798]
[ 1. 3.48896827 9.42538199]
[ 1. 7.98990181 9.38748992]
[ 1. 6.07911968 7.81580716]
[ 1. 8.54988938 9.83106546]
[ 1. 1.86253147 3.64519173]
[ 1. 5.09264649 7.16456405]
[ 1. 0.64048734 2.96504627]
[ 1. 0.44568267 7.27017831]]
[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]]
[[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]]
From the test results, it can be seen that the data results obtained from the two methods of loading data are the same, so both methods are suitable for loading data.
Attention:
This is an example of loading data with a Logistic Regression model. The data may differ from the data itself, but the way of loading the data is pretty much the same, so be flexible.
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.