SoFunction
Updated on 2024-11-17

Two implementations of Python for loading the contents of a file

When it comes to machine learning, the first thing that comes to mind may be Python and algorithms, in fact, Python and algorithms are not enough, data is the prerequisite for machine learning.

Most of the data will be stored in files, in order to call the algorithm through Python to learn about the data, the first thing you need to do is to read the data into the program, this article introduces two ways to load the data, in the subsequent introduction to the algorithm, will be frequent use of the two ways to load the data into the program.

In the following, we will take the Logistic Regression model loading data as an example, and introduce the two different ways of loading data respectively.

I. Load using the open() function

def load_file(file_name):
    '''
    Load a file using the open() function
    :param file_name: file name
    :return: feature matrix, label matrix
    '''
    f = open(file_name)  # Open the document where the training dataset is located
    feature = []  # A list of features
    label = []  #Store a list of labels
    for row in ():
        f_tmp = []  # Intermediate list of features
        l_tmp = []  # Intermediate list of tags
        number = ().split("\t")  # Split the elements of each row according to \t to get each row features and labels
        f_tmp.append(1)  # Setting bias items
        for i in range(len(number) - 1):
            f_tmp.append(float(number[i]))
        l_tmp.append(float(number[-1]))
        (f_tmp)
        (l_tmp)
    ()  # Close the file, very important operation
    return (feature), (label)

Second, the use of Pandas library read_csv () method to load the

def load_file_pd(path, file_name):
    '''
    Loading a file with the pandas library
    :param path: file path
    :param file_name: name of the file
    :return: feature matrix, label matrix
    '''
    feature = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[0, 1])
     = ["a", "b"]
    feature = (columns=list('cab'), fill_value=1)
    label = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[2])
    return , 

III. Examples

We can use the above two methods to load part of the data for testing, the data content is as follows:

The data is divided into three columns, the first two columns are features and the last column is labels.

The code for loading data is as follows:

'''
Two ways to load a file
'''
 
import pandas as pd
import numpy as np
 
def load_file(file_name):
    '''
    Load a file using the open() function
    :param file_name: file name
    :return: feature matrix, label matrix
    '''
    f = open(file_name)  # Open the document where the training dataset is located
    feature = []  # A list of features
    label = []  #Store a list of labels
    for row in ():
        f_tmp = []  # Intermediate list of features
        l_tmp = []  # Intermediate list of tags
        number = ().split("\t")  # Split the elements of each row according to \t to get each row features and labels
        f_tmp.append(1)  # Setting bias items
        for i in range(len(number) - 1):
            f_tmp.append(float(number[i]))
        l_tmp.append(float(number[-1]))
        (f_tmp)
        (l_tmp)
    ()  # Close the file, very important operation
    return (feature), (label)
 
def load_file_pd(path, file_name):
    '''
    Loading a file with the pandas library
    :param path: file path
    :param file_name: name of the file
    :return: feature matrix, label matrix
    '''
    feature = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[0, 1])
     = ["a", "b"]
    feature = (columns=list('cab'), fill_value=1)
    label = pd.read_csv(path + file_name, delimiter="\t", header=None, usecols=[2])
    return , 
 
if __name__ == "__main__":
    path = "C://Users//Machenike//Desktop//xzw//"
    feature, label = load_file(path + "")
    feature_pd, label_pd = load_file_pd(path, "")
    print(feature)
    print(feature_pd)
    print(label)
    print(label_pd)

Test results:

[[ 1.          1.43481273  4.54377111]
 [ 1.          5.80444603  7.72222239]
 [ 1.          2.89737803  4.84582798]
 [ 1.          3.48896827  9.42538199]
 [ 1.          7.98990181  9.38748992]
 [ 1.          6.07911968  7.81580716]
 [ 1.          8.54988938  9.83106546]
 [ 1.          1.86253147  3.64519173]
 [ 1.          5.09264649  7.16456405]
 [ 1.          0.64048734  2.96504627]
 [ 1.          0.44568267  7.27017831]]
[[ 1.          1.43481273  4.54377111]
 [ 1.          5.80444603  7.72222239]
 [ 1.          2.89737803  4.84582798]
 [ 1.          3.48896827  9.42538199]
 [ 1.          7.98990181  9.38748992]
 [ 1.          6.07911968  7.81580716]
 [ 1.          8.54988938  9.83106546]
 [ 1.          1.86253147  3.64519173]
 [ 1.          5.09264649  7.16456405]
 [ 1.          0.64048734  2.96504627]
 [ 1.          0.44568267  7.27017831]]
[[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]
[[0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]

From the test results, it can be seen that the data results obtained from the two methods of loading data are the same, so both methods are suitable for loading data.

Attention:

This is an example of loading data with a Logistic Regression model. The data may differ from the data itself, but the way of loading the data is pretty much the same, so be flexible.

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.