Beginning python mathematical modeling data import (for beginners)

1. Data import is the first step in all numerical model programming

Programming to solve a numerical modeling problem, the problem always involves some data.

Some of the data is given in the textual description of the topic, some of the data is provided through downloads from files attached to the topic or from a specified Web site, and some of the data needs to be collected on your own. Regardless of the way the data is obtained, and regardless of the type of problem and algorithm, the first step is to import this data into the program in an appropriate manner and format.

If there is a problem with the data format

A minor error in reading the data and having to waste time looking it up and solving it can be very agitating in a numerical modeling competition.

Are data errors still light? Yes, heavy is when there is an error in reading the data and the program continues to run, getting the wrong result, which is even worse in a numerical modeling contest.

You may not even know that an error has occurred, and even if you feel there is a problem, you won't be able to pinpoint the error directly to the data import section.

As a result, it kept going to other modules until it got the right one wrong as well, and eventually it was hopeless.

Therefore, it is more important than you think to make sure that the first step of digital programming, "data import", is completed successfully.

There are many ways to import data in the Python language

What is the best method to choose for programming mathematical modeling problems? The answer is: there is no best, only the most appropriate.

For different problems, different algorithms, and different implementations of the invoked toolkit, there will be different requirements for the data.

In addition, the different organization of data in the data files given in the contest questions requires the use of different methods to import the data.

So well, since it's meant to be problem-specific, isn't that the same as not saying it at all? This is exactly the question that this article hopes to answer, although the best data import method for different problems are different, but we first need to learn a may not be the best, but general, safe, simple, easy to learn method.

2. Assigning values to variables directly in the program

Assigning values to variables directly in a program is the simplest, though clumsy, method, and perhaps still the most reliable - if you're not hitting the wrong keyboard.

Indeed, it's embarrassing to present direct assignment as a data import method.

However, for the special needs of the digital-analog race, the direct assignment method is still very commonly used and fully meets the requirements of simplicity, practicality and reliability.

However, direct assignment is not as simple as we would like it to be, and it's still worth a serious talk.

2.1 Why direct assignment?

The vast majority of mathematical modeling textbooks contain routines that import data using direct assignment.
A large percentage of blog routines, including most of the cases in this series, also assign values directly in the program.

The reason for this is that

First, in order to ensure the integrity of the program, copy and paste the return to get the results of the run, do not need to copy the data file and other operations, it avoids the resulting various errors;

The second is to focus the reader's attention on the main points of knowledge and avoid distractions;

Third, it makes the routines more intuitive and easy to understand the algorithms of the routines.

All of these reasons are also advantages of direct assignment. So, aren't these advantages also the pain points of the programming activities of the number modeling contest?

Yes, this is the reason why the direct assignment method is widely popular in the practice of mathematical modeling training and programming for numerical modeling competitions.

2.2 Problems and Considerations of Direct Assignment

However, there are several problems with direct assignments, even in number-mode race programming.

One is that certain issues cannot use the direct assignment method. This is mainly a problem with big data, where the amount of data or the number of data files is so great that it can no longer be realized using direct assignment.
Secondly, some problems can be assigned directly, but it is easy to make mistakes. This is mainly the case with questions that have a large amount of data, or a complex data structure or type. For example, multivariate analysis, time series, data statistics type of questions may have a large amount of data, in the annex to provide data files.
At this point, if you are using direct assignment to import data, you are no longer hitting the keyboard, but copying and pasting data from a file into the program. The question to pay special attention to is: what is the data separator in the file, space or comma, and is it consistent with the formatting requirements of variable assignment?
Even if the data separator in the file appears to be a space, you need to check whether it is a space or a tab, one space or several spaces?
Are there any anomalies such as errors or omissions in the data in the file?
This can be checked, recognized and handled by the program in reading the file, and has to be handled manually when copying and pasting.
Thirdly, problems with small amounts of data, where it is perfectly possible to import data by direct assignment, can also go wrong due to carelessness. This is not so much a case of hitting the wrong keyboard as it is due to the fact that the routines don't necessarily deal with data assignment as a separate module, but rather spread out over the course of the algorithm to make assignments.
It becomes easy for students to forget to modify the variable assignments during the algorithm when using and modifying the routines.
This happens all the time, sometimes because of a lack of understanding of the program and ignoring a variable in an algorithmic step;
More often than not, it's a case of being busy, getting dizzy while repeatedly debugging and replacing data, and focusing on modifying the data at the beginning while neglecting the data at the end.

Getting into the habit of modularizing your data import is the only way to avoid this type of oversight:

Make the data import module a separate function.
If you don't want to use the data import function, write the data import section in a centralized paragraph and place it at the beginning of the program.
Do not confuse the data import for the problem itself with the parameter assignments required by the algorithm into two separate functions or paragraphs.

Routine 1: Data Import as a Separate Function

# Subroutine: Define the objective function of the optimization problem
def cal_Energy(X, nVar, mk): # m(k): penalization factor
    p1 = (max(0, 6*X[0]+5*X[1]-320))**2
    p2 = (max(0, 10*X[0]+20*X[1]-7027)**2
    fx = -(10*X[0]+9*X[1])
    return fx+mk*(p1+p2)

# Subroutine: parameterization of the simulated annealing algorithm
def ParameterSetting():
    tInitial = 100.0            # Set initial temperature for annealing
    tFinal  = 1                 # Setting of stop temperature
    alfa    = 0.98              # Set cooling parameter, T(k)=alfa*T(k-1)
    nMarkov = 100            	# Markov chain length, i.e., number of inner loop runs
    youcans = 0.5               # Define the search step, which can be set to a fixed value or tapered.
    return tInitial, tFinal, alfa, nMarkov, youcans

Routine 2: Write a paragraph on the data import set and place it at the beginning of the program.

# Main program
def main():
    # Model data import
    p1 = [6, 5, -320]
    p2 = [10, 20, -7027]
    p3 = [10, 9]
    print(p1,p2,p3)

    # Algorithm parameterization
    tInitial = 100.0            # Set initial temperature for annealing
    tFinal  = 1                 # Setting of stop temperature
    alfa    = 0.98              # Set cooling parameter, T(k)=alfa*T(k-1)
    nMarkov = 100            	# Markov chain length, i.e., number of inner loop runs
    youcans = 0.5               # Define the search step, which can be set to a fixed value or tapered.
    print(tInitial, tFinal, alfa, nMarkov, youcans)

3. Pandas import data

Although many number-modeling contest problems can be solved by directly assigning values to obtain data, the mainstream data import method is still reading data files.

Data file formats commonly used in mathematical modeling are text files (.txt), Excel files (.xls, .xlsx) and csv files (.csv).
When reading a text file, you will encounter different data separators such as commas, spaces, tabs, and so on.
When reading an Excel file, first of all, the format of .xls is different from .xlsx, and secondly, you have to consider whether the data table has a header row or not, and sometimes there are more than one worksheet in the file.
Missing data and illegal characters are also encountered when reading files.

It's distracting for white people to deal with all of this, especially during competitions.

There are many ways to read data files in Python. This article does not recommend using Python's own file operations such as open, close, read, readline, but rather Pandas to read data files.

The reason:

Pandas provides a variety of commonly used file formats read and write functions, the above cases can be a line of code to deal with.
Pandas is a data analysis toolkit based on NumPy, which is easy to organize and clean data, and easy and flexible to operate.
Pandas provides conversion tools to and from various other data structures, and is easy and flexible to use.
Many mathematical modeling algorithms have routines that use Pandas' Series and DataFrame data structures without conversion.

3.1 Pandas Reading Excel Files

Pandas uses the read_excel() function to read an Excel file.

pd.read_excel(io, sheetname=0,header=0,index_col=None,names=None)

Main parameters of pd.read_excel().

io : Path to the file (including filename).
header : Specifies the row to be used as the column name. The default is 0, which means the first row is the header row. Setting header=None means that there is no header row and the first row is the data row.
sheetname: specify the sheet. Default is sheetname=0. Set sheetname=None to return full sheet, set sheetname=[0,1] to return multiple sheets.
index_col : Specifies the column number or column name to be used as the row index.
names: Specifies the names of the columns, of type list.

Example of using pd.read_excel().

# sheetname means read the specified worksheet, header=0 means the first line is the header line, header=None means the first line is the data line.
df = pd.read_excel("data/", sheetname='Sheet1', header=0)

3.2 Pandas reading csv files

Pandas uses the pandas.read_csv() function to read Excel files.

pd.read_csv( filepath ,sep=',', header=‘infer', names=None, index_col=None)

Main parameters of pd.read_csv().

filepath : File path (including file name).
sep: specify the separator. Default is comma ',', other separators can be set as needed.
header : Specify the line to be used as the column name. ** If the file does not have a column name, the default is 0, which means that the first line is the data line; set header=None, which means that there is no header line and the first line is the data line.
index_col : Specifies the column number or column name to be used as the row index.
names: Specifies the names of the columns, of type list.

Example of using pd.read_csv().

# sep=',' means spacer is comma, header=0 means first line is header line, header=None means first line is data line
df = pd.read_csv("data/", header=0, sep=',')

3.3 Pandas Reading Text Files

For text files .txt and .dat, you can read them using the pandas.read_table() function.

pd.read_table( filepath ,sep='\t', header=‘infer', names=None, index_col=None)

Main parameters of pd.read_table().

filepath : File path (including file name).
sep: Specify the separator. The default is tab tab. You can set other tabs if you want.
header : Specify the line to be used as the column name. ** If the file does not have a column name, the default is 0, which means that the first line is the data line; set header=None, which means that there is no header line and the first line is the data line.
index_col : Specifies the column number or column name to be used as the row index.
names: Specifies the names of the columns, of type list

Example of using pd.read_table().

# sep='\t' means that the separator is a tab, header=None means that there is no header line and the first line is data
df = pd.read_table("data/", sep="\t", header=None)

3.4 Pandas Reads Other File Formats

Pandas also provides functions to read multiple file formats

The usage is also all similar, it's all done in one line of code. For example:

pandas.read_sql, read SQL database
pandas.read_html to grab the table data in the web page
pandas.read_json, read JSON data file
pandas.read_clipboard, read clipboard contents

As these file formats are rarely used in number modeling contests, this article will not be introduced in detail. Students who need it can search for references through search engines according to the function names, or they can check the official documents:

Documentation of the Pandas Input-Output Functions/pandas-docs/stable/reference/
/pandas-docs/stable/reference/

In addition, for big data type of problem, the amount of data to be processed may be very large, if necessary, need to split or merge files, can also be processed with pandas, which will be explained in the subsequent articles with specific problems.

4. Data import routines

[Important note] Although the content of the above chapter introduces the basic method of data import, but I'm afraid it is still difficult to digest and absorb, for my use.
In order to solve this problem, this article will be related to the integration of the contents of the routine, so that readers can learn to collect, but also easy to use and modify.

Routine 01: Reading a data file

# mathmodel01_v1.py
# Demo01 of mathematical modeling algorithm
# Read data files into DataFrame.
# Copyright 2021 Youcans, XUPT
# Crated：2021-05-27
import pandas as pd
# Read the data file
def readDataFile(readPath):  # readPath: address and filename of the data file
    # readPath = "... /data/" # The file path can also be entered directly here
    try:
        if (readPath[-4:] == ".csv"):
            dfFile = pd.read_csv(readPath, header=0, sep=",")  # Comma intervals, first line headers
            # dfFile = pd.read_csv(filePath, header=None, sep=",") # sep: spacer, untitled line
        elif (readPath[-4:] == ".xls") or (readPath[-5:] == ".xlsx"):  # sheet_name defaults to 0
            dfFile = pd.read_excel(readPath, header=0)  # The first line is the title line
            # dfFile = pd.read_excel(filePath, header=None) # Untitled line
        elif (readPath[-4:] == ".dat"):  # sep: spacer, header: whether the first line is a header line or not
            dfFile = pd.read_table(readPath, sep=" ", header=0)  # Spacers are spaces and the first line is a header line
            # dfFile = pd.read_table(filePath,sep=",",header=None) # Intervals are commas, no header rows
        else:
            print("Unsupported file formats.")
    except Exception as e:
        print("Failed to read data file：{}".format(str(e)))
        return
    return dfFile

# Main program
def main():

    # Reading data files # Youcans, XUPT
    readPath = "../data/"  # Address and file name of data file
    dfFile = readDataFile(readPath)  # Call the read file subroutine
    print(type(dfFile))  # View dfFile data types
    print()  # View dfFile shape (number of rows, number of columns)
    print(())  # Display the first 5 rows of dfFile data
    return
if __name__ == '__main__':  # Youcans, XUPT
    main()

Example 01 Run results:

<class ''>

(30, 6)

period price average advertise difference sales

0 1 3.85 3.80 5.50 -0.05 7.38

1 2 3.75 4.00 6.75 0.25 8.51

2 3 3.70 4.30 7.25 0.60 9.52

3 4 3.70 3.70 5.50 0.00 7.50

4 5 3.60 3.85 7.00 0.25 9.33

1. This routine needs to read the data file ".../data/" which is saved in the .../data/ directory. Readers need to modify the file path and file name of this data file in order to read the local file they need.

2. This routine can automatically identify the file type according to the suffix of the file name and call the corresponding function to read the file.

3. The read file module in this routine uses the try...except statement for simple exception handling. If the read fails, you can find the error according to the type of exception thrown.

To this article on the beginning of python mathematical modeling data import (white part) of the article is introduced to this, more related to python mathematical modeling data import content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!