Python mathematical modeling StatsModels statistical regression of linear regression example details

1. Background knowledge

1.1 Interpolation, Fitting, Regression and Prediction

Interpolation, fitting, regression, and prediction are all concepts that are frequently mentioned in mathematical modeling and are often conflated.

Interpolation, is to interpolate a continuous function on the basis of discrete data, so that this continuous curve through all the given discrete data points. Interpolation is an important method of discrete function approximation, which can be used to estimate the approximate value of a function at other points by the function's value status at a finite number of points.
Fitting, is the use of a continuous function (curve) close to the given discrete data to make it fit the given data.

Therefore, interpolation and fitting are both based on the known data points to change the rule of law and similar characteristics of the approximate curve process, but interpolation requires the approximate curve completely through the given data points, while fitting only requires the approximate curve in the overall as close as possible to the data points, and reflect the change law and development trend of the data. Interpolation can be viewed as a special kind of fitting, one that requires an error function of zero. Because the data points are usually with error, error of 0 often means overfitting, overfitting model for the training set of data outside the generalization ability is poor. Therefore, in practice, interpolation is mostly used for image processing and fitting is mostly used for experimental data processing.

Regression, a method of statistical analysis for studying the relationship between one set of random variables and another set of random variables, involves building a mathematical model and estimating the model parameters and testing the credibility of the mathematical model, and also involves using the built model and estimated model parameters for prediction or control.
Forecasting is a very broad concept, and in numerical modeling it refers to the quantitative study of the data and information obtained, according to which a mathematical model appropriate to the purpose of the forecast is built, and then quantitatively predict the future development and change. Interpolation and fitting are generally considered to be methods in the forecasting category.

Regression is a method of data analysis and fitting is a specific method of data processing. Fitting focuses on optimizing the parameters of the curve so that the curve matches the data, while regression focuses on examining the relationship between two or more variables.

1.2 Linear regression

Regression analysis is a method of statistical analysis that examines the quantitative relationship between independent and dependent variables, and is often used in predictive analytics, time series modeling, and to discover causal relationships between variables. According to the type of relationship between variables, regression analysis can be divided into linear regression and non-linear regression.

Linear regression (Linear regression) Assuming that there is a linear relationship between the target (y) and the characteristics (X) in a given data set, i.e., satisfying a multiple quadratic equation . In regression analysis, only one independent variable and one dependent variable are included, and the relationship between them can be approximated by a straight line, which is called univariate linear regression; if two or more independent variables are included, and the relationship between the dependent variable and the independent variable is linear, it is called multiple linear regression.

Based on the sample data, the least squares method can be used to obtain estimates of the parameters of the linear regression model and to minimize the sum of squares of the errors between the model data calculated from the estimated parameters and the given sample data.

Further, it is necessary to analyze whether linear regression can be used for the sample data or not, or whether the assumption of linear correlation is reasonable and the linear model is well stabilized? This requires a significance test using statistical analysis to test whether the linear relationship between the dependent and independent variables is significant and whether it is appropriate to use a linear model to describe their relationship.

2. Statsmodels for linear regression

This section introduces linear fitting and regression analysis in conjunction with the use of the Statsmodels statistical analysis package. The linear model can be expressed as the following equation:

2.1 Importing the Toolkit

import as sm from

import wls_prediction_std

2.2 Importing sample data

Sample data is usually stored in a data file, so read the data file to obtain the sample data. For ease of reading and testing the program, this paper uses random numbers to generate the sample data. The method of reading the data file to import the data will be described later.

# Generate sample data.
nSample = 100
x1 = (0, 10, nSample) # Starting at 0 and ending at 10, all divided into nSample points
e = (size=len(x1)) # Normally distributed random numbers
yTrue = 2.36 + 1.58 * x1 # y = b0 + b1*x1
yTest = yTrue + e # Model data generated

This case is a one-way linear regression problem, (yTest, x) is the imported sample data, we need to obtain the quantitative relationship between the dependent variable y and the independent variable x through linear regression. yTrue is the value of the ideal model, yTest simulates the data from the experimental test, and the normally distributed random errors are added to the ideal model.

2.3 Modeling and Fitting

The equation of the one-way linear regression model is:

y = β0 + β1 * x + e

After adding an intercept column to the matrix X with sm.add_constant(), an ordinary least squares model is built with (), and finally a linear regression model is fitted with (), which returns a summary of the results of the fit and statistical analysis.

X = sm.add_constant(x1) # add intercept column to the left of x1 x0=[1,...1]
model = (yTest, X) # Build a least squares model (OLS)
results = () # return model fit results
is a function of .linear_model with 4 arguments (endog, exog, missing, hasconst).

The first parameter, endog, is the dependent variable y(t) in the regression model, which is a 1-d array data type.

The second input exog is the independent variables x0(t),x1(t),...,xm(t), which are of the (m+1)-d array datatype.
Note that the regression model of does not have a constant term of the form:
y = B*X + e = β0*x0 + β1*x1 + e, x0 = [1,…1]
The previously imported data (yTest, x1) does not contain x0, so you need to add an intercept column x0=[1,...1] to the left of x1 to convert the autoconstant matrix to X = (x0, x1). The function sm.add_constant() accomplishes this.
The parameter missing is used for data checking, hasconst is used for checking constants and is not needed in general.

2.4 Output of fitting and statistical results

The output of the linear regression analysis performed by Statsmodels is very rich, and () returns a summary of the regression analysis.

print(()) # Output a summary of the regression analysis

The summary is very rich in what it returns, and the most important results are discussed here, in the middle paragraph of the summary.

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4669      0.186     13.230      0.000       2.097       2.837
x1             1.5883      0.032     49.304      0.000       1.524       1.652
==============================================================================

coef: Regression coefficient, which is the estimated value of the model parameters β0, β1, ....

std err Standard deviation, also known as the standard deviation, is the arithmetic square root of the variance, reflecting the average degree of difference between the sample data values and the regression model estimates. The larger the standard deviation, the less reliable the regression coefficient.

t: The t-Statistic, equal to the regression coefficients divided by the standard deviation, is used to test each regression coefficient separately to see if the effect of each independent variable on the dependent variable is significant. If the effect of an independent variable xi is not significant, it means that this independent variable can be eliminated from the model.

P>|t|: The P-value (Prob(t-Statistic)) of the t-test, which reflects the significance of the hypothesis of correlation of each independent variable xi with the dependent variable y. The p-value of the t-test is the value of the correlation between the dependent variable y and the independent variable xi. If p<0.05, it can be interpreted that there is a regression relationship between the variables xi and y that is significant at the 0.05 level of significance.

[0.025,0.975]: The lower and upper bounds of the confidence interval of a regression coefficient that contains the regression coefficient with 95% confidence. Note that it does not mean that the probability of the sample data falling into this interval is 95%.

In addition, there are some important indicators to keep an eye on:

R-squared: Coefficient of determination of R-square (Coefficient of determination), indicating the degree of influence of all independent variables on the joint dependent variable, used to measure the goodness of fit of the regression equation, the closer it is to 1 indicates that the degree of fit is better.

F-statistic: The F-Statistic (F-Statistic), which is used to test the significance of the overall regression equation, tests whether the effect of all the independent variables on the dependent variable as a whole is significant.

Statsmodels can also be used to obtain data for desired regression analyses via attributes, for example:

print("OLS model: Y = b0 + b1 * x") # b0: intercept of regression line, b1: slope of regression line
print('Parameters: ', ) # Output: coefficients of the fitted model
yFit = # y value calculated by the fitted model
(x1, yTest, 'o', label="data") # raw data
(x1, yFit, 'r-', label="OLS") # Fitting the data

3. One-dimensional linear regression

3.1 Univariate linear regression Python program:

# LinearRegression_v1.py
# Linear Regression with statsmodels (OLS: Ordinary Least Squares)
# v1.0: Calling statsmodels to implement one-dimensional linear regression
# Date: 2021-05-04
import numpy as np
import  as plt
import  as sm
from  import wls_prediction_std
def main():  # Main program
    # Generate test data.
    nSample = 100
    x1 = (0, 10, nSample)  # Starting at 0 and ending at 10, all divided into nSample points
    e = (size=len(x1))  # Normally distributed random numbers
    yTrue = 2.36 + 1.58 * x1  #  y = b0 + b1*x1
    yTest = yTrue + e  # Model data generated
    # Univariate linear regression: least squares (OLS)
    X = sm.add_constant(x1)  # Add intercept columns to matrix X (x0=[1,.... .1])
    model = (yTest, X)  # Least squares modeling (OLS)
    results = ()  # Return model fit results
    yFit =   # y-value of model fit
    prstd, ivLow, ivUp = wls_prediction_std(results) # Return standard deviation and confidence interval
    # OLS model: Y = b0 + b1*X + e
    print(())  # Summary of output regression analysis
    print("\nOLS model: Y = b0 + b1 * x")  # b0: intercept of the regression line, b1: slope of the regression line
    print('Parameters: ', )  # Output: coefficients of the fitted model
    # Plotting: raw data points, fitted curves, confidence intervals
    fig, ax = (figsize=(10, 8))
    (x1, yTest, 'o', label="data")  # Raw data
    (x1, yFit, 'r-', label="OLS")  # Fitted data
    (x1, ivUp, '--',color='orange',label="upConf")  # 95% upper limit of confidence interval
    (x1, ivLow, '--',color='orange',label="lowConf")  # 95% Lower limit of confidence interval
    (loc='best')  # Show legend
    ('OLS linear regression ')
    ()
    return
if __name__ == '__main__': #YouCans, XUPT
    main()

3.2 Univariate linear regression Results of running the program:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.961
Model:                            OLS   Adj. R-squared:                  0.961
Method:                 Least Squares   F-statistic:                     2431.
Date:                Wed, 05 May 2021   Prob (F-statistic):           5.50e-71
Time:                        16:24:22   Log-Likelihood:                -134.62
No. Observations:                 100   AIC:                             273.2
Df Residuals:                      98   BIC:                             278.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4669      0.186     13.230      0.000       2.097       2.837
x1             1.5883      0.032     49.304      0.000       1.524       1.652
==============================================================================
Omnibus:                        0.070   Durbin-Watson:                   2.016
Prob(Omnibus):                  0.966   Jarque-Bera (JB):                0.187
Skew:                           0.056   Prob(JB):                        0.911
Kurtosis:                       2.820   Cond. No.                         11.7
==============================================================================
OLS model: Y = b0 + b1 * x
Parameters:  [2.46688389 1.58832741]

4. Multiple linear regression

4.1 Multiple linear regression Python program:

# LinearRegression_v2.py
# Linear Regression with statsmodels (OLS: Ordinary Least Squares)
# v2.0: Calling statsmodels to implement multiple linear regression
# Date: 2021-05-04
import numpy as np
import  as plt
import  as sm
from  import wls_prediction_std
# Main program
def main():  # Main program
    # Generate test data.
    nSample = 100
    x0 = (nSample)  # Intercept column x0=[1,.... .1]
    x1 = (0, 20, nSample)  # Starting at 0 and ending at 10, all divided into nSample points
    x2 = (x1)
    x3 = (x1-5)**2
    X = np.column_stack((x0, x1, x2, x3))  # (nSample,4): [x0,x1,x2,...,xm]
    beta = [5., 0.5, 0.5, -0.02] # beta = [b1,b2,...,bm]
    yTrue = (X, beta)  # vector dot product y = b1*x1 + ... + bm*xm
    yTest = yTrue + 0.5 * (size=nSample)  # Model data generated
    # Multiple linear regression: least squares (OLS)
    model = (yTest, X)  # Build the OLS model: Y = b0 + b1*X + ... + bm*Xm + e
    results = ()  # Return model fit results
    yFit =   # y-value of model fit
    print(())  # Summary of output regression analysis
    print("\nOLS model: Y = b0 + b1*X + ... + bm*Xm")
    print('Parameters: ', )  # Output: coefficients of the fitted model
    # Plotting: raw data points, fitted curves, confidence intervals
    prstd, ivLow, ivUp = wls_prediction_std(results) # Return standard deviation and confidence interval
    fig, ax = (figsize=(10, 8))
    (x1, yTest, 'o', label="data")  # Experimental data (raw data + error)
    (x1, yTrue, 'b-', label="True")  # Raw data
    (x1, yFit, 'r-', label="OLS")  # Fitted data
    (x1, ivUp, '--',color='orange', label="ConfInt")  # Confidence interval Previous
    (x1, ivLow, '--',color='orange')  # Confidence interval Next
    (loc='best')  # Show legend
    ('x')
    ('y')
    ()
    return
if __name__ == '__main__':
    main()

4.2 Multiple Linear Regression Results of running the program:

OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.930
Method:                 Least Squares   F-statistic:                     440.0
Date:                Thu, 06 May 2021   Prob (F-statistic):           6.04e-56
Time:                        10:38:51   Log-Likelihood:                -68.709
No. Observations:                 100   AIC:                             145.4
Df Residuals:                      96   BIC:                             155.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0411      0.120     41.866      0.000       4.802       5.280
x1             0.4894      0.019     26.351      0.000       0.452       0.526
x2             0.5158      0.072      7.187      0.000       0.373       0.658
x3            -0.0195      0.002    -11.957      0.000      -0.023      -0.016
==============================================================================
Omnibus:                        1.472   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.479   Jarque-Bera (JB):                1.194
Skew:                           0.011   Prob(JB):                        0.551
Kurtosis:                       2.465   Cond. No.                         223.
==============================================================================
OLS model: Y = b0 + b1*X + ... + bm*Xm
Parameters:  [ 5.04111867  0.4893574   0.51579806 -0.01951219]

在这里插入图片描述

5. Appendix: Detailed description of regression results

: y Dependent variable
Model: OLS Least Squares Model
Method: Least Squares
No. Observations: Number of sample data
Df Residuals：(math.) residual degree of freedom(degree of freedom of residuals)
Df Model: degree of freedom of model (degree of freedom of model)
Covariance Type: robustness of nonrobust covariance arrays
R-squared: R decision factor
Adj. R-squared: Modified coefficient of determination
F-statistic: statistical test F-statistic
Prob (F-statistic): P-value of the F-test
Log likelihood: log likelihood coef: coefficients of the independent variable and the constant term, b1,b2,.... . bm,b0
std err: standard error of the coefficient estimate
t: statistical test t statistic
P>|t|: p-value of t-test
[0.025, 0.975]: lower and upper 95% confidence intervals for the estimated parameters
Omnibus: a test for data normality based on kurtosis and skewness
Prob(Omnibus): probability of testing data normality based on kurtosis and skewness
Durbin-Watson: testing for autocorrelation in residuals
Skewness: the degree of skewness, reflecting the degree of asymmetry of the data distribution
Kurtosis: kurtosis, reflecting the steepness or smoothness of the data distribution
Jarque-Bera (JB): a test for data normality based on kurtosis and skewness
Prob(JB): p-value of Jarque-Bera (JB) test.
Cond. No.: Tests whether there is an exact correlation or a high correlation between the variables.

The above is Python mathematical modeling StatsModels statistical regression of linear regression example details, more information about mathematical modeling StatsModels statistical regression of linear regression please pay attention to my other related articles!