Python for statistical modeling

preamble

Hello everyone, in previous articles we have explained a lot of Python data processing methods such as reading data, missing value processing, data dimensionality reduction, etc., but also introduces some data visualization methods such as Matplotlib, pyecharts, etc., then after mastering these basic skills, to carry out a more in-depth analysis of the need to master a number of commonly used modeling methods, this article will explain how to use Python for statistical analysis. Similar to previous articles, this article only talks about how to use the code to achieve, do not do the theoretical derivation and too much interpretation of the results (in fact, the commonly used models can be easily found in the perfect derivation and analysis). Therefore, readers need to master some basic statistical models such as regression models, time series, etc..

Introduction to Statsmodels

The most commonly used module for statistical modeling and analysis in Python is Statsmodels.Statsmodels is a Python library mainly used for statistical computation and statistical modeling. It has the following features:

Exploratory analysis: Includes exploratory data analysis methods such as linked tables, chained equations multiple interpolation, and visualization of graphs and charts with the results of the statistical model, such as fit charts, box plots, correlation plots, and time series plots.
Regression models: linear regression models, nonlinear regression models, generalized linear models, linear mixed effects models, etc.
Other functions: parameter estimation of models such as analysis of variance (ANOVA), time series analysis, etc. and hypothesis testing of estimated parameters, etc.

Install brew install Statsmodels
Documentation /statsmodels/statsmodels

Linear regression models: ordinary least squares estimation

Linear models are Ordinary Least Squares (OLS) Generalized Least Squares (GLS), Weighted Least Squares (WLS), etc. Statsmodels has a better support for linear models, take a look at the simplest example: Ordinary Least Squares (OLS)

First import the relevant packages

%matplotlib inline
import numpy as np
import  as sm
import  as plt
from  import wls_prediction_std
(9876789)

Then create the data, first setting the sample size to 100

nsample = 100 #sample size

Then set up x1 and x2, with x1 being a 0 to 10 equidistant arrangement and x2 being the square of x1

x = (0, 10, 100)
X = np.column_stack((x, x**2))

Then set beta, the error term and the response variable y

beta = ([1, 0.1, 10])
e = (size=nsample)
X = sm.add_constant(X)
y = (X, beta) + e

Then the regression model is built

model = (y, X) 
results = ()
print(())

View model results

Isn't it very close to the form of the results output by R language? The values of regression coefficients, P-value, R-squared, and other parameters for evaluating the regression model are all available, and you can use dir(results) to get the values of all the variables and retrieve them.

print('Parameters: ', )
print('R2: ', )

Then the regression model is y = 1.3423-0.0402x1 + 10.0103x2, of course this model can continue to optimize then leave it to the reader to complete. Next we plot the sample points against the regression curve

y_fitted = 
fig, ax = (figsize=(8,6))
(x, y, 'o', label='data')
(x, y_fitted, 'r--.',label='OLS')
(loc='best')

Time series: ARMA

There are many models on time series, we choose the ARMA model example, first import the relevant package and generate data

%matplotlib inline
import numpy as np
import  as sm
import pandas as pd
from .arima_process import arma_generate_sample
(12345)

arparams = ([.75, -.25])
maparams = ([.65, .35])

arparams = np.r_[1, -arparams]
maparams = np.r_[1, maparams]
nobs = 250
y = arma_generate_sample(arparams, maparams, nobs)

Next, we can add some date information. For this example, we'll use a pandas time series and build the model

dates = .dates_from_range('1980m1', length=nobs)
y = (y, index=dates)
arma_mod = (y, order=(2,2))
arma_res = arma_mod.fit(trend='nc', disp=-1)

One last prediction.

import  as plt
fig, ax = (figsize=(10,8))
fig = arma_res.plot_predict(start='1999-06-30', end='2001-05-31', ax=ax)
legend = (loc='upper left')

Regression diagnostics: estimating regression models

First import the relevant packages

%matplotlib inline
from  import lzip
import numpy as np
import pandas as pd
import  as smf
import  as sms
import  as plt

Then load the data

url = '/vincentarelbundock/Rdatasets/master/csv/HistData/'
dat = pd.read_csv(url)

fit a model (math.)

results = ('Lottery ~ Literacy + (Pop1831)', data=dat).fit()

View Results

print(())

Regression diagnostics: normality of residuals

Jarque-Bera test:

name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']
test = sms.jarque_bera()
lzip(name, test)
#### results
[('Jarque-Bera', 3.3936080248431666),
('Chi^2 two-tail prob.', 0.1832683123166337),
('Skew', -0.48658034311223375),
('Kurtosis', 3.003417757881633)]

Omni test:

name = ['Chi^2', 'Two-tail probability']
test = sms.omni_normtest()
lzip(name, test)
#### results
[('Chi^2', 3.713437811597181), ('Two-tail probability', 0.15618424580304824)]

Regression diagnosis: heteroscedasticity

Breush-Pagan test:

name = ['Lagrange multiplier statistic', 'p-value',
    'f-value', 'f p-value']
test = sms.het_breuschpagan(, )
lzip(name, test)
### Results
[('Lagrange multiplier statistic', 4.893213374093957),
('p-value', 0.08658690502352209),
('f-value', 2.503715946256434),
('f p-value', 0.08794028782673029)]
Goldfeld-Quandt test

name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(, )
lzip(name, test)
#### results
[('F statistic', 1.1002422436378152), ('p-value', 0.3820295068692507)]

Regression diagnostics: multicollinearity

Checking for multicollinearity can be done using

()

The result is 702.1792145490062, indicating strong multicollinearity.

concluding remarks

The above is the basic functions of Statsmodels, if you are familiar with R readers will find that many commands and R is similar. Finally, I would like to say one more thing, the full text does not appear too much theoretical knowledge of the model, because the derivation process of these models can get very detailed Baidu search quality answers, so after learning how to use the computer to achieve must go back to understand the model of each parameter is how to get, and what is the meaning of the real deal.

Above is the detailed content of Python for statistical modeling, more information about Python statistical modeling please pay attention to my other related articles!