preamble
Hello everyone, in previous articles we have explained a lot of Python data processing methods such as reading data, missing value processing, data dimensionality reduction, etc., but also introduces some data visualization methods such as Matplotlib, pyecharts, etc., then after mastering these basic skills, to carry out a more in-depth analysis of the need to master a number of commonly used modeling methods, this article will explain how to use Python for statistical analysis. Similar to previous articles, this article only talks about how to use the code to achieve, do not do the theoretical derivation and too much interpretation of the results (in fact, the commonly used models can be easily found in the perfect derivation and analysis). Therefore, readers need to master some basic statistical models such as regression models, time series, etc..
Introduction to Statsmodels
The most commonly used module for statistical modeling and analysis in Python is Statsmodels.Statsmodels is a Python library mainly used for statistical computation and statistical modeling. It has the following features:
- Exploratory analysis: Includes exploratory data analysis methods such as linked tables, chained equations multiple interpolation, and visualization of graphs and charts with the results of the statistical model, such as fit charts, box plots, correlation plots, and time series plots.
- Regression models: linear regression models, nonlinear regression models, generalized linear models, linear mixed effects models, etc.
- Other functions: parameter estimation of models such as analysis of variance (ANOVA), time series analysis, etc. and hypothesis testing of estimated parameters, etc.
Install brew install Statsmodels
Documentation /statsmodels/statsmodels
Linear regression models: ordinary least squares estimation
Linear models are Ordinary Least Squares (OLS) Generalized Least Squares (GLS), Weighted Least Squares (WLS), etc. Statsmodels has a better support for linear models, take a look at the simplest example: Ordinary Least Squares (OLS)
First import the relevant packages
%matplotlib inline import numpy as np import as sm import as plt from import wls_prediction_std (9876789)
Then create the data, first setting the sample size to 100
nsample = 100 #sample size
Then set up x1 and x2, with x1 being a 0 to 10 equidistant arrangement and x2 being the square of x1
x = (0, 10, 100) X = np.column_stack((x, x**2))
Then set beta, the error term and the response variable y
beta = ([1, 0.1, 10]) e = (size=nsample) X = sm.add_constant(X) y = (X, beta) + e
Then the regression model is built
model = (y, X) results = () print(())
View model results
Isn't it very close to the form of the results output by R language? The values of regression coefficients, P-value, R-squared, and other parameters for evaluating the regression model are all available, and you can use dir(results) to get the values of all the variables and retrieve them.
print('Parameters: ', ) print('R2: ', )
Then the regression model is y = 1.3423-0.0402x1 + 10.0103x2, of course this model can continue to optimize then leave it to the reader to complete. Next we plot the sample points against the regression curve
y_fitted = fig, ax = (figsize=(8,6)) (x, y, 'o', label='data') (x, y_fitted, 'r--.',label='OLS') (loc='best')
Time series: ARMA
There are many models on time series, we choose the ARMA model example, first import the relevant package and generate data
%matplotlib inline import numpy as np import as sm import pandas as pd from .arima_process import arma_generate_sample (12345) arparams = ([.75, -.25]) maparams = ([.65, .35]) arparams = np.r_[1, -arparams] maparams = np.r_[1, maparams] nobs = 250 y = arma_generate_sample(arparams, maparams, nobs)
Next, we can add some date information. For this example, we'll use a pandas time series and build the model
dates = .dates_from_range('1980m1', length=nobs) y = (y, index=dates) arma_mod = (y, order=(2,2)) arma_res = arma_mod.fit(trend='nc', disp=-1)
One last prediction.
import as plt fig, ax = (figsize=(10,8)) fig = arma_res.plot_predict(start='1999-06-30', end='2001-05-31', ax=ax) legend = (loc='upper left')
Regression diagnostics: estimating regression models
First import the relevant packages
%matplotlib inline from import lzip import numpy as np import pandas as pd import as smf import as sms import as plt
Then load the data
url = '/vincentarelbundock/Rdatasets/master/csv/HistData/' dat = pd.read_csv(url)
fit a model (math.)
results = ('Lottery ~ Literacy + (Pop1831)', data=dat).fit()
View Results
print(())
Regression diagnostics: normality of residuals
Jarque-Bera test:
name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis'] test = sms.jarque_bera() lzip(name, test) #### results [('Jarque-Bera', 3.3936080248431666), ('Chi^2 two-tail prob.', 0.1832683123166337), ('Skew', -0.48658034311223375), ('Kurtosis', 3.003417757881633)]
Omni test:
name = ['Chi^2', 'Two-tail probability'] test = sms.omni_normtest() lzip(name, test) #### results [('Chi^2', 3.713437811597181), ('Two-tail probability', 0.15618424580304824)]
Regression diagnosis: heteroscedasticity
Breush-Pagan test:
name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value'] test = sms.het_breuschpagan(, ) lzip(name, test) ### Results [('Lagrange multiplier statistic', 4.893213374093957), ('p-value', 0.08658690502352209), ('f-value', 2.503715946256434), ('f p-value', 0.08794028782673029)] Goldfeld-Quandt test
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(, )
lzip(name, test)
#### results
[('F statistic', 1.1002422436378152), ('p-value', 0.3820295068692507)]
Regression diagnostics: multicollinearity
Checking for multicollinearity can be done using
()
The result is 702.1792145490062, indicating strong multicollinearity.
concluding remarks
The above is the basic functions of Statsmodels, if you are familiar with R readers will find that many commands and R is similar. Finally, I would like to say one more thing, the full text does not appear too much theoretical knowledge of the model, because the derivation process of these models can get very detailed Baidu search quality answers, so after learning how to use the computer to achieve must go back to understand the model of each parameter is how to get, and what is the meaning of the real deal.
Above is the detailed content of Python for statistical modeling, more information about Python statistical modeling please pay attention to my other related articles!