In this post, we will analyze historical stock prices as an example, introduce how to load data from a file, and how to use NumPy's basic mathematical and statistical analysis functions, learn how to read and write files, and try functional programming and NumPy linear algebra operations to learn NumPy's common functions.
File reading
Reading and writing files is an essential skill for data analysis
CSV (Comma-Separated Value) format is a common file format. Typically, a database dump file is in CSV format, and the fields in the file correspond to the columns in the database table.
The loadtxt function in NumPy makes it easy to read CSV files, automatically slice and dice fields, and load the data into NumPy arrays.
1. Save or create a new file
import numpy as np i = (3) The #eye(n) function creates an n-dimensional unit matrix print(i) ('', i) #savetxt()Creating and saving files
savetxt() function, if there is already a file then update, if there is no file in the directory, then create and save the file
The results of the run are as follows:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
2, read csv file function loadtxt
1) First, create a file with the name in the directory where you saved the program and set the data as shown below:
2) Read the file as follows:
c,v=('', delimiter=',', usecols=(6,7), unpack=True)
The usecols parameter is a tuple to get the data in fields 7 through 8, which are the closing price and volume of the stock in the above file. The unpack parameter is set to True to split the data in different columns, i.e., to assign the closing price and volume arrays to the variables c and v, respectively.
3、Common Functions
Volume-weighted average, time-weighted, arithmetic mean, median, variance, etc.
import numpy as np i = (3) The #eye(n) function creates an n-dimensional unit matrix print(i) ('', i) #savetxt creates and saves the file # Read csv files c,v=('', delimiter=',', usecols=(6,7), unpack=True) ""The parameter "usecols" is a tuple to get the data in fields 7 through 8, that is, the closing price and volume data of the stock. The unpack parameter is set to True to split the data stored in different columns, i.e., separately assigning the closing price and volume arrays to the variables and volume arrays to the variables c and v, respectively.""" vwap = (c, weights=v) # The average function is called, using v as a weight parameter. print(vwap) print('\n') print( (c)) # Arithmetic mean print('\n') t = (len(c)) print( t ) print('\n') twap =(c, weights=t) # Weighted by time print( twap ) print('\n') h,l=('',delimiter=',', usecols=(4,5), unpack=True) # Get the data in fields 4 through 5, i.e., the high and low prices of the stock print ( (h)) # Get the maximum value max() print ( (l)) # Get the minimum value min() print('\n') print( (h) ) # The ptp() function was used to calculate the extreme deviation, i.e. the difference between the maximum and minimum values print( (l) ) print('\n') print( (c)) # median median() function, i.e., the number in the middle of a number of data print( (c))The #msort(( )) function sorts the array of prices, which verifies that the median of the above # Calculation of variance variance = (c) # variance function var() print(variance)
Perform the relevant calculations using code, excel and run the results as follows:
In order to calculate later, will be in the data to add a few more lines, modify the following and save (for the date of the latter read and write and modify, the date form modified to the following):
603112,2022-4-1,,13.56,13.97,13.55,13.87,3750000603112,2022-4-2,,13.75,14.25,13.69,14.03,4003500603112,2022-4-3,,13.69,14.11,13.61,13.95,3956500603112,2022-4-4,,14.3,14.3,13.73,13.89,4250000603112,2022-4-5,,14.1,14.5,13.93,14,4013500603112,2022-4-6,,14.5,15.4,14.35,15.4,9056500603112,2022-4-7,,16,16.94,15.85,16.94,3750000
4. Yield on stocks, etc.
The stock market is the most common is the rate of increase, that is, today's closing price relative to yesterday's percentage of increase or decrease, that is, (today's closing price - yesterday's closing price) / yesterday's closing price * 100, numpy in the diff() function can return to an array consisting of the difference between neighboring array elements, due to the neighboring data subtracted from the data, so the diff () array of data less than the original array of one.
As modified above, there are 7 days of closing prices, and diff() calculates a result with only 6 digits.
import numpy as np # Read csv files c,v=('', delimiter=',', usecols=(6,7), unpack=True) # Simple rate of return on equities The # diff function returns an array consisting of the differences of neighboring array elements results = (c) print(results) print('\n') results1 = (c)/c[:-1]*100 # Up relative to the day before print(results1) print('\n') Standard_deviation =(results) # Calculate the standard deviation print(Standard_deviation)
Run the results, code, excel for comparison:
5. Logarithmic returns and volatility
(1) logarithmic gain: log function to get the logarithm of each closing price, and then the results of the use of diff function can be.
logreturns = ( (c) ) print(logreturns)
Run results:
[ 0.01146966 -0.00571839 -0.00431035 0.00788817 0.09531018 0.09531018]
2) The role of where
The where function returns the index values of all sequences that satisfy the specified conditions, such as the above logreturns with two less than 0 data.
posretindices = (results1 > 0) print('Indices with positive returns1',posretindices)
Run results:
Indices with positive returns1 (array([0, 3, 4, 5], dtype=int64),)
3) Volatility: Volatility = the standard deviation of the logarithmic rate of return divided by its mean, divided by the square root of the reciprocal of the trading period. The following code for the volatility of the statistics in years and months.
annual_volatility =((logreturns)/(logreturns))/(1./252.)# Use floating point numbers to get the right results print ( annual_volatility ) # Monthly volatility month_volatility =((logreturns)/(logreturns))/(1./12.) print ( month_volatility )
6. Date analysis
Dealing with dates is always tedious.NumPy is oriented towards floating-point arithmetic, so it requires some specialized handling of dates.
Through the above code, we know that modifying the function ('', delimiter=','', usecols=(6,7), unpack=True) in the parameter usecols=(6,7) can read different columns, the date is in the 2nd column, that is, the subscripts should be 1 (the columns subscripts are from 0), you can redefine the new date columns and get it after the and save it.
The code is as follows:
dates, c=('', delimiter=',', usecols=(1,6), unpack=True) #Read the subscript of1、6data,Separate deposits todatescap (a poem)cnumerical series。
However, the actual runtime will report an error that
The code needs to be modified as follows:
import numpy as np from datetime import datetime def datestr2num(s): #Define a function return (('ascii'),"%Y-%m-%d").date().weekday() #decode('ascii') Convert string s to ascii code # Read csv files dates,close=('',delimiter=',', usecols=(1,6),converters={1:datestr2num},unpack=True) print(dates)
The result of the run: [4. 5. 6. 0. 1. 2. 3.], also starting from 0 and ending at 6. In order to better illustrate the data, real data can be used, i.e., real trading data downloaded directly from the Commodore software, as shown below:
(Note: there is one less space column than in the original)
Modify the code as follows:
import numpy as np from datetime import datetime def datestr2num(s): #Define a function return (('ascii'),"%Y-%m-%d").date().weekday() #decode('ascii') Convert string s to ascii code # Read csv files dates,c=('',delimiter=',', usecols=(1,5), converters={1:datestr2num},unpack=True) print(dates) print(len(dates)) #Statistics exported days
Run results:
As shown above, the export has 420 days of data.
Tally the relevant data by Monday through Friday:
averages = (5) # Create an array of 5 elements, save the closing price of the trading day, 0-4 represent the five trading days from Monday to Friday. for i in range(5): # Iterate over date markers 0 through 4 indices =(dates==i) The #where function gets the index value for each weekday and stores it in the indices array. prices=(c,indices) The #take function gets the closing price for each weekday. avg= (prices) #Average values are calculated for each working day and stored in the averages array. averages[i] = avg #Average values are calculated for each working day and stored in the averages array. print('day', i) #print('prices', prices) print("Average", avg) print(averages)
Of course, in addition to the above, you can also find the maximum and minimum values in 420 days, as well as the maximum and minimum values in the average value of trading days, etc., and fix the code as follows:
import numpy as np from datetime import datetime def datestr2num(s): #Define a function return (('ascii'),"%Y-%m-%d").date().weekday() #decode('ascii') Convert string s to ascii code # Read csv files dates,c=('',delimiter=',', usecols=(1,5), converters={1:datestr2num},unpack=True) averages = (5) # Create an array of 5 elements, save the closing price of the trading day, 0-4 represent the five trading days from Monday to Friday. for i in range(5): # Iterate over date markers 0 through 4 indices =(dates==i) The #where function gets the index value for each weekday and stores it in the indices array. prices=(c,indices) The #take function gets the closing price for each weekday. avg= (prices) averages[i] = avg #For each working day, the average is calculated and stored in the averages array, which consists of five values. print('day', i) #print('prices', prices) print("Average", avg) print(averages) print('\n') print('the top close price:',(c)) # Highest closing price print('the low close price:',(c)) # Lowest closing price print('\n') top = (averages) # Find the maximum value in the averages series print ("Highest average", top) print ("Top day of the week", (averages)) The #argmax function returns the index of the largest element in the averages array. print('\n') bottom = (averages) # Find the smallest value in the averages series. print ("Lowest average", bottom) print ( "Bottom day of the week", (averages))#argminThe function returns theaveragesIndex value of the smallest element in the array
The results of the run are as follows:
summarize
This preliminary import of real stock trading information, and the use of numpy common functions on the preliminary calculations, listed in the following common functions:
The loadtxt() function makes it easy to read a CSV file, automatically slice and dice the fields, and load the data into a NumPy array.
savetxt() creates and saves a file
('', delimiter=','', usecols=(6,7),) The usecols parameter is used to select the column to read from.
(c, weights=v) weighted average, using v as a weight parameter.
(c)) # Arithmetic mean
(h)) # Get the maximum value max()
(l)) # Get the minimum value min()
(h) ) has been calculated using the ptp() function for the polar difference
(c)) median median() function, i.e., the number in the middle of a number of data
(c)) function sorts the price array, the
(c) Variance function var()
(c) The function can return an array consisting of the differences of neighboring array elements
(results) # Standard deviation
( (c) )
(results1 > 0) Selection
()# square root sqrt(), floating point number
('ascii') Converts the string s to ascii.
(c,indices) #take function gets the closing price for each weekday.
(averages)) #argmax function returns the index of the largest element in the array
(averages))#argmin function returns the index value of the smallest element in the array
The above is Python data analysis of NumPy common function use details, more information about Python NumPy common function please pay attention to my other related articles!