SoFunction
Updated on 2024-11-16

Python data analysis of NumPy common function usage details

In this post, we will analyze historical stock prices as an example, introduce how to load data from a file, and how to use NumPy's basic mathematical and statistical analysis functions, learn how to read and write files, and try functional programming and NumPy linear algebra operations to learn NumPy's common functions.

File reading

Reading and writing files is an essential skill for data analysis

CSV (Comma-Separated Value) format is a common file format. Typically, a database dump file is in CSV format, and the fields in the file correspond to the columns in the database table.

The loadtxt function in NumPy makes it easy to read CSV files, automatically slice and dice fields, and load the data into NumPy arrays.

1. Save or create a new file

import numpy as np

i = (3) The #eye(n) function creates an n-dimensional unit matrix
print(i)
('', i) #savetxt()Creating and saving files

savetxt() function, if there is already a file then update, if there is no file in the directory, then create and save the file

The results of the run are as follows:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

2, read csv file function loadtxt

1) First, create a file with the name in the directory where you saved the program and set the data as shown below:

2) Read the file as follows:

c,v=('', delimiter=',', usecols=(6,7), unpack=True)

The usecols parameter is a tuple to get the data in fields 7 through 8, which are the closing price and volume of the stock in the above file. The unpack parameter is set to True to split the data in different columns, i.e., to assign the closing price and volume arrays to the variables c and v, respectively.

3、Common Functions

Volume-weighted average, time-weighted, arithmetic mean, median, variance, etc.

import numpy as np

i = (3) The #eye(n) function creates an n-dimensional unit matrix
print(i)
('', i) #savetxt creates and saves the file

# Read csv files
c,v=('', delimiter=',', usecols=(6,7), unpack=True)
""The parameter "usecols" is a tuple to get the data in fields 7 through 8, that is, the closing price and volume data of the stock. The unpack parameter is set to True to split the data stored in different columns, i.e., separately assigning the closing price and volume arrays to the variables
and volume arrays to the variables c and v, respectively."""
vwap = (c, weights=v)  # The average function is called, using v as a weight parameter.
print(vwap)
print('\n')
print( (c)) # Arithmetic mean
print('\n')
t = (len(c))
print( t )
print('\n')
twap =(c, weights=t) # Weighted by time
print( twap )
print('\n')
h,l=('',delimiter=',', usecols=(4,5), unpack=True)
# Get the data in fields 4 through 5, i.e., the high and low prices of the stock

print ( (h)) # Get the maximum value max()
print ( (l)) # Get the minimum value min()
print('\n')
print( (h) ) # The ptp() function was used to calculate the extreme deviation, i.e. the difference between the maximum and minimum values
print( (l) )
print('\n')
print( (c)) # median median() function, i.e., the number in the middle of a number of data
print( (c))The #msort(( )) function sorts the array of prices, which verifies that the median of the above
# Calculation of variance
variance = (c) # variance function var()
print(variance)

Perform the relevant calculations using code, excel and run the results as follows:

In order to calculate later, will be in the data to add a few more lines, modify the following and save (for the date of the latter read and write and modify, the date form modified to the following):

603112,2022-4-1,,13.56,13.97,13.55,13.87,3750000603112,2022-4-2,,13.75,14.25,13.69,14.03,4003500603112,2022-4-3,,13.69,14.11,13.61,13.95,3956500603112,2022-4-4,,14.3,14.3,13.73,13.89,4250000603112,2022-4-5,,14.1,14.5,13.93,14,4013500603112,2022-4-6,,14.5,15.4,14.35,15.4,9056500603112,2022-4-7,,16,16.94,15.85,16.94,3750000

4. Yield on stocks, etc.

The stock market is the most common is the rate of increase, that is, today's closing price relative to yesterday's percentage of increase or decrease, that is, (today's closing price - yesterday's closing price) / yesterday's closing price * 100, numpy in the diff() function can return to an array consisting of the difference between neighboring array elements, due to the neighboring data subtracted from the data, so the diff () array of data less than the original array of one.

As modified above, there are 7 days of closing prices, and diff() calculates a result with only 6 digits.

import numpy as np

# Read csv files
c,v=('', delimiter=',', usecols=(6,7), unpack=True)

# Simple rate of return on equities
The # diff function returns an array consisting of the differences of neighboring array elements
results = (c)
print(results)
print('\n')
results1 = (c)/c[:-1]*100  # Up relative to the day before
print(results1)
print('\n')
Standard_deviation =(results) # Calculate the standard deviation
print(Standard_deviation)

Run the results, code, excel for comparison:

5. Logarithmic returns and volatility

(1) logarithmic gain: log function to get the logarithm of each closing price, and then the results of the use of diff function can be.

logreturns = ( (c) )
print(logreturns)

Run results:

[ 0.01146966 -0.00571839 -0.00431035  0.00788817  0.09531018  0.09531018]

2) The role of where

The where function returns the index values of all sequences that satisfy the specified conditions, such as the above logreturns with two less than 0 data.

posretindices = (results1 > 0) 
print('Indices with positive returns1',posretindices)

Run results:

Indices with positive returns1 (array([0, 3, 4, 5], dtype=int64),)

3) Volatility: Volatility = the standard deviation of the logarithmic rate of return divided by its mean, divided by the square root of the reciprocal of the trading period. The following code for the volatility of the statistics in years and months.

annual_volatility =((logreturns)/(logreturns))/(1./252.)# Use floating point numbers to get the right results
print ( annual_volatility )
# Monthly volatility
month_volatility =((logreturns)/(logreturns))/(1./12.)
print ( month_volatility )

6. Date analysis

Dealing with dates is always tedious.NumPy is oriented towards floating-point arithmetic, so it requires some specialized handling of dates.

Through the above code, we know that modifying the function ('', delimiter=','', usecols=(6,7), unpack=True) in the parameter usecols=(6,7) can read different columns, the date is in the 2nd column, that is, the subscripts should be 1 (the columns subscripts are from 0), you can redefine the new date columns and get it after the and save it.

The code is as follows:

dates, c=('', delimiter=',', usecols=(1,6), unpack=True) #Read the subscript of1、6data,Separate deposits todatescap (a poem)cnumerical series。

However, the actual runtime will report an error that

The code needs to be modified as follows:

import numpy as np
from datetime import datetime

def datestr2num(s): #Define a function
    return (('ascii'),"%Y-%m-%d").date().weekday()  
#decode('ascii') Convert string s to ascii code

# Read csv files
dates,close=('',delimiter=',', usecols=(1,6),converters={1:datestr2num},unpack=True)
print(dates)

The result of the run: [4. 5. 6. 0. 1. 2. 3.], also starting from 0 and ending at 6. In order to better illustrate the data, real data can be used, i.e., real trading data downloaded directly from the Commodore software, as shown below:

(Note: there is one less space column than in the original)

Modify the code as follows:

import numpy as np
from datetime import datetime

def datestr2num(s): #Define a function
    return (('ascii'),"%Y-%m-%d").date().weekday()  
#decode('ascii') Convert string s to ascii code

# Read csv files
dates,c=('',delimiter=',', usecols=(1,5),
                       converters={1:datestr2num},unpack=True)
print(dates)

print(len(dates)) #Statistics exported days

Run results:

As shown above, the export has 420 days of data.

Tally the relevant data by Monday through Friday:

averages = (5) # Create an array of 5 elements, save the closing price of the trading day, 0-4 represent the five trading days from Monday to Friday.
for i in range(5):  # Iterate over date markers 0 through 4
    indices =(dates==i)   The #where function gets the index value for each weekday and stores it in the indices array.
    prices=(c,indices)   The #take function gets the closing price for each weekday.
    avg= (prices) #Average values are calculated for each working day and stored in the averages array.
    averages[i] = avg  #Average values are calculated for each working day and stored in the averages array.
    print('day', i)
    #print('prices', prices)
    print("Average", avg)

print(averages)

Of course, in addition to the above, you can also find the maximum and minimum values in 420 days, as well as the maximum and minimum values in the average value of trading days, etc., and fix the code as follows:

import numpy as np
from datetime import datetime

def datestr2num(s): #Define a function
    return (('ascii'),"%Y-%m-%d").date().weekday()  
#decode('ascii') Convert string s to ascii code

# Read csv files
dates,c=('',delimiter=',', usecols=(1,5),
                       converters={1:datestr2num},unpack=True)

averages = (5) # Create an array of 5 elements, save the closing price of the trading day, 0-4 represent the five trading days from Monday to Friday.
for i in range(5):  # Iterate over date markers 0 through 4
    indices =(dates==i)   The #where function gets the index value for each weekday and stores it in the indices array.
    prices=(c,indices)   The #take function gets the closing price for each weekday.
    avg= (prices) 
    averages[i] = avg  #For each working day, the average is calculated and stored in the averages array, which consists of five values.
    print('day', i)
    #print('prices', prices)
    print("Average", avg)

print(averages)
print('\n')

print('the top close price:',(c)) # Highest closing price
print('the low close price:',(c)) # Lowest closing price
print('\n')

top = (averages)  # Find the maximum value in the averages series
print ("Highest average", top)
print ("Top day of the week", (averages)) The #argmax function returns the index of the largest element in the averages array.
print('\n')

bottom = (averages) # Find the smallest value in the averages series.
print ("Lowest average", bottom)
print ( "Bottom day of the week", (averages))#argminThe function returns theaveragesIndex value of the smallest element in the array

The results of the run are as follows:

summarize

This preliminary import of real stock trading information, and the use of numpy common functions on the preliminary calculations, listed in the following common functions:

The loadtxt() function makes it easy to read a CSV file, automatically slice and dice the fields, and load the data into a NumPy array.

savetxt() creates and saves a file

('', delimiter=','', usecols=(6,7),) The usecols parameter is used to select the column to read from.

(c, weights=v) weighted average, using v as a weight parameter.

(c)) # Arithmetic mean

(h)) # Get the maximum value max()

(l)) # Get the minimum value min()

(h) ) has been calculated using the ptp() function for the polar difference

(c)) median median() function, i.e., the number in the middle of a number of data

(c)) function sorts the price array, the

(c) Variance function var()

(c) The function can return an array consisting of the differences of neighboring array elements

(results) # Standard deviation

( (c) )

(results1 > 0) Selection

()# square root sqrt(), floating point number

('ascii') Converts the string s to ascii.

(c,indices) #take function gets the closing price for each weekday.

(averages)) #argmax function returns the index of the largest element in the array

(averages))#argmin function returns the index value of the smallest element in the array

The above is Python data analysis of NumPy common function use details, more information about Python NumPy common function please pay attention to my other related articles!