Use Python to quickly open a million rows of very large Excel file methods

A student on Knowledge said that when he tried to open a 20M or so excel file, whether he used read_excel in pandas, or directly used xlrd or openpyxl module, the speed was unbearably slow, taking about 1 minute or so.

Can this really be the case? The first impression is that the student did not set read-only mode when using the openpyxl module. For testing purposes, first generate an excel file with one million rows of data using the following code.

>>> from openpyxl import Workbook
>>> wb = Workbook()
>>> sh = 
>>> (['id', 'Languages', 'Math', 'English', 'Physical'])
>>> for i in range(1000000): # Write 1 million rows of data
	([i+1, 90, 100, 95, 99])

	
>>> (r'd:\')
>>> import os
>>> (r'd:\') # File size: 20M bytes
20230528

Next a function is defined that opens a file using the openpyxl module, examining the time consumed to turn off and turn on read-only mode, respectively.

>>> from openpyxl import load_workbook
>>> import time
>>> def read_xlsx(read_only):
	t0 = ()
	wb = load_workbook(r'd:\', read_only=read_only)
	t1 = ()
	print()
	print((row=1, column=1).value)
	print((row=100, column=3).value)
	print('Took %0.3f seconds'%(t1-t0))

	
>>> read_xlsx(True)
['Sheet']
id
100
take a period of (x amount of time)0.404(time) second
>>> read_xlsx(False)
['Sheet']
id
100
take a period of (x amount of time)67.817(time) second

Running the test, sure enough, it really does take over a minute without read-only turned on, and just 0.4 seconds with read-only mode.

However, do not be too happy, openpyxl module does not provide like pandas.read_excel () that read all the data into a data structure, can only locate the rows, columns or grids and then read the data. To use openpyxl module to read all the data into an array or DataFrame, you need to traverse all the rows and columns, which is still a very time-consuming operation.

So, does pandas.read_excel() also support read-only mode? Unfortunately, read_excel() does not have a parameter like read_only. Although read_excel() can accept file paths, file objects, file-like objects, and even binary data, read_excel() still takes about 80 seconds to parse the one million rows of data, even if you pass in the file contents. The following code verifies this.

>>> import pandas as pd
>>> def read_excel_by_pandas():	
	with open(r'd:\', 'rb') as fp:
		content = ()
		t0 = ()
		df = pd.read_excel(content, engine='openpyxl')
		t1 = ()
	print(())
	print(())
	print('Took %0.3f seconds'%(t1-t0))

	
>>> read_excel_by_pandas()
  id multilingualism  math English (language) physiotherapy
0  1 90 100 95 99
1  2 90 100 95 99
2  3 90 100 95 99
3  4 90 100 95 99
4  5 90 100 95 99
       id multilingualism  math English (language) physiotherapy
999995  999996 90 100 95 99
999996  999997 90 100 95 99
999997  999998 90 100 95 99
999998  999999 90 100 95 99
999999 1000000 90 100 95 99
take a period of (x amount of time)81.369(time) second

Conclusion: When dealing with very large Excel files, use the read-only mode of the openpyxl module to quickly open and get the data for a given grid, but don't try to read all the data into your own defined data structures, it will take a long time. There is nothing pandas can do about this.

to this article on the use of Python to quickly open a million rows of very large Excel file on the method of the article is introduced to this, more related python to open excel file content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!