Normally we'd use wc -l to count file lines, but it's easy to do with Python.
To quickly count the number of lines in a text file, you actually have to count the number of line breaks in that text file. To maximize speed, we need to read as much text as possible and then process it together. To count the number of line breaks you can use thebytes
built-incount
Methods.
The code is as follows:
from __future__ import print_function import time if __name__ == '__main__': import sys start = () with open([1],'rb') as f: count = 0 last_data = '\n' while True: data = (0x400000) if not data: break count += (b'\n') last_data = data if last_data[-1:] != b'\n': count += 1 # Remove this if a wc-like count is needed end = () print(count) print((end-start) * 1000)
In the above code, we count the incomplete part of the end of the file without line breaks as one line, which is slightly different from wc -l. If you want to be consistent with wc -l, you can delete the lines with comments.
It's not being handled here.universal newline
, ignore blank lines and other logic, if you need these features, the program will become a little more complex.
Tested using three text files with 10 million lines, 160 million lines, and 640 million lines. Run it twice with wc -l first, then with Python's.
Run results:
[root@yz- test]# docker run -it --rm -v `pwd`:/opt/workspace python:3 bash -c "cd /opt/workspace && time wc -l && time wc -l && time python3 " 10000000 real 0m0.086s user 0m0.072s sys 0m0.013s 10000000 real 0m0.080s user 0m0.060s sys 0m0.019s 10000000 64.38159942626953 real 0m0.150s user 0m0.100s sys 0m0.033s [root@yz- test]# docker run -it --rm -v `pwd`:/opt/workspace python:3 bash -c "cd /opt/workspace && time wc -l && time wc -l && time python3 " 160000000 real 0m1.322s user 0m0.991s sys 0m0.318s 160000000 real 0m1.313s user 0m0.966s sys 0m0.341s 160000000 838.7012481689453 real 0m0.908s user 0m0.595s sys 0m0.297s [root@yz- test]# docker run -it --rm -v `pwd`:/opt/workspace python:3 bash -c "cd /opt/workspace && time wc -l && time wc -l && time python3 " 640000000 real 0m5.805s user 0m4.349s sys 0m1.455s 640000000 real 0m5.787s user 0m4.342s sys 0m1.445s 640000000 3323.5926628112793 real 0m3.399s user 0m2.255s sys 0m1.108s
can be seenPython
is actually faster than wc -l, mainly because of the purelyPython
There are very few steps, and most of the time is spent in the process of reading(), counting(), and such C implementations. wc is slower because the guess is probably that the default buffer is smaller, so it takes moreread()
to this article on how to use Python to quickly count the number of lines of text to this article, more related Python quickly count the number of lines of text content please search my previous posts or continue to browse the following related articles I hope you will support me in the future!