Preface
I heard about zero copy technology before, and I felt so profound and far away🥸
All of them use zero copy technology to see what framework, for examplenetty
Use zero copy technology.
I saw an article that allowed me to connect with zero copy technology to get rid of it. It turned out that I could also use zero copy technology at work. Today I will share this article with you.
Commonly used examples of inefficiency
When we are facing the idea of dividing the text file into the largest chunks, we may try to write the following code:
private static final long maxFileSizeBytes = 10 * 1024 * 1024; // Default 10MB public void split(Path inputFile, Path outputDir) throws IOException { if (!(inputFile)) { throw new IOException("The input file does not exist: " + inputFile); } if ((inputFile) == 0) { throw new IOException("The input file is empty: " + inputFile); } (outputDir); try (BufferedReader reader = (inputFile)) { int fileIndex = 0; long currentSize = 0; BufferedWriter writer = null; try { writer = newWriter(outputDir, fileIndex++); String line; while ((line = ()) != null) { byte[] lineBytes = (line + ()).getBytes(); if (currentSize + > maxFileSizeBytes) { if (writer != null) { (); } writer = newWriter(outputDir, fileIndex++); currentSize = 0; } (line); (); currentSize += ; } } finally { if (writer != null) { (); } } } } private BufferedWriter newWriter(Path dir, int index) throws IOException { Path filePath = ("part_" + index + ".txt"); return (filePath); }
Efficiency analysis
This code is technically OK, but splitting large files into multiple blocks is very inefficient.
It performs many heap allocations (rows), resulting in the creation and discarding of a large number of temporary objects (strings, byte arrays).
There is also a less obvious problem, which copies data to multiple buffers and performs context switching between user and kernel mode.
The details are as follows:
BufferedReader: BufferedReader's BufferedReadermiddle:
- On the bottom
FileReader
orInputStreamReader
Called upread()
- Data fromKernel space→User SpaceBuffer copy.
- Then parse into Java string (heap allocation).
getBytes() : getBytes()of
- Will
String
Convert to newbyte[]
More heap allocation.
BufferedWriter: BufferedWriter's BufferedWritermiddle:
- Get the byte/char data from user space.
- Call
write()
This involvesUser SpaceCopy toKernel space. - Finally refresh to disk.
Therefore, the data moves back and forth between the kernel and user space many times and generates additional heap changes. In addition to garbage collection pressure, it also has the following consequences:
- Memory bandwidth is wasted for copying between buffers.
- Disk-to-disk transfer has high CPU utilization.
- The operating system could have directly processed batch copies (via DMA or optimized I/O), but Java code intercepted this efficiency by introducing user space logic.
Efficient treatment plan
So, how can we avoid the above problems?
The answer is to use it as much as possiblezero copy, that is, avoid leaving the kernel space as much as possible. This can be usedFileChannel
methodlong transferTo(long position, long count, WritableByteChannel target)
Completed in java. It is a disk-to-disk transfer directly and will also be used as some IO optimizations of the system.
There is a problemThe described method performs a byte block, which may destroy the integrity of the row. To solve this problem, we need a strategy to ensure that lines remain intact even when processing files by moving byte segments
It's easy without the above problems, just call it for each block
transferTo
,Willposition
Increment asposition = position + maxFileSize
, until more data cannot be transferred.
To maintain the integrity of the row, we need to determine the end of the last full row in each byte block. To do this, we first look for the expected end of the chunk and then scan backwards to find the previous line break. This will give us an accurate byte count of chunk, ensuring that the last, uninterrupted rows are included. This will be the only part of the code that performs buffer allocation and copying, and since these operations should be minimal, the expected performance impact is negligible.
private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024; private long maxSizePerFileInBytes; private Path outputDirectory; private Path tempDir; private void split(Path fileToSplit) throws IOException { try (RandomAccessFile raf = new RandomAccessFile((), "r"); FileChannel inputChannel = ()) { long fileSize = (); long position = 0; int fileCounter = 1; while (position < fileSize) { // Calculate end position (try to get close to max size) long targetEndPosition = (position + maxSizePerFileInBytes, fileSize); // If we're not at the end of the file, find the last line ending before max size long endPosition = targetEndPosition; if (endPosition < fileSize) { endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition); } long chunkSize = endPosition - position; var outputFilePath = ("_part" + fileCounter); try (FileOutputStream fos = new FileOutputStream(()); FileChannel outputChannel = ()) { (position, chunkSize, outputChannel); } position = endPosition; fileCounter++; } } } private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition) throws IOException { long originalPosition = (); try { int bufferSize = LINE_ENDING_SEARCH_WINDOW; long chunkSize = maxPosition - startPosition; if (chunkSize < bufferSize) { bufferSize = (int) chunkSize; } byte[] buffer = new byte[bufferSize]; long searchPos = maxPosition; while (searchPos > startPosition) { long distanceToStart = searchPos - startPosition; int bytesToRead = (int) (bufferSize, distanceToStart); long readStartPos = searchPos - bytesToRead; (readStartPos); int bytesRead = (buffer, 0, bytesToRead); if (bytesRead <= 0) break; // Search backwards through the buffer for newline for (int i = bytesRead - 1; i >= 0; i--) { if (buffer[i] == '\n') { return readStartPos + i + 1; } } searchPos -= bytesRead; } throw new IllegalArgumentException( "File " + fileToSplit + " cannot be split. No newline found within the limits."); } finally { (originalPosition); } }
findLastLineEndBeforePosition
The method has certain limitations. Specifically, it only works on Unix-like systems (\n
), very long lines can cause a large number of backward read iterations and contain more thanmaxSizePerFileInBytes
The file of the line cannot be split. However, it is great for scenarios like split access log files, which often have short lines and a large number of entries.
Performance Analysis
In theory, wezero copy
Splitting files should be faster [often used], and now is the time to measure how fast it can be. To do this, I ran some benchmarks for both implementations and these are the results.
Benchmark Mode Cnt Score Error Units avgt 15 1179.429 ± 54.271 ms/op :· avgt 15 1349.613 ± 60.903 MB/sec :· avgt 15 1694927403.481 ± 6060.581 B/op :· avgt 15 718.000 counts :· avgt 15 317.000 ms avgt 15 77.352 ± 1.339 ms/op :· avgt 15 23.759 ± 0.465 MB/sec :· avgt 15 2555608.877 ± 8644.153 B/op :· avgt 15 10.000 counts :· avgt 15 5.000 ms
Below are the benchmark code and file size (200+MB) used for the above results.
int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks public void setup() throws Exception { inputFile = ("/tmp/large_input.txt"); outputDir = ("/tmp/split_output"); // Create a large file for benchmarking if it doesn't exist if (!(inputFile)) { try (BufferedWriter writer = (inputFile)) { for (int i = 0; i < 10_000_000; i++) { ("This is line number " + i); (); } } } } public void splitFile() throws Exception { (inputFile, outputDir); } public void splitFileZeroCopy() throws Exception { (inputFile); }
zeroCopy
It showed considerable acceleration, taking only 77 milliseconds, and for this particular case, it took 1179 milliseconds. This performance advantage can be critical when dealing with large amounts of data or many files.
in conclusion
Efficient splitting of large text files requires system-level performance considerations, not just logic. While the basic approach highlights the problem of excessive memory, the redesigned solution can significantly improve performance by utilizing zero-copy technology and maintaining row integrity.
This demonstrates the impact of system-aware programming and understanding I/O mechanisms in creating faster, more resource-saving tools to handle large text data such as logs or datasets.
The above is the detailed content of Java's methods and techniques for efficiently segmenting text files. For more information about Java's segmenting text files, please pay attention to my other related articles!