SoFunction
Updated on 2025-05-13

Java methods and techniques to efficiently segment text files

Preface

I heard about zero copy technology before, and I felt so profound and far away🥸

All of them use zero copy technology to see what framework, for examplenettyUse zero copy technology.

I saw an article that allowed me to connect with zero copy technology to get rid of it. It turned out that I could also use zero copy technology at work. Today I will share this article with you.

Commonly used examples of inefficiency

When we are facing the idea of ​​dividing the text file into the largest chunks, we may try to write the following code:

    private static final long maxFileSizeBytes = 10 * 1024 * 1024; // Default 10MB

    public void split(Path inputFile, Path outputDir) throws IOException {
        if (!(inputFile)) {
            throw new IOException("The input file does not exist: " + inputFile);
        }
        if ((inputFile) == 0) {
            throw new IOException("The input file is empty: " + inputFile);
        }

        (outputDir);

        try (BufferedReader reader = (inputFile)) {
            int fileIndex = 0;
            long currentSize = 0;
            BufferedWriter writer = null;
            try {
                writer = newWriter(outputDir, fileIndex++);

                String line;
                while ((line = ()) != null) {
                byte[] lineBytes = (line + ()).getBytes();
                if (currentSize +  > maxFileSizeBytes) {
                    if (writer != null) {
                        ();
                    }
                    writer = newWriter(outputDir, fileIndex++);
                    currentSize = 0;
                }
                (line);
                ();
                currentSize += ;
                }
            } finally {
                if (writer != null) {
                    ();
                }
            }
        }
    }

    private BufferedWriter newWriter(Path dir, int index) throws IOException {
        Path filePath = ("part_" + index + ".txt");
        return (filePath);
    }

Efficiency analysis

This code is technically OK, but splitting large files into multiple blocks is very inefficient.

It performs many heap allocations (rows), resulting in the creation and discarding of a large number of temporary objects (strings, byte arrays).
There is also a less obvious problem, which copies data to multiple buffers and performs context switching between user and kernel mode.

The details are as follows:

BufferedReader: BufferedReader's BufferedReadermiddle:

  • On the bottomFileReaderorInputStreamReaderCalled upread()
  • Data fromKernel spaceUser SpaceBuffer copy.
  • Then parse into Java string (heap allocation).

getBytes() : getBytes()of

  • WillStringConvert to newbyte[]More heap allocation.

BufferedWriter: BufferedWriter's BufferedWritermiddle:

  • Get the byte/char data from user space.
  • Callwrite()This involvesUser SpaceCopy toKernel space.
  • Finally refresh to disk.

Therefore, the data moves back and forth between the kernel and user space many times and generates additional heap changes. In addition to garbage collection pressure, it also has the following consequences:

  • Memory bandwidth is wasted for copying between buffers.
  • Disk-to-disk transfer has high CPU utilization.
  • The operating system could have directly processed batch copies (via DMA or optimized I/O), but Java code intercepted this efficiency by introducing user space logic.

Efficient treatment plan

So, how can we avoid the above problems?

The answer is to use it as much as possiblezero copy, that is, avoid leaving the kernel space as much as possible. This can be usedFileChannelmethodlong transferTo(long position, long count, WritableByteChannel target)Completed in java. It is a disk-to-disk transfer directly and will also be used as some IO optimizations of the system.

There is a problemThe described method performs a byte block, which may destroy the integrity of the row. To solve this problem, we need a strategy to ensure that lines remain intact even when processing files by moving byte segments

It's easy without the above problems, just call it for each blocktransferTo,WillpositionIncrement asposition = position + maxFileSize, until more data cannot be transferred.

To maintain the integrity of the row, we need to determine the end of the last full row in each byte block. To do this, we first look for the expected end of the chunk and then scan backwards to find the previous line break. This will give us an accurate byte count of chunk, ensuring that the last, uninterrupted rows are included. This will be the only part of the code that performs buffer allocation and copying, and since these operations should be minimal, the expected performance impact is negligible.

private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024;
​
private long maxSizePerFileInBytes;
private Path outputDirectory;
private Path tempDir;
​
private void split(Path fileToSplit) throws IOException {
    try (RandomAccessFile raf = new RandomAccessFile((), "r");
            FileChannel inputChannel = ()) {
​
        long fileSize = ();
        long position = 0;
        int fileCounter = 1;
​
        while (position < fileSize) {
            // Calculate end position (try to get close to max size)
            long targetEndPosition = (position + maxSizePerFileInBytes, fileSize);
​
            // If we're not at the end of the file, find the last line ending before max size
            long endPosition = targetEndPosition;
            if (endPosition < fileSize) {
                endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition);
            }
​
            long chunkSize = endPosition - position;
            var outputFilePath = ("_part" + fileCounter);
            try (FileOutputStream fos = new FileOutputStream(());
                    FileChannel outputChannel = ()) {
                (position, chunkSize, outputChannel);
            }
​
            position = endPosition;
            fileCounter++;
        }
​
    }
}
​
private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition)
        throws IOException {
    long originalPosition = ();
​
    try {
        int bufferSize = LINE_ENDING_SEARCH_WINDOW;
        long chunkSize = maxPosition - startPosition;
​
        if (chunkSize < bufferSize) {
            bufferSize = (int) chunkSize;
        }
​
        byte[] buffer = new byte[bufferSize];
        long searchPos = maxPosition;
​
        while (searchPos > startPosition) {
            long distanceToStart = searchPos - startPosition;
            int bytesToRead = (int) (bufferSize, distanceToStart);
​
            long readStartPos = searchPos - bytesToRead;
            (readStartPos);
​
            int bytesRead = (buffer, 0, bytesToRead);
            if (bytesRead <= 0)
                break;
​
            // Search backwards through the buffer for newline
            for (int i = bytesRead - 1; i >= 0; i--) {
                if (buffer[i] == '\n') {
                    return readStartPos + i + 1;
                }
            }
​
            searchPos -= bytesRead;
        }
​
        throw new IllegalArgumentException(
                "File " + fileToSplit + " cannot be split. No newline found within the limits.");
    } finally {
        (originalPosition);
    }
}

findLastLineEndBeforePositionThe method has certain limitations. Specifically, it only works on Unix-like systems (\n), very long lines can cause a large number of backward read iterations and contain more thanmaxSizePerFileInBytesThe file of the line cannot be split. However, it is great for scenarios like split access log files, which often have short lines and a large number of entries.

Performance Analysis

In theory, wezero copySplitting files should be faster [often used], and now is the time to measure how fast it can be. To do this, I ran some benchmarks for both implementations and these are the results.

Benchmark                                                    Mode  Cnt           Score      Error   Units
                              avgt   15        1179.429 ±   54.271   ms/op
:·               avgt   15        1349.613 ±   60.903  MB/sec
:·          avgt   15  1694927403.481 ± 6060.581    B/op
:·                    avgt   15         718.000             counts
:·                     avgt   15         317.000                 ms
                      avgt   15          77.352 ±    1.339   ms/op
:·       avgt   15          23.759 ±    0.465  MB/sec
:·  avgt   15     2555608.877 ± 8644.153    B/op
:·            avgt   15          10.000             counts
:·             avgt   15           5.000                 ms

Below are the benchmark code and file size (200+MB) used for the above results.

int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks
​
public void setup() throws Exception {
    inputFile = ("/tmp/large_input.txt");
    outputDir = ("/tmp/split_output");
    // Create a large file for benchmarking if it doesn't exist
    if (!(inputFile)) {
        try (BufferedWriter writer = (inputFile)) {
            for (int i = 0; i < 10_000_000; i++) {
                ("This is line number " + i);
                ();
            }
        }
    }
}
​
public void splitFile() throws Exception {
    (inputFile, outputDir);
}
​
public void splitFileZeroCopy() throws Exception {
    (inputFile);
}

zeroCopyIt showed considerable acceleration, taking only 77 milliseconds, and for this particular case, it took 1179 milliseconds. This performance advantage can be critical when dealing with large amounts of data or many files.

in conclusion

Efficient splitting of large text files requires system-level performance considerations, not just logic. While the basic approach highlights the problem of excessive memory, the redesigned solution can significantly improve performance by utilizing zero-copy technology and maintaining row integrity.

This demonstrates the impact of system-aware programming and understanding I/O mechanisms in creating faster, more resource-saving tools to handle large text data such as logs or datasets.

The above is the detailed content of Java's methods and techniques for efficiently segmenting text files. For more information about Java's segmenting text files, please pay attention to my other related articles!