SoFunction
Updated on 2025-05-06

Java implements intercepting strings by byte length

In Java, since strings may contain multi-byte characters (such as Chinese), intercepting directly byte length may lead to garbled code or inaccurate interception. Here are several ways to intercept strings by byte length:

Method 1: Use String's getBytes method

public static String substringByBytes(String str, int byteLength) {
    if (str == null || () || byteLength <= 0) {
        return "";
    }
    
    byte[] bytes = ();
    if (byteLength >= ) {
        return str;
    }
    
    // Handle the situation where the intercept position may be multibyte characters    int len = 0;
    for (int i = 0; i < (); i++) {
        char c = (i);
        len += (c <= 255) ? 1 : 2; // Assume that non-ASCII characters account for 2 bytes        
        if (len > byteLength) {
            return (0, i);
        } else if (len == byteLength) {
            return (0, i + 1);
        }
    }
    return str;
}

Method 2: Specify character encoding processing

public static String substringByBytes(String str, int byteLength, String charsetName) 
        throws UnsupportedEncodingException {
    if (str == null || () || byteLength <= 0) {
        return "";
    }
    
    byte[] bytes = (charsetName);
    if (byteLength >= ) {
        return str;
    }
    
    // Create a new string based on the encoding    return new String(bytes, 0, byteLength, charsetName);
}

Method 3: More accurate character encoding processing

public static String substringByBytes(String str, int maxBytes, String charsetName) 
        throws UnsupportedEncodingException {
    if (str == null || charsetName == null || ()) {
        return str;
    }
    
    byte[] bytes = (charsetName);
    if ( <= maxBytes) {
        return str;
    }
    
    // Handle half character problems that may be caused by truncation    int nBytes = 0;
    int i = 0;
    for (; i < (); i++) {
        char c = (i);
        int charBytes = (c).getBytes(charsetName).length;
        if (nBytes + charBytes > maxBytes) {
            break;
        }
        nBytes += charBytes;
    }
    
    return (0, i);
}

Example of usage

public static void main(String[] args) {
    String testStr = "Hello, Java World! Hello World!";
    
    try {
        (substringByBytes(testStr, 10)); // Output: Hello, J        (substringByBytes(testStr, 15, "UTF-8")); // Output: Hello, Java        (substringByBytes(testStr, 20, "GBK")); // Output: Hello, Java world!    } catch (UnsupportedEncodingException e) {
        ();
    }
}

Things to note

The number of bytes occupied by characters under different encodings is different:

In UTF-8 encoding, Chinese usually accounts for 3 bytes.

In GBK encoding, Chinese accounts for 2 bytes

In ISO-8859-1 encoding, all characters account for 1 byte

When intercepting, you need to consider the encoded byte boundaries to avoid truncating multi-byte characters and causing garbled code.

Performance considerations: For frequent intercepting of large strings, it is recommended to cache byte arrays or use more efficient algorithms.

For special characters such as emojis, additional processing may be required

Method supplement

Method 1:

Solution Design

1. Byte length calculation

First, we need to calculate the byte length of the string. In Java, it can be used()The method converts the string into a byte array and then calculates the length of the array.

2. Intercept logic

Depending on the provided byte length, we need to intercept the specified byte length from the start of the string. If the intercepted string is on the character boundary, we need to make sure the intercepted string is a valid UTF-8 sequence.

3. Exception handling

During the interception process, an invalid UTF-8 sequence may be encountered, and we need to catch and handle these exceptions.

Code implementation

public class ByteLengthStringCutter {
    public static String cutByByteLength(String input, int byteLength) {
        if (input == null || byteLength <= 0) {
            return "";
        }
 
        byte[] bytes = (StandardCharsets.UTF_8);
        if ( <= byteLength) {
            return input;
        }
 
        StringBuilder sb = new StringBuilder();
        try {
            for (int i = 0; i < byteLength; i++) {
                ((char) bytes[i]);
            }
            return ();
        } catch (IllegalArgumentException e) {
            // Handle invalid UTF-8 sequences            return cutByByteLength(input, byteLength - 1);
        }
    }
}

Method 2:

Complete code

public class SubstringDemo {

    public static void main(String[] args) {
        // Enter the string to be intercepted and the intercepted length        String str = "This is a test string";
        int length = 5; // The byte length that needs to be intercepted
        try {
            // Convert string to byte array            byte[] bytes = ("UTF-8");
            
            // Perform byte interception            String result = new String(bytes, 0, length, "UTF-8");

            // Output the intercepted result            ("The result after intercepting is:" + result);
        } catch (Exception e) {
            ();
        }
    }
}

This is the end of this article about Java's implementation of intercepting strings by byte length. For more related Java's intercepting string content, please search for my previous articles or continue browsing the following related articles. I hope everyone will support me in the future!