In Java, since strings may contain multi-byte characters (such as Chinese), intercepting directly byte length may lead to garbled code or inaccurate interception. Here are several ways to intercept strings by byte length:
Method 1: Use String's getBytes method
public static String substringByBytes(String str, int byteLength) { if (str == null || () || byteLength <= 0) { return ""; } byte[] bytes = (); if (byteLength >= ) { return str; } // Handle the situation where the intercept position may be multibyte characters int len = 0; for (int i = 0; i < (); i++) { char c = (i); len += (c <= 255) ? 1 : 2; // Assume that non-ASCII characters account for 2 bytes if (len > byteLength) { return (0, i); } else if (len == byteLength) { return (0, i + 1); } } return str; }
Method 2: Specify character encoding processing
public static String substringByBytes(String str, int byteLength, String charsetName) throws UnsupportedEncodingException { if (str == null || () || byteLength <= 0) { return ""; } byte[] bytes = (charsetName); if (byteLength >= ) { return str; } // Create a new string based on the encoding return new String(bytes, 0, byteLength, charsetName); }
Method 3: More accurate character encoding processing
public static String substringByBytes(String str, int maxBytes, String charsetName) throws UnsupportedEncodingException { if (str == null || charsetName == null || ()) { return str; } byte[] bytes = (charsetName); if ( <= maxBytes) { return str; } // Handle half character problems that may be caused by truncation int nBytes = 0; int i = 0; for (; i < (); i++) { char c = (i); int charBytes = (c).getBytes(charsetName).length; if (nBytes + charBytes > maxBytes) { break; } nBytes += charBytes; } return (0, i); }
Example of usage
public static void main(String[] args) { String testStr = "Hello, Java World! Hello World!"; try { (substringByBytes(testStr, 10)); // Output: Hello, J (substringByBytes(testStr, 15, "UTF-8")); // Output: Hello, Java (substringByBytes(testStr, 20, "GBK")); // Output: Hello, Java world! } catch (UnsupportedEncodingException e) { (); } }
Things to note
The number of bytes occupied by characters under different encodings is different:
In UTF-8 encoding, Chinese usually accounts for 3 bytes.
In GBK encoding, Chinese accounts for 2 bytes
In ISO-8859-1 encoding, all characters account for 1 byte
When intercepting, you need to consider the encoded byte boundaries to avoid truncating multi-byte characters and causing garbled code.
Performance considerations: For frequent intercepting of large strings, it is recommended to cache byte arrays or use more efficient algorithms.
For special characters such as emojis, additional processing may be required
Method supplement
Method 1:
Solution Design
1. Byte length calculation
First, we need to calculate the byte length of the string. In Java, it can be used()
The method converts the string into a byte array and then calculates the length of the array.
2. Intercept logic
Depending on the provided byte length, we need to intercept the specified byte length from the start of the string. If the intercepted string is on the character boundary, we need to make sure the intercepted string is a valid UTF-8 sequence.
3. Exception handling
During the interception process, an invalid UTF-8 sequence may be encountered, and we need to catch and handle these exceptions.
Code implementation
public class ByteLengthStringCutter { public static String cutByByteLength(String input, int byteLength) { if (input == null || byteLength <= 0) { return ""; } byte[] bytes = (StandardCharsets.UTF_8); if ( <= byteLength) { return input; } StringBuilder sb = new StringBuilder(); try { for (int i = 0; i < byteLength; i++) { ((char) bytes[i]); } return (); } catch (IllegalArgumentException e) { // Handle invalid UTF-8 sequences return cutByByteLength(input, byteLength - 1); } } }
Method 2:
Complete code
public class SubstringDemo { public static void main(String[] args) { // Enter the string to be intercepted and the intercepted length String str = "This is a test string"; int length = 5; // The byte length that needs to be intercepted try { // Convert string to byte array byte[] bytes = ("UTF-8"); // Perform byte interception String result = new String(bytes, 0, length, "UTF-8"); // Output the intercepted result ("The result after intercepting is:" + result); } catch (Exception e) { (); } } }
This is the end of this article about Java's implementation of intercepting strings by byte length. For more related Java's intercepting string content, please search for my previous articles or continue browsing the following related articles. I hope everyone will support me in the future!