Python to automatically identify and batch convert text file encoding

As the question, it is very simple, is to use the chardet library to recognize the file encoding, decoding and then output into the target encoding. Sort of a small tool that can be used occasionally, to be used in case there is no very difficult kind, for example, downloaded someone else's project file online, a open all messy code ......

coding

Added a more detailed comment ~~ read the requirements should not be high, usually used Python, know a few common libraries on the line.

from pathlib import Path
import chardet
import re

def text_file_encoding_convert(f: Path, target_encoding: str, *, dry_run=False) -> (bool, str, float):
    ''' Convert a single file to target encoding
    @param f File path
    @param target_encoding target encoding, e.g. urf-8
    @param dry_run No actual modification of the source file when True.

    @return returns three values (success, estimated source encoding, estimated certainty)
    '''
    target_encoding = target_encoding.lower()    # python's standard encoding names are all lowercase
    raw = f.read_bytes()
    result = (raw)
    encoding = result["encoding"].lower()     # chardet estimated code name
    confidence = result["confidence"]    # Certainty of estimates
    flag = True

    # The following single for loop is used to avoid repetitive return statements, and after break, the loop jumps to the final return.
    for _d（￣_￣）b_ in (1,):
        if encoding == target_encoding or encoding == "ascii" and target_encoding == "utf-8":
            # If the target encoding is the same as the source encoding, don't do anything. utf-8 encoding is compatible with ASCII, so there is no change when converting to utf-8 when the original encoding is ASCII, so skip it too.
            print(f"-> [NO CONVERSION NEEDED] {}: {encoding} ==> [ {target_encoding} ]")
            break

        try:
            text = (encoding)
        except:
            print(f"!> Encoding err: {}, detected: {encoding}, {confidence}.")
            flag = False
            break

        if dry_run:
            print(f"-> [ NO WET ] {}: {encoding} ==> [ {target_encoding} ]")
        else:
            # Must be converted to a byte array using the target encoding, then written to the source file byte by byte
            # If you write as text, you run into the delightful CR LF line feed problem.
            # CR LF line breaks in the source file are automatically turned into CR CR LF, i.e. a bunch of extra blank lines.
            out = (target_encoding)
            f.write_bytes(out)
            print(f"-> {}: {encoding} ==> [ {target_encoding} ]")

    return (flag, encoding, confidence)


def text_file_encoding_batch_convert(
    folder: Path,
    target_encoding: str,
    *,
    dry_run=True,
    recursive=False,
    pattern=".*(c|h|txt|cxx|cpp|hpp|hxx|csv|asm)$",
    skip_when_error=True,
):
    ''' Batch convert the encoding of text files in a directory
    @param folder Target directory
    @param target_encoding Target encoding
    @param dry_run Don't actually modify the source file to avoid writing mistakes.
    @param recursive Include all files in subfolders.
    @param pattern Regular expression to filter text files based on filename, defaults to several text types based on suffixes.
    @param skip_when_error Default True, when there is an error in converting a single file, it will be prompted and skipped, otherwise it will be terminated.
    '''
    if recursive:
        flist = ("*")
    else:
        flist = ("*")

    p = (pattern)   # Compile the rules, it should be faster after that #
    for f in flist:
        if not (f.is_file() and ()):
            continue

        ok, encoding, confidence = text_file_encoding_convert(f, target_encoding, dry_run=dry_run)
        if not ok:
            if skip_when_error:
                print("!> SKIP.")
            else:
                print("!> ABORT.")
                return

usage

Since we're batch converting files, it's good to just call the second function as follows:

folder = Path(r"D:\Downloads\Some shit\\")
text_file_encoding_batch_convert(folder, "utf-8", recursive=True)

The target directory is D:\Downloads\Some shit\\\, put into a Path object, passed in as the first parameter. The second parameter is the target encoding, the third parameter must be written in the parameter name recursive, used to specify whether to traverse the sub-folders, the default is False, in order to avoid accidents. If you run it directly, the output should look like this:

-> [NO WET] : gb2312 ⇒ utf-8

[NO WET] means DRY [doge], which means no actual modification of the source file, just a general look at the output to see if it's right, and another parameter if you actually run it:

text_file_encoding_batch_convert(folder, "utf-8", recursive=True, dry_run=False)

By the way, the output information are in English, because everyone should have encountered the console output Chinese garbled problem, anyway, there are not many words. The remaining two parameters, skip_when_error leave it alone on the line, nothing useful; pattern is a regular expression to match the file name, only matching the file will be processed, you can set up your own; the default regular only matches a few easy to meet the problem of garbled text files, match too much may also be wrong to kill, if you need to be added to the brackets on the line.

By the way, the code above is the same as this article, using theCC - BY SA 4.0 Agreement.

This article on Python to automatically identify and batch convert text file encoding of the article is introduced to this, more related Python convert text file encoding content, please search for my previous posts or continue to browse the following articles hope that you will support me in the future more!