SoFunction
Updated on 2024-12-16

A simple Python implementation of separating a large file into multiple smaller files by paragraphs.

This article example describes the Python implementation of a large file separated into multiple small files by paragraph simple operation method. Shared for your reference, as follows:

I was helping a classmate with a bit of corpus today. The corpus file is a bit big and has two consecutive line breaks as paragraph markers. He wants to separate it into several small files by paragraphs, i.e. every 3 paragraphs form a new file. Since I have not encountered similar operations before, I looked for some similar methods on the Internet, and they all looked a bit complicated. So after trying it out, I wrote a piece of code by myself and solved the problem perfectly.

basic ideaYes.First read the contents of the original file, and use regular expressions, based on \n\n slicing process, the result is a list, in which each list element is stored in a slice of the content; and then create a handle to write the file; next traversing the slice list, and write the contents of the current slice, to determine whether it has already been written to the 3 paragraphs, if not, then continue to read and write the next slice, if it has been enough for 3. If not, then continue to read and write the next slice, if there are already enough 3, then close the previous write file handle, with a different file name to re-create a new write file handle, the end of the cycle, waiting to read and write the next slice.

# -*- coding:utf8 -*-
import re;
p=('\n\n',);
fileContent=open('files/office.txt','r',encoding='utf8').read();# Read the contents of the file
paraList=(fileContent) # Slicing text based on line breaks
fileWriter=open('files/','a',encoding='utf8');# Create a handle to a write file
for paraIndex in range(len(paraList)):# Iterate through the sliced text list
  (paraList[paraIndex]);# Write the first element of the list to the file first
  if((paraIndex+1)%3==0):# Determine if enough 3 slices have been written, and if they have been
    (); # Close the current handle
    fileWriter=open('files/'+str((paraIndex+1)/3)+'.txt','a',encoding='utf8'); # Recreate a new handle and wait for the next slice element to be written. Note the filename handling trick here.
();# Close the last write file handle created.
print('finished');

Readers interested in more Python related content can check out this site's topic: theSummary of Python file and directory manipulation techniques》、《Summary of Python text file manipulation techniques》、《Summary of Python URL manipulation techniques》、《Summary of Python image manipulation techniques》、《Python Data Structures and Algorithms Tutorial》、《Python Socket Programming Tips Summary》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniquesand thePython introductory and advanced classic tutorials

I hope that the description of this article will be helpful for you Python Programming.