SoFunction
Updated on 2024-11-07

Python's method for reading a file in list or dict field mode

preamble

Python for processing text data is definitely a tool, extremely simple reading, splitting, filtering, conversion support, so that developers do not need to consider the complicated process of streaming file processing (relative to JAVA, hee hee). The blogger's own work, some complex text data processing calculations, including writing Streaming programs on HADOOP, are done with Python.

In the text processing process, the file will be loaded into memory is the first step, which involves how to map a column in the file to a specific variable in the process, the most foolhardy way, is in accordance with the subscripts of the field references, such as this:

# fields is the list after reading one line and splitting it by the separator character
user_id = fields[0]
user_name = fields[1]
user_type = fields[2]

If read in this way, once the file has the order, increase or decrease column changes, the maintenance of the code is a nightmare, this code must be eliminated.

This article recommends two elegant ways to read data, both configure the field schema first and then read it according to the schema, which comes in the form of dictionary schema and list schema;

Read the file and split it into a list of field data by separator

First reads the file, splits each row of data according to the delimiter, and returns a list of fields for subsequent processing.

The code is as follows:

def read_file_data(filepath):
 '''Read the file by line according to the path, param filepath: absolute path of the file
 @param filepath: the path of the file to read.
 @return: the list of data in each line after splitting by \t.
 '''
 fin = open(filepath, 'r')
 for line in fin:
  try:
   line = line[:-1]
   if not line: continue
  except:
   continue
  
  try:
   fields = ("\t")
  except:
   continue
  # Throw the current line's split list
  yield fields
 ()

Use the yield keyword to throw out the split data a single row at a time so that it can be used in the scheduler with thefor fields in read_file_data(fpath)way to read each line.

Mapping to Model 1: Use a configured dictionary schema to assemble a list of read data.

This approach configures a {"field name": field position} dictionary as the data schema, then assembles the read list data according to that schema, and finally realizes accessing the data using the dictionary.

The function used:

@staticmethod
def map_fields_dict_schema(fields, dict_schema):
 """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name':0, 'age':1},Then return{'name':'a','age':'b'}
 @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit
 @param dict_schema: A dictionary.,keyis the field name,valueis the location of the field;
 @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value
 """
 pdict = {}
 for fstr, findex in dict_schema.iteritems():
  pdict[fstr] = str(fields[int(findex)])
 return pdict

With this method and the previous method, the data can be read in the following way:

# coding:utf8
"""
@author.
Test loading a data list using dictionary mode
Advantage: for multi-column files, only by configuring the fields that need to be read, you can read the data in the corresponding columns.
Disadvantages: If there are more fields, it is more troublesome to configure the position of each field.
"""
import file_util
import pprint
 
# Configured dictionary schema to be read, you can configure the location of only the columns you care about
dict_schema = {"userid":0, "username":1, "usertype":2}
for fields in file_util.FileUtil.read_file_data(""):
 # Take a list of fields and map them to a dictionary schema
 dict_fields = file_util.FileUtil.map_fields_dict_schema(fields, dict_schema)
 (dict_fields)

Output results:

{'userid': '1', 'username': 'name1', 'usertype': '0'}
{'userid': '2', 'username': 'name2', 'usertype': '1'}
{'userid': '3', 'username': 'name3', 'usertype': '2'}
{'userid': '4', 'username': 'name4', 'usertype': '3'}
{'userid': '5', 'username': 'name5', 'usertype': '4'}
{'userid': '6', 'username': 'name6', 'usertype': '5'}
{'userid': '7', 'username': 'name7', 'usertype': '6'}
{'userid': '8', 'username': 'name8', 'usertype': '7'}
{'userid': '9', 'username': 'name9', 'usertype': '8'}
{'userid': '10', 'username': 'name10', 'usertype': '9'}
{'userid': '11', 'username': 'name11', 'usertype': '10'}
{'userid': '12', 'username': 'name12', 'usertype': '11'}

Mapping to Model 2: Use the configured list schema to assemble a list of read data.

If you need to read all the columns of the file, or some of the previous columns, then the configuration of the dictionary mode advantages of complexity, because you need to configure the index position of each field, and these positions are counted from 0 to the end, belonging to the low-level labor, need to be eliminated.

List mode was born out of destiny, which is achieved by first converting a configured list mode into a dictionary mode and then loading it by dictionary.

Conversion mode, and code to read in list-by-list mode:

@staticmethod
def transform_list_to_dict(para_list):
 """particle marking the following noun as a direct object['a', 'b']convert{'a':0, 'b':1}forms
 @param para_list: listings,Inside is the name of the field corresponding to each column
 @return: dictionaries,Inside is a mapping of field names and locations
 """
 res_dict = {}
 idx = 0
 while idx < len(para_list):
  res_dict[str(para_list[idx]).strip()] = idx
  idx += 1
 return res_dict
 
@staticmethod
def map_fields_list_schema(fields, list_schema):
 """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name', 'age'},Then return{'name':'a','age':'b'}
 @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit
 @param list_schema: 列名称的listingslist
 @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value
 """
 dict_schema = FileUtil.transform_list_to_dict(list_schema)
 return FileUtil.map_fields_dict_schema(fields, dict_schema)

When you use it, you can configure the schema in the form of a list and do not need to configure the index more concise:

# coding:utf8
"""
@author.
Testing loading a list of data using list mode
Pros: if you read all the columns, with list mode you only need to write the field names of each column in order.
Disadvantage: you can't read only the fields you care about, you need to read all of them.
"""
import file_util
import pprint
 
# Configured list mode to be read, only the first columns can be configured, or all grins.
list_schema = ["userid", "username", "usertype"]
for fields in file_util.FileUtil.read_file_data(""):
 # Take a list of fields and map them to a dictionary schema
 dict_fields = file_util.FileUtil.map_fields_list_schema(fields, list_schema)
 (dict_fields) 

The result of the run is exactly the same as for dictionary mode.

All code of file_util.py

Here is the entire code in file_util.py, which can be used in your own public class library

# -*- encoding:utf8 -*-
'''
@author: 
@version: 2014-12-5
'''
 
class FileUtil(object):
 ''' File, path common operation method
 '''
 @staticmethod
 def read_file_data(filepath):
  '''Read the file by line according to the path, param filepath: absolute path of the file
  @param filepath: the path of the file to read.
  @return: the list of data in each line after splitting by \t.
  '''
  fin = open(filepath, 'r')
  for line in fin:
   try:
    line = line[:-1]
    if not line: continue
   except:
    continue
   
   try:
    fields = ("\t")
   except:
    continue
   # Throw the current line's split list
   yield fields
  ()
 
 @staticmethod
 def transform_list_to_dict(para_list):
  """particle marking the following noun as a direct object['a', 'b']convert{'a':0, 'b':1}forms
  @param para_list: listings,Inside is the name of the field corresponding to each column
  @return: dictionaries,Inside is a mapping of field names and locations
  """
  res_dict = {}
  idx = 0
  while idx < len(para_list):
   res_dict[str(para_list[idx]).strip()] = idx
   idx += 1
  return res_dict
 
 @staticmethod
 def map_fields_list_schema(fields, list_schema):
  """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name', 'age'},Then return{'name':'a','age':'b'}
  @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit
  @param list_schema: 列名称的listingslist
  @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value
  """
  dict_schema = FileUtil.transform_list_to_dict(list_schema)
  return FileUtil.map_fields_dict_schema(fields, dict_schema)
 
@staticmethod
def map_fields_dict_schema(fields, dict_schema):
 """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name':0, 'age':1},Then return{'name':'a','age':'b'}
 @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit
 @param dict_schema: 一个dictionary (of Chinese compound words),keyis the field name,valueis the location of the field;
 @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value
 """
 pdict = {}
 for fstr, findex in dict_schema.iteritems():
  pdict[fstr] = str(fields[int(findex)])
 return pdict

summarize

Above is the entire content of this article, I hope that the content of this article for everyone to learn or use python can have some help, if there are questions you can leave a message to exchange.