preamble
Python for processing text data is definitely a tool, extremely simple reading, splitting, filtering, conversion support, so that developers do not need to consider the complicated process of streaming file processing (relative to JAVA, hee hee). The blogger's own work, some complex text data processing calculations, including writing Streaming programs on HADOOP, are done with Python.
In the text processing process, the file will be loaded into memory is the first step, which involves how to map a column in the file to a specific variable in the process, the most foolhardy way, is in accordance with the subscripts of the field references, such as this:
# fields is the list after reading one line and splitting it by the separator character user_id = fields[0] user_name = fields[1] user_type = fields[2]
If read in this way, once the file has the order, increase or decrease column changes, the maintenance of the code is a nightmare, this code must be eliminated.
This article recommends two elegant ways to read data, both configure the field schema first and then read it according to the schema, which comes in the form of dictionary schema and list schema;
Read the file and split it into a list of field data by separator
First reads the file, splits each row of data according to the delimiter, and returns a list of fields for subsequent processing.
The code is as follows:
def read_file_data(filepath): '''Read the file by line according to the path, param filepath: absolute path of the file @param filepath: the path of the file to read. @return: the list of data in each line after splitting by \t. ''' fin = open(filepath, 'r') for line in fin: try: line = line[:-1] if not line: continue except: continue try: fields = ("\t") except: continue # Throw the current line's split list yield fields ()
Use the yield keyword to throw out the split data a single row at a time so that it can be used in the scheduler with thefor fields in read_file_data(fpath)
way to read each line.
Mapping to Model 1: Use a configured dictionary schema to assemble a list of read data.
This approach configures a {"field name": field position} dictionary as the data schema, then assembles the read list data according to that schema, and finally realizes accessing the data using the dictionary.
The function used:
@staticmethod def map_fields_dict_schema(fields, dict_schema): """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name':0, 'age':1},Then return{'name':'a','age':'b'} @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit @param dict_schema: A dictionary.,keyis the field name,valueis the location of the field; @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value """ pdict = {} for fstr, findex in dict_schema.iteritems(): pdict[fstr] = str(fields[int(findex)]) return pdict
With this method and the previous method, the data can be read in the following way:
# coding:utf8 """ @author. Test loading a data list using dictionary mode Advantage: for multi-column files, only by configuring the fields that need to be read, you can read the data in the corresponding columns. Disadvantages: If there are more fields, it is more troublesome to configure the position of each field. """ import file_util import pprint # Configured dictionary schema to be read, you can configure the location of only the columns you care about dict_schema = {"userid":0, "username":1, "usertype":2} for fields in file_util.FileUtil.read_file_data(""): # Take a list of fields and map them to a dictionary schema dict_fields = file_util.FileUtil.map_fields_dict_schema(fields, dict_schema) (dict_fields)
Output results:
{'userid': '1', 'username': 'name1', 'usertype': '0'} {'userid': '2', 'username': 'name2', 'usertype': '1'} {'userid': '3', 'username': 'name3', 'usertype': '2'} {'userid': '4', 'username': 'name4', 'usertype': '3'} {'userid': '5', 'username': 'name5', 'usertype': '4'} {'userid': '6', 'username': 'name6', 'usertype': '5'} {'userid': '7', 'username': 'name7', 'usertype': '6'} {'userid': '8', 'username': 'name8', 'usertype': '7'} {'userid': '9', 'username': 'name9', 'usertype': '8'} {'userid': '10', 'username': 'name10', 'usertype': '9'} {'userid': '11', 'username': 'name11', 'usertype': '10'} {'userid': '12', 'username': 'name12', 'usertype': '11'}
Mapping to Model 2: Use the configured list schema to assemble a list of read data.
If you need to read all the columns of the file, or some of the previous columns, then the configuration of the dictionary mode advantages of complexity, because you need to configure the index position of each field, and these positions are counted from 0 to the end, belonging to the low-level labor, need to be eliminated.
List mode was born out of destiny, which is achieved by first converting a configured list mode into a dictionary mode and then loading it by dictionary.
Conversion mode, and code to read in list-by-list mode:
@staticmethod def transform_list_to_dict(para_list): """particle marking the following noun as a direct object['a', 'b']convert{'a':0, 'b':1}forms @param para_list: listings,Inside is the name of the field corresponding to each column @return: dictionaries,Inside is a mapping of field names and locations """ res_dict = {} idx = 0 while idx < len(para_list): res_dict[str(para_list[idx]).strip()] = idx idx += 1 return res_dict @staticmethod def map_fields_list_schema(fields, list_schema): """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name', 'age'},Then return{'name':'a','age':'b'} @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit @param list_schema: 列名称的listingslist @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value """ dict_schema = FileUtil.transform_list_to_dict(list_schema) return FileUtil.map_fields_dict_schema(fields, dict_schema)
When you use it, you can configure the schema in the form of a list and do not need to configure the index more concise:
# coding:utf8 """ @author. Testing loading a list of data using list mode Pros: if you read all the columns, with list mode you only need to write the field names of each column in order. Disadvantage: you can't read only the fields you care about, you need to read all of them. """ import file_util import pprint # Configured list mode to be read, only the first columns can be configured, or all grins. list_schema = ["userid", "username", "usertype"] for fields in file_util.FileUtil.read_file_data(""): # Take a list of fields and map them to a dictionary schema dict_fields = file_util.FileUtil.map_fields_list_schema(fields, list_schema) (dict_fields)
The result of the run is exactly the same as for dictionary mode.
All code of file_util.py
Here is the entire code in file_util.py, which can be used in your own public class library
# -*- encoding:utf8 -*- ''' @author: @version: 2014-12-5 ''' class FileUtil(object): ''' File, path common operation method ''' @staticmethod def read_file_data(filepath): '''Read the file by line according to the path, param filepath: absolute path of the file @param filepath: the path of the file to read. @return: the list of data in each line after splitting by \t. ''' fin = open(filepath, 'r') for line in fin: try: line = line[:-1] if not line: continue except: continue try: fields = ("\t") except: continue # Throw the current line's split list yield fields () @staticmethod def transform_list_to_dict(para_list): """particle marking the following noun as a direct object['a', 'b']convert{'a':0, 'b':1}forms @param para_list: listings,Inside is the name of the field corresponding to each column @return: dictionaries,Inside is a mapping of field names and locations """ res_dict = {} idx = 0 while idx < len(para_list): res_dict[str(para_list[idx]).strip()] = idx idx += 1 return res_dict @staticmethod def map_fields_list_schema(fields, list_schema): """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name', 'age'},Then return{'name':'a','age':'b'} @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit @param list_schema: 列名称的listingslist @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value """ dict_schema = FileUtil.transform_list_to_dict(list_schema) return FileUtil.map_fields_dict_schema(fields, dict_schema) @staticmethod def map_fields_dict_schema(fields, dict_schema): """Patterns based on fields,Returns the corresponding value of the mode and data value;for example fieldsbecause of['a','b','c'],schemabecause of{'name':0, 'age':1},Then return{'name':'a','age':'b'} @param fields: Array with data,This is generally accomplished by applying aLine StringBy following the\tsplit @param dict_schema: 一个dictionary (of Chinese compound words),keyis the field name,valueis the location of the field; @return: dictionary (of Chinese compound words),keyis the field name,valueis the field value """ pdict = {} for fstr, findex in dict_schema.iteritems(): pdict[fstr] = str(fields[int(findex)]) return pdict
summarize
Above is the entire content of this article, I hope that the content of this article for everyone to learn or use python can have some help, if there are questions you can leave a message to exchange.