Python Hidden Markov Modeling for Chinese Pinyin Inputting

I read an article about Hidden Markov Models on the Internet and thought it couldn't be more amazing, and then I found a great article about how to use Hidden Markov Models to implement Chinese Pinyin input on the Internetblog (loanword)The only thing I can do is to find a thesaurus for stuttering participles on the Internet, and based on this training, I came up with a Hidden Markov Model, and used the Viterbi algorithm to implement a simple pinyin input method. githuh address:/LiuRoy/Pinyin_Demo

Introduction to Principles Hidden Markov Models

Copy a paragraph from an online definition:

A Hidden Markov Model (HMM) is a statistical model used to describe a Markov process with implicitly unknown parameters. The difficulty is to determine the hidden parameters of the process from the observable parameters and then use these parameters for further analysis.

The observable parameter in Pinyin Input Method is Pinyin, and the implicit parameter is the corresponding Chinese character.

viterbi algorithm

consultation/wiki/Viterbi AlgorithmThe idea is dynamic programming, the code is relatively simple will not repeat.

code interpretation

model definition

The code is shown in model/file, and three data tables are designed to store the three probability matrices of Hidden Markov. The advantage of this is obvious, the transfer probability matrix of Chinese characters is a very large sparse matrix, direct file storage takes up a lot of space, and can only be read into memory at one time when loading, which is not only a high memory footprint but also slow loading speed. In addition, the join operation of the database is very convenient for the probability calculation in the viterbi algorithm.

The data table is defined below:

class Transition(BaseModel):

  __tablename__ = 'transition'

  id = Column(Integer, primary_key=True)
  previous = Column(String(1), nullable=False)
  behind = Column(String(1), nullable=False)
  probability = Column(Float, nullable=False)


class Emission(BaseModel):

  __tablename__ = 'emission'

  id = Column(Integer, primary_key=True)
  character = Column(String(1), nullable=False)
  pinyin = Column(String(7), nullable=False)
  probability = Column(Float, nullable=False)


class Starting(BaseModel):

  __tablename__ = 'starting'

  id = Column(Integer, primary_key=True)
  character = Column(String(1), nullable=False)
  probability = Column(Float, nullable=False)

Model Generation

The code can be found in the train/ file, which contains the initstarting，initemission, init_transition corresponds to generating the initial probability matrix, emission probability matrix, and transfer probability matrix in the Hidden Markov Model, respectively, and writing the generated results to a sqlite file. The dataset used for training is the thesaurus in Stuttering Segmentation, because there is no training for long sentences, and the results of the final run prove that it can only be applied to short sentence input.

initial probability matrix (math.)

Statistical initialization of the probability matrix is to find out all the Chinese characters appearing at the beginning of the word and count the number of times they appear at the beginning of the word, and finally calculate the probability of these Chinese characters appearing at the beginning of the word based on the above data, and the Chinese characters that have not been counted are considered as having 0 probability of appearing at the beginning of the word, and they are not written into the database. One point to note is that in order to prevent the computer from comparing the probabilities because of the smaller size of the probabilities, all the probabilities are calculated by natural logarithm operation. The statistical results are as follows:

transfer probability matrix (math.)

The simplest first-order Hidden Markov Model is used here, i.e., it is considered that in a sentence, the occurrence of each Chinese character is only related to the one that precedes it, which is simple and crude, but it can satisfy most of the cases. The statistical process is to find the set of Chinese characters that appear after each character in the dictionary and count the probabilities. Because this probability matrix is very large, it is too slow to write the data into the database one by one, and it can be optimized to batch writing to improve the training efficiency. The results are as follows:

The above chart shows the ten words with the highest probability of appearing after the first one, which is also quite in line with daily habits.

launch probability matrix (math.)

To put it plainly, we count the corresponding pinyin of each Chinese character and the probability of using it in everyday situations. Take storm for example, it has two pronunciations: bao and pu, the difficulty is to find the probability of bao and pu appearing in the dictionary. Here we use the pypinyin module, which converts phrases in the dictionary to pinyin and then counts the probability, but some of the pronunciations are not exactly correct, and the final input method that runs will not match the pinyin. The result is as follows:

viterbi implementation

Code to build input_method/file, here will find up to ten local optimal solution, note that ten local optimal solution instead of ten global optimal solution, but the optimal one of these ten solutions is the global optimal solution, the code is as follows:

def viterbi(pinyin_list):
  """
  viterbiAlgorithmic Implementation of Input Methods

  Aargs:
    pinyin_list (list): Pinyin List
  """
  start_char = Emission.join_starting(pinyin_list[0])
  V = {char: prob for char, prob in start_char}

  for i in range(1, len(pinyin_list)):
    pinyin = pinyin_list[i]

    prob_map = {}
    for phrase, prob in ():
      character = phrase[-1]
      result = Transition.join_emission(pinyin, character)
      if not result:
        continue

      state, new_prob = result
      prob_map[phrase + state] = new_prob + prob

    if prob_map:
      V = prob_map
    else:
      return V
  return V

Results Showcase

Run the input_method/ file to briefly show the results:

Problem statistics:

The statistical dictionary to generate the transfer matrix to write to the database is so slow that it takes almost ten minutes to run it once. The launch probability matrix data is inaccurate, there are always some Chinese characters with pinyin mismatch. The training set is too small and the implemented input method is not applicable to long sentences.