SoFunction
Updated on 2024-11-17

Python jieba library usage and example analysis

1, jieba library basic introduction

(1), jieba library overview

jieba is an excellent third-party library for Chinese word splitting.

  • - Chinese text needs to be split to obtain individual words
  • - jieba is an excellent third-party library for Chinese word splitting, you need to install it additionally.
  • - The jieba library provides three modes of word splitting, the simplest to master only one function

(2), the principle of jieba split word

Jieba lexicon relies on Chinese thesaurus

- Determining the probability of association between Chinese characters using a Chinese thesaurus
- The probability of forming a phrase between Chinese characters is high, forming the result of word splitting

- In addition to participles, users can also add customized phrases

jieba library instructions

(1), three modes of jieba participles

Precision mode, full mode, search engine mode

  • - Precise mode: cuts the text precisely without redundant words
  • - Full mode: scans the text for all possible words with redundancy
  • - Search engine model: based on the precise model, the long words are cut up again

(2), jieba library commonly used functions

Examples of applications

3. Using the jieba library to count the number of appearances of the tasks in the Three Kingdoms

import jieba

txt = open("D:\\\\ Three Kingdoms.txt", "r", encoding='utf-8').read()
words = (txt)   # Segmentation of text using exact mode
counts = {}   # Store words and their occurrences as key-value pairs

for word in words:
  if len(word) == 1:  # Individual words are not counted
    continue
  else:
    counts[word] = (word, 0) + 1  # Iterate over all words, adding 1 to the value of each occurrence.
    
items = list(())# Convert key-value pairs to lists
(key=lambda x: x[1], reverse=True)  # Sort words in descending order according to the number of times they occur

for i in range(15):
  word, count = items[i]
  print("{0:<5}{1:>5}".format(word, count))

Statistics on the number of pairs of more than the first fifteen nouns, Cao Cao is worthy of a generation of lords, the first place deserved, but we will find that the data obtained still need to be further processed, such as some of the useless words, some repeat the meaning of the words.

This is the whole content of this article, I hope it will help you to learn more.