1, jieba library basic introduction
(1), jieba library overview
jieba is an excellent third-party library for Chinese word splitting.
- - Chinese text needs to be split to obtain individual words
- - jieba is an excellent third-party library for Chinese word splitting, you need to install it additionally.
- - The jieba library provides three modes of word splitting, the simplest to master only one function
(2), the principle of jieba split word
Jieba lexicon relies on Chinese thesaurus
- Determining the probability of association between Chinese characters using a Chinese thesaurus
- The probability of forming a phrase between Chinese characters is high, forming the result of word splitting
- In addition to participles, users can also add customized phrases
jieba library instructions
(1), three modes of jieba participles
Precision mode, full mode, search engine mode
- - Precise mode: cuts the text precisely without redundant words
- - Full mode: scans the text for all possible words with redundancy
- - Search engine model: based on the precise model, the long words are cut up again
(2), jieba library commonly used functions
Examples of applications
3. Using the jieba library to count the number of appearances of the tasks in the Three Kingdoms
import jieba txt = open("D:\\\\ Three Kingdoms.txt", "r", encoding='utf-8').read() words = (txt) # Segmentation of text using exact mode counts = {} # Store words and their occurrences as key-value pairs for word in words: if len(word) == 1: # Individual words are not counted continue else: counts[word] = (word, 0) + 1 # Iterate over all words, adding 1 to the value of each occurrence. items = list(())# Convert key-value pairs to lists (key=lambda x: x[1], reverse=True) # Sort words in descending order according to the number of times they occur for i in range(15): word, count = items[i] print("{0:<5}{1:>5}".format(word, count))
Statistics on the number of pairs of more than the first fifteen nouns, Cao Cao is worthy of a generation of lords, the first place deserved, but we will find that the data obtained still need to be further processed, such as some of the useless words, some repeat the meaning of the words.
This is the whole content of this article, I hope it will help you to learn more.