summarize
Today we begin our journey into Natural Language Processing (NLP). Natural Language Processing allows us to process, understand, and utilize human language, bridging the gap between machine language and human language.
bag-of-words model
The Bag of Words Model (BWM) helps us to convert a sentence into a vector representation. The Bag of Words Model treats text as an unordered collection of words and counts each word.
quantitative
The bag-of-words model first performs segmentation, and after that. By counting the number of times each word appears in the text. We can get the text based on the word characteristics, if the text samples of these words and the corresponding word frequency together, is often referred to as vectorization.
Example.
import jieba from gensim import corpora # Define punctuation punctuation = [",", "。", ":", ";", "?", "!"] # Define the corpus content = [ "It's a beautiful day!", "It's going to rain tomorrow?", "It's going to thunder the day after tomorrow." ] # Split the word seg = [(con) for con in content] print("corpus:", seg) # Remove punctuation tokenized = () for s in tokenized: for p in punctuation: if p in s: (p) print("Remove punctuation:", tokenized) # tokenized is after de-tokenization dictionary = (seg) print("Bag-of-words modeling:", dictionary) # Save the dictionary ('') # View mapping of dictionary and subscript ids print("No.", dictionary.token2id)
Output.
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\Windows\AppData\Local\Temp\ Loading model cost 1.140 seconds. Prefix dict has been built successfully. corpus: [['Today's weather', 'That's nice.', '!'], ['Tomorrow', 'To', 'It's raining', '?'], ['The Day After Tomorrow', 'To', 'Thunder', '。']] Remove Punctuation: [['Today's weather', 'That's nice.'], ['Tomorrow', 'To', 'It's raining'], ['The Day After Tomorrow', 'To', 'Thunder']] bag-of-words model: Dictionary(7 unique tokens: ['Today's weather', 'That's nice.', 'It's raining', 'Tomorrow', 'To']...) serial number: {'Today's weather': 0, 'That's nice.': 1, 'It's raining': 2, 'Tomorrow': 3, 'To': 4, 'The Day After Tomorrow': 5, 'Thunder': 6}
Above is the detailed content of Python machine learning NLP natural language processing basic operation bag of words model, more information about Python machine learning NLP natural language processing please pay attention to my other related articles!