Python Machine Learning NLP Natural Language Processing Basic Operations Bag-of-Words Modeling

summarize

Today we begin our journey into Natural Language Processing (NLP). Natural Language Processing allows us to process, understand, and utilize human language, bridging the gap between machine language and human language.

在这里插入图片描述

bag-of-words model

The Bag of Words Model (BWM) helps us to convert a sentence into a vector representation. The Bag of Words Model treats text as an unordered collection of words and counts each word.

在这里插入图片描述

quantitative

The bag-of-words model first performs segmentation, and after that. By counting the number of times each word appears in the text. We can get the text based on the word characteristics, if the text samples of these words and the corresponding word frequency together, is often referred to as vectorization.

在这里插入图片描述

Example.

import jieba
from gensim import corpora
# Define punctuation
punctuation = ["，", "。", "：", "；", "?", "!"]
# Define the corpus
content = [
    "It's a beautiful day!",
    "It's going to rain tomorrow?",
    "It's going to thunder the day after tomorrow."
]
# Split the word
seg = [(con) for con in content]
print("corpus:", seg)

# Remove punctuation
tokenized = ()
for s in tokenized:
    for p in punctuation:
        if p in s:
            (p)
print("Remove punctuation:", tokenized)
# tokenized is after de-tokenization
dictionary = (seg)
print("Bag-of-words modeling:", dictionary)

# Save the dictionary
('')
# View mapping of dictionary and subscript ids
print("No.", dictionary.token2id)

Output.

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\
Loading model cost 1.140 seconds.
Prefix dict has been built successfully.
corpus: [['Today's weather', 'That's nice.', '!'], ['Tomorrow', 'To', 'It's raining', '?'], ['The Day After Tomorrow', 'To', 'Thunder', '。']]
Remove Punctuation: [['Today's weather', 'That's nice.'], ['Tomorrow', 'To', 'It's raining'], ['The Day After Tomorrow', 'To', 'Thunder']]
bag-of-words model: Dictionary(7 unique tokens: ['Today's weather', 'That's nice.', 'It's raining', 'Tomorrow', 'To']...)
serial number: {'Today's weather': 0, 'That's nice.': 1, 'It's raining': 2, 'Tomorrow': 3, 'To': 4, 'The Day After Tomorrow': 5, 'Thunder': 6}

Above is the detailed content of Python machine learning NLP natural language processing basic operation bag of words model, more information about Python machine learning NLP natural language processing please pay attention to my other related articles!