SoFunction
Updated on 2024-11-16

Python sklearn CountVectorizer Usage Details

synopsis

CountVectorizer Official Documentation

Vectorizes a collection of documents into a count matrix.

If an a priori dictionary is not provided and some sort of feature selection is not done using an analyzer, then the number of features will be equal to the vocabulary found by analyzing the data.

Data preprocessing

Two methods: 1. You can directly put into the model without word separation; 2. You can first separate the Chinese text into words.

The two methods will produce very different words. Specific demonstrations will be given later.

import jieba
import re
from sklearn.feature_extraction.text import CountVectorizer
# Raw data
text = ['Rarely out in public with cell phones',
        'Most people are still serious about learning',
        'They will come with action',
        'No matter how disheveled you are right now, pull yourself together',
        'All it takes is a little bit of change',
        'You can be refreshed on the outside as well as the inside']
# Extract Chinese
text = [' '.join(('[\u4e00-\u9fa5]+',tt,)) for tt in text]
#Split the word
text = [' '.join((tt)) for tt in text]
text

pic1

build a model

training model

#Build the model
vectorizer = CountVectorizer()
# Train the model
X = vectorizer.fit_transform(text)

All vocabularies: model.get_feature_names()

# Vocabulary generated by pooling all documents
feature_names = vectorizer.get_feature_names()
print(feature_names)

Lexicon generated without word separation

pic2

word generated from a participle (e.g. a lexical item)

pic3

Counting matrix: ()

# Matrix formed by the number of occurrences of the relative vocabulary for each document
matrix = ()
print(matrix)

pic4

#Counting matrices into DataFrames
df = (matrix, columns=feature_names)
df

pic5

Vocabulary index: model.vocabulary_

print(vectorizer.vocabulary_)

pic6

This article on the use of Python_sklearn_CountVectorizer detailed article is introduced to this, more related to the use of Python_sklearn_CountVectorizer content, please search for my previous articles or continue to browse the following related articles I hope that you will support me in the future more!