synopsis
CountVectorizer Official Documentation。
Vectorizes a collection of documents into a count matrix.
If an a priori dictionary is not provided and some sort of feature selection is not done using an analyzer, then the number of features will be equal to the vocabulary found by analyzing the data.
Data preprocessing
Two methods: 1. You can directly put into the model without word separation; 2. You can first separate the Chinese text into words.
The two methods will produce very different words. Specific demonstrations will be given later.
import jieba import re from sklearn.feature_extraction.text import CountVectorizer # Raw data text = ['Rarely out in public with cell phones', 'Most people are still serious about learning', 'They will come with action', 'No matter how disheveled you are right now, pull yourself together', 'All it takes is a little bit of change', 'You can be refreshed on the outside as well as the inside'] # Extract Chinese text = [' '.join(('[\u4e00-\u9fa5]+',tt,)) for tt in text] #Split the word text = [' '.join((tt)) for tt in text] text
build a model
training model
#Build the model vectorizer = CountVectorizer() # Train the model X = vectorizer.fit_transform(text)
All vocabularies: model.get_feature_names()
# Vocabulary generated by pooling all documents feature_names = vectorizer.get_feature_names() print(feature_names)
Lexicon generated without word separation
word generated from a participle (e.g. a lexical item)
Counting matrix: ()
# Matrix formed by the number of occurrences of the relative vocabulary for each document matrix = () print(matrix)
#Counting matrices into DataFrames df = (matrix, columns=feature_names) df
Vocabulary index: model.vocabulary_
print(vectorizer.vocabulary_)
This article on the use of Python_sklearn_CountVectorizer detailed article is introduced to this, more related to the use of Python_sklearn_CountVectorizer content, please search for my previous articles or continue to browse the following related articles I hope that you will support me in the future more!