Python Implementation of Spam Recognition Using Machine Learning Algorithms

development tool

**Python version:** 3.6.4

Related modules:

scikit-learn module;

jieba module;

numpy module;

and some of the modules that come with Python.

Environment Setup

Just install Python and add it to the environment variables, and pip install the relevant modules you need.

progressive realization

(1) Divide the data set

Most of the datasets used for spam recognition on the Internet are English emails, so to show my sincerity, I spent some time to find a dataset of Chinese emails. The dataset is divided as follows:

Training dataset:

7063 normal e-mails (under data/normal folder);

7775 spam emails (under the folder data/spam).

Test dataset:

A total of 392 emails (under the folder data/test).

(2) Creation of dictionaries

The content of the emails in the dataset generally looks like this:

First, we use regular expressions to filter out non-Chinese characters, and then use the jieba thesaurus to split the statement and remove some deactivated words, and then finally use the above results to create the dictionary, which is formatted as:

{"word1": word1 word frequency, "word2": word2 word frequency...}

The specific implementation of all these elements is embodied in the **""** file in the main program () call is sufficient:

The final results are saved in a **""** file.

Is it a big success? Of course not!!!

There are 52,113 words in the current lexicon, which is obviously too many, and some words appear only once or twice, and it is obviously unwise to keep a dimension empty for subsequent feature extraction. Therefore, we keep only the 4000 words with the highest word frequency as the final created dictionary:

The final results are saved in a **""** file.

(3) Feature extraction

Once the lexicon is ready, we can convert the content of each letter into word vectors, which obviously have a dimension of 4000, with each dimension representing the frequency of occurrence of a high-frequency word in that letter, and finally, we merge these word vectors into a large feature vector matrix of size:

(7063+7775)×4000

That is, the first 7063 rows are feature vectors of normal emails and the rest are feature vectors of spam emails.

The specific implementation of the above remains embodied in the **""** file, which is called in the main program as follows:

The final result is saved in the file **"fvs_%d_%"**, where the first formatting character represents the number of normal emails and the second represents the number of spam emails.

(4) Training the classifier

We use scikit-learn machine learning library to train the classifiers, and the models are chosen as plain Bayesian classifier and SVM (Support Vector Machine):

(5) Performance testing

The model is tested using a test dataset:

The results are as follows:

It can be found that the performance of the two models is about the same (SVM is slightly better than plain Bayes), but SVM is more inclined towards spam determination.

to this article on the Python implementation of spam recognition of the article is introduced to this, more related Python to identify spam content please search for my previous posts or continue to browse the following related articles I hope that you will support me in the future more!