Detailed explanation of the basic knowledge of Python spaCy library (NLP processing library)

1. Introduction to spaCy

spaCyIt is an efficient industrial-grade natural language processing (NLP) library focusing on processing and analyzing text data. Unlike NLTK, the spaCy design goal isProduction environment, providing high-performance pre-trained models and concise APIs.

Core features:

Supports tasks such as word segmentation, part-of-speech annotation, dependency syntax analysis, naming entity recognition (NER) and other tasks.
Built-in pre-trained model (supports multilingual: English, Chinese, German, etc.).
High performance, based on Cython implementation, fast processing speed.
Provides intuitive API and rich text processing tools.

2. Installation and configuration

Install spaCy：

pip install spacy

Download the pre-trained model(Take the English model as an example):

python -m spacy download en_core_web_sm

Model naming rules:[Language]_[Type]_[Size]_[Size](likeen_core_web_smRepresents a small English model).

3. Basic usage process

1. Loading the model and processing text

import spacy
# Load the pretrained modelnlp = ("en_core_web_sm")
# Process texttext = "Apple is looking at buying . startup for $1 billion."
doc = nlp(text)

2. Analysis of text processing results

Tokenization：

for token in doc:
    print()  # Output text for each word

Output:

Apple
is
looking
at
buying
.
startup
for
$
1
billion
.

Part of speech annotation (POS Tagging)：

for token in doc:
    print(f"{} → {token.pos_} → {token.tag_}")  # Word nature（Coarse grain size）and detailed tags

Output example:

Apple → PROPN → NNP
is → AUX → VBZ
looking → VERB → VBG
...

Named Entity Recognition (NER)：

for ent in :
    print(f"{} → {ent.label_}")  # Entity text and types

Output:

Apple → ORG
. → GPE
$1 billion → MONEY

Dependency Parsing：

for token in doc:
    print(f"{} → {token.dep_} → {}")

Output example:

Apple → nsubj → looking
is → aux → looking
looking → ROOT → looking
...

4. Visualization tools

spaCy providesdisplacyModule for visualizing text analysis results.

1. Visualize the dependency tree

from spacy import displacy
(doc, style="dep", jupyter=True)  # exist Jupyter Shown in

2. Visualize named entities

(doc, style="ent", jupyter=True)

5. Process long texts

For long text, it is recommended to useBatch processing to improve efficiency:

texts = ["This is a sentence.", "Another example text."]
docs = list((texts))
# Can be combined with multi-threaded acceleration (caution required)docs = list((texts, n_process=2))

6. Model and language support

Supported models：

English:en_core_web_sm, en_core_web_md, en_core_web_lg(Small/Medium/Large).
Chinese:zh_core_web_sm。
Other languages: German (de), French (fr), Spanish (es), etc.

Custom model：
spaCy supports users to train their own models and need to prepare labeled data.

7. Summary

Applicable scenarios: Information extraction, text cleaning, entity recognition, rapid prototype development.
Advantages: Efficient, easy to use, rich pre-trained models.
Learning Resources：

Official Documentation:/

Community tutorial:/

This is the article about the basic knowledge of the Python spaCy library [NLP processing library]. For more related content of the Python spaCy library, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!