1. Introduction to spaCy
spaCyIt is an efficient industrial-grade natural language processing (NLP) library focusing on processing and analyzing text data. Unlike NLTK, the spaCy design goal isProduction environment, providing high-performance pre-trained models and concise APIs.
Core features:
- Supports tasks such as word segmentation, part-of-speech annotation, dependency syntax analysis, naming entity recognition (NER) and other tasks.
- Built-in pre-trained model (supports multilingual: English, Chinese, German, etc.).
- High performance, based on Cython implementation, fast processing speed.
- Provides intuitive API and rich text processing tools.
2. Installation and configuration
Install spaCy:
pip install spacy
Download the pre-trained model(Take the English model as an example):
python -m spacy download en_core_web_sm
Model naming rules:[Language]_[Type]_[Size]_[Size]
(likeen_core_web_sm
Represents a small English model).
3. Basic usage process
1. Loading the model and processing text
import spacy # Load the pretrained modelnlp = ("en_core_web_sm") # Process texttext = "Apple is looking at buying . startup for $1 billion." doc = nlp(text)
2. Analysis of text processing results
Tokenization:
for token in doc: print() # Output text for each word
Output:
Apple
is
looking
at
buying
.
startup
for
$
1
billion
.
Part of speech annotation (POS Tagging):
for token in doc: print(f"{} → {token.pos_} → {token.tag_}") # Word nature(Coarse grain size)and detailed tags
Output example:
Apple → PROPN → NNP
is → AUX → VBZ
looking → VERB → VBG
...
Named Entity Recognition (NER):
for ent in : print(f"{} → {ent.label_}") # Entity text and types
Output:
Apple → ORG
. → GPE
$1 billion → MONEY
Dependency Parsing:
for token in doc: print(f"{} → {token.dep_} → {}")
Output example:
Apple → nsubj → looking
is → aux → looking
looking → ROOT → looking
...
4. Visualization tools
spaCy providesdisplacy
Module for visualizing text analysis results.
1. Visualize the dependency tree
from spacy import displacy (doc, style="dep", jupyter=True) # exist Jupyter Shown in
2. Visualize named entities
(doc, style="ent", jupyter=True)
5. Process long texts
For long text, it is recommended to useBatch processing to improve efficiency:
texts = ["This is a sentence.", "Another example text."] docs = list((texts)) # Can be combined with multi-threaded acceleration (caution required)docs = list((texts, n_process=2))
6. Model and language support
Supported models:
- English:
en_core_web_sm
,en_core_web_md
,en_core_web_lg
(Small/Medium/Large). - Chinese:
zh_core_web_sm
。 - Other languages: German (de), French (fr), Spanish (es), etc.
Custom model:
spaCy supports users to train their own models and need to prepare labeled data.
7. Summary
- Applicable scenarios: Information extraction, text cleaning, entity recognition, rapid prototype development.
- Advantages: Efficient, easy to use, rich pre-trained models.
- Learning Resources:
Official Documentation:/
Community tutorial:/
This is the article about the basic knowledge of the Python spaCy library [NLP processing library]. For more related content of the Python spaCy library, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!