write sth. upfront
Originally planned to continue to write a blog on the use of Portia, but in the writing of the code on the way to find that the use of Portia in windows7 DockerToolbox inside the error is too much, it is recommended that you still in the Linux virtual machine or directly on the server to run. Otherwise, it's too much effort~
Today we shift a bit and introduce anewspaper
newspaper
github address :/codelucas/n…
Look at the name should be able to guess and newspapers / news have to do, this library is mainly used for article crawling and organizing, a big man in China to do, of course, his github has been posted on other developers' recommendations
For example, the author of the requests library tweeted the following testimonials
"Newspaper is an amazing python library for extracting & curating articles."
The Changelog has dedicated a review article to it, so gather around for that as well!
Newspaper delivers Instapaper style article extraction.
For such a crawler library out of the country, we still need to introduce it!
Very easy to install
pip install newspaper3k -i /simple
Official documentation is available:/en/latest/u…
Use of the Newspaper Framework
The framework is very easy to use. Simply refer to this page of documentation to apply it
Example: Single news content acquisition
The first way of applying this is to get the content of the web page directly
from newspaper import Article url = "/p/857678806293124" article = Article(url) # Create article objects () # Load web page () # Parsing web pages print() # Print html documents
Of course there are some other attributes, but the framework is based on keyword recognition, there are some bugs exist, sometimes the recognition is not accurate
# print() # Print the html document print() # Body of news print("-"*100) print() # News headlines print("-"*100) print() # News writers print("-"*100) print() # News summaries print() # News Keywords # print(article.top_image) # URL of the top_image of this article # print() # All image urls in this article
Newspaper article caching
By default, newspaper caches all articles to be extracted and clears it if the article has been crawled. This feature is used to prevent duplicate articles and increase extraction speed. It is possible to use thememoize_articles
parameter to select whether to cache or not.
But when I use the following method to extract, the magical bug appeared, how can not get the article I want. Sigh~ it seems that the road of framework improvement still needs to continue ah
import newspaper url = "/c/2020-08-29/" # article = Article(url) # create article object # () # Load the page # () # Parse the page news = (url, language='zh', memoize_articles=False) article = [0] () () print('title=',)
Other Functions
In the process of application found that there is indeed a big problem with parsing, but the overall framework design ideas are still very good. A little high and low , see github comments on the newspaper is actually very much looking forward to, after using, I suggest or use requests and then add bs4 to get more reasonable.
In addition to the features briefly described above, it has a number of extensions, such as the following
-
requests
cap (a poem)newspaper
Ensemble parsing the body of the page, i.e., crawling with requests, with newspaper acting as a parser - can be invoked
Google Trends
text - Support multi-task crawling
- Support for NPL Natural Language Processing
- Even the official documentation gives a
Easter Eggs
Easter Eggs~, available at the bottom of the document!
Well, it's hard to say.
Write it on the back.
I was going to play around with the NewSpaper crawler framework in Python, but it looks like I can't, so it's good to expand my knowledge, and of course, after downloading the source code from github, I can learn a lot by studying the coding specifications of the big guys.
Above is Python crawler framework NewSpaper use details, more information about Python crawler framework NewSpaper please pay attention to my other related articles!