Python crawler framework NewSpaper use details

write sth. upfront

Originally planned to continue to write a blog on the use of Portia, but in the writing of the code on the way to find that the use of Portia in windows7 DockerToolbox inside the error is too much, it is recommended that you still in the Linux virtual machine or directly on the server to run. Otherwise, it's too much effort~

Today we shift a bit and introduce anewspaper

newspaper

github address :/codelucas/n…

Look at the name should be able to guess and newspapers / news have to do, this library is mainly used for article crawling and organizing, a big man in China to do, of course, his github has been posted on other developers' recommendations

For example, the author of the requests library tweeted the following testimonials

"Newspaper is an amazing python library for extracting & curating articles."

The Changelog has dedicated a review article to it, so gather around for that as well!

Newspaper delivers Instapaper style article extraction.

For such a crawler library out of the country, we still need to introduce it!

Very easy to install

pip install newspaper3k -i /simple

Official documentation is available:/en/latest/u…

Use of the Newspaper Framework

The framework is very easy to use. Simply refer to this page of documentation to apply it

Example: Single news content acquisition

The first way of applying this is to get the content of the web page directly

from newspaper import Article
url = "/p/857678806293124"
article = Article(url) # Create article objects
()        # Load web page
()           # Parsing web pages
print() # Print html documents

Of course there are some other attributes, but the framework is based on keyword recognition, there are some bugs exist, sometimes the recognition is not accurate

# print() # Print the html document
print() # Body of news
print("-"*100)
print() # News headlines
print("-"*100)
print()  # News writers
print("-"*100)
print()   # News summaries
print() # News Keywords
# print(article.top_image) # URL of the top_image of this article
# print() # All image urls in this article

Newspaper article caching

By default, newspaper caches all articles to be extracted and clears it if the article has been crawled. This feature is used to prevent duplicate articles and increase extraction speed. It is possible to use thememoize_articlesparameter to select whether to cache or not.

But when I use the following method to extract, the magical bug appeared, how can not get the article I want. Sigh~ it seems that the road of framework improvement still needs to continue ah

import newspaper
url = "/c/2020-08-29/"
# article = Article(url) # create article object
# () # Load the page
# () # Parse the page
news = (url, language='zh', memoize_articles=False)
article = [0]
()
()
print('title=',)

Other Functions

In the process of application found that there is indeed a big problem with parsing, but the overall framework design ideas are still very good. A little high and low , see github comments on the newspaper is actually very much looking forward to, after using, I suggest or use requests and then add bs4 to get more reasonable.

In addition to the features briefly described above, it has a number of extensions, such as the following

requestscap (a poem)newspaperEnsemble parsing the body of the page, i.e., crawling with requests, with newspaper acting as a parser
can be invokedGoogle Trendstext
Support multi-task crawling
Support for NPL Natural Language Processing
Even the official documentation gives aEaster EggsEaster Eggs~, available at the bottom of the document!

Well, it's hard to say.

Write it on the back.

I was going to play around with the NewSpaper crawler framework in Python, but it looks like I can't, so it's good to expand my knowledge, and of course, after downloading the source code from github, I can learn a lot by studying the coding specifications of the big guys.

Above is Python crawler framework NewSpaper use details, more information about Python crawler framework NewSpaper please pay attention to my other related articles!