SoFunction
Updated on 2024-11-12

Python3 Crawler Learning Beginner's Tutorial

This article is an example of Python3 crawler related to introductory knowledge. Shared for your reference, as follows:

online to see most of the crawler tutorials are Python2, but Python3 is the trend of the future, many beginners watch the Python2 tutorials to learn Python3, then it is difficult to adapt to, after all, and there are still a lot of differences between a systematic approach to learning and the route is very important, so I contacted a period of time, I would like to write about their own learning process, to share I want to write about my learning process and share my learning experience, and exercise myself.

I. Getting Started

Here is the official technical documentation of Python3, here we need to focus on, the language of the technical documentation is used to check, not for learning, there is really no need to memorize the document, so that the learning efficiency is really very low, not as good as a piece of learning side to do, in practice will learn something, or even if you memorize the document, you are still difficult to make what the project to, I was in the first time is on it, walked a lot! bend in the road, here recommended W3cscjool tutorials inside the very good, learning and practice combined.

1. Let's cut the crap and get down to business.

The first example: crawling the source code of the home page of Zhihu.

#-*-coding:utf-8 -*-
import 
url = ""
page_info = (url).read()
page_info = page_info.decode('utf-8')
print(page_info)

Run results:

After running, the source code of the home page of the Zhihu website will be read out in the IDLE shell Blahblahblah~~~~.

Crawler Definition:

A Web Spider, also known as a web spider, is a program or script that automatically crawls a website for information according to certain rules.

Synopsis:

Web Spider is a very visual name. If you compare the Internet to a spider's web, then a Spider is a spider that crawls around on the web. Web spiders look for web pages through web page links, starting from a certain page on the website, reading the content of the web page, finding other links in the web page, and then searching for the next web page through these links, and so on, until all the web pages of the website have been crawled.

Crawler Process:

① first urllib request to open the Url to get the web page html document - ② browser to open the source code of the web page to analyze the nodes of the elements - ③ through the Beautiful Soup (will be discussed later) or regular expressions to extract the desired data - ④ store the data to the local disk or database (crawl, analyze, store). regular expression to extract the desired data -- ④ store the data to local disk or database (capture, analysis, storage)

urllib and urllib2

The urllib2 library, in which urllib2 is renamed urllib, is divided into a number of submodules:, cap (a poem). Although the function names are mostly the same as before, you need to be aware of which functions have been moved to submodules when using the new urllib library.
urllib is the standard library for python and contains functions for requesting data from the web, handling cookies, and even changing metadata like request headers and user agents.
urlopen is used to open and read a remote object obtained from the network. It can easily read HTML files, image files or any other file stream.

url = ""
page_info = (url).read()

is a sub-module of urllib, you can open and process some complex URLs

decode('utf-8')Used to convert the page to utf-8 encoding format, otherwise it will be garbled

page_info = page_info.decode('utf-8')
print(page_info)

()method implements the open url and returns aobject, by means of theread()method, get the response body, transcode it and finally pass it through theprint()Print it out.

More about Python related content can be viewed on this site's topic: thePython Socket Programming Tips Summary》、《Python Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques

I hope that what I have said in this article will help you in Python programming.