The so-called web crawling is to read the web resource specified in the URL address from the web stream and save it locally. There are many libraries in Python that can be used to crawl web pages, let's learn urllib2 first.
urllib2 is a self-contained module (no need to download, just import it and use it)
Documentation on the official urllib2 website:/2/library/
urllib2 source code
urllib2 was changed to
urlopen
Let's start with a snippet of code:
#-*- coding:utf-8 -*- #01.urllib2_urlopen.py # Import urllib2 library import urllib2 # Sends a request to the specified url and returns the server's class file object response = ("") The # class file object supports file object manipulation methods, such as read() method to read the file html = () # Print the string print(html)
Executing the written python code will print the result:
python2 01.urllib2_urlopen.py
In fact, if we hit the Baidu home page in the browser, right-click and select "View Source", you will find that we just printed out exactly the same. That is to say, the above four lines of code has helped us to Baidu's home page of all the code to climb down.
The python code corresponding to a basic url request is really very simple.
Request
Check out the official documentation url usage as follows:
(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]]) Open the URL url, which can be either a string or a Request object.
In our first example, the parameter to urlopen() is a url address; the
But if you need to perform more complex operations, such as increasing the http header, you must create a Request instance to be used as a parameter of urlopen(); and the url address that needs to be accessed is used as a parameter of the Request instance.
#-*- coding:utf-8 -*- #02.urllib2_request.py import urllib2 The #url is used as an argument to the Request() method, which constructs and returns a Request object. request = ("") The #Request object is used as a parameter to the urlopen() method, which sends it to the server and receives a response. response = (request) html = () print(html)
The run results are exactly the same:
A new Request instance is created, and in addition to the mandatory url parameter, two other parameters can be set:
data(default empty): the data to be submitted along with the url(e.g. the data to be posted), and the HTTP request will be changed from "GET" to "POST".
headers (empty by default): is a dictionary containing key-value pairs of HTTP headers to be sent.
These two parameters are described below.
User-Agent
But this directly use urllib2 to send a request to a website, it is indeed slightly abrupt, as if, people have a door to every house, you as a passer-by directly into the identity is obviously not very polite. And some sites do not like to be accessed by the program (non-human access), it is possible that they will reject your access request.
But if we use a legitimate identity to request someone else's website, obviously they are welcome, so we should add an identity to this code of ours, which is called User-Agent header.
A browser is a recognized and allowed identity in the Internet world. If we want our crawler to be more like a real user, our first step is that we need to masquerade as a recognized browser. Different browsers will have different User-Agent headers when sending requests. urllib2's default User-Agent header is: Python-urllib/ (x and y are the Python major and minor version numbers, e.g. Python-urllib/2.7)
#-*- coding:utf-8 -*- #03.urllib2_useragent.py import urllib2 url = "" # IE 9.0 User-Agent, included in ua-header ua_header = {"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} # url together with headers, together with the construction of Request, this request will be attached to the IE9.0 browser User-Agent request = (url, headers = ua_header) # Send this request to the server response = (request) html = () print(html)
Add more Header information
A specific Header is added to the HTTP Request to construct a complete HTTP request.
You can add/modify a specific header by calling Request.add_header() or you can view an existing header by calling Request.get_header().
Add a specific header
#-*- coding:utf-8 -*- #04.urllib2_headers.py import urllib2 url = "" #User-Agent for IE 9.0 header = {"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} request =(url, headers = header) #A specific header can also be added/modified by calling Request.add_header() request.add_header("Connection","keep-alive") # You can also view header information by calling Request.get_header() request.get_header(header_name = "Connection") response = (request) print() # Can view the response status code html = () print(html) randomly added/modificationsUser-Agent #-*- coding:utf-8 -*- #05.urllib2_add_headers.py import urllib2 import random url = "" ua_list = [ "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ", "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ", "Mozilla/5.0 (Macintosh; Intel Mac OS... " ] user_agent = (ua_list) request = (url) #A specific header can also be added/modified by calling Request.add_header() request.add_header("User-Agent", user_agent) # Capitalize the first letter and lowercase all subsequent letters. request.add_header("User-agent") response = (req) html = () print(html)
take note of
The urllib2 module has been split across several modules in Python 3 named and
This is the whole content of this article, I hope it will help you to learn more.