Python crawler urllib2 usage details

The so-called web crawling is to read the web resource specified in the URL address from the web stream and save it locally. There are many libraries in Python that can be used to crawl web pages, let's learn urllib2 first.

urllib2 is a self-contained module (no need to download, just import it and use it)

Documentation on the official urllib2 website:/2/library/

urllib2 source code

urllib2 was changed to

urlopen

Let's start with a snippet of code:

#-*- coding:utf-8 -*-
#01.urllib2_urlopen.py
# Import urllib2 library
import urllib2
# Sends a request to the specified url and returns the server's class file object
response = ("")
The # class file object supports file object manipulation methods, such as read() method to read the file
html = ()
# Print the string
print(html)

Executing the written python code will print the result:

python2 01.urllib2_urlopen.py

In fact, if we hit the Baidu home page in the browser, right-click and select "View Source", you will find that we just printed out exactly the same. That is to say, the above four lines of code has helped us to Baidu's home page of all the code to climb down.
The python code corresponding to a basic url request is really very simple.

Request

Check out the official documentation url usage as follows:

(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])
Open the URL url, which can be either a string or a Request object.

In our first example, the parameter to urlopen() is a url address; the

But if you need to perform more complex operations, such as increasing the http header, you must create a Request instance to be used as a parameter of urlopen(); and the url address that needs to be accessed is used as a parameter of the Request instance.

#-*- coding:utf-8 -*-
#02.urllib2_request.py

import urllib2

The #url is used as an argument to the Request() method, which constructs and returns a Request object.
request = ("")

The #Request object is used as a parameter to the urlopen() method, which sends it to the server and receives a response.
response = (request)

html = ()

print(html)

The run results are exactly the same:

A new Request instance is created, and in addition to the mandatory url parameter, two other parameters can be set:

data(default empty): the data to be submitted along with the url(e.g. the data to be posted), and the HTTP request will be changed from "GET" to "POST".
headers (empty by default): is a dictionary containing key-value pairs of HTTP headers to be sent.
These two parameters are described below.

User-Agent

But this directly use urllib2 to send a request to a website, it is indeed slightly abrupt, as if, people have a door to every house, you as a passer-by directly into the identity is obviously not very polite. And some sites do not like to be accessed by the program (non-human access), it is possible that they will reject your access request.

But if we use a legitimate identity to request someone else's website, obviously they are welcome, so we should add an identity to this code of ours, which is called User-Agent header.

A browser is a recognized and allowed identity in the Internet world. If we want our crawler to be more like a real user, our first step is that we need to masquerade as a recognized browser. Different browsers will have different User-Agent headers when sending requests. urllib2's default User-Agent header is: Python-urllib/ (x and y are the Python major and minor version numbers, e.g. Python-urllib/2.7)

#-*- coding:utf-8 -*-
#03.urllib2_useragent.py

import urllib2

url = ""

# IE 9.0 User-Agent, included in ua-header
ua_header = {"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

# url together with headers, together with the construction of Request, this request will be attached to the IE9.0 browser User-Agent
request = (url, headers = ua_header)

# Send this request to the server
response = (request)

html = ()

print(html)

Add more Header information

A specific Header is added to the HTTP Request to construct a complete HTTP request.

You can add/modify a specific header by calling Request.add_header() or you can view an existing header by calling Request.get_header().

Add a specific header

#-*- coding:utf-8 -*-
#04.urllib2_headers.py

import urllib2

url = ""

#User-Agent for IE 9.0
header = {"User-Agent":"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
request =(url, headers = header)

#A specific header can also be added/modified by calling Request.add_header()
request.add_header("Connection","keep-alive")

# You can also view header information by calling Request.get_header()
request.get_header(header_name = "Connection")

response = (request)
print()  # Can view the response status code

html = ()
print(html)

  randomly added/modificationsUser-Agent

#-*- coding:utf-8 -*-
#05.urllib2_add_headers.py

import urllib2
import random

url = ""

ua_list = [
  "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ",
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ",
  "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ",
  "Mozilla/5.0 (Macintosh; Intel Mac OS... "
]

user_agent = (ua_list)

request = (url)

#A specific header can also be added/modified by calling Request.add_header()
request.add_header("User-Agent", user_agent)

# Capitalize the first letter and lowercase all subsequent letters.
request.add_header("User-agent")

response = (req)

html = ()

print(html)

take note of

The urllib2 module has been split across several modules in Python 3 named and

This is the whole content of this article, I hope it will help you to learn more.