SoFunction
Updated on 2024-11-17

Zero base to write python crawler urllib2 use guide

The previous section described a simple introduction to urllib2, and the following section organizes some of the details of using urllib2.

specifications

By default, urllib2 will use the environment variable http_proxy to set the HTTP Proxy.
If you want to explicitly control the Proxy in your program without being affected by environment variables, you can use a proxy.
Create a new test14 to implement a simple proxy demo:

Copy Code The code is as follows.

import urllib2 
enable_proxy = True 
proxy_handler = ({"http" : ':8080'}) 
null_proxy_handler = ({}) 
if enable_proxy: 
    opener = urllib2.build_opener(proxy_handler) 
else: 
    opener = urllib2.build_opener(null_proxy_handler) 
urllib2.install_opener(opener) 

One detail to note here is that using urllib2.install_opener() sets the global opener for urllib2.
This will be easy to use later, but you can't do more detailed control, such as wanting to use two different Proxy settings in your program, etc.
Instead of using install_opener to change the global settings, it's better to just call the opener's open method instead of the global urlopen method.

set up

In older versions of Python (pre-Python 2.6), the urllib2 API did not expose a timeout setting, and the only way to set the timeout value was to change the global timeout value of the socket.

Copy Code The code is as follows.

import urllib2 
import socket 
(10) # Timeout after 10 seconds
(10) # Another way

As of Python 2.6, the timeout can be set directly from the timeout parameter of ().

Copy Code The code is as follows.

import urllib2 
response = ('', timeout=10) 

3. Include a specific Header in the HTTP Request.

To add the header, you need to use the Request object:

Copy Code The code is as follows.

import urllib2 
request = ('/') 
request.add_header('User-Agent', 'fake-client') 
response = (request) 
print () 

Be careful with some headers, as the server will check for them.
User-Agent : Some servers or proxies use this value to determine if the request is from a browser or not
Content-Type : When using the REST interface, the server checks this value to determine how the content in the HTTP Body should be parsed. Common values are:
application/xml : used in XML RPC, e.g. RESTful/SOAP calls
application/json : Used in JSON RPC calls.
application/x-www-form-urlencoded : Used when the browser submits a web form.
When using RESTful or SOAP services provided by the server, an incorrect Content-Type setting can cause the server to deny service.


By default, urllib2 will automatically redirect the HTTP 3XX return code without manual configuration. To check if a redirect has occurred, just check if the URL of the Response is the same as the URL of the Request.

Copy Code The code is as follows.

import urllib2 
my_url = '' 
response = (my_url) 
redirected = () == my_url 
print redirected 
my_url = '/b1UZuP' 
response = (my_url) 
redirected = () == my_url 
print redirected 

If you don't want automatic redirect, you can customize the HTTPRedirectHandler class in addition to using the lower-level httplib library.

Copy Code The code is as follows.

import urllib2 
class RedirectHandler(): 
    def http_error_301(self, req, fp, code, msg, headers): 
        print "301" 
        pass 
    def http_error_302(self, req, fp, code, msg, headers): 
        print "303" 
        pass  
opener = urllib2.build_opener(RedirectHandler) 
('/b1UZuP') 


urllib2's handling of cookies is also automatic. If you need to get the value of a cookie item, you can do so:

Copy Code The code is as follows.

import urllib2 
import cookielib 
cookie = () 
opener = urllib2.build_opener((cookie)) 
response = ('') 
for item in cookie: 
    print 'Name = '+ 
    print 'Value = '+ 

After running it will output the value of the cookie for visiting Baidu:

6. Using HTTP's PUT and DELETE methods

urllib2 only supports the HTTP GET and POST methods, so if you want to use HTTP PUT and DELETE, you have to use the lower-level httplib library. However, we can still make urllib2 able to make a PUT or DELETE request by doing the following:

Copy Code The code is as follows.

import urllib2 
request = (uri, data=data) 
request.get_method = lambda: 'PUT' # or 'DELETE' 
response = (request) 

7. Get the HTTP return code

For 200 OK, you can get the HTTP return code by using the getcode() method of the response object returned by urlopen. But for other return codes, urlopen throws an exception. In this case, you need to check the code property of the exception object:

Copy Code The code is as follows.

import urllib2 
try: 
    response = ('/why') 
except , e: 
    print  

Log

When using urllib2, you can turn on debug Log in the following way, so that the contents of incoming and outgoing packets will be printed out on the screen, which is convenient for debugging, and can sometimes save the work of capturing packets

Copy Code The code is as follows.

import urllib2 
httpHandler = (debuglevel=1) 
httpsHandler = (debuglevel=1) 
opener = urllib2.build_opener(httpHandler, httpsHandler) 
urllib2.install_opener(opener) 
response = ('') 

This allows you to see the contents of the transmitted packet:

9. Processing of forms

How do I fill out the form when it is necessary to log in?
First use the tool to intercept the content of the form to be filled.
For example, I usually use the firefox+httpfox plugin to see what packets I'm actually sending.
Using verycd as an example, first find the POST request you sent, and the POST form entry.
You can see verycd then need to fill username, password, continueURI, fk, login_submit these items, which fk is randomly generated (in fact, not too random, it looks like the epoch time after a simple encoding generated), you need to get from the web page, that is to say, you have to visit the web page, with the regular expression and other tools to intercept the fk items in the return data. Regular expressions and other tools to intercept the return data in the fk items. continueURI as the name suggests can be written casually, login_submit is fixed, which can be seen from the source code. And username, password that's obvious:

Copy Code The code is as follows.

# -*- coding: utf-8 -*- 
import urllib 
import urllib2 
postdata=({ 
'username':'Wang Xiaoguang'.
    'password':'why888', 
    'continueURI':'/', 
    'fk':'', 
'login_submit':'login'
}) 
req = ( 
    url = '/signin', 
    data = postdata 

result = (req) 
print ()  

10. Masquerading as a browser visit

Some sites are so averse to crawlers that they reject all requests.
At this point we need to masquerade as a browser, which can be done by modifying the header in the http package

Copy Code The code is as follows.

#… 
 
headers = { 
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' 

req = ( 
    url = '/signin/*//', 
    data = postdata, 
    headers = headers 

#... 

11. Countering "anti-chain theft"

Certain sites have so-called anti-link-stealing settings, which are, to put it bluntly, very simple.
It's to check if the referer site in the header of the request you send is his own.
So we just need to like change the referer of the headers to that site, using cnbeta as an example:
#...
headers = {
    'Referer':'/articles'
}
#...
headers is a dict data structure, you can put in any header you want to do some camouflage.
For example, some websites like to read the X-Forwarded-For in the header to see people's real IP, you can directly change the X-Forwarde-For.