The previous section described a simple introduction to urllib2, and the following section organizes some of the details of using urllib2.
specifications
By default, urllib2 will use the environment variable http_proxy to set the HTTP Proxy.
If you want to explicitly control the Proxy in your program without being affected by environment variables, you can use a proxy.
Create a new test14 to implement a simple proxy demo:
import urllib2
enable_proxy = True
proxy_handler = ({"http" : ':8080'})
null_proxy_handler = ({})
if enable_proxy:
opener = urllib2.build_opener(proxy_handler)
else:
opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)
One detail to note here is that using urllib2.install_opener() sets the global opener for urllib2.
This will be easy to use later, but you can't do more detailed control, such as wanting to use two different Proxy settings in your program, etc.
Instead of using install_opener to change the global settings, it's better to just call the opener's open method instead of the global urlopen method.
set up
In older versions of Python (pre-Python 2.6), the urllib2 API did not expose a timeout setting, and the only way to set the timeout value was to change the global timeout value of the socket.
import urllib2
import socket
(10) # Timeout after 10 seconds
(10) # Another way
As of Python 2.6, the timeout can be set directly from the timeout parameter of ().
import urllib2
response = ('', timeout=10)
3. Include a specific Header in the HTTP Request.
To add the header, you need to use the Request object:
import urllib2
request = ('/')
request.add_header('User-Agent', 'fake-client')
response = (request)
print ()
Be careful with some headers, as the server will check for them.
User-Agent : Some servers or proxies use this value to determine if the request is from a browser or not
Content-Type : When using the REST interface, the server checks this value to determine how the content in the HTTP Body should be parsed. Common values are:
application/xml : used in XML RPC, e.g. RESTful/SOAP calls
application/json : Used in JSON RPC calls.
application/x-www-form-urlencoded : Used when the browser submits a web form.
When using RESTful or SOAP services provided by the server, an incorrect Content-Type setting can cause the server to deny service.
By default, urllib2 will automatically redirect the HTTP 3XX return code without manual configuration. To check if a redirect has occurred, just check if the URL of the Response is the same as the URL of the Request.
import urllib2
my_url = ''
response = (my_url)
redirected = () == my_url
print redirected
my_url = '/b1UZuP'
response = (my_url)
redirected = () == my_url
print redirected
If you don't want automatic redirect, you can customize the HTTPRedirectHandler class in addition to using the lower-level httplib library.
import urllib2
class RedirectHandler():
def http_error_301(self, req, fp, code, msg, headers):
print "301"
pass
def http_error_302(self, req, fp, code, msg, headers):
print "303"
pass
opener = urllib2.build_opener(RedirectHandler)
('/b1UZuP')
urllib2's handling of cookies is also automatic. If you need to get the value of a cookie item, you can do so:
import urllib2
import cookielib
cookie = ()
opener = urllib2.build_opener((cookie))
response = ('')
for item in cookie:
print 'Name = '+
print 'Value = '+
After running it will output the value of the cookie for visiting Baidu:
6. Using HTTP's PUT and DELETE methods
urllib2 only supports the HTTP GET and POST methods, so if you want to use HTTP PUT and DELETE, you have to use the lower-level httplib library. However, we can still make urllib2 able to make a PUT or DELETE request by doing the following:
import urllib2
request = (uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = (request)
7. Get the HTTP return code
For 200 OK, you can get the HTTP return code by using the getcode() method of the response object returned by urlopen. But for other return codes, urlopen throws an exception. In this case, you need to check the code property of the exception object:
import urllib2
try:
response = ('/why')
except , e:
Log
When using urllib2, you can turn on debug Log in the following way, so that the contents of incoming and outgoing packets will be printed out on the screen, which is convenient for debugging, and can sometimes save the work of capturing packets
import urllib2
httpHandler = (debuglevel=1)
httpsHandler = (debuglevel=1)
opener = urllib2.build_opener(httpHandler, httpsHandler)
urllib2.install_opener(opener)
response = ('')
This allows you to see the contents of the transmitted packet:
9. Processing of forms
How do I fill out the form when it is necessary to log in?
First use the tool to intercept the content of the form to be filled.
For example, I usually use the firefox+httpfox plugin to see what packets I'm actually sending.
Using verycd as an example, first find the POST request you sent, and the POST form entry.
You can see verycd then need to fill username, password, continueURI, fk, login_submit these items, which fk is randomly generated (in fact, not too random, it looks like the epoch time after a simple encoding generated), you need to get from the web page, that is to say, you have to visit the web page, with the regular expression and other tools to intercept the fk items in the return data. Regular expressions and other tools to intercept the return data in the fk items. continueURI as the name suggests can be written casually, login_submit is fixed, which can be seen from the source code. And username, password that's obvious:
# -*- coding: utf-8 -*-
import urllib
import urllib2
postdata=({
'username':'Wang Xiaoguang'.
'password':'why888',
'continueURI':'/',
'fk':'',
'login_submit':'login'
})
req = (
url = '/signin',
data = postdata
)
result = (req)
print ()
10. Masquerading as a browser visit
Some sites are so averse to crawlers that they reject all requests.
At this point we need to masquerade as a browser, which can be done by modifying the header in the http package
#…
headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
req = (
url = '/signin/*//',
data = postdata,
headers = headers
)
#...
11. Countering "anti-chain theft"
Certain sites have so-called anti-link-stealing settings, which are, to put it bluntly, very simple.
It's to check if the referer site in the header of the request you send is his own.
So we just need to like change the referer of the headers to that site, using cnbeta as an example:
#...
headers = {
'Referer':'/articles'
}
#...
headers is a dict data structure, you can put in any header you want to do some camouflage.
For example, some websites like to read the X-Forwarded-For in the header to see people's real IP, you can directly change the X-Forwarde-For.