SoFunction
Updated on 2024-11-17

Python crawler of the urllib library common method usage summary of the whole

Urllib

Official Documentation Address: /3/library/

urllib provides a set of functions for manipulating URLs.

This article is about python urllib library common method usage of the content, the following words without further ado, come together to look at the details of the introduction it

1、Read cookies

import  as cj, as request

cookie = ()
handler = (cookie)

opener = request.build_opener(handler)
response = ('http://')

for item in cookie:
 print( + "=" + )

2、Save cookies in a file

filename = 'baidu_cookies.txt'
cookies = (filename)
handler = (cookies)
opener = request.build_opener(handler)
response = ('')
(ignore_discard=True,ignore_expires=True)

3. Handling exceptions

URLError and HTTPError classes, the two classes are parent-child relationship, HTTPError will return the error code, both classes can handle exceptions generated by the request module, both have a reason attribute, used to record the reason for the exception
URLError handles exceptions:

from urllib import request,error

try:
 response = ('http:///')
except  as e:
 print()

HTTPError handles exceptions:

This class is specialized in handling exceptions for HTTP requests. HTTP requests return a request code, so HTTPError will have a code attribute. In addition, HTTP requests contain headers, so HTTPError also contains a headers attribute.HTTPError inherits from the URLError class, so it also contains a reason attribute.

Code:

try:
 response = ('http:///')
except  as e:
 print()
 print()
 print()

4、resolving link

The parse class in the urllib library provides a number of methods for parsing links.

The urlparse() method is specialized for parsing links, so let's look at the return value of this method:

from  import urlparse
result = urlparse('http://')
print(result)

The above code returns the result:

ParseResult(scheme='http', netloc='', path='', params='', query='', fragment='')

The urlparse() method returns the ParseResult class, which has six attributes, namely, scheme, netloc, path, params, query and fragment. scheme represents the protocol, with http, https, ftp and other types of protocols. netloc is the website's netloc is the domain name of the website, path is the name of the webpage to be accessed, params is the parameter. query is the query parameter, fragment is the anchor.

How does the urlparse() method map a link to the 6 parameters above?
Moving on to the next piece of code:

from  import urlparse
result = urlparse('http:///;user=bigdata17?id=10#content')
print(result)

The results of the run are as follows:

ParseResult(scheme='http', netloc='', path='/', params='user=bigdata17', query='id=10', fragment='content')

Visible from the link starting as :// stop is scheme. starting from :// to a / location is netloc domain. Starting from / and ending with a ; semicolon is path, the path to access the page. Starting at ; and ending at ? until is the params parameter. Starting at ? question mark and ending with the # well sign is the query query parameter. And finally the fragment anchor parameter.

5. urlopen() method

This method returns an HTTPResponse object:

import  as request
response = ('http://')
print(response)

< object at 0x000002A9655BBF28>

HTTPResponse object has read(), getheaders() and other methods.

The read() method allows you to read information from a web page:

import  as request
response = ('http://')
print(().decode('utf-8'))

When using this method should pay attention to the encoding format used by the site, with decode () method used together, otherwise there will be garbled. Like Baidu uses utf-8, NetEase uses gbk.

The getHeaders() method returns the headers of the page:

import  as request
response = ('http://')
print(())

Results:

[('Server', 'nginx/1.12.2'), ('Date', 'Mon, 12 Nov 2018 15:45:22 GMT'), ('Content-Type', 'text/html'), ('Content-Length', '38274'), ('Last-Modified', 'Thu, 08 Nov 2018 00:35:52 GMT'), ('Connection', 'close'), ('ETag', '"5be384e8-9582"'), ('Accept-Ranges', 'bytes')]

Continue to see what parameters the urlopen() method has:

(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Where url is a mandatory parameter, the other parameters are not mandatory. data is used to transfer data to the website we want to crawl, such as username, password, verification code, etc. timeout is to set the request timeout.

Usage of the data parameter:

>>> import  as parse
>>> import  as request
>>> data = bytes(({'username': 'bigdata17'}), encoding='utf8')
>>> print(data)
b'username=bigdata17'
>>> response = ('/post', data=data)
>>> print(())
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "username
": "bigdata17"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n
"Connection": "close", \n "Content-Length": "18", \n "Content-Type": "appl
ication/x-www-form-urlencoded", \n "Host": "", \n "User-Agent":
 "Python-urllib/3.7"\n }, \n "json": null, \n "origin": "183.134.52.58", \n
"url": "/post"\n}\n'

When transferring data using data, the urlencode method must be used to convert the data from data to the bytes type.

When using the urlopen method, if you don't use the data parameter, you will use the get method to transfer the data, if you use the data parameter, you will use the post method to transfer the data. post method must ensure that there is a corresponding method on the website you want to crawl (the URL you want to crawl is /post, post is the method that handles data transfer via the data parameter), otherwise it will report an error: HTTP Error 404: NOT FOUND. The post method must ensure that there is a corresponding method on the website to be crawled (the URL to be crawled is /post, and post is the method to handle the data we transferred via the data parameter), otherwise it will report the error HTTP Error 404: NOT FOUND.

Usage of the timeout parameter:

This parameter is used to set the request timeout, so that our crawler program will not wait for a long time when there is a network failure or server abnormality:

import  as request
response = ('http://', timeout=1)
print(())

If timeout is set to 0.01, the following error is reported:

: timed out
During handling of the above exception, another exception

Sets the request header information:

The header information of the request is generally with the browser information, many websites based on the request header information to determine whether the request is a normal browser initiated or initiated by the crawler. Set the crawler header information method:

from urllib import request, parse

url = '/post'
headers = {
 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
 'Host': ''
}
dict = {
 'name': 'bigdata17'
}
data = bytes((dict), encoding='utf8')
req = (url=url, data=data, headers=headers, method='POST')
response = (req)
print(().decode('utf-8'))

Setting up the proxy:

If an ip visits a certain website too often, it will be restricted according to the anti-crawler measures. We can set the proxy through the ProxyHandler method provided by urllib:

import 
proxy_handler = ({'http': 'http://:3128/'})
proxy_auth_handler = ()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = .build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
('/login?alias=&redir=https%3A%2F%%2F&source=index_nav&error=1001')

summarize

Above is the entire content of this article, I hope that the content of this article on your learning or work has a certain reference learning value, if there are questions you can leave a message to exchange, thank you for my support.