Urllib
Official Documentation Address: /3/library/
urllib provides a set of functions for manipulating URLs.
This article is about python urllib library common method usage of the content, the following words without further ado, come together to look at the details of the introduction it
1、Read cookies
import as cj, as request cookie = () handler = (cookie) opener = request.build_opener(handler) response = ('http://') for item in cookie: print( + "=" + )
2、Save cookies in a file
filename = 'baidu_cookies.txt' cookies = (filename) handler = (cookies) opener = request.build_opener(handler) response = ('') (ignore_discard=True,ignore_expires=True)
3. Handling exceptions
URLError and HTTPError classes, the two classes are parent-child relationship, HTTPError will return the error code, both classes can handle exceptions generated by the request module, both have a reason attribute, used to record the reason for the exception
URLError handles exceptions:
from urllib import request,error try: response = ('http:///') except as e: print()
HTTPError handles exceptions:
This class is specialized in handling exceptions for HTTP requests. HTTP requests return a request code, so HTTPError will have a code attribute. In addition, HTTP requests contain headers, so HTTPError also contains a headers attribute.HTTPError inherits from the URLError class, so it also contains a reason attribute.
Code:
try: response = ('http:///') except as e: print() print() print()
4、resolving link
The parse class in the urllib library provides a number of methods for parsing links.
The urlparse() method is specialized for parsing links, so let's look at the return value of this method:
from import urlparse result = urlparse('http://') print(result)
The above code returns the result:
ParseResult(scheme='http', netloc='', path='', params='', query='', fragment='')
The urlparse() method returns the ParseResult class, which has six attributes, namely, scheme, netloc, path, params, query and fragment. scheme represents the protocol, with http, https, ftp and other types of protocols. netloc is the website's netloc is the domain name of the website, path is the name of the webpage to be accessed, params is the parameter. query is the query parameter, fragment is the anchor.
How does the urlparse() method map a link to the 6 parameters above?
Moving on to the next piece of code:
from import urlparse result = urlparse('http:///;user=bigdata17?id=10#content') print(result)
The results of the run are as follows:
ParseResult(scheme='http', netloc='', path='/', params='user=bigdata17', query='id=10', fragment='content')
Visible from the link starting as :// stop is scheme. starting from :// to a / location is netloc domain. Starting from / and ending with a ; semicolon is path, the path to access the page. Starting at ; and ending at ? until is the params parameter. Starting at ? question mark and ending with the # well sign is the query query parameter. And finally the fragment anchor parameter.
5. urlopen() method
This method returns an HTTPResponse object:
import as request response = ('http://') print(response) < object at 0x000002A9655BBF28>
HTTPResponse object has read(), getheaders() and other methods.
The read() method allows you to read information from a web page:
import as request response = ('http://') print(().decode('utf-8'))
When using this method should pay attention to the encoding format used by the site, with decode () method used together, otherwise there will be garbled. Like Baidu uses utf-8, NetEase uses gbk.
The getHeaders() method returns the headers of the page:
import as request response = ('http://') print(())
Results:
[('Server', 'nginx/1.12.2'), ('Date', 'Mon, 12 Nov 2018 15:45:22 GMT'), ('Content-Type', 'text/html'), ('Content-Length', '38274'), ('Last-Modified', 'Thu, 08 Nov 2018 00:35:52 GMT'), ('Connection', 'close'), ('ETag', '"5be384e8-9582"'), ('Accept-Ranges', 'bytes')]
Continue to see what parameters the urlopen() method has:
(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Where url is a mandatory parameter, the other parameters are not mandatory. data is used to transfer data to the website we want to crawl, such as username, password, verification code, etc. timeout is to set the request timeout.
Usage of the data parameter:
>>> import as parse >>> import as request >>> data = bytes(({'username': 'bigdata17'}), encoding='utf8') >>> print(data) b'username=bigdata17' >>> response = ('/post', data=data) >>> print(()) b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "username ": "bigdata17"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "18", \n "Content-Type": "appl ication/x-www-form-urlencoded", \n "Host": "", \n "User-Agent": "Python-urllib/3.7"\n }, \n "json": null, \n "origin": "183.134.52.58", \n "url": "/post"\n}\n'
When transferring data using data, the urlencode method must be used to convert the data from data to the bytes type.
When using the urlopen method, if you don't use the data parameter, you will use the get method to transfer the data, if you use the data parameter, you will use the post method to transfer the data. post method must ensure that there is a corresponding method on the website you want to crawl (the URL you want to crawl is /post, post is the method that handles data transfer via the data parameter), otherwise it will report an error: HTTP Error 404: NOT FOUND. The post method must ensure that there is a corresponding method on the website to be crawled (the URL to be crawled is /post, and post is the method to handle the data we transferred via the data parameter), otherwise it will report the error HTTP Error 404: NOT FOUND.
Usage of the timeout parameter:
This parameter is used to set the request timeout, so that our crawler program will not wait for a long time when there is a network failure or server abnormality:
import as request response = ('http://', timeout=1) print(())
If timeout is set to 0.01, the following error is reported:
: timed out
During handling of the above exception, another exception
Sets the request header information:
The header information of the request is generally with the browser information, many websites based on the request header information to determine whether the request is a normal browser initiated or initiated by the crawler. Set the crawler header information method:
from urllib import request, parse url = '/post' headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Host': '' } dict = { 'name': 'bigdata17' } data = bytes((dict), encoding='utf8') req = (url=url, data=data, headers=headers, method='POST') response = (req) print(().decode('utf-8'))
Setting up the proxy:
If an ip visits a certain website too often, it will be restricted according to the anti-crawler measures. We can set the proxy through the ProxyHandler method provided by urllib:
import proxy_handler = ({'http': 'http://:3128/'}) proxy_auth_handler = () proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = .build_opener(proxy_handler, proxy_auth_handler) # This time, rather than install the OpenerDirector, we use it directly: ('/login?alias=&redir=https%3A%2F%%2F&source=index_nav&error=1001')
summarize
Above is the entire content of this article, I hope that the content of this article on your learning or work has a certain reference learning value, if there are questions you can leave a message to exchange, thank you for my support.