1, through the module to achieve a simple example of sending a request and read the content of the page is as follows:
#Import Module import # Open the page to be crawled response = ('') #Read the web page code html = () #Print what was read print(html)
Results:
b'<!DOCTYPE html><!--STATUS OK-->\n\n\n \n \n <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><meta name="description" content="\xe5\x85\xa8\xe7\x90\x83\xe6\x9c\x80\xe5\xa4\xa7\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x81\xe8\x87\xb4\xe5\x8a\x9b\xe4\xba\x8e\xe8\xae\xa9\xe7\xbd\x91\xe6\xb0\x91\xe6\x9b\xb4\xe4\xbe\xbf\xe6\x8d\xb7\xe5\x9c\xb0\xe8\x8e\xb7\xe5\x8f\x96\xe4\xbf\xa1\xe6\x81\xaf\xef\xbc\x8c\xe6\x89\xbe\xe5\x88\xb0\xe6\x89\x80\xe6\xb1\x82\xe3\x80\x82\xe7\x99\xbe\xe5\xba\xa6\xe8\xb6\x85\xe8\xbf\x87\xe5\x8d\x83\xe4\xba\xbf\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x91\xe9\xa1\xb5\xe6\x95\xb0\xe6\x8d\xae\xe5\xba\x93\xef\xbc\x8c\xe5\x8f\xaf\xe4\xbb\xa5\xe7\x9e\xac\xe9\x97\xb4\xe6\x89\xbe\xe5\x88\xb0\xe7\x9b\xb8\xe5\x85\xb3\xe7\x9a\x84\xe6\x90\x9c\xe7\xb4\xa2\xe7\xbb\x93\xe6\x9e\x9c\xe3\x80\x82"><link rel="shortcut icon" href="/" rel="external nofollow" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/" rel="external nofollow" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" /><link rel="icon" sizes="any" mask href="///img/baidu_85beaf5496f291521eb75ba38eacbd87.svg" rel="external nofollow" ><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title><style index="newi" type="text/css">#form .bdsug{top:39px}.bdsug{display:none;position:absolute;width:535px;background:#fff;border:1px solid ………………(Too many omissions)
The above example is to get the content of Baidu's web page by get request.
The following is the implementation of getting the web page information through the module's post request:
#Import Module import import # Process the data using urlencode encoding and then set it to utf-8 encoding using encoding data = bytes(({'word':'hello'}),encoding='utf-8') # Open the specified web page to be crawled response = ('/post',data=data) html = () #Print what was read print(html)
Results:
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5ec3f607-00f717e823a5c268fe0e0be8"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "/post"\n}\n'
2. urllib3 module
Sample code for sending network requests via urllib3 module:
#Import Module import urllib3 # Create a PoolManager object to handle the connection to the thread pool and all the details of thread safety http = () # Send requests for pages to be crawled response = ('GET','/') #Print what was read print()
Results:
b'<!DOCTYPE html><!--STATUS OK-->\r\n<html>\r\n<head>\r\n\t<meta http-equiv="content-type" content="text/html;charset=utf-8">\r\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\r\n\t<link href="/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/css/" rel="external nofollow" rel="stylesheet" type="text/css" />\r\n\t<!--[if lte IE 8]><style index="index" >#content{height:480px\\9}#m{top:260px\\9}</style><![endif]-->\r\n\t<!--[if IE 8]><style index="index" >#u1 ,#u1 :visited{font-family:simsun}</style><![endif]-->\r\n\t<script>var hashMatch = (/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {("http://"++"/s?"+hashMatch[1]);} …………………………(Too many omissions)
The post request is implemented to get the content of the web page information:
#Import Module import urllib3 # Create a PoolManager object to handle the connection to the thread pool and all the details of thread safety http = () # Send requests for pages to be crawled response = ('POST','/post',fields={'word':'hello'}) #Print what was read print()
Results:
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "128", \n "Content-Type": "multipart/form-data; boundary=06ff68d7a4a22f600244a70bf9382ab2", \n "Host": "", \n "X-Amzn-Trace-Id": "Root=1-5ec3f8c3-9f33c46c1c1b37f6774b84f2"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "/post"\n}\n'
3. Requests module
Take the GET request method as an example of code that prints multiple request messages.
#Import Module import requests # Send requests for pages to be crawled response = ('') # Print status code print('Status code:',response.status_code) # Print the request url print('url:',) # Print header information print('header:',) # Print cookie information print('cookie:',) # Print web page source code as text print('text:',) # Print web page source code as a stream of bytes print('content:',)
Results:
status code: 200 url: / header: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 19 May 2020 15:28:30 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.; path=/', 'Transfer-Encoding': 'chunked'} cookie: <RequestsCookieJar[<Cookie BDORZ=27315 for ./>]> text: <!DOCTYPE html> <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8> ………………(hereinafter referred to as) content: b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8> ………………(hereinafter referred to as)
Example of sending an HTTP web request as a POST request:
#Import Module import requests # Form parameters data = {'word':'hello'} # Send requests for pages to be crawled response = ('/post',data=data) # Print web page source code as a stream of bytes print()
Results:
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "", \n "User-Agent": "python-requests/2.23.0", \n "X-Amzn-Trace-Id": "Root=1-5ec3fc97-965139d919e5a08e8135e731"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "/post"\n}\n'
This is the whole content of this article.