SoFunction
Updated on 2024-11-13

Python crawler to achieve HTTP network request multiple implementations

1, through the module to achieve a simple example of sending a request and read the content of the page is as follows:

#Import Module
import 
# Open the page to be crawled
response = ('')
#Read the web page code
html = ()
#Print what was read
print(html)

Results:

b'<!DOCTYPE html><!--STATUS OK-->\n\n\n \n \n       <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><meta name="description" content="\xe5\x85\xa8\xe7\x90\x83\xe6\x9c\x80\xe5\xa4\xa7\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x81\xe8\x87\xb4\xe5\x8a\x9b\xe4\xba\x8e\xe8\xae\xa9\xe7\xbd\x91\xe6\xb0\x91\xe6\x9b\xb4\xe4\xbe\xbf\xe6\x8d\xb7\xe5\x9c\xb0\xe8\x8e\xb7\xe5\x8f\x96\xe4\xbf\xa1\xe6\x81\xaf\xef\xbc\x8c\xe6\x89\xbe\xe5\x88\xb0\xe6\x89\x80\xe6\xb1\x82\xe3\x80\x82\xe7\x99\xbe\xe5\xba\xa6\xe8\xb6\x85\xe8\xbf\x87\xe5\x8d\x83\xe4\xba\xbf\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x91\xe9\xa1\xb5\xe6\x95\xb0\xe6\x8d\xae\xe5\xba\x93\xef\xbc\x8c\xe5\x8f\xaf\xe4\xbb\xa5\xe7\x9e\xac\xe9\x97\xb4\xe6\x89\xbe\xe5\x88\xb0\xe7\x9b\xb8\xe5\x85\xb3\xe7\x9a\x84\xe6\x90\x9c\xe7\xb4\xa2\xe7\xbb\x93\xe6\x9e\x9c\xe3\x80\x82"><link rel="shortcut icon" href="/" rel="external nofollow" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/" rel="external nofollow" title="\xe7\x99\xbe\xe5\xba\xa6\xe6\x90\x9c\xe7\xb4\xa2" /><link rel="icon" sizes="any" mask href="///img/baidu_85beaf5496f291521eb75ba38eacbd87.svg" rel="external nofollow" ><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><link rel="dns-prefetch" href="//" rel="external nofollow" /><title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title><style index="newi" type="text/css">#form .bdsug{top:39px}.bdsug{display:none;position:absolute;width:535px;background:#fff;border:1px solid 
………………(Too many omissions)

The above example is to get the content of Baidu's web page by get request.

The following is the implementation of getting the web page information through the module's post request:

#Import Module
import 
import 
# Process the data using urlencode encoding and then set it to utf-8 encoding using encoding
data = bytes(({'word':'hello'}),encoding='utf-8')
# Open the specified web page to be crawled
response = ('/post',data=data)
html = ()
#Print what was read
print(html)

Results:

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5ec3f607-00f717e823a5c268fe0e0be8"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "/post"\n}\n'

2. urllib3 module

Sample code for sending network requests via urllib3 module:

#Import Module
import urllib3
# Create a PoolManager object to handle the connection to the thread pool and all the details of thread safety
http = ()
# Send requests for pages to be crawled
response = ('GET','/')
#Print what was read
print()

Results:

b'<!DOCTYPE html><!--STATUS OK-->\r\n<html>\r\n<head>\r\n\t<meta http-equiv="content-type" content="text/html;charset=utf-8">\r\n\t<meta http-equiv="X-UA-Compatible" content="IE=Edge">\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<link rel="dns-prefetch" href="//" rel="external nofollow" />\r\n\t<title>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93</title>\r\n\t<link href="/5eN1bjq8AAUYm2zgoY3K/r/www/cache/static/home/css/" rel="external nofollow" rel="stylesheet" type="text/css" />\r\n\t<!--[if lte IE 8]><style index="index" >#content{height:480px\\9}#m{top:260px\\9}</style><![endif]-->\r\n\t<!--[if IE 8]><style index="index" >#u1 ,#u1 :visited{font-family:simsun}</style><![endif]-->\r\n\t<script>var hashMatch = (/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {("http://"++"/s?"+hashMatch[1]);}
…………………………(Too many omissions)

The post request is implemented to get the content of the web page information:

#Import Module
import urllib3
# Create a PoolManager object to handle the connection to the thread pool and all the details of thread safety
http = ()
# Send requests for pages to be crawled
response = ('POST','/post',fields={'word':'hello'})
#Print what was read
print()

Results:

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "128", \n "Content-Type": "multipart/form-data; boundary=06ff68d7a4a22f600244a70bf9382ab2", \n "Host": "", \n "X-Amzn-Trace-Id": "Root=1-5ec3f8c3-9f33c46c1c1b37f6774b84f2"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "/post"\n}\n'

3. Requests module

Take the GET request method as an example of code that prints multiple request messages.

#Import Module
import requests
# Send requests for pages to be crawled
response = ('')
# Print status code
print('Status code:',response.status_code)
# Print the request url
print('url:',)
# Print header information
print('header:',)
# Print cookie information
print('cookie:',)
# Print web page source code as text
print('text:',)
# Print web page source code as a stream of bytes
print('content:',)

Results:

status code: 200
url: /
header: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 19 May 2020 15:28:30 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:32 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.; path=/', 'Transfer-Encoding': 'chunked'}
cookie: <RequestsCookieJar[<Cookie BDORZ=27315 for ./>]>
text: <!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
………………(hereinafter referred to as)
content: b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
………………(hereinafter referred to as)

Example of sending an HTTP web request as a POST request:

#Import Module
import requests
# Form parameters
data = {'word':'hello'}
# Send requests for pages to be crawled
response = ('/post',data=data)
# Print web page source code as a stream of bytes
print()

Results:

b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "", \n "User-Agent": "python-requests/2.23.0", \n "X-Amzn-Trace-Id": "Root=1-5ec3fc97-965139d919e5a08e8135e731"\n }, \n "json": null, \n "origin": "123.139.39.71", \n "url": "/post"\n}\n'

This is the whole content of this article.