SoFunction
Updated on 2024-11-19

python crawler tool of the use of requests library (ultra-comprehensive crawling web page cases)

Requests Library

Installation using pip.
pip install requests

Basic requests

req = ("/")
req = ("/")
req = ("/")
req = ("/")
req = ("/")
req = (/)

requesting

The parameters are dictionaries, and we can pass parameters of type json:

import requests
from fake_useragent import UserAgent#Request header libraries
headers = {"User-Agent":UserAgent().random}# Get a random request header
url = "/s"# Web site
params={
  "wd":"Douban."  # URL suffixes
}

(url,headers=headers,params=params)

在这里插入图片描述

returns the status code, so we need to convert it to text in order to get the content:

#get request

headers = {"User-Agent":UserAgent().random}
url = "/s"
params={
  "wd":"Douban."
}

response = (url,headers=headers,params=params)

requesting

Arguments are also dictionaries and can also be passed of type json:

import requests 
from fake_useragent import UserAgent

headers = {"User-Agent":UserAgent().random}

url = "/index/login/login" #Login account password URL
params = {
  "user":"1351351335",# Account number
  "password":"123456"#password
}

response = (url,headers=headers,data=params)

在这里插入图片描述

Because a login page is needed here, I used a random one here without logging in, so it shows the result like this, if you want to test the login effect, please find a login page to try it.

act on behalf of sb. in a responsible position

Proxies are often used to avoid being blocked when collecting IPs, and requests have corresponding proxies.

#IP Proxy

import requests 
from fake_useragent import UserAgent

headers = {"User-Agent":UserAgent().random}
url = "/get" # Returns the URL of the current IP

proxies = {
  "http":"http://yonghuming:[email protected]:8088"#http://user ID:cryptographic@IP:port number
  #"http": "https://182.145.31.211:4224"# or IP: port number
}

(url,headers=headers,proxies=proxies)

Proxy IPs can be found by going to: fast proxies or can be purchased.
/get. This URL is to view your current information:

在这里插入图片描述

4. Set access timeout

A timeout can be set via the timeout attribute, which will prompt an error if the response content is not fetched after this time has elapsed.

#Set access times
("/",timeout=0.1)

在这里插入图片描述

5. Certificate issues (SSLError:HTTP)

ssl authentication.

import requests 
from fake_useragent import UserAgent #Request header libraries

url = "https:///index/" # Web addresses that require certificates
headers = {"User-Agent":UserAgent().random}# Get a random request header

.urllib3.disable_warnings()# Disable security warnings
response = (url,verify=False,headers=headers)
 = "utf-8" # Used to display Chinese for transcoding

在这里插入图片描述

Autosave cookies

import requests
from fake_useragent import UserAgent

headers = {"User-Agent":UserAgent().chrome}
login_url = "/index/login/login" # Web address to log in
params = {
  "user":"yonghuming",#Username
  "password":"123456"#password
}
session = () # Used to save cookies

#Session, sorry requests #
response = (login_url,headers=headers,data=params)

info_url = "/index/" #The web address after logging in your account and password.
resp = (info_url,headers=headers)

Since I'm not using a page that requires an account password here, it shows this:

在这里插入图片描述

I got a Tree of Wisdom page.

#cookie 

import requests
from fake_useragent import UserAgent

headers = {"User-Agent":UserAgent().chrome}
login_url = "/login?service=/login/gologin" # Web address to log in
params = {
  "user":"12121212",#Username
  "password":"123456"#password
}
session = () # Used to save cookies

#Session, sorry requests #
response = (login_url,headers=headers,data=params)

info_url = "/#/stdetInex" #web address after login and password
resp = (info_url,headers=headers)
 = "utf-8"

在这里插入图片描述

7. Obtaining response information

coding hidden meaning
() Get response content (as json string)
Get the corresponding content (as a string)
Getting the response content (in bytes)
Get response header content
Get access to the address
Getting the code of a web page
request header content
Getting cookies

to this article on the use of python crawler tools of the requests library (super-comprehensive crawling web page case) of the article is introduced to this, more related python crawler requests library usage content please search my previous posts or continue to browse the following related articles I hope you will support me in the future more!