preliminary
View the official KFC website request method: post request.
X-Requested-With: XMLHttpRequest
The official website of Judge Got KFC isajax
requesting
With these two preparatory steps, the goal of this crawl is clarified:
ajax's post request KFC official website Get the first 10 pages of Shanghai KFC locations.
analyze
Get the top 10 pages of Shanghai KFC locations, then you need to analyze the url of each page first.
first page
# page1 # /kfccda/ashx/?op=cname # POST # cname: Shanghai # pid: # pageIndex: 1 # pageSize: 10
second page
# page2 # /kfccda/ashx/?op=cname # POST # cname: Shanghai # pid: # pageIndex: 2 # pageSize: 10
Page 3 and so on.
program entry
First review the basic operations of urllib crawling:
# Use urllib to get the source code of Baidu home page import # 1. Define a url, which is the address you want to visit url = '' # 2. Simulate a browser sending a request to a server response response response = (url) # 3. Get the source code of the page in the response content content # The read method returns binary data in byte form. # Convert binary data to strings # Binary--> String Decoding decode methods content = ().decode('utf-8') # 4. Print data print(content)
- Define a url, which is the address you want to access
- Simulate the browser sending a request to the server response response
- Get the source code of the page in the response content content
if __name__ == '__main__': start_page = int(input('Please enter the start page code: ')) end_page = int(input('Please enter the ending page number: ')) for page in range(start_page, end_page+1): # Customization of request objects request = create_request(page) # Get the web page source code content = get_content(request) # Download data down_load(page, content)
Correspondingly, we declare methods similarly in the main function.
url composition data locator
The key to crawling is to find the interface. For this case, the preview page can be found on the page corresponding to thejson
data, indicating that this is the data we want.
Construct url
It's not hard to find a common thread in the url of KFC's official website, which we've saved asbase_url
。
base_url = '/kfccda/ashx/?op=cname'
parameters
As usual, looking for patterns, only 'pageIndex' has anything to do with page numbers.
data = { 'cname': 'Shanghai', 'pid': '', 'pageIndex': page, 'pageSize': '10' }
post request
- The parameters of the post request must be encoded.
data = (data).encode('utf-8')
- The encode method must be called after encoding
- Parameters are placed in the request object's customized methods: the parameters of a post request are not spliced into the url, but are placed in the request object's customized parameters.
So encode the data
data = (data).encode('utf-8')
Header acquisition (a means of preventing back-crawling)
i.e. The UA portion of the response header.
User Agent, a special string header that enables the server to recognize the operating system and version, CPU type, browser and version, browser kernel, browser rendering engine, browser language, browser plug-ins, and so on, used by the client.
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38' }
Request object customization
Once the parameters, base_url, and request headers are ready, it's time for request object customization.
request = (base_url, headers=headers, data=data)
Get the web page source code
Takes a request as a parameter and simulates a browser sending a request to a server to get a response.
response = (request) content = ().decode('utf-8')
Get the source code of the page in the response and download the data
utilizationread()
method, which gets the binary data in byte form, requires the use of thedecode
Performs decoding and conversion to a string.
content = ().decode('utf-8')
Then we write the data we get from the download to a file using thewith open() as fp
syntax, the system automatically closes the file.
def down_load(page, content): with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8') as fp: (content)
All Codes
# ajax's post request to KFC official website to get the first 10 pages of KFC locations in Shanghai. # page1 # /kfccda/ashx/?op=cname # POST # cname: Shanghai # pid: # pageIndex: 1 # pageSize: 10 # page2 # /kfccda/ashx/?op=cname # POST # cname: Shanghai # pid: # pageIndex: 2 # pageSize: 10 import , def create_request(page): base_url = '/kfccda/ashx/?op=cname' data = { 'cname': 'Shanghai', 'pid': '', 'pageIndex': page, 'pageSize': '10' } data = (data).encode('utf-8') headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38' } request = (base_url, headers=headers, data=data) return request def get_content(request): response = (request) content = ().decode('utf-8') return content def down_load(page, content): with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8') as fp: (content) if __name__ == '__main__': start_page = int(input('Please enter the start page code: ')) end_page = int(input('Please enter the ending page number: ')) for page in range(start_page, end_page+1): # Customization of request objects request = create_request(page) # Get the web page source code content = get_content(request) # Download data down_load(page, content)
Crawling results
Take a bow!!! Actually also crawled Lisa's photo, welcome to leave a comment if you want to see the crawler code !!!!
To this point this article on the Python practice exercises finally on the KFC article is introduced to this, more related Python KFC official website content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!