Python Hands-On Exercise on Finally Taking on KFC

请添加图片描述

preliminary

View the official KFC website request method: post request.

在这里插入图片描述

X-Requested-With: XMLHttpRequest The official website of Judge Got KFC isajaxrequesting

在这里插入图片描述

With these two preparatory steps, the goal of this crawl is clarified:
ajax's post request KFC official website Get the first 10 pages of Shanghai KFC locations.

analyze

Get the top 10 pages of Shanghai KFC locations, then you need to analyze the url of each page first.

first page

# page1
# /kfccda/ashx/?op=cname
# POST
# cname: Shanghai
# pid:
# pageIndex: 1
# pageSize: 10

second page

# page2
# /kfccda/ashx/?op=cname
# POST
# cname: Shanghai
# pid:
# pageIndex: 2
# pageSize: 10

Page 3 and so on.

program entry

First review the basic operations of urllib crawling:

# Use urllib to get the source code of Baidu home page
import 

# 1. Define a url, which is the address you want to visit
url = ''

# 2. Simulate a browser sending a request to a server response response
response = (url)

# 3. Get the source code of the page in the response content content
# The read method returns binary data in byte form.
# Convert binary data to strings
# Binary--> String Decoding decode methods
content = ().decode('utf-8')

# 4. Print data
print(content)

Define a url, which is the address you want to access
Simulate the browser sending a request to the server response response
Get the source code of the page in the response content content

if __name__ == '__main__':
    start_page = int(input('Please enter the start page code: '))
    end_page = int(input('Please enter the ending page number: '))

    for page in range(start_page, end_page+1):
        # Customization of request objects
        request = create_request(page)
        # Get the web page source code
        content = get_content(request)
        # Download data
        down_load(page, content)

Correspondingly, we declare methods similarly in the main function.

url composition data locator

请添加图片描述

The key to crawling is to find the interface. For this case, the preview page can be found on the page corresponding to thejsondata, indicating that this is the data we want.

请添加图片描述

Construct url

It's not hard to find a common thread in the url of KFC's official website, which we've saved asbase_url。

base_url = '/kfccda/ashx/?op=cname'

parameters

As usual, looking for patterns, only 'pageIndex' has anything to do with page numbers.

    data = {
        'cname': 'Shanghai',
        'pid': '',
        'pageIndex': page,
        'pageSize': '10'
    }

post request

The parameters of the post request must be encoded.

data = (data).encode('utf-8')

The encode method must be called after encoding
Parameters are placed in the request object's customized methods: the parameters of a post request are not spliced into the url, but are placed in the request object's customized parameters.

So encode the data

data = (data).encode('utf-8')

Header acquisition (a means of preventing back-crawling)

请添加图片描述

i.e. The UA portion of the response header.

User Agent, a special string header that enables the server to recognize the operating system and version, CPU type, browser and version, browser kernel, browser rendering engine, browser language, browser plug-ins, and so on, used by the client.

 headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'
    }

Request object customization

Once the parameters, base_url, and request headers are ready, it's time for request object customization.

 request = (base_url,
  headers=headers, data=data)

Get the web page source code

Takes a request as a parameter and simulates a browser sending a request to a server to get a response.

 response = (request)
    content = ().decode('utf-8')

Get the source code of the page in the response and download the data

utilizationread()method, which gets the binary data in byte form, requires the use of thedecodePerforms decoding and conversion to a string.

content = ().decode('utf-8')

Then we write the data we get from the download to a file using thewith open() as fp syntax, the system automatically closes the file.

def down_load(page, content):
    with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8') as fp:
        (content)

All Codes

# ajax's post request to KFC official website to get the first 10 pages of KFC locations in Shanghai.

# page1
# /kfccda/ashx/?op=cname
# POST
# cname: Shanghai
# pid:
# pageIndex: 1
# pageSize: 10

# page2
# /kfccda/ashx/?op=cname
# POST
# cname: Shanghai
# pid:
# pageIndex: 2
# pageSize: 10

import , 

def create_request(page):
    base_url = '/kfccda/ashx/?op=cname'

    data = {
        'cname': 'Shanghai',
        'pid': '',
        'pageIndex': page,
        'pageSize': '10'
    }

    data = (data).encode('utf-8')

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'
    }

    request = (base_url, headers=headers, data=data)
    return request

def get_content(request):
    response = (request)
    content = ().decode('utf-8')
    return content

def down_load(page, content):
    with open('kfc_' + str(page) + '.json', 'w', encoding='utf-8') as fp:
        (content)

if __name__ == '__main__':
    start_page = int(input('Please enter the start page code: '))
    end_page = int(input('Please enter the ending page number: '))

    for page in range(start_page, end_page+1):
        # Customization of request objects
        request = create_request(page)
        # Get the web page source code
        content = get_content(request)
        # Download data
        down_load(page, content)

Crawling results

在这里插入图片描述

Take a bow!!! Actually also crawled Lisa's photo, welcome to leave a comment if you want to see the crawler code !!!!

To this point this article on the Python practice exercises finally on the KFC article is introduced to this, more related Python KFC official website content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!