Python implementation of crawling Jack Ma's microblogging function example

In this paper, the example of Python implementation of crawling Ma Yun's microblogging function. Shared for your reference, as follows:

Analyzing requests

Let's turn on the XHR filter for Ajax and keep sliding the page to load new tweets, you can see that there will be a constant stream of Ajax requests.

Let's select one of the requests to analyze its parameter information, and click on the request to enter the details page, as shown in the figure:

You can find that this is a GET request, the request has six parameters: display, retcode, type, value, containerid and page, observation of these requests can be found only page in the change, it is clear that the page is used to control the paging.

Analyzing the response

As shown:

It is a Json format, the browser developer tools automatically do the parsing to facilitate our view, you can see that the most critical two parts of the information is cardlistInfo and cards, the two will be expanded, cardlistInfo contains a more important information is total, after observing the total number of microblogging in fact it is found that we can We can estimate the number of pages based on this number.

I found that it has another important field called mblog, and continued to expand it, and found that it contains some information about the microblog. For example, attitudes_count number of likes, comments_count number of comments, reposts_count number of retweets, created_at time of posting, text body of the microblog, and so on. It's not hard to get all of this, and it's all formatted in a way that makes it easier to extract the information.

This way we can get 10 tweets from a single request to the interface, and all we need to do is change the page parameter when making the request. This way we can get all the tweets by simply doing a loop.

on-the-spot exercise

Here we'll start to simulate these Ajax requests with a program that crawls all of Jack Ma's tweets.

First we define a method to get the result of each request. page is a variable parameter at the time of the request, so we pass it in as a parameter to the method with the following code:

from  import urlencode
import requests
base_url = '/api/container/getIndex?'
headers = {
  'Host': '',
  'Referer': '/u/2145291155',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
  'X-Requested-With': 'XMLHttpRequest',
}
def get_page(page):
  params = {
    'display': '0',
    'retcode': '6102',
    'type': 'uid',
    'value': '2145291155',
    'containerid': '1076032145291155',
    'page': page
  }
  url = base_url + urlencode(params)
  try:
    response = (url, headers=headers)
    if response.status_code == 200:
      return ()
  except  as e:
    print('Error', )

First, we define a base_url to represent the first half of the requested URL, then we construct a parameter dictionary, where display, retcode, type, value, containerid are fixed parameters, only page is a variable parameter, then we call the urlencode() method to transform the Next, we call the urlencode() method to convert the parameters into GET request parameters for the URL, i.e., something likedisplay=0&retcode=6102&type=uid&value=2145291155&containerid=1076032145291155&page=2 Then we use Requests to request the link, add the headers parameter, and then determine the status code of the response. If it is 200, then we directly call the json() method to parse the content into Json and return it, otherwise we don't return any information, and if there is an exception, then we will catch it and output its If an exception occurs, it will be caught and output the exception information.

Then we need to define a parsing method to extract the information we want from the result. For example, if we want to save the body, likes, comments, and retweets of a tweet, we can iterate through the cards, get the information in the mblog, and return it as a new dictionary.

from pyquery import PyQuery as pq
def parse_page(json):
  if json:
    items = ('cards')
    for item in items:
      item = ('mblog')
      weibo = {}
      weibo['Tweets:'] = pq(('text')).text()
      weibo['Number of forwards'] = ('attitudes_count')
      weibo['Number of comments'] = ('comments_count')
      weibo['Likes'] = ('reposts_count')
      yield weibo

Here we remove the HTML tags from the body with the help of PyQuery.

Finally, we iterate through the page and print out the extracted results.

if __name__ == '__main__':
  for page in range(1, 50):
    json = get_page(page)
    results = parse_page(json)
    for result in results:
      print(result)

Alternatively, we can add a method to save the results to a local TXT file.

def save_to_txt(result):
  with open('Jack Ma's Weibo.txt', 'a', encoding='utf-8') as file:
    (str(result) + '\n')

Code Organization

import requests
from  import urlencode
from pyquery import PyQuery as pq
base_url = '/api/container/getIndex?'
headers = {
  'Host': '',
  'Referer': '/u/2145291155',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
  'X-Requested-With': 'XMLHttpRequest',
}
max_page = 50
def get_page(page):
  params = {
    'display': '0',
    'retcode': '6102',
    'type': 'uid',
    'value': '2145291155',
    'containerid': '1076032145291155',
    'page': page
  }
  url = base_url + urlencode(params)
  try:
    response = (url, headers=headers)
    if response.status_code == 200:
      return (), page
  except  as e:
    print('Error', )
def parse_page(json, page: int):
  if json:
    items = ('data').get('cards')
    for index, item in enumerate(items):
      if page == 1 and index == 1:
        continue
      else:
        item = ('mblog')
        weibo = {}
        weibo['Tweets:'] = pq(('text')).text()
        weibo['Number of retweets:'] = ('attitudes_count')
        weibo['Number of comments:'] = ('comments_count')
        weibo['Likes:'] = ('reposts_count')
        yield weibo
def save_to_txt(result):
  with open('Jack Ma's Weibo.txt', 'a', encoding='utf-8') as file:
    (str(result) + '\n')
if __name__ == '__main__':
  for page in range(1, max_page + 1):
    json = get_page(page)
    results = parse_page(*json)
    for result in results:
      print(result)
      save_to_txt(result)

This article refers to Cui Qingcai's "python3 web crawler development in practice".

I hope that what I have said in this article will help you in Python programming.