SoFunction
Updated on 2024-11-12

Python Crawler Crawling Sina Weibo Content Example [Based on Proxy IP].

In this article, the example of Python crawler to crawl the content of Sina microblogging. Shared for your reference, as follows:

Crawler written in Python to crawl the microblogging V microblogging content, this article to the goddess of microblogging as an example (crawl sina m station:/u/1259110474

Generally when you do a crawler to crawl a website, your first choice is the m-site, followed by the wap-site, and finally consider the PC-site. Of course, this is not absolute, there are times when the PC station information is the most complete, and you happen to need all the information, then the PC station is your first choice. General m station are to m after the beginning of the domain name, so this article to engage in the URL is .

preliminary

1.Proxy IP

There are a lot of free proxy ip on the internet, such as west spurs free proxy ip/, you can find one yourself that you can use for testing;

2. Catch packet analysis

Get microblogging content address by grabbing packets, here is no longer detailed, do not understand the partners can find their own Baidu related information, the following directly on the complete code

Full Code:

# -*- coding: utf-8 -*-
import 
import json
# Define the microblogging IDs of the microbloggers to be crawled
id='1259110474'
#Set Proxy IP
proxy_addr="122.241.72.191:808"
# Define the page open function
def use_proxy(url,proxy_addr):
  req=(url)
  req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE  MetaSr 1.0")
  proxy=({'http':proxy_addr})
  opener=.build_opener(proxy,)
  .install_opener(opener)
  data=(req).read().decode('utf-8','ignore')
  return data
# Get the containerid of the homepage of the microblog, which is needed to crawl the content of the microblog.
def get_containerid(url):
  data=use_proxy(url,proxy_addr)
  content=(data).get('data')
  for data in ('tabsInfo').get('tabs'):
    if(('tab_type')=='weibo'):
      containerid=('containerid')
  return containerid
# Obtain basic information about the user of a microblogging V account, such as: microblogging nickname, microblogging address, microblogging avatar, number of followers, number of fans, gender, level, etc.
def get_userInfo(id):
  url='/api/container/getIndex?type=uid&value='+id
  data=use_proxy(url,proxy_addr)
  content=(data).get('data')
  profile_image_url=('userInfo').get('profile_image_url')
  description=('userInfo').get('description')
  profile_url=('userInfo').get('profile_url')
  verified=('userInfo').get('verified')
  guanzhu=('userInfo').get('follow_count')
  name=('userInfo').get('screen_name')
  fensi=('userInfo').get('followers_count')
  gender=('userInfo').get('gender')
  urank=('userInfo').get('urank')
  print("Twitter Nickname:"+name+"\n"+"Twitter homepage address:"+profile_url+"\n"+"Twitter avatar address:"+profile_image_url+"\n"+"To certify or not to certify:"+str(verified)+"\n"+"Twitter Note:"+description+"\n"+"Number of persons of concern:"+str(guanzhu)+"\n"+"Number of fans:"+str(fensi)+"\n"+"Gender:"+gender+"\n"+"Twitter Rating:"+str(urank)+"\n")
# Get microblog content information, and save it to text, including: the content of each microblog, microblog details page address, the number of likes, comments, retweets and so on.
def get_weibo(id,file):
  i=1
  while True:
    url='/api/container/getIndex?type=uid&value='+id
    weibo_url='/api/container/getIndex?type=uid&value='+id+'&containerid='+get_containerid(url)+'&page='+str(i)
    try:
      data=use_proxy(weibo_url,proxy_addr)
      content=(data).get('data')
      cards=('cards')
      if(len(cards)>0):
        for j in range(len(cards)):
          print("----- is crawling No."+str(i)+"p."+str(j)+"Article tweets ------")
          card_type=cards[j].get('card_type')
          if(card_type==9):
            mblog=cards[j].get('mblog')
            attitudes_count=('attitudes_count')
            comments_count=('comments_count')
            created_at=('created_at')
            reposts_count=('reposts_count')
            scheme=cards[j].get('scheme')
            text=('text')
            with open(file,'a',encoding='utf-8') as fh:
              ("---- section"+str(i)+"p."+str(j)+"Article tweets ----"+"\n")
              ("Twitter address:"+str(scheme)+"\n"+"Published:"+str(created_at)+"\n"+"Tweets:"+text+"\n"+"Likes:"+str(attitudes_count)+"\n"+"Number of comments:"+str(comments_count)+"\n"+"Number of retweets:"+str(reposts_count)+"\n")
        i+=1
      else:
        break
    except Exception as e:
      print(e)
      pass
if __name__=="__main__":
  file=id+".txt"
  get_userInfo(id)
  get_weibo(id,file)

Crawling results

More about Python related content can be viewed on this site's topic: thePython Socket Programming Tips Summary》、《Python Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques

I hope that what I have said in this article will help you in Python programming.