In this article, the example of Python crawler to crawl the content of Sina microblogging. Shared for your reference, as follows:
Crawler written in Python to crawl the microblogging V microblogging content, this article to the goddess of microblogging as an example (crawl sina m station:/u/1259110474)
Generally when you do a crawler to crawl a website, your first choice is the m-site, followed by the wap-site, and finally consider the PC-site. Of course, this is not absolute, there are times when the PC station information is the most complete, and you happen to need all the information, then the PC station is your first choice. General m station are to m after the beginning of the domain name, so this article to engage in the URL is .
preliminary
1.Proxy IP
There are a lot of free proxy ip on the internet, such as west spurs free proxy ip/, you can find one yourself that you can use for testing;
2. Catch packet analysis
Get microblogging content address by grabbing packets, here is no longer detailed, do not understand the partners can find their own Baidu related information, the following directly on the complete code
Full Code:
# -*- coding: utf-8 -*- import import json # Define the microblogging IDs of the microbloggers to be crawled id='1259110474' #Set Proxy IP proxy_addr="122.241.72.191:808" # Define the page open function def use_proxy(url,proxy_addr): req=(url) req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE MetaSr 1.0") proxy=({'http':proxy_addr}) opener=.build_opener(proxy,) .install_opener(opener) data=(req).read().decode('utf-8','ignore') return data # Get the containerid of the homepage of the microblog, which is needed to crawl the content of the microblog. def get_containerid(url): data=use_proxy(url,proxy_addr) content=(data).get('data') for data in ('tabsInfo').get('tabs'): if(('tab_type')=='weibo'): containerid=('containerid') return containerid # Obtain basic information about the user of a microblogging V account, such as: microblogging nickname, microblogging address, microblogging avatar, number of followers, number of fans, gender, level, etc. def get_userInfo(id): url='/api/container/getIndex?type=uid&value='+id data=use_proxy(url,proxy_addr) content=(data).get('data') profile_image_url=('userInfo').get('profile_image_url') description=('userInfo').get('description') profile_url=('userInfo').get('profile_url') verified=('userInfo').get('verified') guanzhu=('userInfo').get('follow_count') name=('userInfo').get('screen_name') fensi=('userInfo').get('followers_count') gender=('userInfo').get('gender') urank=('userInfo').get('urank') print("Twitter Nickname:"+name+"\n"+"Twitter homepage address:"+profile_url+"\n"+"Twitter avatar address:"+profile_image_url+"\n"+"To certify or not to certify:"+str(verified)+"\n"+"Twitter Note:"+description+"\n"+"Number of persons of concern:"+str(guanzhu)+"\n"+"Number of fans:"+str(fensi)+"\n"+"Gender:"+gender+"\n"+"Twitter Rating:"+str(urank)+"\n") # Get microblog content information, and save it to text, including: the content of each microblog, microblog details page address, the number of likes, comments, retweets and so on. def get_weibo(id,file): i=1 while True: url='/api/container/getIndex?type=uid&value='+id weibo_url='/api/container/getIndex?type=uid&value='+id+'&containerid='+get_containerid(url)+'&page='+str(i) try: data=use_proxy(weibo_url,proxy_addr) content=(data).get('data') cards=('cards') if(len(cards)>0): for j in range(len(cards)): print("----- is crawling No."+str(i)+"p."+str(j)+"Article tweets ------") card_type=cards[j].get('card_type') if(card_type==9): mblog=cards[j].get('mblog') attitudes_count=('attitudes_count') comments_count=('comments_count') created_at=('created_at') reposts_count=('reposts_count') scheme=cards[j].get('scheme') text=('text') with open(file,'a',encoding='utf-8') as fh: ("---- section"+str(i)+"p."+str(j)+"Article tweets ----"+"\n") ("Twitter address:"+str(scheme)+"\n"+"Published:"+str(created_at)+"\n"+"Tweets:"+text+"\n"+"Likes:"+str(attitudes_count)+"\n"+"Number of comments:"+str(comments_count)+"\n"+"Number of retweets:"+str(reposts_count)+"\n") i+=1 else: break except Exception as e: print(e) pass if __name__=="__main__": file=id+".txt" get_userInfo(id) get_weibo(id,file)
Crawling results
More about Python related content can be viewed on this site's topic: thePython Socket Programming Tips Summary》、《Python Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques》
I hope that what I have said in this article will help you in Python programming.