Python implementation of WeChat friends data crawling and analysis

preamble

With the popularity of WeChat, more and more people start to use WeChat. Gradually, WeChat has transformed from a mere social software to a way of life. People need WeChat for daily communication and WeChat for work communication. Each friend in WeChat represents a different role that people play in society.

Today's article will be based on Python data analysis of WeChat friends, the dimensions chosen here are: gender, avatar, signature, location, mainly using two forms of charts and word clouds to present the results, of which the text-based information will be used in two methods of word frequency analysis and sentiment analysis. As the common saying goes: if you want to do a good job, you must first make good use of your tools. Before starting this article, we will briefly introduce the third-party modules used in this article:

itchat: a Python version of the WeChat web interface, used in this article to get WeChat friend information.
jieba: the Python version of stuttering, which is used in this paper to process the text information.
matplotlib: Python's charting module, used in this article to draw bar charts and pie charts.
snownlp: a Python module for Chinese word segmentation, used in this paper to determine the sentiment of text messages.
PIL: Python's image processing module, used in this article to process images.
numpy: a numerical module in Python, used in this article with the wordcloud module.
wordcloud: a word cloud module in Python, used in this article to draw word cloud images.
TencentYoutuyun: the Python version of the SDK provided by Tencent Youtuyun, which is used in this paper to recognize faces and extract image tag information.

All of the above modules can be installed via pip. For detailed instructions on how to use each module, please refer to their respective documentation.

1. Data analysis

The prerequisite for analyzing WeChat's friend data is to get information about your friends, which is made very easy by using the module itchat, which we can do with the following two lines of code:

itchat.auto_login(hotReload = True) 
friends = itchat.get_friends(update = True)

Same as usual login to the web version of WeChat, we use the cell phone to scan the QR code to log in, the friends object returned here is a collection, the first element is the current user. So, in the following data analysis process, we always take friends[1:] as the original input data, each element in the collection is a dictionary structure, in my case, you can notice that there are Sex, City, Province, HeadImgUrl, Signature, we analyze the following analysis from the four fields fields, we analyze the following from these four fields to start:

2. Friend's gender

To analyze the gender of our friends, we first need to get the gender information of all our friends, here we extract the Sex field of each friend's information, and then count the number of Male, Female, and Unkonw respectively, we assemble these three values into a list, we can use the matplotlib module to draw a pie chart, which is implemented in the code as follows:

def analyseSex(firends): 
  sexs = list(map(lambda x:x['Sex'],friends[1:])) 
 counts = list(map(lambda x:x[1],Counter(sexs).items())) 
 labels = ['Unknow','Male','Female'] 
 colors = ['red','yellowgreen','lightskyblue'] 
 (figsize=(8,5), dpi=80) 
 (aspect=1) 
 (counts, # Results of gender statistics
   labels=labels, #Gender Showcase Label
   colors=colors, # Pie chart area color matching
   labeldistance = 1.1, # Label distance from the dot
   autopct = '%3.1f%%', # Pie chart area text formatting
   shadow = False, #Whether the pie chart is shaded or not
   startangle = 90, # Pie Chart Starting Angle
   pctdistance = 0.6 #Pie chart area text distance from dots
 ) 
 (loc='upper right',) 
 (u'Gender composition of %s' WeChat friends' % friends[0]['NickName']) 
 ()

Here is a brief explanation of this code, the gender field in WeChat has three kinds of values, Unkonw, Male and Female, whose corresponding values are 0, 1 and 2 respectively.The three different values are counted by Counter() in the Collection module, and its items() method returns a collection of tuples.

The first dimension of the tuple represents the keys, i.e., 0, 1, and 2, the second dimension of the tuple represents the number, and the set of the tuple is sorted, i.e., the keys are listed in the order of 0, 1, and 2, so we can get the number of these three different values by using the map() method, and we can pass it to matplotlib to plot them, and the percentage of each of the three different values will be calculated by the The percentage of each of the three different values is calculated by matplotlib. The following figure shows the gender distribution of friends plotted by matplotlib:

3. Friend's avatar

Friends' avatars are analyzed in two ways: first, what proportion of these friends' avatars use face avatars; second, what valuable keywords can be extracted from these friends' avatars.

Here we need to download the avatar to the local according to the HeadImgUrl field, and then through the face recognition related API interface provided by Tencent Youtu, we detect whether there is a face in the avatar image and extract the tags in the image. Among them, the former is a classification summary, we use pie charts to present the results; the latter is to analyze the text, we use word clouds to present the results. The key code is shown below:

def analyseHeadImage(frineds): 
 # Init Path 
 basePath = ('.') 
 baseFolder = basePath + '\\HeadImages\\'
 if((baseFolder) == False): 
  (baseFolder)  
 # Analyse Images 
 faceApi = FaceAPI() 
 use_face = 0
 not_use_face = 0
 image_tags = '' 
 for index in range(1,len(friends)): 
  friend = friends[index] 
  # Save HeadImages 
  imgFile = baseFolder + '\\Image%' % str(index) 
  imgData = itchat.get_head_img(userName = friend['UserName']) 
  if((imgFile) == False): 
   with open(imgFile,'wb') as file: 
    (imgData)   
  # Detect Faces 
  (1) 
  result = (imgFile) 
  if result == True: 
   use_face += 1
  else: 
   not_use_face += 1  
  # Extract Tags 
  result = (imgFile) 
  image_tags += ','.join(list(map(lambda x:x['tag_name'],result)))  
 labels = [u'Use face avatar',u'Do not use face avatars'] 
 counts = [use_face,not_use_face] 
 colors = ['red','yellowgreen','lightskyblue'] 
 (figsize=(8,5), dpi=80) 
 (aspect=1) 
 (counts, # Results of gender statistics
   labels=labels, #Gender Showcase Label
   colors=colors, # Pie chart area color matching
   labeldistance = 1.1, # Label distance from the dot
   autopct = '%3.1f%%', # Pie chart area text formatting
   shadow = False, #Whether the pie chart is shaded or not
   startangle = 90, # Pie Chart Starting Angle
   pctdistance = 0.6 #Pie chart area text distance from dots
 ) 
 (loc='upper right',) 
 (u'The use of face avatars by %s of WeChat friends' % friends[0]['NickName']) 
 ()  
 image_tags = image_tags.encode('iso8859-1').decode('utf-8') 
 back_coloring = (('')) 
 wordcloud = WordCloud( 
  font_path='', 
  background_color="white", 
  max_words=1200, 
  mask=back_coloring, 
  max_font_size=75, 
  random_state=45, 
  width=800, 
  height=480, 
  margin=15
 )  
 (image_tags) 
 (wordcloud) 
 ("off") 
 ()

Here we will create a new HeadImages directory in the current directory to store the avatars of all our friends, and then we will use a class named FaceApi, which is encapsulated by Tencent Youtu's SDK, ** here we call two API interfaces for Face Detection and Image Label Recognition respectively, ** the former one counts the number of friends who are using face avatars and those who are not, and the latter one accumulates the labels extracted from each avatar. The former counts the number of friends who "use face avatars" and "don't use face avatars", and the latter adds up the tags extracted from each avatar. The results of the analysis are shown in the figure below:

It can be noticed that among all WeChat friends, about 1/4 of them use face avatars, while 3/4 of them do not have face avatars, which means that only 25% of all WeChat friends are confident in their "face value", or 75% of WeChat friends have a low profile and do not like to use face avatars as their WeChat avatars. This means that only 25% of WeChat friends are confident in their "face value", or 75% of WeChat friends have a low-profile style of behavior and don't like to use face avatars.

Secondly, considering that Tencent Youtu can't really recognize the "face", **We extracted the labels from our friends' avatars again to help us understand what keywords are in the avatars of our WeChat friends, and the results of the analysis are shown in the figure:

Through the word cloud, we can find that: in the WeChat friends' signature word cloud, the keywords with relatively high frequency are:** girl, tree, house, text, screenshot, cartoon, group photo, sky, sea. **This indicates that among my WeChat friends, there are four main sources of WeChat avatars chosen by my friends: daily, travel, landscape, and screenshot.

The style in the WeChat avatars chosen by my friends is mainly cartoon, and the common elements in the WeChat avatars chosen by my friends are the sky, the sea, houses and trees. By observing all my friends' avatars, I found that among my WeChat friends, 15 of them used personal photos as their WeChat avatars, 53 used internet pictures as their WeChat avatars, 25 used anime pictures as their WeChat avatars, 3 used group photo pictures as their WeChat avatars, 5 used children's photos as their WeChat avatars, and 13 used landscape pictures as their WeChat avatars. There are 18 people who use girl's photo as WeChat avatar, which is basically in line with the analysis results of image label extraction.

4. Friend's signature

Analyze the signature of friends, the signature is the most rich text information in the friend information, according to the usual human "labeling" methodology, the signature can be analyzed in a certain period of time in a certain person's state, just like a person who is happy will laugh, sad will cry, crying and laughing two kinds of labels, respectively, indicate that the person's happy and sad state.

Here we do two kinds of processing for signatures, the first one is to generate a word cloud after segmentation using Stuttering Segmentation, the purpose is to understand what are the keywords in friends' signatures and which keywords appear relatively more frequently; the second one is to analyze the sentiment tendency in friends' signatures using SnowNLP, i.e., whether the friends' signatures as a whole are expressed positively, negatively, or neutrally, and what is the weight of each. Here it is sufficient to extract the Signature field, and its core code is as follows:

def analyseSignature(friends): 
 signatures = '' 
 emotions = [] 
 pattern = ("1f\d.+") 
 for friend in friends: 
  signature = friend['Signature'] 
  if(signature != None): 
   signature = ().replace('span', '').replace('class', '').replace('emoji', '') 
   signature = (r'1f(\d.+)','',signature) 
   if(len(signature)>0): 
    nlp = SnowNLP(signature) 
    () 
    signatures += ' '.join(.extract_tags(signature,5)) 
 with open('','wt',encoding='utf-8') as file: 
   (signatures) 
 # Sinature WordCloud 
 back_coloring = (('')) 
 wordcloud = WordCloud( 
  font_path='', 
  background_color="white", 
  max_words=1200, 
  mask=back_coloring, 
  max_font_size=75, 
  random_state=45, 
  width=960, 
  height=720, 
  margin=15
 ) 
 (signatures) 
 (wordcloud) 
 ("off") 
 () 
 wordcloud.to_file('') 
 # Signature Emotional Judgment 
 count_good = len(list(filter(lambda x:x>0.66,emotions))) 
 count_normal = len(list(filter(lambda x:x>=0.33 and x<=0.66,emotions))) 
 count_bad = len(list(filter(lambda x:x<0.33,emotions))) 
 labels = [u'Negative negativity',u'Neutral',u'Positive Positive'] 
 values = (count_bad,count_normal,count_good) 
 ['-serif'] = ['simHei'] 
 ['axes.unicode_minus'] = False
 (u'Emotional judgment') 
 (u'Frequency') 
 (range(3),labels) 
 (loc='upper right',) 
 (range(3), values, color = 'rgb') 
 (u'Sentiment analysis of %s' WeChat friend signature messages' % friends[0]['NickName']) 
 ()

Through the word cloud, we can find that: in the signature messages of the WeChat friends, the keywords with relatively high frequency are: hard work, grow up, beautiful, happy, life, happiness, life, far away, time, walk.

Through the following bar chart, we can find that: in the signature messages of WeChat friends, positive positive emotional judgment accounts for about 55.56%, neutral emotional judgment accounts for about 32.10%, and negative negative emotional judgment accounts for about 12.35%. This result is basically consistent with the results we showed through the word cloud, which indicates that about 87.66% of the signature messages of WeChat friends convey a positive attitude.

5. Friend Location

To analyze the location of a friend, we mainly extract the fields Prince and City. map visualization in Python is mainly through the Basemap module, which requires downloading map information from foreign websites and is very inconvenient to use.

Baidu's ECharts in the front-end use more, although the community provides pyecharts project, but I noticed that because of policy changes, the current Echarts no longer support the function of exporting maps, so the customization aspects of the map is still a problem, the mainstream technical solutions is to configure the national provinces and cities of the JSON data.

Here I am using BDP Personal Edition, which is a zero-programming solution, we export a CSV file through Python, and then upload it to BDP, by simply dragging and dropping you can create visual maps, which can't be any easier, and here we are just showing the code that generates the CSV part:

def analyseLocation(friends): 
 headers = ['NickName','Province','City'] 
 with open('','w',encoding='utf-8',newline='',) as csvFile: 
  writer = (csvFile, headers) 
  () 
  for friend in friends[1:]: 
   row = {} 
   row['NickName'] = friend['NickName'] 
   row['Province'] = friend['Province'] 
   row['City'] = friend['City'] 
   (row)

The following figure shows the geographic distribution of my WeChat friends generated in BDP, and you can find that: my WeChat friends are mainly concentrated in Ningxia and Shaanxi provinces.

Above is Python to achieve WeChat friends data crawl and analysis of the details, more information about Python WeChat friends data crawl and analysis of the information please pay attention to my other related articles!