SoFunction
Updated on 2024-11-21

Python Shake Shack Comment Data Crawl Analysis

Zhang started posting videos on 10.4, and the number of likes on the videos has been very high. The video on 11.17 peaked at 250w likes, and then the number of followers started to skyrocket.

So digging into the comments of the 11.17 video will help us more to achieve our purpose. In addition, to facilitate better learning of crawling techniques and data visualization and analysis, the full version of the code I put at the end of the article.

1. Grabbing data

Jitterbug came out with a web version, and it's a lot easier to grab data.

图片

gripe about

Slide over to the comments section of the web page, filter requests containing comment in your browser's web request, and keep refreshing the comment to see the comment's interface.

With the interface, you can write Python programs to simulate the request and get the comment data.

Request data to set a certain interval, to avoid too large requests, affecting the service of others

There are two things to keep in mind when grabbing comment data:

  • Sometimes the interface may return empty data, so you need to try a few more times, generally after the manual sliding validation of the interface is basically available
  • Data may be duplicated between different pages, so page hopping requests are required

2. EDA

The 11.17 video had 12w comments and I only grabbed just over 1w.

图片

The text columns are comments.

Start by doing some exploratory analysis of the data, a couple of EDA tools have been described before that can automatically produce basic statistics and graphs.

This time I used ProfileReport.

# eda
profile = ProfileReport(df, title='Zhang student jitterbug comment data', explorative=True)
profile

图片

Distribution of comments by time

In terms of the time distribution of comments, since the time of the posted video was the 17th, all the comments were posted more on the 17th and 18th. But further back even up to 12.9, there are still quite a lot of new comments generated, indicating that the video is really hot.

图片

Length distribution of comments

Most of the comments are under 20 words, basically no more than 40 words, indicating that they are short texts.

图片

Identity of the commentator

99.8% of the people who participated in the review were unauthenticated, indicating that the reviewers were basically regular users.

3. LDA

The above statistics are still too rough. But it's not possible to go through all 1.2w comments to find out what people are interested in.

So the comments need to be categorized once first, which is equivalent to upscaling and abstracting the data. Because only by upscaling the data and understanding what each dimension means and accounts for can we help us grasp the data from a global perspective.

Here I used the LDA algorithm to cluster the text, and the comments that are aggregated together can be seen as belonging to the same topic.

The core idea of the LDA algorithm is twofold:

  • Texts with certain similarities are clustered together to form a topic. Each topic contains the words needed to generate that topic, as well as the probability distribution of those words. In this way the category of the topic can be inferred artificially.
  • Each article will it has a probability distribution over all topics, and in this way you can infer which topic the article belongs to.

For example, after clustering by the LDA algorithm, a certain topic has a high probability of occurrence of words such as war and military expenses, then we can categorize that topic as military. If there is an article that has a high probability of belonging to the military theme, we can categorize that article as military.

After a brief introduction to the theory of LDA, let's go into practice.

3.1 Segmentation, de-duplication of words

# Split the word

emoji = {'Pity', 'Daze', 'Dizzy', 'a bright idea', 'High five', 'Send a heart', 'Sobbing', 'Yawn', 'Lick the screen', 'Snickering', 'Pleasant', 'See you later', '666', 'Kumoshi', 'Awkward Laughter', 'Spit your tongue out', 'Skimming', 'Look', 'cuckold', 'Cover your face', 'Dull Innocent', 'Strong', 'Shocked', 'Insidious', 'Absolute', 'Give it a go', 'In your face', 'Coffee', 'Decline', 'Cheer together', 'Cool and tuggy', 'shedding tears', 'Blackface', 'Love', 'Crying with laughter', 'Wit', 'Sleepy', 'Smiling Kangaroo', 'Strong', 'Shut up.', 'Come and see me', 'Color', 'Bean Smile', 'Smile without being rude', 'Red face', 'Nose-picking', 'Naughty', 'Violet don't go', 'Likes', 'Compare hearts', 'Leisurely', 'Rose', 'Cuddle', 'A Little Applause', 'Handshake', 'Adulterous Smile', 'Shyness', 'Close to tears', 'Shhh.', 'Surprise', 'Pig-headed', 'Spit', 'Observing in the dark', 'Don't look', 'Beer', 'Bare your teeth', 'Getting angry', 'Desperate Stare', 'Laughter', 'Spitting blood', 'Bad smile', 'Gaze', 'Lovely', 'Embrace', 'Wipe the Sweat', 'Applause', 'Victory', 'Thank you', 'Thinking', 'Smile', 'Doubt', 'I want to be quiet', 'A flash of light', 'White eyes', 'Tearjerker', 'Yeah'}
stopwords = [() for line in open('stop_words.txt', encoding='UTF-8').readlines()]

def fen_ci(x):
    res = []
    for x in (x):
        if x in stopwords or x in emoji or x in ['[', ']']:
            continue
        (x)
    return ' '.join(res)

df['text_wd'] = df['text'].apply(fen_ci)

Since there are a lot of emoji expressions in the comments, I extracted the text corresponding to the emoji expressions and generated an emoji array for filtering emoji words.

3.2 Calling LDA

from sklearn.feature_extraction.text import CountVectorizer
from  import LatentDirichletAllocation
import numpy as np

def run_lda(corpus, k):
    cntvec = CountVectorizer(min_df=2, token_pattern='\w+')
    cnttf = cntvec.fit_transform(corpus)
    
    lda = LatentDirichletAllocation(n_components=k)
    docres = lda.fit_transform(cnttf)
    
    return cntvec, cnttf, docres, lda
    
cntvec, cnttf, docres, lda = run_lda(df['text_wd'].values, 8)

After many trials, it is more effective to categorize the data into 8 categories.

The top 20 words with the probability of occurrence under each topic are selected:

图片

Word distribution of topics

From the probability distribution of these words, the categories of each theme are summarized. Themes 0 ~ 7 are: actually watch the whole thing, know where the key is, life in the countryside, feed the dog, filming techniques, and lock the door? , Eggs with too much salt, Socks under the pillow.

Statistical Theme Percentage:

图片

Percentage of themes

The red color is Theme 3 (Feeding the Dogs), which has the largest percentage, and many people commented that they thought they were going to make it for themselves, but they didn't realize it was for the dogs. I thought the same thing when I read it.

The other themes are more evenly split across the board.

After categorizing the themes, we can find that it is not only the rural life that draws attention to Zhang, but also the large number of anti-normal shots in the video.

Finally, a tree diagram shows the themes and corresponding specific comments.

图片

The core code has been posted in the article, and the full code is available for pickup in the following manner.

coding

Link:/s/1FnIgkW2b_uVtQq1Z-i8PJA
Extraction code: 1234