Crawling Jitterbug video list information using python

If you see a video of Shake vlogger that you are particularly interested in and want to dump all of them, how do you do it? The following introduction describes how to use python to export all the video information of a specific user

packet analysis

Chrome Deveploer Tools Chrome Developer Tools

On the Jitterbug App side, copy the vlogger homepage address, for example:/kGcU4y/ Then, in the PC with chrome browser card, and simulate the phone, here choose the iPhone, and then copy the home page address, put the browser to visit, the page jumps to the/share/user/110677980134

Scroll down the home page, select the Network=>XHR tab, and see something like this request

:authority: 
:method: GET
:path: /web/api/v2/aweme/post/?user_id=110677980134&sec_uid=&count=21&max_cursor=1561112910000&aid=1128&_signature=3Xf-nxAQgGfUO4SKisB.&dytk=061ae6e81229e178146aa674327eba89
:scheme: https
accept: application/json
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7,zh-TW;q=0.6,da;q=0.5
cookie: tt_webid=6690145457198417412; _ga=GA1.2.605400954.1557670882; _ba=BA0.2-20181226-5199e-GIJXgXk9ajNkyFhmv7Wy; _gid=GA1.2.1914501522.1562857517
referer: /share/user/110677980134
user-agent: Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1
x-requested-with: XMLHttpRequest

Screenshot of returned data

By analyzing the URL of an ajax request/web/api/v2/aweme/post/?user_id=110677980134&sec_uid=&count=21&max_cursor=1559299764000&aid=1128&_signature=3Xf-nxAQgGfUO4SKisB.&dytk=061ae6e81229e178146aa674327eba89 Deriving the request parameters mainly contains:

field	typology	clarification
user_id	int	Jitterbug account ID
count	int	For the number of data items returned, use the default value of 21
max_cursor	int	Cursor of the request, each request takes the max_cursor returned by the previous request.
aid	int	Use the default value of 11128
_signature	string	Parameter signatures on each request
dytk	string	One parameter per request

The method of obtaining the parameter:

/share/user/110677980134

(function() {
  $(function(){
    __M.require('douyin_falcon:page/reflow_user/index').init({
      uid: "110677980134",
      dytk: '061ae6e81229e178146aa674327eba89'
    });
  });
})();

This parameter is obtained by the regular

_signature Getting is more complicated, Jitterbug has obfuscated and compressed the front-end js code, it's not easy to analyze the algorithm process directly, but you can execute the signature algorithm code and return the corresponding signature result.
Execution of js code can use nodejs or selenium webdriver, here we recommend the use of selenium webdriver, nodejs js execution environment and the browser has a difference in the results of the calculated signature, and can not be verified, selenium webdriver can call the local browser, the calculated signature can be consistent with the browser direct access to the calculated signature. selenium webdriver can call the local browser, and the calculated signature can be consistent with the signature calculated by the browser directly accessing the access.
The js code after formatting, click to view, executes the js method _bytedAcrawler.sign("110677980134") to sign the parameters

Code implementation to export homepage video list

def get_user_video_list_by_uid(user_id, cursor=0):
  url = '/web/api/v2/aweme/post/?'
  sign, dytk = signature(user_id)
  tk_logger.info("sign:%s,dytk:%s" % (sign, dytk))
  if sign is None or dytk is None:
    tk_logger.log("sign [%s] or dytk [%s] is none" % (sign, dytk))
    return None
  headers = dict_merge(CHROME_HEADER, {
    "Accept": "application/json",
    "X-Requested-With": "XMLHttpRequest",
  })
  params = {
    "user_id": user_id,
    "count": "21",
    "max_cursor": cursor,
    "aid": "1128",
    "_signature": sign,
    "dytk": dytk
  }
  res = (url, headers=headers, params=params)
  tk_logger.info("request url: %s" % )
  content = ("utf8")
  jsn = (content)
  return jsn

Information about the list of acquired videos

Get video information code snippet

def get_video_detail_by_id(video_id):
  url = "/aweme/v1/aweme/detail/?version_code=6.5.0&pass-region=1&pass-route=1&js_sdk_version=1.16.2.7&app_name=aweme&vid=9D5F078E-A1A9-4F64-81C7-F89CA6A3B1DC&app_version=6.5.0&device_id=34712926793&channel=App%20Store&mcc_mnc=46011&aid=1128&screen_width=750&openudid=263bd93f02801d126ca004edccbff8f6e1b19f51&os_api=18∾=WIFI&os_version=12.3.1&device_platform=iphone&build_number=65014&device_type=iPhone9,1&iid=74239983401&idfa=F39B285A-4B4F-4874-9D7E-C728A892BF6D"
  data = {"aweme_id": video_id}
  headers = {
    "sdk-version": "1",
    "x-Tt-Token": "00fc1e7950db67b5f43a312e9265cdfee513ea70c36d918c871f3bb553347f3db50ffca143b8722327b345816a75efca071d",
    "User-Agent": "Aweme 6.5.0 rv:65014 (iPhone; iOS 12.3.1; en_CN) Cronet",
    "Content-Type": "application/x-www-form-urlencoded",
    "Cookie": "tt_webid=6636348554880222728; __tea_sdk__user_unique_id=6636348554880222728; odin_tt=76d9b82d6e6f2ddfc99719a5b5d44a7d703cf977f0f7bddf8537f93920d57cb9ec33162ee47868b760f6b09e69209bb2f90bad220b75678af850a0dfa9f056e2; install_id=74239983401; ttreq=1$dab0516952a4157c0c11d4993533c09d6e45fc94; sid_guard=fc1e7950db67b5f43a312e9265cdfee5%7C1559955316%7C5184000%7CWed%2C+07-Aug-2019+00%3A55%3A16+GMT; uid_tt=0afcb06309f632d872799ec0ac3b2c80; sid_tt=fc1e7950db67b5f43a312e9265cdfee5; sessionid=fc1e7950db67b5f43a312e9265cdfee5",
    "X-Khronos": "1559956401",
    "X-Gorgon": "8300000000002e40eee38cad71d14037bd1385d18bc973f094f5",
  }
  ret = {}
  res = (url, data=data, headers=headers)
  if res.status_code == 200:
    # tk_logger.info("video detail raw:%s" % ("utf8"))
    jsn = ()
    detail = ("aweme_detail", {})
    video_info = get_video_info(detail)
    user_info = get_user_info(detail)
    play_addr = get_play_address(detail)
    video_cover = get_video_cover(detail)
    ret["video_info"] = video_info
    ret["user_info"] = user_info
    ret["play_addr"] = play_addr
    ret["video_cover"] = video_cover
  else:
    raise TKException("get video detail failed [%s][%d]" % (url, res.status_code))
  return ret

Download video code snippet

detail = get_video_detail_by_id(video_id)
def download_video(detail):
  url = ("play_addr", {}).get("url_list", [])
  if len(url) == 0:
    raise TKException("cannot get video url list [%s]" % detail)

  url = url[0]
  folder = DOWNLOAD_DIR + '/' + ('user_info', {}).get("uid", "unknown")
  if not (folder):
    (folder)
  video_id = ('video_info', {}).get('statistics', {}).get('aweme_id')
  # filename = "%s/%s" % (folder, ("video_info", {}).get("desc", video_id) + ".mp4")
  filename = "%s/%s" % (folder, video_id + ".mp4")
  tk_logger.info("download video %s" % url)
  if (filename):
    file_size = get_remote_file_size(url)
    if file_size == (filename):
      tk_logger.info("file already downloaded, skip ...")
      return
    else:
      tk_logger.info("download file , file size:%d" % file_size)
  res = (url, headers=IOS_HEADER)
  if res.status_code == 200:
    with open(filename, "wb") as fp:
      for chunk in res.iter_content(chunk_size=1024):
        (chunk)
  else:
    raise TKException("download video [%s] failed [%d]" % (url, res.status_code))

Download Video

summarize

The above is a small introduction to the use of python to crawl the Jitterbug video list information ,I hope to help you, if you have any questions please leave me a message, I will reply to you in a timely manner. Here also thank you very much for your support of my website!
If you find this article helpful, please feel free to reprint it, and please note the source, thank you!