SoFunction
Updated on 2024-11-16

python crawler crawling fast video multi-threaded download function

Environment: python 2.7 + win10

Tools: fiddler postman android emulator

First of all, open fiddler, fiddler as http/https packet grabber, here is not much introduction.

Configure to allow https

 

Configure to allow remote connections, i.e. turn on the http proxy.

 

Computer ip: 192.168.1.110

Then Make sure the phone and computer are under a LAN and can communicate. Since I don't have an Android phone on my side, I used an Android emulator instead, and it works the same.

Open the mobile browser, enter 192.168.1.110:8888, that is, set the proxy address, after the installation of the certificate in order to capture the packet

 

After installing the certificate, modify the network in WiFi settings to manually specify the http proxy.

 

After saving, fiddler can catch the app's data, open the hand Refresh, you can see a lot of http requests come in, the general interface address and so on is very obvious, you can see it is the json type of the

 

http post request, the return data is json , expanded to find a total of 20 video information, first to ensure that it is correct, find a video link to see.

 

ok it's playable it's clean and no watermarks.

So now open up postman and simulate this post to see if there are any test parameters.

 

There are so many parameters, and I thought the client_key and sign would be validated... I thought the client_key and sign would be validated... but then I realized I was wrong and nothing was validated, so I just submitted it...

Form-data submission results in an error

 

Then switch to raw.

 

The error message is different. Try adding headers.

 

nice Successful return data, I tried a few more times, and found that each time the return result is not the same, are 20 videos, just one of the post parameter there is a page=1 may have been the first page, just like has been in the cell phone does not turn down the start has been refreshing that, anyway It does not matter, as long as the return data is not duplicated on the good.

Here's the code.

# -*-coding:utf-8-*-
# author : Corleone
import urllib2,urllib
import json,os,re,socket,time,sys
import Queue
import threading
import logging
# Log module
logger = ("AppName")
formatter = ('%(asctime)s %(levelname)-5s: %(message)s')
console_handler = ()
console_handler.formatter = formatter
(console_handler)
()
video_q = ()  # Video queue
def get_video():
  url = "http://101.251.217.210/rest/n/feed/hot?app=0&lon=121.372027&c=BOYA_BAIDU_PINZHUAN&sys=ANDROID_4.1.2&mod=HUAWEI(HUAWEI%20C8813Q)&did=ANDROID_e0e0ef947bbbc243&ver=5.4&net=WIFI&country_code=cn&iuid=&appver=5.4.7.5559&max_memory=128&oc=BOYA_BAIDU_PINZHUAN&ftt=&ud=0&language=zh-cn&lat=31.319303 "
  data = {
    'type': 7,
    'page': 2,
    'coldStart': 'false',
    'count': 20,
    'pv': 'false',
    'id': 5,
    'refreshTimes': 4,
    'pcursor': 1,
    'os': 'android',
    'client_key': '3c2cd3f3',
    'sig': '22769f2f5c0045381203fc57d1b5ad9b'
  }
  req = (url)
  req.add_header("User-Agent", "kwai-android")
  req.add_header("Content-Type", "application/x-www-form-urlencoded")
  params = (data)
  try:
    html = (req, params).read()
  except :
    (u"Network is unstable. Retrying access.")
    html = (req, params).read()
  result = (html)
  reg = (u"[\u4e00-\u9fa5]+")  # Match Chinese only
  for x in result['feeds']:
    try:
      title = x['caption'].replace("\n","")
      name = " ".join((title))
      video_q.put([name, x['photo_id'], x['main_mv_urls'][0]['url']])
    except KeyError:
      pass
def download(video_q):
  path = u"D:\ Racer."
  while True:
    data = video_q.get()
    name = data[0].replace("\n","")
    id = data[1]
    url = data[2]
    file = (path, name + ".mp4")
    (u"Downloading: %s" %name)
    try:
      (url,file)
    except IOError:
      file = (path, u"Nuts."+ '%s.mp4') %id
      try:
        (url, file)
      except (,):
        (u"Request disconnected. Dormant for 2 seconds.")
        (2)
        (url, file)
    (u"Download complete: %s" % name)
    video_q.task_done()
def main():
  # Use help
  try:
    threads = int([1])
  except (IndexError, ValueError):
    print u"\n usage:" + [0] + u" [Number of threads:10] \n"
    print u"For example:" + [0] + " 10" + u" Crawl Video opens10thread Crawl once a day (math.) linear (of degree one)2000Around one video(space between)"
    return False
  # Judge the catalog
  if (u'D:\ Racer') == False:
    (u'D:\ Racer')
  # Parsing web pages
  (u"Crawling the web.")
  for x in range(1,100):
    (u"%s request." % x)
    get_video()
  num = video_q.qsize()
  (u"Total %s of video" %num)
  # Multi-threaded downloads
  for y in range(threads):
    t = (target=download,args=(video_q,))
    (True)
    ()
  video_q.join()
  (u"----------- All have been crawled ---------------")
main()

Tested below

 

Multi-threaded downloads Download about 2000 videos at a time Default downloads go to D:\Shortcuts

 

Okay, that's it for this one. It's actually quite simple. I can't believe Racer isn't encrypted. Because when I climbed Jitterbug, I ran into a problem. .....

summarize

The above is a small introduction to the python crawler crawling fast video multi-threaded download, I hope to help you, if you have any questions please leave me a message, I will reply to you in time. Here also thank you very much for your support of my website!