Python's multi-threaded crawler crawling web page image sample code

goal

Well, we know that searching or browsing websites comes with a lot of beautiful, beautiful images.

When we download, we have to mouse over one by one and flip pages.

So, is there a way to automatically recognize and download images using a non-manual method. Beautiful.

Then use the python language and build a crawler that grabs and downloads web images.

Of course, for efficiency, we use multi-threaded parallelism at the same time.

analysis of ideas

Python has a lot of third-party libraries that can help us implement a wide variety of features. The problem is figuring out what we need:

1) http request library, according to the website address can get the web page source code. You can even download images to write to disk.

2) Parsing web page source code to recognize image link addresses. For example, regular expressions, or simple third-party libraries.

3) Support for building multiple threads or thread pools.

4) If possible, you need to fake it as a browser, or bypass the site checksum. (Well, chances are the site will be crawler-proof ;-))

5) If possible, you also need to automate the creation of catalogs, random numbers, date and time, and other related content.

So, let's start messing things up. o(∩_∩)o~

Environment Configuration

Operating system: windows or linux can be

Python version: Python 3.6 ( not Python oh )

third-party repository

threading or multithreading or thread pooling (python 3.2+)

re Regular expression built-in modules

os OS built-in modules

coding process

Let's break down the process. The full source code is provided at the end of the blog post.

Masquerading as a browser

import 

# ------ masquerading as a browser ---
def makeOpener(head={
  'Connection': 'Keep-Alive',
  'Accept': 'text/html, application/xhtml+xml, */*',
  'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
  'Connection': 'keep-alive',
  'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'
  }):
  cj = ()
  opener = .build_opener((cj))
  header = []
  for key, value in ():
    elem = (key, value)
    (elem)
   = header
  return opener

Getting the source code of a web page

# ------ Getting the source code of a web page ---
# url web link address
def getHtml(url):
  print('url='+url)
  oper = makeOpener()
  if oper is not None:
    page = (url)
    #print ('-----oper----')
  else:
    req=(url)
    # Crawler masquerading as a browser
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0')
    page = (req)
  html = ()
  if collectHtmlEnabled: #Whether to capture html
    with open('', 'wb') as f:
      (html) # Capture local files to analyze
  # ------ Change the character encoding within the html object to UTF-8 ------
  if chardetSupport:
    cdt = (html)
    charset = cdt['encoding'] #Content analysis with chardet
  else:
    charset = 'utf8'
  try:
    result = (charset)
  except:
    result = ('gbk')
  return result

Download individual images

# ------ Download images based on image url ------
# folderPath Defines the directory where the image is stored imgUrl The address of the link to an image index The index, which indicates the first image.
def downloadImg(folderPath, imgUrl, index):
  # ------ Exception Handling ------
  try:
    imgContent = ((imgUrl)).read()
  except  as e:
    if printLogEnabled : print ('[Error] Current image cannot be downloaded')
    return False
  except  as e:
    if printLogEnabled : print ('[Error] Current image download exception')
    return False
  else:
    imgeNameFromUrl = (imgUrl)
    if printLogEnabled : print ('Downloading section'+str(index+1)+'A picture, picture address:'+str(imgUrl))
    # ------ IO Processing ------
    isExists=(folderPath)
    if not isExists: # directory does not exist, then create
       ( folderPath )
       #print ('Creating directory')
    # Image name naming rules, random strings
    imgName = imgeNameFromUrl
    if len(imgeNameFromUrl) < 8:
      imgName = random_str(4) + random_str(1,'123456789') + random_str(2,'0123456789')+"_" + imgeNameFromUrl
    filename= folderPath + "\\"+str(imgName)+".jpg"
    try:
       with open(filename, 'wb') as f:
         (imgContent) # Write to local disk
       # if printLogEnabled : print ('Downloading of the first '+str(index+1)+'image' completed')
    except :
      return False
    return True

Download a batch of images (multi-threaded/thread pooling mode is supported)

# ------ Batch Download Images ------
# folderPath Defines the directory where images are stored imgList Links to multiple images
def downloadImgList(folderPath, imgList):
  index = 0
  # print ('poolSupport='+str(poolSupport))
  if not poolSupport:
   #print ('Multi-threaded mode')
   # ------ Multithreaded Programming ------
   threads = []
   for imgUrl in imgList:
     # if printLogEnabled : print ('Ready to download the first '+str(index+1)+'image')
     ((target=downloadImg,args=(folderPath,imgUrl,index,)))
     index += 1
   for t in threads:
     (True)
     ()
   () # Parent thread, wait for all threads to end
   if len(imgList) >0 : print ('End of download, directory of stored images:' + str(folderPath))
  else:
   #print ('Thread pooling mode')
    # ------ Thread Pool Programming ------
   futures = []
   # Create a thread pool with a maximum of N tasks thePoolSize global variable
   with (max_workers=thePoolSize) as pool: 
    for imgUrl in imgList:
     # if printLogEnabled : print ('Ready to download the first '+str(index+1)+'image')
     ((downloadImg, folderPath, imgUrl, index))
     index += 1
    result = (futures, timeout=None, return_when='ALL_COMPLETED')
    suc = 0
    for f in :
      if (): suc +=1
    print('End of download, total:'+str(len(imgList))+', number of successes: '+str(suc)+', directory for storing images: ' + str(folderPath))

invocation example

For example, Baidu posting

# ------ Download all images in Baidu posts ------
# folderPath Define the directory where the picture is stored url Baidu Post Link
def downloadImgFromBaidutieba(folderPath='tieba', url='/p/5256331871'):
  html = getHtml(url)
  # ------ Finding an image address using regular expressions to match web content ------
  #reg = r'src="(.*?\.jpg)"'
  reg = r'src="(.*?/sign=.*?\.jpg)"'
  imgre = (reg);
  imgList = (imgre, html)
  print ('Number of images found:' + str(len(imgList)))
  # Download images
  if len(imgList) >0 : downloadImgList(folderPath, imgList) 

# Program entry
if __name__ == '__main__':
  now = ().strftime('%Y-%m-%d %H-%M-%S')
  # Download all images in Baidu posts
  downloadImgFromBaidutieba('tieba\\'+now, '/p/5256331871')

effect

The full source code can be found at

My github:/SvenAugustus/PicDownloader-example

This is the whole content of this article.