goal
Well, we know that searching or browsing websites comes with a lot of beautiful, beautiful images.
When we download, we have to mouse over one by one and flip pages.
So, is there a way to automatically recognize and download images using a non-manual method. Beautiful.
Then use the python language and build a crawler that grabs and downloads web images.
Of course, for efficiency, we use multi-threaded parallelism at the same time.
analysis of ideas
Python has a lot of third-party libraries that can help us implement a wide variety of features. The problem is figuring out what we need:
1) http request library, according to the website address can get the web page source code. You can even download images to write to disk.
2) Parsing web page source code to recognize image link addresses. For example, regular expressions, or simple third-party libraries.
3) Support for building multiple threads or thread pools.
4) If possible, you need to fake it as a browser, or bypass the site checksum. (Well, chances are the site will be crawler-proof ;-))
5) If possible, you also need to automate the creation of catalogs, random numbers, date and time, and other related content.
So, let's start messing things up. o(∩_∩)o~
Environment Configuration
Operating system: windows or linux can be
Python version: Python 3.6 ( not Python oh )
third-party repository
threading or multithreading or thread pooling (python 3.2+)
re Regular expression built-in modules
os OS built-in modules
coding process
Let's break down the process. The full source code is provided at the end of the blog post.
Masquerading as a browser
import # ------ masquerading as a browser --- def makeOpener(head={ 'Connection': 'Keep-Alive', 'Accept': 'text/html, application/xhtml+xml, */*', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 'Connection': 'keep-alive', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0' }): cj = () opener = .build_opener((cj)) header = [] for key, value in (): elem = (key, value) (elem) = header return opener
Getting the source code of a web page
# ------ Getting the source code of a web page --- # url web link address def getHtml(url): print('url='+url) oper = makeOpener() if oper is not None: page = (url) #print ('-----oper----') else: req=(url) # Crawler masquerading as a browser req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0') page = (req) html = () if collectHtmlEnabled: #Whether to capture html with open('', 'wb') as f: (html) # Capture local files to analyze # ------ Change the character encoding within the html object to UTF-8 ------ if chardetSupport: cdt = (html) charset = cdt['encoding'] #Content analysis with chardet else: charset = 'utf8' try: result = (charset) except: result = ('gbk') return result
Download individual images
# ------ Download images based on image url ------ # folderPath Defines the directory where the image is stored imgUrl The address of the link to an image index The index, which indicates the first image. def downloadImg(folderPath, imgUrl, index): # ------ Exception Handling ------ try: imgContent = ((imgUrl)).read() except as e: if printLogEnabled : print ('[Error] Current image cannot be downloaded') return False except as e: if printLogEnabled : print ('[Error] Current image download exception') return False else: imgeNameFromUrl = (imgUrl) if printLogEnabled : print ('Downloading section'+str(index+1)+'A picture, picture address:'+str(imgUrl)) # ------ IO Processing ------ isExists=(folderPath) if not isExists: # directory does not exist, then create ( folderPath ) #print ('Creating directory') # Image name naming rules, random strings imgName = imgeNameFromUrl if len(imgeNameFromUrl) < 8: imgName = random_str(4) + random_str(1,'123456789') + random_str(2,'0123456789')+"_" + imgeNameFromUrl filename= folderPath + "\\"+str(imgName)+".jpg" try: with open(filename, 'wb') as f: (imgContent) # Write to local disk # if printLogEnabled : print ('Downloading of the first '+str(index+1)+'image' completed') except : return False return True
Download a batch of images (multi-threaded/thread pooling mode is supported)
# ------ Batch Download Images ------ # folderPath Defines the directory where images are stored imgList Links to multiple images def downloadImgList(folderPath, imgList): index = 0 # print ('poolSupport='+str(poolSupport)) if not poolSupport: #print ('Multi-threaded mode') # ------ Multithreaded Programming ------ threads = [] for imgUrl in imgList: # if printLogEnabled : print ('Ready to download the first '+str(index+1)+'image') ((target=downloadImg,args=(folderPath,imgUrl,index,))) index += 1 for t in threads: (True) () () # Parent thread, wait for all threads to end if len(imgList) >0 : print ('End of download, directory of stored images:' + str(folderPath)) else: #print ('Thread pooling mode') # ------ Thread Pool Programming ------ futures = [] # Create a thread pool with a maximum of N tasks thePoolSize global variable with (max_workers=thePoolSize) as pool: for imgUrl in imgList: # if printLogEnabled : print ('Ready to download the first '+str(index+1)+'image') ((downloadImg, folderPath, imgUrl, index)) index += 1 result = (futures, timeout=None, return_when='ALL_COMPLETED') suc = 0 for f in : if (): suc +=1 print('End of download, total:'+str(len(imgList))+', number of successes: '+str(suc)+', directory for storing images: ' + str(folderPath))
invocation example
For example, Baidu posting
# ------ Download all images in Baidu posts ------ # folderPath Define the directory where the picture is stored url Baidu Post Link def downloadImgFromBaidutieba(folderPath='tieba', url='/p/5256331871'): html = getHtml(url) # ------ Finding an image address using regular expressions to match web content ------ #reg = r'src="(.*?\.jpg)"' reg = r'src="(.*?/sign=.*?\.jpg)"' imgre = (reg); imgList = (imgre, html) print ('Number of images found:' + str(len(imgList))) # Download images if len(imgList) >0 : downloadImgList(folderPath, imgList) # Program entry if __name__ == '__main__': now = ().strftime('%Y-%m-%d %H-%M-%S') # Download all images in Baidu posts downloadImgFromBaidutieba('tieba\\'+now, '/p/5256331871')
effect
The full source code can be found at
My github:/SvenAugustus/PicDownloader-example
This is the whole content of this article.