SoFunction
Updated on 2024-11-16

python crawling NUS-WIDE database images

The lab needs the original drawings from the NUS-WIDE database, the address of the dataset is/research/ Since this data only gives the URL of each image, a small crawler program is needed to crawl these images. It is recommended to use a VPN during the downloading process of the images. since some of the URLs are already invalid, some invalid images will be downloaded.

# PYTHON 2.7   Ubuntu 14.04
nuswide = "$NUS-WIDE-urls_ROOT" #the location of your 
imagepath = "$IMAGE_ROOT" # path of dataset you want to download in
f = open(nuswide, 'r')
url = ()
import re
import urllib
import os
reg = r"ImageData.+?jpg"
location_re = (reg)
reg = r"(ImageData.+?)/0"
direction_re = (reg)
reg = r"http.+?jpg"
image_re = (reg)
for i in url:
  filename = (location_re, i)
  direction = (direction_re, i)
  image = (image_re, i)
  if image:
    path = imagepath+filename[0]
    path_n = imagepath+direction[0]
    print path_n
    if (path_n):
      (image[1], path)
    else:
      (path_n)
      (image[1], path)

And then share with you a small crawler to crawl Baidu posting images (you know)

#coding=utf-8

The #urllib module provides an interface for reading data from Web pages.
import urllib
The #re module contains mainly regular expressions
import re
# Define a getHtml() function.
def getHtml(url):
  page = (url) The #() method is used to open a URL address.
  html = () The #read() method is used to read data from a URL.
  return html

def getImg(html):
  reg = r'src="(.+?\.jpg)" pic_ext'  #Regular expression to get the image address
  imgre = (reg)   #() compiles a regular expression into a regular expression object.
  imglist = (imgre,html)   The #() method reads html containing imgre (regular expression) data.
  # Iterate through the filtered image addresses in a for loop and save them locally
  #The core is the () method, which downloads the remote data directly to the local, and the images are incrementally named through x
  x = 0

  for imgurl in imglist:
  (imgurl,'D:\E\%' % x)
      x+=1


html = getHtml("/p/xxxx")
print getImg(html)