Installation under Windows:
Download: /pypi/pyquery/#downloads
Download and install:
C:\Python27>easy_install E:\python\pyquery-1.2.
It can also be installed directly online:
C:\Python27>easy_install pyquery
pyquery is a python library similar to jquery, you can use the syntax like jquery to extract any data in the web page, this is used for html web page data extraction and mining or a very good third-party library. Here we look at what are the uses of pyquery.
Extracting information from html strings
#!/usr/bin/python # -*- coding: utf-8 -*- from pyquery import PyQuery as pq html = ''' <html> <head> <title>this is title</title> </head> <body> <p >Hello, World</p> <p >Nihao</p> <div class="class1"> <img src="" /> </div> <ul> <li>list1</li> <li>list2</li> </ul> </body> </html> ''' d=pq(html) print d('title') # Equivalent to css selector, get elements based on html tags print d('title').text() # text() method to get the currently selected text block print d('#hi').text() # Equivalent to an id selector, gets the element directly based on the id name print d('p').filter('#hi2').text() # You can get the specified element by id or class print d('.class1') # Equivalent to class selector print d('.class1').html() # html() method gets the currently selected html block print d('.class1').find('img').attr('src') # Find nested elements and check attributes print d('ul').find('li').eq(0).text() # Get one of multiple identical html elements by index number print d('ul').children() # Get all child elements print d('ul').children().eq(0) # Get child elements by index print d('img').parents() # Get parent element print d('#hi').next() # Get next element print d('#hi').nextAll() #Get the next block of all elements print d('p').not_('#hi2') # Returns elements that do not match the selector # Iterate over all matching elements for i in ('li'): print () print [() for i in ('li')] # Iteration for list pushdown print d.make_links_absolute(base_url='') # Change a relative path in an html document to an absolute path
The above snippet gives the common pyquery methods. We first define a piece of html code, and then use a series of pyquery methods to operate on the html code, mainly to get specific elements and text. Of course, pyquery can not only get elements, but also set element attributes, add elements and other functions, in view of our most commonly used is the above code used in the method, here will not be introduced to other methods.
Extract information from url or local html file
Of course, pyquery can also parse more than just html strings like the one above:
d = pq(url='/')
We can load a URL directly, no different from the above method of operation. This method uses the urllib module for http requests by default, but if you have requests installed on your system, then requests will be used for http requests, which means you can use any parameter of requests, for example:
pq('/', headers={'user-agent': 'pyquery'})
Or, if you already have the corresponding html file in your local area, then there is also this:
d = pq(filename=path_to_html_file)
The above writeup directly specifies a local html file, and still operates in the same way as above.
As you can see, pyquery gives us full convenience to do any element picking, just like jquery.
Using pyquery to grab douban movie top250
After looking at the syntax of pyquery, let's look at an example to grab the Douban movie top250.
Because Douban anti-crawler anti-tremor, run a few times can not be caught again, I had to first use requests to download the page, directly using pyquery to analyze the page to extract information:
from pyquery import PyQuery as pq import requests head_req = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Referer':'/top250?start=0', } r=("/top250?start=0",headers=head_req) with open("","wb") as html: () d=pq(filename="") # print d('ol').find('li').html() for data in d('ol').items('li'): print ('.hd').find('.title').eq(0).text() print ('.star').find('.rating_num').text() print ('.quote').find('.inq').text() print
Run it and see the results:
Shawshank Redemption 9.6 Hope sets people free.。 This killer isn't too cold. 9.4 The story that Uncle Dan and Lori had to tell.。 Forrest Gump 9.4 A Modern History of the United States。 Xiang Yu bids farewell his favorite concubine (classic subject) 9.4 magnificent style unmatched in his generation (idiom); peerless talent。 Beautiful Life 9.5 The most beautiful lie of all.。 Chihuahua 9.2 The Best of Miyazaki,The Best of Hisaishi Jean。 Schindler's List 9.4 Save a man.,It's about saving the world.。 The Sea Pianist 9.2 Everyone has to walk a path he's determined to follow.,Even if it's to the bone.。 Robotron (video game series) 9.3 Owari or Wally (name),big life。 Inception 9.2 Nolan gave us a dream we couldn't steal.。 RMS Titanic, British passenger liner that sank in 1912 9.1 What's lost is forever.。 The Three Stooges vs. Bollywood (TV series) 9.1 Handsome Bean,Highly emotional version of Shears.。 Springtime in the Oxford Class 9.2 Heavenly children's voices,It's the closest thing to God that exists.。 The Story of the Loyal Dog 9.2 Never forget the ones you love.。 chinchilla 9.1 人人心中都有个chinchilla,Childhood never goes away.。 The Great Sage Marries 9.1 love of a lifetime。 godfather 9.2 Never hold a grudge against your opponent.,This will make you lose your mind.。 Gone with the Wind (film) 9.2 Tomorrow is another day. Paradise Cinema 9.1 Those kissing scenes.,those youthful years,All washed away by tears in the darkness of the theater.。 When happiness comes knocking 8.9 Civilian inspirational movie。 Fight Club 9.0 Evil and mediocrity lie dormant in the same matrix.,To confront each other at a particular time.。 Truman's World 9.0 If I never see you again.,Good morning to you.,Hello (daytime greeting),Good night!。 out of reach 9.1 An elegant comedy full of warmth。 Lord of the Rings, comic book superhero3:The King is invincible 9.1 The Epic Finale。 Roman Holiday 8.9 Love is only for a day.。
Of course this is only the first page of 25, and we already know that the url for the top250 Douban movies is
/top250?start=0
start parameter from 0, plus 25 each time until the
/top250?start=225
So you can write a loop to catch them all.