I first contact crawler this thing is in May this year, when I wrote a blog search engine, the crawler used is also quite intelligent, at least than the movie came to this station to use the crawler level is much higher!
Back to writing crawlers in Python.
Python has always been my main scripting language, no one, Python's language is simple and flexible, the standard library is powerful, usually can be used as a calculator, text encoding conversion, image processing, batch downloads, batch processing of text, etc. In short, I like it, and the more I use it, the more I get used to it. In short, I like it very much, and the more I use it, the more I get used to it, it's such a good tool, I won't tell anyone.
Because of its powerful string processing capabilities, as well as the existence of urllib2, cookielib, re, threading modules, with Python to write crawlers is simply easy. To what extent is it easy. I told a student at the time, I wrote a movie came to use a few crawlers and a bunch of data organization of scattered lines of script code totaling less than 1,000 lines of code, write a movie came to this site is only 150 lines of code. Because the crawler code in another 64-bit Black Apple, so I do not list, just list the code of the site on the VPS, tornadoweb framework to write the
[xiaoxia@307232 movie_site]$ wc -l *.py template/*
156
92 template/
79 template/
94 template/
47 template/
77 template/
Here is a direct SHOW of the process of writing the crawler.The following is for communication and learning purposes only, nothing more.
As an example, the latest video download resource for a bay is at the following URL
http://sb. or sth. indefinite/browse/200
Because of the large number of advertisements in that page, I will only post a portion of the main text:
For a python crawler, downloading the source code for this page, one line of code is enough. The urllib2 library is used here.
>>> import urllib2
>>> html = ('http://sb. or sth. indefinite/browse/200').read()
>>> print 'size is', len(html)
size is 52977
Of course, you can also use the system function in the os module to call the wget command to download the content of the web page, for students who have mastered the wget or curl tool is very convenient.
Using Firebug to look at the structure of the page, you can see that the body html is a table. each resource is a tr tag.
And for each resource, there is information that needs to be extracted:
1、Video classification
2. Name of resource
3. Links to resources
4. Size of resources
5. Upload time
That's all there is to it, and more can be added if needed.
First extract a snippet of code from the tr tag to see what's going on.
<tr>
<td class="vertTh">
<center>
<a href="/browse/200" title="More in this catalog">video</a><br />
(<a href="/browse/205" title="More in this catalog">radio</a>)
</center>
</td>
<td>
<div class="detName"> <a href="/torrent/7782194/The_Walking_Dead_Season_3_Episodes_1-3_HDTV-x264" class="detLink" title="particulars The Walking Dead Season 3 Episodes 1-3 HDTV-x264">The Walking Dead Season 3 Episodes 1-3 HDTV-x264</a>
</div>
<a href="magnet:?xt=urn:btih:4f63d58e51c1a4a997c6f099b2b529bdbba72741&dn=The+Walking+Dead+Season+3+Episodes+1-3+HDTV-x264&tr=udp%3A%2F%%3A80&tr=udp%3A%2F%%3A80&tr=udp%3A%2F%%3A6969&tr=udp%3A%2F%%3A80" title="Download this torrent using magnet"><img src="//static.some/img/" alt="Magnet link" /></a> <a href="//torrents.something/7782194/The_Walking_Dead_Season_3_Episodes_1-3_HDTV-x264." title="Download Seeds"><img src="//static.some/img/" class="dl" alt="Download" /></a><img src="//static.some/img/" /><img src="//static.some/img/" />
<font class="detDesc">Uploaded <b>3 minutes ago</b>, adults and children 2 GiB, uploader <a class="detDesc" href="/user/paridha/" title="Browse paridha">paridha</a></font>
</td>
<td align="right">0</td>
<td align="right">0</td>
</tr>
The following regular expression is used to extract the content of the html code. If you don't know anything about regular expressions, you can go to /2/library/ to learn more.
There is a reason why you should use regular expressions instead of some other tools for parsing HTML or DOM tree. I have tried to use BeautifulSoup3 to extract the content, and then found that the speed is really slow death ah, a second can handle 100 content, is already the limit of my computer. And changed the regular expression, compiled and processed the content, the speed of the direct kill it!
With so much to extract, how do I write my regular expressions?
Based on my past experience.". *?" or ". +?" This thing works.But there are a few little things to keep in mind that you'll know when you actually use them
For the tr tag code above, the first symbol I need to make my expression match to is the
<tr>
Indicates the beginning of the content, but of course it can be something else, just don't miss what you need. And then the content I want to match is this below, get the video category.
(<a href="/browse/205" title="More in this catalog">TV</a>)
Then I'm going to match the links to the resources.
<a href="..." class="detLink" title="...">...</a>
and then to other resource information.
font class="detDesc">Uploaded <b>3 minutes ago</b>, size 2 GiB, uploaded by
final match
</tr>
Great job!
Of course, the final match can be expressed without the regular expression, as long as the starting position is positioned correctly, and the position of the information obtained later is also correct.
For those of you who know more about regular expressions, you probably know how to write them. I'll show you how to handle the expression I wrote.
It was as simple as that, and the results came back with a pretty jubilant ego.
Of course, a crawler designed in this way is targeted, directed to crawl the content of a particular site.There is also no crawler that will not filter the links it collects. It is usually possible to use BFS (Breadth-First Search Algorithm) to crawl all the page links of a website.
Complete Python crawler code to crawl the latest 10 pages of video resources in a bay:
# coding: utf8
import urllib2
import re
import pymongo
db = ().test
url = 'http://sb. or sth. indefinite/browse/200/%d/3'
find_re = (r'<tr>.+?\(.+?">(.+?)</a>.+?class="detLink".+?">(.+?)</a>.+?<a href="(magnet:.+?)" .+?Uploaded <b>(.+?)</b>, adults and children (.+?),', )
# Directional crawling.10Page Latest Video Resources
for i in range(0, 10):
u = url % (i)
# Download data
html = (u).read()
# Find resource information
for x in find_re.findall(html):
values = dict(
category = x[0],
name = x[1],
magnet = x[2],
time = x[3],
size = x[4]
)
# Save to database
(values)
print 'Done!'
The above code is only for ideas to show, the actual operation of the use of mongodb database, at the same time may not be able to access a certain bay site and can not get normal results.
So, the crawler used in the Movie Coming website is not difficult to write, what is difficult is how to organize the data to get useful information after obtaining it. For example, how to match a movie information with a resource, how to establish a correlation between the movie information database and the video links, all of these need to keep trying various methods, and finally select the more reliable ones.
A certain student once sent an email wanting to get the source code of my crawler even for money.
If I did give it, my crawler is just a few hundred lines of code on an A4 sheet of paper, he wouldn't say, pfft!!!! ......
It is said that now is the age of information explosion, so the competition is still who has the best data mining ability
Okay, so the question is which is the best way to learn excavator (data) skills?