What is a web crawler?
A web crawler is a program that automatically extracts web pages for a search engine to download from the World Wide Web and is an important component of a search engine. Traditional crawlers start with the URL of one or a number of initial web pages, obtain the URL of the initial web page, and in the process of crawling the web page, continuously extract new URLs from the current page into the queue until certain stopping conditions of the system are met.
What's the point of a crawler?
- Being a universal search engine web page collector. (google,baidu)
- Vertical search engine.
- Scientific research: online human behavior, online community evolution, human dynamics research, econometric sociology, complex networks, data mining, and other fields of empirical research require a large amount of data, the web crawler is a powerful tool to collect relevant data.
- Snooping, hacking, spamming ......
Crawlers are the first and easiest step in search engines
- Webpage collection
- Indexing
- query sorting
What language to write a crawler in?
C, C++. High efficiency, fast, suitable for general-purpose search engines to do a full network crawl. Disadvantages, slow development, write up stinky and long, eg: Skynet search source code.
Scripting languages: Perl, Python, Java, Ruby. simple, easy to learn, good text processing can facilitate the detailed extraction of web page content, but the efficiency is often not high, suitable for a small number of sites focus on crawling
C#? (Seems to be the preferred language for information management people)
Why did you end up choosing Python?
- Cross-platform with good support for Linux and windows.
- Scientific Computing, Numerical Fitting: Numpy, Scipy
- Visualization: 2d: Matplotlib (beautiful for diagrams), 3d: Mayavi2
- Complex Networks: Networkx
- Statistics: Interfacing with R: Rpy
- Interactive terminals
- Rapid development of websites?
A simple Python crawler
import urllib import def loadPage(url,filename): """ Role: send request based on url to get html data;. :param url: :return. """ request=(url) html1= (request).read() return ('utf-8') def writePage(html,filename): """ Write html locally :param html: the content of the corresponding file on the server :return. """ with open(filename,'w') as f: (html) print('-'*30) def tiebaSpider(url,beginPage,endPage): """ Role posting crawler scheduler, responsible for processing each page url;. :param url. :param beginPage. :param endPage. :return. """ for page in range(beginPage,endPage+1): pn=(page - 1)*50 fullurl=url+"&pn="+str(pn) print(fullurl) filename='First'+str(page)+'page.html' html= loadPage(url,filename) writePage(html,filename) if __name__=="__main__": kw=input('Please enter the name of the posting you want to need to crawl:') beginPage=int(input('Please enter a start page')) endPage=int(input('Please enter an end page')) url='/f?' kw1={'kw':kw} key = (kw1) fullurl=url+key tiebaSpider(fullurl,beginPage,endPage)
These are all the reasons and knowledge points about why Python writes web crawlers, thanks for reading and supporting me.