Python has a number of libraries that make it easy to write web crawlers to crawl certain pages and get valuable information! However, in many cases, the page fetched by a crawler is just a static page, i.e. the source code of a web page, just like the "View Page Source" in your browser. Some dynamic things such as javascript script after the execution of the information generated by the crawl is not, here for the time being, give some programs, can be used to python crawl js output after the execution of the information.
1. Two basic solutions
1.1 Dynamic page crawling with dryscrape library
js scripts are executed by the browser and return information, so one of the most direct ways to capture the page after js execution is to simulate the browser's behavior in python. webkit is an open source browser engine, and python provides many libraries that can call this engine, one of which is dryscrape, which calls the webkit engine to handle It calls the webkit engine to process web pages that contain js, etc!
import dryscrape # Dynamically crawling pages using the dryscrape library def get_url_dynamic(url): session_req=() session_req.visit(url) # Request page response=session_req.body() # Text of web pages #print(response) return response get_text_line(get_url_dynamic(url)) #will output a text
The same applies here for the rest of the pages that contain js! Although it can fulfill the requirement of crawling dynamic pages, the disadvantage is still obvious: it's slow! It's too slow, which makes sense when you think about it. Python calls webkit to request a page and waits for the page to finish loading, loads the js file, lets the js execute, and returns the executed page, so it should be a little slower! There are many other libraries that can call webkit: PythonWebkit, PyWebKitGit, Pygt (which you can use to write a browser), pyjamas, and so on, and I've heard that they can do the same thing!
1.2 selenium web testing framework
selenium is a web testing framework that allows to call the local browser engine to send web requests, so it is equally possible to fulfill the requirements of crawling a page.
# Using selenium webdriver works, but opens the browser window in real time
def get_url_dynamic2(url): driver=() # Call the local Firefox, Chrom or even Ie. (url) #Request page, a browser window will open html_text=driver.page_source () #print html_text return html_text get_text_line(get_url_dynamic2(url)) #will output a text
This is not a temporary solution! With selenium similar framework there is a windmill, feel a little more complex, will not repeat!
2、the installation and use of selenium
2.1 Installation of selenium
Installation on Ubuntu can be done directly using pip install selenium. due to the following reasons:
1. selenium starts with executable_path="geckodriver" in __init__ of webdriver/firefox/; instead executable_path="wires"
2. firefox 47 or above, you need to download a third-party driver, namely geckodriver
It also requires some special handling:
1. Download geckodriverckod address:
mozilla/geckodriver
2. Unzip the geckodriverckod and store it under the path /usr/local/bin/:
sudo mv ~/Downloads/geckodriver /usr/local/bin/
2.2 Using selenium
1. Running errors:
driver = () TypeError: 'module' object is not callable
Solution: Browser names need to be capitalized Chrome and Firefox, Ie
2. Adoption
content = driver.find_element_by_class_name('content')
to locate the element, the method returns a FirefoxWebElement, and when you want to get the contained value, you can pass the
value =
to this article on python how to crawl dynamic website is introduced to this article, more related python how to crawl dynamic website content please search my previous articles or continue to browse the following related articles I hope that you will support me more in the future!