SoFunction
Updated on 2024-11-13

How python crawls dynamic websites

Python has a number of libraries that make it easy to write web crawlers to crawl certain pages and get valuable information! However, in many cases, the page fetched by a crawler is just a static page, i.e. the source code of a web page, just like the "View Page Source" in your browser. Some dynamic things such as javascript script after the execution of the information generated by the crawl is not, here for the time being, give some programs, can be used to python crawl js output after the execution of the information.

1. Two basic solutions

1.1 Dynamic page crawling with dryscrape library

js scripts are executed by the browser and return information, so one of the most direct ways to capture the page after js execution is to simulate the browser's behavior in python. webkit is an open source browser engine, and python provides many libraries that can call this engine, one of which is dryscrape, which calls the webkit engine to handle It calls the webkit engine to process web pages that contain js, etc!

import dryscrape
# Dynamically crawling pages using the dryscrape library
def get_url_dynamic(url):
    session_req=()
    session_req.visit(url) # Request page
    response=session_req.body() # Text of web pages
    #print(response)
    return response
get_text_line(get_url_dynamic(url)) #will output a text

The same applies here for the rest of the pages that contain js! Although it can fulfill the requirement of crawling dynamic pages, the disadvantage is still obvious: it's slow! It's too slow, which makes sense when you think about it. Python calls webkit to request a page and waits for the page to finish loading, loads the js file, lets the js execute, and returns the executed page, so it should be a little slower! There are many other libraries that can call webkit: PythonWebkit, PyWebKitGit, Pygt (which you can use to write a browser), pyjamas, and so on, and I've heard that they can do the same thing!

1.2 selenium web testing framework

selenium is a web testing framework that allows to call the local browser engine to send web requests, so it is equally possible to fulfill the requirements of crawling a page.

# Using selenium webdriver works, but opens the browser window in real time

def get_url_dynamic2(url):
    driver=() # Call the local Firefox, Chrom or even Ie.
    (url) #Request page, a browser window will open
    html_text=driver.page_source
    ()
    #print html_text
    return html_text
get_text_line(get_url_dynamic2(url)) #will output a text

This is not a temporary solution! With selenium similar framework there is a windmill, feel a little more complex, will not repeat!

2、the installation and use of selenium

2.1 Installation of selenium

Installation on Ubuntu can be done directly using pip install selenium. due to the following reasons:

1. selenium starts with executable_path="geckodriver" in __init__ of webdriver/firefox/; instead executable_path="wires"

2. firefox 47 or above, you need to download a third-party driver, namely geckodriver

It also requires some special handling:

1. Download geckodriverckod address:

mozilla/geckodriver

2. Unzip the geckodriverckod and store it under the path /usr/local/bin/:

sudo mv ~/Downloads/geckodriver /usr/local/bin/

2.2 Using selenium

1. Running errors:

driver = ()
TypeError: 'module' object is not callable

Solution: Browser names need to be capitalized Chrome and Firefox, Ie

2. Adoption

content = driver.find_element_by_class_name('content')

to locate the element, the method returns a FirefoxWebElement, and when you want to get the contained value, you can pass the

value = 

to this article on python how to crawl dynamic website is introduced to this article, more related python how to crawl dynamic website content please search my previous articles or continue to browse the following related articles I hope that you will support me more in the future!