Brief introduction:
Selenium
is a Web automation testing tool, initially developed for website automation testing, Selenium can run directly on the browser, it supports all major browsers (including PhantomJS these interface-less browsers (in 2018 the developer said to suspend the development of chromedriver can achieve the same functionality)), can receive commands that Let the browser automatically load the page, get the required data, and even page screenshots.
1. Installation
pip install selenium -i /simple
2. Download the browser driver
Google Chrome used here
/mirrors/chromedriver/
Check your browser version to download the corresponding driver.
Put the unzipped driver in its own
Catalog.
3. Examples
3.1 Download the corresponding version of the browser driver
/mirrors/chromedriver/
Put the unzipped driver in your own directory.
3.2 Test code, open a web page and get the title of the page
from import Chrome if __name__ == '__main__': web = Chrome() ("") print()
3.3 A small sample
from import Chrome if __name__ == '__main__': web = Chrome() url = '/acm/home' (url) # Get the a tag to click on el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/div/a') # Click () # "/html/body/div/div[3]/div[1]/div[2]/div[2]/div[2]/div[1]/h4/a" # Crawl the desired content lists = web.find_elements_by_xpath("/html/body/div/div[3]/div[1]/div[2]/div[@class='platform-item js-item ']/div[" "2]/div[1]/h4/a") print(len(lists)) for i in lists: print()
3.4 Automatic input and jump
from import Chrome from import Keys import time if __name__ == '__main__': web = Chrome() url = '/acm/home' (url) el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/div/a') () (1) input_el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/form/input[1]') input_el.send_keys('Cattlemen', ) # do something
4. Turn on headless mode
Whether headless mode is enabled (i.e., whether an interface is required)
from import Chrome from import Options option = Options() # Instantiate the option object option.add_argument("--headless") # Add headerless parameters to the option object if __name__ == '__main__': web = Chrome(executable_path='D:\PyProject\spider\venv\Scripts\',options=option) # Specify the driver location, otherwise look in the python interpreter directory. ("") print()
5. Save a screenshot of the page
from import Chrome from import Options option = Options() # Instantiate the option object option.add_argument("--headless") # Add headerless parameters to the option object if __name__ == '__main__': web = Chrome() web.maximize_window() # Maximize browser window ("") print() web.save_screenshot('') # Save a screenshot of the current web page Save to current folder () # Close current page
6. Analog inputs and clicks
from import Chrome from import Options option = Options() # Instantiate the option object option.add_argument("--headless") # Add headerless parameters to the option object if __name__ == '__main__': web = Chrome() web.maximize_window() # Maximize browser window ("") el = web.find_element_by_id('kw') el.send_keys('Harris-H') btn = web.find_element_by_id('su') () # () # Close current page
It seems that Baidu can now recognizeselenium
, also requires image verification.
6.1 Finding nodes based on text values
# Find the node with the text value Baidu.com driver.find_element_by_link_text("Baidu.") # Get a list of elements based on the text contained in the link, fuzzy match driver.find_elements_by_partial_link_text("Degree.")
6.2 Getting the text of the current node
# Get the text of the current node ele.get_attribute("data-click") # Gets the attribute corresponding to thevalue
6.3 Printing some information about the current page
print(driver.page_source) # Print the source code of a web page print(driver.get_cookies()) # Print out cookies for web pages print(driver.current_url) # Print out the current page'surl
6.4 Close Browser() # close the current page
() # Close the current page () # Directly close the browser
6.5 Mouse scrolling simulation
from import Chrome import time if __name__ == '__main__': driver = Chrome() ( "/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=78000241_12_hao_pg&wd=selenium%20js%E6%BB%91%E5%8A%A8&fenlei=256&rsv_pq=8215ec3a00127601&rsv_t=a763fm%2F7SHtPeSVYKeWnxKwKBisdp%2FBe8pVsIapxTsrlUnas7%2F7Hoo6FnDp6WsslfyiRc3iKxP2s&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=31&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_btype=i&inputT=9266&rsv_sug4=9770") # 1. Scroll to the bottom of the page js = "=1000" # Execute js driver.execute_script(js) (2) # Scroll to the top js = "=0" driver.execute_script(js) # Execute js (2) ()
options = () options.add_argument("--proxy-server=http://110.52.235.176:9999") # Add Agent options.add_argument("--headless") # Headless mode options.add_argument("--lang=en-US") # Web pages in English prefs = {"profile.managed_default_content_settings.images": 2, '': 2} # Disable rendering options.add_experimental_option("prefs", prefs) driver = (executable_path="D:\ProgramApp\chromedriver\",chrome_options=options) ("/ip")
8. Verify slider movement
Goal: Sliding CAPTCHA
- 1. Positioning buttons
- 2. Press and hold the slider
- 3. Slide button
import time from selenium import webdriver if __name__ == '__main__': chrome_obj = () chrome_obj.get('/demo/2017/unlock/') # 1. Position the slide button click_obj = chrome_obj.find_element_by_xpath('//div[@class="bar1 bar"]/div[@class="slide-to-unlock-handle"]') # 2. Press and hold # Create an action chain object, the argument is the browser object action_obj = (chrome_obj) # Click and hold, the parameter is the button for positioning action_obj.click_and_hold(click_obj) # Get its width and height size_ = click_obj.size width_ = 298 - size_['width'] # The width of the frame minus the width of the slider is the distance to the x-axis (to the right). print(width_) # 3. Position the slide coordinates action_obj.move_by_offset(298-width_, 0).perform() # 4. Release the slide action_obj.release() (6) chrome_obj.quit()
9. Open multiple windows and page switching
Sometimes there are many sub-tabs in a window. Selenium provides a switch_to_window to do this, and you can find out which page you want to switch to from thedriver.window_handles
hit the nail on the head
from selenium import webdriver if __name__ == '__main__': driver = () ("/") driver.implicitly_wait(2) driver.execute_script("('/')") driver.switch_to.window(driver.window_handles[1]) print(driver.page_source)
manipulate
# 1. Get all cookies: for cookie in driver.get_cookies(): print(cookie) # 2. Get the value based on the key of the cookie: value = driver.get_cookie(key) # 3. Delete all cookies: driver.delete_all_cookies() # 4. Delete a cookie: driver.delete_cookie(key) # Add cookie: driver.add_cookie({"name":"password","value":"111111"})
11. Simulate login
Here is an analog login to our school's Registrar's Office:
from import Chrome if __name__ == '__main__': web = Chrome() ('/') username = web.find_element_by_id('userAccount') username.send_keys('xxxxxxx') # Here's your student number password = web.find_element_by_id('userPassword') password.send_keys('xxxxxxx') # Here's your password btn = web.find_element_by_xpath('//*[@]/li[4]/button') () # do something
Since there's no slider or anything to validate, it's pretty simple qwq. then just do your own manipulation later.
12. Advantages and disadvantages
selenium can execute js on the page, for js rendered data and simulation of login processing is very easy.
selenium is very inefficient due to the number of requests it sends in the process of fetching a page, so it needs to be used with discretion in many cases.
To this article on Python-Selenium automated crawler is introduced to this article, more related Selenium automated crawler content please search for my previous articles or continue to browse the following related articles I hope you will support me in the future!