I. What is Selenium
selenium is a complete web application testing system , including test recording (selenium IDE), writing and running (Selenium Remote Control) and test parallel processing (Selenium Grid). selenium core Selenium Core is based on JsUnit . The core of Selenium, Selenium Core, is based on JsUnit and is written entirely in JavaScript, so it can be used in any browser that supports JavaScript.
selenium can simulate a real browser , automated testing tools , support for multiple browsers , crawlers are mainly used to solve the JavaScript rendering problem .
Second, selenium installation
When writing a crawler in python, we mainly use selenium's Webdriver, and we can first see which browsers are supported by the following
from selenium import webdrive help(webdriver)
The result is as follows, from the result we can also see that the basic mountain supports all common browsers:
NAME:
PACKAGE CONTENTS:
- android (package)
- blackberry (package)
- chrome (package)
- common (package)
- edge (package)
- firefox (package)
- ie (package)
- opera (package)
- phantomjs (package)
- remote (package)
- safari (package)
- support (package)
- webkitgtk (package)
VERSION: 3.14.1
1. PhantomJS: Browser without Visualization Interface (Headless Browser)
PhantomJS is a WebKit-based server-side JavaScript API, support for the Web without browser support , its fast , native support for a variety of Web standards : Dom processing , CSS selector , JSON and so on.
PhantomJS can be used for page automation, web monitoring, web screenshots, and interface-less testing.
2、Download the browser driver
When selenium was upgraded to 3.0, different browser drivers were standardized. If you want to use selenium to drive different browsers, you must download and set up different browser drivers separately.
Download address for each browser:
- Firefox browser driver:geckodriver
- Chrome Driver:chromedriver , taobao alternate address
- Internet Explorer Driver:IEDriverServer
- Edge browser driver:MicrosoftWebDriver
classifier for sums of money: The webdriver needs to be compatible with the corresponding browser version and selenium version.
View the mapping of drivers to browser versions:
According to the chrome version of the corresponding driver, directly according to the browser version to find the corresponding driver (just correspond to the big version)
After downloading, unzip it to any directory (the path should not have Chinese).
Browser driver file, (.exe file obtained after downloading and unzipping in Win environment) needs to be put into the same level directory before you can use it. Or configure your computer's environment variables
You can manually create a directory for storing browser drivers, e.g. C:\driver , and drop the downloaded browser driver files (e.g. chromedriver, geckodriver) into that directory.
My Computer-->Properties-->System Settings-->Advanced-->Environment Variables-->System Variables-->Path, add the "C:\driver" directory to the value of Path.
Third, the basic use of selenium
1. Declare the browser object
Above we know that selenium supports many browsers, but if you want to declare and invoke a browser you need to:
Only two examples are written here, but of course all other supported browsers can be called in this way.
from selenium import webdriver browser = () # browser = ()
Headless startup
Headless Chrome is an interface-less form of Chrome that allows you to run your programs with all the features Chrome supports without opening the browser. Compared to modern browsers, Headless Chrome is much easier to test web applications, get screenshots of websites, do crawling for information, etc. Headless Chrome is much closer to the browser environment than the older PhantomJS, SlimerJS, and others.
Headless Chrome Requirements for Chrome: The official documentation describes that mac and linux environments require chrome version 59+, while the windows version of chrome requires 60+, while chromedriver requires version 2.30+.
from selenium import webdriver from import By from import WebDriverWait from import expected_conditions as EC from .action_chains import ActionChains from import Keys chrome_options = () # Use headless browser mode chrome_options.add_argument('--headless') // Add no interface option chrome_options.add_argument('--disable-gpu') // If you don't add this option, sometimes there are problems with positioning # Launch a browser and get the source code of the page browser = (chrome_options=chrome_options) mainUrl = "/" (mainUrl) print(f"browser text = {browser.page_source}") ()
2. Visit the page
After the following code runs, it will automatically open the Chrome browser and log in to Baidu to print the source code of the Baidu home page, and then close the browser
from selenium import webdriver browser = () ("") print(browser.page_source) ()
3、Find the element
1、Single element search
(<a href="/" rel="external nofollow" rel="external nofollow" target="_blank"></a>) input_first = browser.find_element_by_id("q") input_second = browser.find_element_by_css_selector("#q") input_third = browser.find_element_by_xpath('//*[@]') print(input_first) print(input_second) print(input_third) ()
Here we get the responsive element in three different ways, the first by means of the id, the second in the CSS selector, and the third in the xpath selector, all with the same result.
The results are as follows:
Commonly used methods for finding elements:
- find_element_by_name:Positioning by element name
- find_element_by_id:Positioning by element id
- find_element_by_xpath:Locate by xpath expression
- find_element_by_link_text:Positioning by complete hyperlinked text
- find_element_by_partial_link_text:Positioning by partial link text
- find_element_by_tag_name:Positioning by tag
- find_element_by_class_name:Localization by class name
- find_element_by_css_selector:Positioning via css selector
Examples:
xpath positioning, xpath positioning there are N ways to write, here are a few common ways to write:.
dr.find_element_by_xpath("//*[@id='kw']") dr.find_element_by_xpath("//*[@name='wd']") dr.find_element_by_xpath("//input[@class='s_ipt']") dr.find_element_by_xpath("/html/body/form/span/input") dr.find_element_by_xpath("//span[@class='soutu-btn']/input") dr.find_element_by_xpath("//form[@id='form']/span/input") dr.find_element_by_xpath("//input[@id='kw' and @name='wd']")
Through the css positioning, css positioning has N kinds of writing, here are a few commonly used ways to write:.
dr.find_element_by_css_selector("#kw") dr.find_element_by_css_selector("[name=wd]") dr.find_element_by_css_selector(".s_ipt") dr.find_element_by_css_selector("html > body > form > span > input") dr.find_element_by_css_selector("-btn> input#kw") dr.find_element_by_css_selector("form#form > span > input")
How to use xpath
1. First method: positioning by absolute path (I believe that you will not use this method)
(“html/body/div/form/input”)
2. The second method: positioning through the relative path, two slashes represent relative paths
(“//input//div”)
3. Third method: positioning by element indexing
(“//input[4]”)
4. Fourth method: positioning using xpath + node attributes (can be used in combination with methods 2 and 3)
(“//input[@id='kw1']”) (“//input[@type='name' and @name='kw1']”)
5. Fifth method: matching using partial attribute values (the most powerful method)
(“//input[start-with(@id,'nice')]”) (“//input[ends-with(@id,'beautiful')]") (“//input[contains(@id,'So beautiful')]”)
6. Sixth method: use of a combination of pre-centralized methods
(“//input[@id='kw1']//input[start-with(@id,'nice']/div[1]/form[3])
The following approach is a more general one: here you need to remember the By module so it needs to be imported.
from selenium import webdriver from import By browser = () ("") input_first = browser.find_element(,"q") print(input_first) ()
Of course this approach is generic to the one described above, browser.find_element(, "q") where the ID can be replaced with several others
2、Multiple elements to find
Multiple elements find_elements, a single element is find_element, other than the use of no difference, through one of the example demonstration:
(<a href="/" rel="external nofollow" rel="external nofollow" target="_blank"></a>) lis = browser.find_elements_by_css_selector('.service-bd li') print(lis) ()
The result is a list of
Of course, the above can also be accomplished by importing from import By.
lis = browser.find_elements(By.CSS_SELECTOR,'.service-bd li')
The same method of finding in a single element exists for multiple element finding:
- find_elements_by_name
- find_elements_by_id
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
4、Element Interaction Operation
Generally speaking, the following are some of the more commonly used methods for manipulating objects in the webdriver:
- click--Click on the object
- send_keys-- Simulate keystrokes on an object
- clear--clears the contents of the object, if possible
- submit--Contents of the submitted object, if possible
- text-- Used to get text information about an element
1、Keyboard events
The keys package is required to invoke keyboard keystrokes:
from import Keys
Call keystrokes via send_keys():
send_keys() # TAB send_keys() # Enter send_keys(,'a') # ctrl+a Selects the contents of the input box in full send_keys(,'x') # ctrl+x cuts the content of the input box import time () input_str = browser.find_element_by_id('q') input_str.send_keys("ipad") (1) input_str.clear() input_str.send_keys("MakBook pro") button = browser.find_element_by_class_name('btn-search') ()
The result of running can be seen that the program will automatically open Chrome and open Taobao to enter ipad,then delete it and re-enter MakBook pro and click search.
2、Mouse event
Mouse events generally include right mouse button, double click, drag, move the mouse to an element and so on. The ActionChains class needs to be introduced. Introduce methods:
from .action_chains import ActionChains
ActionChains common methods:
- context_click() : Right click;
- double_click() : Double tap;
- drag_and_drop(): Drag;
- move_to_element() : Mouse hover.
- perform(): Executes all behaviors stored in ActionChains;
Double mouse click on the example:
qqq =driver.find_element_by_xpath("xxx") # Locate the element to be double-clicked ActionChains(driver).double_click(qqq).perform() #Performs a double-click on a localized element.
Mouse drag and drop example:
from import ActionChains url = "/try/?filename=jqueryui-api-droppable" (url) browser.switch_to.frame('iframeResult') source = browser.find_element_by_css_selector('#draggable') target = browser.find_element_by_css_selector('#droppable') actions = ActionChains(browser) actions.drag_and_drop(source, target) ()
For more operational references:
/#.action_chains
5. Execute JavaScript
This is a very useful method, here you can directly call the js method to achieve some of the operations, the following example is by logging in to know and then through the js flip to the bottom of the page, and the pop-up box to prompt the
(<a href="/explore" rel="external nofollow" target="_blank">/explore</a>) browser.execute_script('(0, )') browser.execute_script('alert("To Bottom")')
6. Get DOM
1, get element attributes: get_attribute('class')
url = '/explore' (url) logo = browser.find_element_by_id('zh-top-link-logo') print(logo) print(logo.get_attribute('class'))
2, get the text value: text
url = '/explore' (url) input = browser.find_element_by_class_name('zu-top-add-question') print()
3、Get ID, location, label name
- id
- location
- tag_name
- size
url = '/explore' (url) input = browser.find_element_by_class_name('zu-top-add-question') print() print() print(input.tag_name) print()
7、Frame
In many web pages are Frame tags, so we crawl the data involved in cutting into the frame as well as cut out of the problem, through the following example demonstration
Commonly used here are switch_to.from() and switch_to.parent_frame()
import time from selenium import webdriver from import NoSuchElementException browser = () url = '/try/?filename=jqueryui-api-droppable' (url) browser.switch_to.frame('iframeResult') source = browser.find_element_by_css_selector('#draggable') print(source) try: logo = browser.find_element_by_class_name('logo') except NoSuchElementException: print('NO LOGO') browser.switch_to.parent_frame() logo = browser.find_element_by_class_name('logo') print(logo) print()
8. Waiting
1. Implicit waiting
When an implicit wait is used to execute a test, if the WebDriver does not find an element in the DOM, it will continue to wait for a set amount of time before throwing a not found exception.
In other words, when looking for an element or when an element does not appear immediately, the implicit wait will wait for a certain period of time before looking up the DOM, the default time is 0
to a certain time found that the elements have not been loaded, then continue to wait for the time we specify, if more than the time we specify has not been loaded will throw an exception, if there is no need to wait for the time it has been loaded will be immediately executed
browser.implicitly_wait(10) ('/explore') input = browser.find_element_by_class_name('zu-top-add-question') print(input)
2. Display Waiting
Specify a wait condition and a maximum wait time within which to determine whether the wait condition is satisfied.
If it is valid, it will return immediately, if not, it will wait until it waits for the maximum waiting time you specified, if it is still not satisfied, it will throw an exception, if it is satisfied, it will return normally.
from selenium import webdriver from import By from import WebDriverWait from import expected_conditions as EC browser = () ('/') wait = WebDriverWait(browser, 10) input = (EC.presence_of_element_located((, 'q'))) button = (EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search'))) print(input, button)
The condition in the above example: EC.presence_of_element_located() is confirming whether the element has appeared. EC.element_to_be_clickable() is confirming whether the element is clickable or not
3. Commonly used judgment conditions:
- title_is : The title is a piece of content
- title_contains : The title contains a certain content
- presence_of_element_located : The element is loaded out, passing in the positioning tuple, e.g. (, 'p')
- visibility_of_element_located : the element is visible, pass in the positioning tuple
- visibility_of : As can be seen, the incoming element object
- presence_of_all_elements_located : All elements loaded out
- text_to_be_present_in_element : an element text contains a certain text
- text_to_be_present_in_element_value : An element value contains a text
- frame_to_be_available_and_switch_to_it : frame loading and switching
- invisibility_of_element_located : element is not visible
- element_to_be_clickable : Element clickable
- staleness_of : Determine whether an element is still in the DOM, and whether the page has been refreshed.
- element_to_be_selected : element selectable, pass element object
- element_located_to_be_selected : element selectable, pass in positioning tuple
- element_selection_state_to_be : Pass in the element object and state, return True for equal, otherwise return False.
- element_located_selection_state_to_be : Pass in the location tuple and the status, return True for equal, False otherwise.
- alert_is_present : Whether Alert appears
Example: blogspot title judgment
# coding: utf-8 from selenium import webdriver from import expected_conditions as EC driver = () ("/101718qiong/") title = EC.title_is(u"Silence&QH - Blogland") # Determine that the title is exactly equal to print title(driver) title1 = EC.title_contains("Silence&QH") # Determine if the title contains print title1(driver) r1 = EC.title_is(u"Silence&QH - Blogland")(driver) # Alternatively written r2 = EC.title_contains("Silence&QH")(driver) print r1 print r2
For more operational references:
/#.expected_conditions
9. Browser Browser Operation
Browser maximization, minimization
browser.maximize_window() # Maximize the browser browser.minimize_window() # Minimize browser display
Browser Settings Window Size
browser.set_window_size(480, 800) # Setting the Browser Width480、your (honorific)800demonstrate
Browser forward and backward
- back(): back
- forward().
import time from selenium import webdriver browser = () ('/') ('/') ('/') () (1) () ()
10、cookie operation
- get_cookies()
- delete_all_cookes()
- add_cookie()
('/explore') print(browser.get_cookies()) browser.add_cookie({'name': 'name', 'domain': '', 'value': 'zhaofan'}) print(browser.get_cookies()) browser.delete_all_cookies() print(browser.get_cookies())
11、Multi-window management
New tab() is implemented by executing the js command, different tabs are present in the list browser.window_handles.
The first tab can be manipulated by browser.window_handles[0]. current_window_handle: get the current window handle.
import time ('') browser.execute_script('()') print(browser.window_handles) browser.switch_to.window(browser.window_handles[1]) ('') (1) browser.switch_to.window(browser.window_handles[0]) ('')
12. Exception handling
The exception here is a bit more complicated, the reference address on the official website:
/#
Here's just a simple demonstration of finding a non-existent element
from selenium import webdriver from import TimeoutException, NoSuchElementException browser = () try: ('') except TimeoutException: print('Time Out') try: browser.find_element_by_id('hello') except NoSuchElementException: print('No Element') finally: ()
13. Warning box handling
Handling JavaScript-generated alerts, confirms, and prompts in WebDriver is very simple, using the switch_to.alert method to locate alerts/confirms/prompts, and then using the text/accept/dismiss/send_keys methods to manipulate them. keys, and then use the text/accept/dismiss/send_keys methods.
methodologies
- text : Returns the text in alert/confirm/prompt.
- accept(): accepts an existing warning box
- dismiss() : dismisses the existing warning box
- send_keys(keysToSend) : sends text to the warning box. keysToSend: sends text to the warning box.
Demonstration
from selenium import webdriver from .action_chains import ActionChains import time driver = ("F:\Chrome\ChromeDriver\chromedriver") driver.implicitly_wait(10) ('') # Hover over the "Settings" link. link = driver.find_element_by_link_text('Settings') ActionChains(driver).move_to_element(link).perform() # Open search settings driver.find_element_by_link_text("Search Settings").click() # Set a wait time of 2s here or you may get an error. (2) # Save settings driver.find_element_by_class_name("prefpanelgo").click() (2) # Accept the warning box driver.switch_to.() ()
14, drop-down box selection operation
Import the Select drop-down box Select class and use this class to handle drop-down box operations.
from import Select
Methods of the Select class:
- select_by_value("select_value") value of the value attribute of the select tag
- select_by_index("index_value") Index of the dropdown box
- select_by_visible_testx("text_value") Text value of the dropdown box
Sometimes we will encounter drop-down boxes , WebDriver provides a Select class to deal with drop-down boxes . Such as Baidu search settings of the drop-down box.
from selenium import webdriver from import Select from time import sleep driver = ("F:\Chrome\ChromeDriver\chromedriver") driver.implicitly_wait(10) ('') #1. Hover over the "Settings" link. driver.find_element_by_link_text('Settings').click() sleep(1) #2. Open the search settings driver.find_element_by_link_text("Search Settings").click() sleep(2) #3. Number of search results displayed sel = driver.find_element_by_xpath("//select[@id='nr']") Select(sel).select_by_value('50') # 50 displayed sleep(3) ()
15. File upload
For the upload function realized through the input tag, it can be regarded as an input box, i.e., the file upload can be realized by specifying the path of the local file through send_keys().
File uploads via the send_keys() method.
from selenium import webdriver import os driver = () file_path = 'file:///' + ('') (file_path) # Locate the upload button to add a local file driver.find_element_by_name("file").send_keys('D:\\upload_file.txt') ()
16、Window Screenshot
Automation use cases are executed by the program, so sometimes the error messages printed are not very clear. If you can take a screenshot of the current window when a script execution error occurs, then you can visualize the cause of the error through the picture.WebDriver provides a screenshot function get_screenshot_as_file() to intercept the current window.
Screenshot Method:
get_screenshot_as_file(self, filename) to capture the current window and save the image locally
from selenium import webdriver from time import sleep driver =(executable_path ="F:\GeckoDriver\geckodriver") ('') driver.find_element_by_id('kw').send_keys('selenium') driver.find_element_by_id('su').click() sleep(2) #1. Intercept the current window, and specify where to save the screenshot image driver.get_screenshot_as_file("D:\\baidu_img.jpg") ()
17. Close the browser
In the previous examples we have been using the quit() method, which means to quit the associated driver and close all windows. In addition, WebDriver also provides the close() method to close the current window. For example, in the case of multi-window handling, in the process of executing the use case to open a number of windows, we want to close one of the windows, then we need to use the close() method to close.
- close() closes a single window
- quit() closes all windows
18. selenium avoids being detected and recognized.
Now a lot of big websites have adopted a monitoring mechanism for selenium. For example, under normal circumstances, we use the browser to visit Taobao and other sites the value of the
The value is undefined, whereas it is true when accessed using selenium, so how do I fix this?
Simply set the Chromedriver startup parameters to solve the problem. Before starting Chromedriver, turn on the experimental features parameter for ChromeexcludeSwitches
, which has a value of['enable-automation']
, the full code is below:
from import Chrome from import ChromeOptions option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) driver = Chrome(options=option)
19, Example:
Automatic login to CSDN
import time import numpy as np from numpy import random from selenium import webdriver from import ActionChains from import By from import WebDriverWait from import expected_conditions as EC def ease_out_expo(x): if x == 1: return 1 else: return 1 - pow(2, -10 * x) def get_tracks(distance, seconds, ease_func): tracks = [0] offsets = [0] for t in (0.0, seconds+0.1, 0.1): ease = globals()[ease_func] offset = round(ease(t / seconds) * distance) (offset - offsets[-1]) (offset) return offsets, tracks def drag_and_drop(browser, offset): # Position the slider element WebDriverWait(browser, 20).until( EC.visibility_of_element_located((, "//*[@class='nc_iconfont btn_slide']")) ) knob = browser.find_element_by_xpath("//*[@class='nc_iconfont btn_slide']") offsets, tracks = get_tracks(offset, 0.2, 'ease_out_expo') ActionChains(browser).click_and_hold(knob).perform() for x in tracks: ActionChains(browser).move_by_offset(x, 0).perform() # Let go ActionChains(browser).pause((6, 14) / 10).release().perform() chrome_options = () chrome_options.add_argument("--start-maximized") browser = (chrome_options=chrome_options) ('') browser.find_element_by_id('kw').send_keys('CSDN') browser.find_element_by_id('su').click() WebDriverWait(browser, 20).until( EC.visibility_of_element_located((By.PARTIAL_LINK_TEXT, "-Professional IT technology community.")) ) browser.find_element_by_partial_link_text('-Professional IT Technology Community').click() browser.switch_to.window(browser.window_handles[1]) # Move handles (1) browser.find_element_by_partial_link_text('Login').click() browser.find_element_by_link_text('Account Password Login').click() browser.find_element_by_id('all').send_keys('yangbobin') browser.find_element_by_name('pwd').send_keys('pass-word') browser.find_element_by_css_selector("button[data-type='account']").click() (5) # Wait for the slider module and other JS files to finish loading! while True: # Define mouse drag-and-drop actions drag_and_drop(browser, 261) # Wait for the JS authentication to run, if you do not wait easy to report errors (2) # Check if the authentication was successful, get the text value WebDriverWait(browser, 20).until( EC.visibility_of_element_located((By.LINK_TEXT, "Refresh.")) ) browser.find_element_by_link_text('Refresh').click()
Automatic login to 163 mailbox
First, it's the 163. landing as an iframe
browser.switch_to_frame('x-URS-iframe')
This is the end of this article about python crawler of selenium module. I hope it will be helpful for your learning and I hope you will support me more.