Generally, when using selenium to crawl data, the common processing process is to let selenium complete all operations in the entire process from opening the browser. But sometimes, we hope that the user will open the browser and enter the specified web page first, complete a series of operations such as login authentication (such as user, password, SMS verification code and various difficult-to-process graphic verification codes), and then let selenium perform continuous operations from the logged-in page to crawl data. So how can we connect the front and back operations?
General Operation
The following method is generally used for routine operations. After setting the initial parameters, use the get method to open the web page directly.
from selenium import webdriver class DriverClass: def __init__(self): = self._init_driver() def _init_driver(self): try: option = () option.add_experimental_option('excludeSwitches', ['enable-automation']) option.add_experimental_option('useAutomationExtension', False) prefs = dict() prefs['credentials_enable_service'] = False prefs['profile.password_manager_enable'] = False prefs[''] = "Person 1" option.add_experimental_option('prefs', prefs) option.add_argument('--disable-gpu') option.add_argument("--disable-blink-features=AutomationControlled") option.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"') option.add_argument('--no-sandbox') option.add_argument('ignore-certificate-errors') driver = (r"./driver/", options=option) driver.implicitly_wait(2) driver.maximize_window() return driver except Exception as e: raise e def get_driver(self) -> : if isinstance(, ): return raise Exception('Initialization of the browser failed') if __name__ == '__main__': dc = DriverClass() driver = dc.get_driver() print(driver) ("")
Continue operation
The connection operation is mainly done by setting the same interface when opening the browser (or selenium does not know which browser page to connect from).
User opens the browser
When the user opens the browser manually, specify the corresponding port (9527 is set here) and the data directory (customize one by one).
C:\Program Files\Google\Chrome\Application> --remote-debugging-port=9527 --user-data-dir="E:\lky_project\tmp_project\handle_qcc_data\\chrome_user_data"
After executing the above command, a new browser page will be opened.
After opening the browser, the user can manually enter the corresponding page to complete the corresponding user login authentication and other operations.
Program connection to browser
selenium by adding the following configuration parameters
option.add_experimental_option("debuggerAddress", "127.0.0.1:9527")
To open and continue the browser that handles the specified port that the user has opened. After that, the program can continue to process subsequent tasks through the browser handle.
driver_class.py
from selenium import webdriver class DriverClass: def __init__(self): = self._init_driver() def _init_driver(self): try: option = () # option.add_experimental_option('excludeSwitches', ['enable-automation']) # option.add_experimental_option('useAutomationExtension', False) # prefs = dict() # prefs['credentials_enable_service'] = False # prefs['profile.password_manager_enable'] = False # prefs[''] = "Person 1" # option.add_experimental_option('prefs', prefs) option.add_argument('--disable-gpu') option.add_argument("--disable-blink-features=AutomationControlled") option.add_argument('--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"') option.add_argument('--no-sandbox') option.add_argument('ignore-certificate-errors') option.add_experimental_option("debuggerAddress", "127.0.0.1:9527") driver = (r"./driver/", options=option) driver.implicitly_wait(2) # driver.maximize_window() return driver except Exception as e: raise e def get_driver(self) -> : if isinstance(, ): return raise Exception('Initialization of the browser failed') if __name__ == '__main__': dc = DriverClass() driver = dc.get_driver() print(driver) # The program uses the subsequent browser handle driver Complete subsequent operations
Things to note
Note that some of the parameter settings of my follow-up operation functions above are commented out. This is because the connection is to continue to operate from the opened browser. Some parameters are already set when the user opens the browser, so it is no longer supported to continue to set repeatedly through the connection.
Practical examples
For example, after manually opening the browser with the specified port 9527, log in to Qichacha and enter advanced search, and then use the program to obtain the number of companies with the corresponding qualifications (the operation is too frequent and may trigger verification or blocking, please be cautious!), and finally generate the result file (there may be abnormal interruption in the middle, so the following method of using the breakpoint continuous search can be made. In this way, the subsequent operation will only query the unqueried qualification data).
driver_class.py is the above.
import json import re import time from import By from driver_class import DriverClass dc = DriverClass() driver = dc.get_driver() xpath_prefix = '//div/div/div/div/span[text()="Qualification Certificate"]/following-sibling::div' def checkbox_select(element_checkbox): """Check box selected""" class_attribute = element_checkbox.get_attribute("class") if "checked" not in class_attribute: element_checkbox.find_element(, './span[@class="qccd-tree-checkbox-inner"]').click() def checkbox_unselect(element_checkbox): """Checkbox Unchecked""" class_attribute = element_checkbox.get_attribute("class") if "checked" in class_attribute: element_checkbox.find_element(, './span[@class="qccd-tree-checkbox-inner"]').click() def get_amount(element_checkbox): """Get the number of enterprises corresponding to the corresponding check box""" checkbox_select(element_checkbox) xpath_confirm = xpath_prefix + '/div/div/div/div/div[text()="Sure"]' driver.find_element(, xpath_confirm).click() (0.5) try: xpath_result = '//div/div/div[@class="search-btn limit-svip"]' result = str(driver.find_element(, xpath_result).text) except Exception as e: print(f"abnormal: {str(e)}") result = "0" result = (",", "") match_object = ("(\d+)", result) amount = match_object.group(1) print(f"number:{amount}") # Clear the result to avoid accidentally clicking to close when clicking on the selection xpath_clear = '//div/div/a[contains(text(), "clear")]' try: driver.find_element(, xpath_clear).click() except: pass xpath_select = xpath_prefix + '[@class="trigger-container"]' driver.find_element(, xpath_select).click() (0.2) checkbox_unselect(element_checkbox) return amount def extend_options(): """Expand the collapse item and get data, expand only three layers""" # (data, open("", 'w', encoding="utf-8"), indent=2, ensure_ascii=False) try: data = (open("", encoding="utf-8")) except: data = {} try: xpath_first_class = xpath_prefix + '//div/ul/li[@role="treeitem"]' # xpath_first_class = xpath_prefix + '//div/ul/li/span[contains(@class, "qccd-tree-switcher")]' first_item_list = driver.find_elements(, xpath_first_class) for item_li in first_item_list: text_dk1 = item_li.find_element(, './span/span/div/span[@class="text-dk"]').text data[text_dk1] = (text_dk1, {}) print(f"{text_dk1}") switcher = item_li.find_element(, './span[contains(@class, "qccd-tree-switcher")]') class_attribute = switcher.get_attribute("class") element_checkbox = item_li.find_element(, './span[contains(@class, "checkbox")]') if "close" in class_attribute: () (0.1) elif "noop" in class_attribute: # The current node has no child nodes if not data[text_dk1]: amount = get_amount(element_checkbox) data[text_dk1] = amount continue # After clicking, the lower level ul/li will be displayed second_item_list = item_li.find_elements(, "./ul/li") for second_item_li in second_item_list: text_dk2 = second_item_li.find_element(, './span/span/div/span[@class="text-dk"]').text data[text_dk1][text_dk2] = data[text_dk1].get(text_dk2, {}) print(f"--{text_dk2}") switcher = second_item_li.find_element(, './span[contains(@class, "qccd-tree-switcher")]') class_attribute = switcher.get_attribute("class") element_checkbox = second_item_li.find_element(, './span[contains(@class, "checkbox")]') if "close" in class_attribute: () (0.1) elif "noop" in class_attribute: # The current node has no child nodes if not data[text_dk1][text_dk2]: amount = get_amount(element_checkbox) data[text_dk1][text_dk2] = amount continue # After clicking, the lower level ul/li will be displayed third_item_list = second_item_li.find_elements(, "./ul/li") for third_item_li in third_item_list: text_dk3 = third_item_li.find_element(, './span/span/div/span[@class="text-dk"]').text data[text_dk1][text_dk2][text_dk3] = data[text_dk1][text_dk2].get(text_dk3, {}) print(f"----{text_dk3}") switcher = third_item_li.find_element(, './span[contains(@class, "qccd-tree-switcher")]') class_attribute = switcher.get_attribute("class") # When you reach the third layer, no longer expand, directly select the check box element_checkbox = third_item_li.find_element(, './span[contains(@class, "checkbox")]') if not data[text_dk1][text_dk2][text_dk3]: amount = get_amount(element_checkbox) data[text_dk1][text_dk2][text_dk3] = amount except Exception as e: raise e finally: (data, open("", 'w', encoding="utf-8"), indent=2, ensure_ascii=False) def spider_data(): # Try to close the qualification certificate selection box and clear the options xpath_close = xpath_prefix + '/div/div/div/a[@class="nclose"]' xpath_clear = '//div/div/a[contains(text(), "clear")]' try: driver.find_element(, xpath_close).click() except: pass try: driver.find_element(, xpath_clear).click() except: pass # Click the Qualification Certificate Selection Box xpath_select = xpath_prefix + '[@class="trigger-container"]' driver.find_element(, xpath_select).click() (2) extend_options() # Cancel button xpath_cancel = xpath_prefix + '/div/div/div/div/div[text()="Cancel"]' # OK button xpath_confirm = xpath_prefix + '/div/div/div/div/div[text()="Sure"]' driver.find_element(, xpath_confirm).click() if __name__ == '__main__': spider_data()
Finally, the generated file can be obtained as follows:
{ "Construction Qualification": { "Engineering Design Qualification Certificate": { "Special qualification for engineering design": "26329", "Architectural Engineering Design Firm": "356", "Engineering Design Industry Qualification": "4487", "Professional Qualification for Engineering Design": "19902", "Comprehensive Qualification for Engineering Design": "98" }, "Engineering Survey Qualification Certificate": { "Comprehensive Qualification for Engineering Survey": "377", "Professional Qualification for Engineering Survey": "7464", "Engineering Survey Labor Qualification": "3019" }, ... }, "Food and Agricultural Product Certification": { "Organic products(OGA)": "49868", "Good agricultural practices(GAP)": "6449", "Food quality certification(Wine)": "151", "Green food certification": "34723", "Green Market Certification": "318", "Pollution-free agricultural products": "31067", "Food Safety Management System Certification": "72075", "Hazard Analysis and Critical Control Point Certification": "51844", "Good production specification certification for dairy production enterprises": "445", "Hazard analysis and key control points of dairy production enterprises(HACCP)System certification": "570", "Feed products": "85" }, "Other qualifications": { "School License": "192010", "Agent Accounting License": "34588", "Accounting Firm Practice Certificate": "12252", "DOC Certificate": "982", "SMC Certificate": "1886", "Famous and Special New Agricultural Products Certificate": "1818", "Comprehensive Qualifications for Bidding": "36317", "Blockchain Information Service Filing": "2765", "Medical Institution Practice License": "570877", "CCC Factory Certification": "16154", "Sanitation License": "3244" } }
The above is the detailed content of opening the specified port of Python selenium to implement continuous operations. For more information about Python selenium browser, please follow my other related articles!