Flow of this post (basic idea of crawling).
Data source analysis (only if you find the source of the data, you can implement it in code)
- Determine the requirements (what is the content to be crawled?) Crawl CSDN article content Save pdf
- Packet capture analysis via developer tools Analyze where the data is coming from?
Code implementation process:
- Send a request Send a request for the article list page
- Getting data Getting the source code of a web page
- Parsing data The url of the article and the title of the article.
- Send Request Send a request for the url of the article details page.
- Getting data Getting the source code of a web page
- Parsing Data Extracting Article Title / Article Content
- Save data Save article content as html file
- Convert html file to pdf file
- multi-page crawling
1. Import module
import requests # Data requests Send requests Third-party modules pip install requests import parsel # data parsing module third party module pip install parsel import os # File manipulation module import re # Regular expression module import pdfkit # pip install pdfkit
2. Creating Folders
filename = 'pdf\\' # File name filename_1 = 'html\\' if not (filename): #If there's no such folder (filename) # Automatically create this folder if not (filename_1): #If there's no such folder (filename_1) # Create this folder automatically.
3. Send request
for page in range(1, 11): print(f'=================It's crawling.{page}Page Data Content=================') url = f'/qdPython/article/list/{page}' # python code for the server to send a request >>> after the server receives it (if not disguised) is recognized, it is a crawler, >>> will not give you back the data # The client (browser) sends a request to the server >>> the server receives the request >>> the browser returns a response to the request. # headers Request headers are just python code that masquerades as a browser to make a request. # headers parameter field is queried in the developer tools Copy # Not all of the parameter fields are required # user-agent: basic information about the browser (equivalent to a wolf in sheep's clothing, so that it can blend in with the sheep) # cookie: user info detects if account is logged in (some sites require login to see data, some data content on B-site) # referer: anti-theft link request where your url was redirected from (B.com video content / girl picture download / Vipshop product data) # Depending on the content of the site, it's a case-by-case analysis. headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36' } # Request method: get request post request You can see what the url request method looks like through the developer tools. # Search / Login / Query This is a post request response = (url=url, headers=headers)
4. Data parsing
# Need to convert the fetched html string data into a selector parser object selector = () # getall returns a list href = ('.article-list a::attr(href)').getall()
5. If every element in the list is extracted, it is not possible to extract all the elements of the list.
for index in href: # Send request Send a request for the url of the article details page response_1 = (url=index, headers=headers) selector_1 = (response_1.text) title = selector_1.css('#articleContentId::text').get() new_title = change_title(title) content_views = selector_1.css('#content_views').get() html_content = html_str.format(article=content_views) html_path = filename_1 + new_title + '.html' pdf_path = filename + new_title + '.pdf' with open(html_path, mode='w', encoding='utf-8') as f: (html_content) print('Saving:', title)
6. Replacement of special characters
def change_title(name): mode = (r'[\\\/\:\*\?\"\<\>\|]') new_name = (mode, '_', name) return new_name
Run the code to download the HTML file:
7. Convert PDF files
config = (wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\') pdfkit.from_file(html_path, pdf_path, configuration=config)
to this article on the Python crawl csnd article and turn it into a PDF file of the article is introduced to this, more related Python crawl csnd article content, please search for my previous articles or continue to browse the following articles I hope that you will support me more in the future!