SoFunction
Updated on 2024-11-19

Python crawl csnd article and convert to PDF file

Flow of this post (basic idea of crawling).

Data source analysis (only if you find the source of the data, you can implement it in code)

  • Determine the requirements (what is the content to be crawled?) Crawl CSDN article content Save pdf
  • Packet capture analysis via developer tools Analyze where the data is coming from?

Code implementation process:

  • Send a request Send a request for the article list page
  • Getting data Getting the source code of a web page
  • Parsing data The url of the article and the title of the article.
  • Send Request Send a request for the url of the article details page.
  • Getting data Getting the source code of a web page
  • Parsing Data Extracting Article Title / Article Content
  • Save data Save article content as html file
  • Convert html file to pdf file
  • multi-page crawling

1. Import module

import requests # Data requests Send requests Third-party modules pip install requests
import parsel # data parsing module third party module pip install parsel
import os # File manipulation module
import re # Regular expression module
import pdfkit # pip install pdfkit

2. Creating Folders

filename = 'pdf\\' # File name
filename_1 = 'html\\'
if not (filename): #If there's no such folder
    (filename) # Automatically create this folder

if not (filename_1): #If there's no such folder
    (filename_1) # Create this folder automatically.

3. Send request

for page in range(1, 11):
    print(f'=================It's crawling.{page}Page Data Content=================')
    url = f'/qdPython/article/list/{page}'

    # python code for the server to send a request >>> after the server receives it (if not disguised) is recognized, it is a crawler, >>> will not give you back the data
    # The client (browser) sends a request to the server >>> the server receives the request >>> the browser returns a response to the request.
    # headers Request headers are just python code that masquerades as a browser to make a request.
    # headers parameter field is queried in the developer tools Copy
    # Not all of the parameter fields are required
    # user-agent: basic information about the browser (equivalent to a wolf in sheep's clothing, so that it can blend in with the sheep)
    # cookie: user info detects if account is logged in (some sites require login to see data, some data content on B-site)
    # referer: anti-theft link request where your url was redirected from (B.com video content / girl picture download / Vipshop product data)
    # Depending on the content of the site, it's a case-by-case analysis.
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
    }
    # Request method: get request post request You can see what the url request method looks like through the developer tools.
    # Search / Login / Query This is a post request
    response = (url=url, headers=headers)

4. Data parsing

# Need to convert the fetched html string data into a selector parser object
selector = ()
# getall returns a list
href = ('.article-list a::attr(href)').getall()

5. If every element in the list is extracted, it is not possible to extract all the elements of the list.

for index in href:
    # Send request Send a request for the url of the article details page
    response_1 = (url=index, headers=headers)
    selector_1 = (response_1.text)
    title = selector_1.css('#articleContentId::text').get()
    new_title = change_title(title)
    content_views = selector_1.css('#content_views').get()
    html_content = html_str.format(article=content_views)
    html_path = filename_1 + new_title + '.html'
    pdf_path = filename + new_title + '.pdf'
    with open(html_path, mode='w', encoding='utf-8') as f:
        (html_content)
        print('Saving:', title)

6. Replacement of special characters

def change_title(name):
    mode = (r'[\\\/\:\*\?\"\<\>\|]')
    new_name = (mode, '_', name)
    return new_name

Run the code to download the HTML file:

7. Convert PDF files

config = (wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\')
pdfkit.from_file(html_path, pdf_path, configuration=config)

to this article on the Python crawl csnd article and turn it into a PDF file of the article is introduced to this, more related Python crawl csnd article content, please search for my previous articles or continue to browse the following articles I hope that you will support me more in the future!