Recently I encountered a problem, and I needed to crawl the content of the web page rendered by js, so I studied the relevant implementation methods. Mainly rely onpuppeteer
Implementation, it is a Node library. If you want to use it in PHP, you also use it.spatie/browsershot
。
Environmental dependency
environment | Require |
---|---|
Node | >=7.6.0 |
PHP | >=7.1 |
PHP extension | php_sockets, php_exif |
puppeteer
PuppeteerIt's oneNode
Library, I installed this library directly using npm under the php project, and then use it tospatie/browsershot
to call it. Readers can also create a new node project to install this library, and then expose a port to the outside world to pass the URL to return the HTML content through the interface.
npm i puppeteer --save
Install Chromium offline
Installpuppeteer
Will download it whenChromium
, because it may not be downloaded for well-known reasons, so the following provides an offline download method.
Skip to install chromium
If the previous command has been run and downloadingChromium
Then, you canCtrl+C
Stop the task. If it has not been run, use the following command to install.
npm i puppeteer --ignore-scripts
Get the chromium version number that needs to be downloaded
Open/node_modules/puppeteer/
searchchromium_revision
CorrespondingVersion number
"puppeteer": { "chromium_revision": "756035", "firefox_revision": "latest" }
Download the corresponding version of chromium
Replace the characters in the braces below with the version number above, for example, I'm locally
win x64
, the download address is/chromium-browser-snapshots/Win_x64/756035/
macDownload address: /chromium-browser-snapshots/Mac/{chromiumVersion}/ windows 64位Version下载地址: /chromium-browser-snapshots/Win_x64/{chromiumVersion}/ windows 32位Version下载地址: /chromium-browser-snapshots/Win/{chromiumVersion}/ Linux X86Version下载地址: /chromium-browser-snapshots/Linux/{chromiumVersion}/ Linux X64Version下载地址: /chromium-browser-snapshots/Linux_x64/{chromiumVersion}/
Decompression
Will download itchromium
Unzip the installation package topuppeteer
In-house.local_chromium/win64-{chromium version number}/
In the directory. Take mine as an example/node_modules/puppeteer/.local_chromium/win64-756035/chrome-win/
. Get it done ~
spatie/browsershot
browsershotIt's onecomposer
Package, I've used it beforespatie/laravel-permission, they are all produced by the same team
composer require spatie/browsershot
use
In fact, the difficult part is finding the right tools and installation tools, which is actually very simple to use. Here is a very simple example, and more methods are to look atOfficial DocumentationBar.
<?php use Spatie\Browsershot\Browsershot; class Spider { /** * Get html content * @param $url * @return string */ public static function getBodyHtml($url) { return Browsershot::url($url)->bodyHtml(); } }
Summarize
This is the article about PHP using puppeteer to crawl the page content rendered by JS. This is the end of this article. For more related PHP to obtain the page content rendered by JS. Please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!