How to make image crawler for any website with puppeteer.
Making crawler is not difficult. Analyze a target website and implement crawling code with the analysis result. However, it is not easy to make the crawler that work on any website, not only one target.
This article was written to share problems and solutions encountered while making an Image Crawler for any website with puppeteer. The crawler receive a url of any website and return an array of url of images shown in the target page. The below video is demo.
TL;DR : Problem and Solution
- How to collect the url of images loaded on webpage?
> Intercept requests from webpage and collect the url of image requests. - When will you return the array of url of images?
> When there is no other image request in 3 seconds since the last image request. - How to solve the infinite loop of an image request by on-error handler function registered on image elements?
> Respond base64 buffer of an image sized at 1x1px for the intercepted image request. - How to trigger image lazy-load?
> Scroll down the window of web page every time a request for image occurs.
The code below is the full code of above functions. I will explain details one by one below.
1. How to collect the url of images loaded on webpage?
> Intercept requests from webpage and collect the url of image requests.
-
Method A : Parse HTML and extract the url. There are 3 problems with this method.
# problem 1. The url of image is written in various forms. The image tag have src and srcset attributes for url of image. Also picture tag can have url of image.
# problem 2. CSS background-image property can have a url of image.
# problem 3. The images exposed on the web page may differ from the images extracted by html parsing because the rendering with CSS.
Method B : Intercept requests from webpage and collect the url of image requests. I choose this method.
Set request-interception to page instance.
await page.setRequestInterception(true);
Correct the url of requests for image type.
const images = new Set();page.on("request", request => { if (request.resourceType() === "image") {
...
images.add(request.url());
... } else { request.continue(); }});
2. When will you return the array of url of images?
> When there is no other image request in 3 seconds since the last image request.
-
Document object of window have 2 events for page loading process.
# event 1. DOMContentLoaded
: This event is too early for return the array. The DCL event don’t guarantee the loading of image.
# event 2. load
: The Load event guarantee the loading of image in html parsing, but the image resources added dynamically through ajax can’t be guaranteed.
Both of above 2 events can’t decide the timing to respond. So I made custom event using timeout. When there is no other image request in 3 seconds since the last image request, response is sent.
let timeout;const cb = async () => { res.send(Array.from(images)); await page.close(); await browser.close();};page.on("request", request => { if (request.resourceType() === "image") { images.add(request.url()); // Setting 3 seconds timeout clearTimeout(timeout); timeout = setTimeout(cb, 3000); } else { request.continue(); }});
3. How to solve the infinite loop of an image request by on-error handler function?
> Respond base64 buffer of an image sized at 1x1px for the intercepted image request.
-
If you abort the request, the error handler registered in the image element is executed because the image element can’t render proper image. And in most cases, the error handler contains logic to assign the default image url to the image’s src attribute. So if the image element has error handler assigning the default image url, the image request is fall into an infinite loop.
if (request.resourceType() === "image") { request.abort(); images.add(request.url());}
Therefore we should respond proper image for intercepted image request, not abort. I chose base64 buffer of an image sized at 1x1px as a proper image.
request.respond({ status: 200, contentType: "image/gif", body: Buffer.from( "R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7", "base64" )});
4. How to trigger image lazy-load?
> Scroll down the window of web page every time a request for image occurs.
page.evaluate(_ => { window.scrollBy(0, window.innerHeight);});
You can the project code in https://github.com/18choi18/image-crawler-server/blob/master/app.js
Thank you for reading