Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Front page > Programming > Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Published on 2024-11-05

Browse:857

Web Scraping Made Easy: Parse Any HTML Page with Puppeteer

Imagine building an e-commerce platform where we can easily fetch product data in real-time from major stores like eBay, Amazon, and Flipkart. Sure, there’s Shopify and similar services, but let's be honest—it can feel a bit cumbersome to buy a subscription just for a project. So, I thought, why not scrape these sites and store the products directly in our database? It would be an efficient and cost-effective way to get products for our e-commerce projects.

What is Web Scraping?

Web scraping involves extracting data from websites by parsing the HTML of web pages to read and collect content. It often involves automating a browser or sending HTTP requests to the site, and then analyzing the HTML structure to retrieve specific pieces of information like text, links, or images.Puppeteer is one library used to scrape the websites.

?What is Puppeteer?

Puppeteer is a Node.js library.It provides a high-level API for controlling headless Chrome or Chromium browsers.Headless Chrome is a version of chrome that runs everything without an UI(perfect for running things in the background).

We can automate various tasks using puppeteer,such as:

Web Scraping: Extracting content from websites involves interacting with the page's HTML and JavaScript. We typically retrieve the content by targeting the CSS selectors.
PDF Generation: Converting web pages into PDFs programmatically is ideal when you want to directly generate a PDF from a web page, rather than taking a screenshot and then converting the screenshot to a PDF. (P.S. Apologies if you already have workarounds for this).
Automated Testing: Running tests on web pages by simulating user actions like clicking buttons, filling out forms, and taking screenshots. This eliminates the tedious process of manually going through long forms to ensure everything is in place.

?How to get started with puppetter?

Firstly we have to install the library,go ahead and do this.
Using npm:

npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.

Using yarn:

yarn add puppeteer // Downloads compatible Chrome during installation.
yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.

Using pnpm:

pnpm add puppeteer # Downloads compatible Chrome during installation.
pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.

? Example to demonstrate the use of puppeteer

Here is an example of how to scrape a website. (P.S. I used this code to retrieve products from the Myntra website for my e-commerce project.)

const puppeteer = require("puppeteer");
const CategorySchema = require("./models/Category");

// Define the scrape function as a named async function
const scrape = async () => {
    // Launch a new browser instance
    const browser = await puppeteer.launch({ headless: false });

    // Open a new page
    const page = await browser.newPage();

    // Navigate to the target URL and wait until the DOM is fully loaded
    await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens sport wear', { waitUntil: 'domcontentloaded' });

    // Wait for additional time to ensure all content is loaded
    await new Promise((resolve) => setTimeout(resolve, 25000));

    // Extract product details from the page
    const items = await page.evaluate(() => {
        // Select all product elements
        const elements = document.querySelectorAll('.product-base');
        const elementsArray = Array.from(elements);

        // Map each element to an object with the desired properties
        const results = elementsArray.map((element) => {
            const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src");
            return {
                image: image ?? null,
                brand: element.querySelector(".product-brand")?.textContent,
                title: element.querySelector(".product-product")?.textContent,
                discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent,
                actualPrice: element.querySelector(".product-price .product-strike")?.textContent,
                discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1),
                total: 20, // Placeholder value, adjust as needed
                available: 10, // Placeholder value, adjust as needed
                ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration
            };
        });

        return results; // Return the list of product details
    });

    // Close the browser
    await browser.close();

    // Prepare the data for saving
    const data = {
        category: "mens-sport-wear",
        subcategory: "Mens",
        list: items
    };

    // Create a new Category document and save it to the database
    // Since we want to store product information in our e-commerce store, we use a schema and save it to the database.
    // If you don't need to save the data, you can omit this step.
    const category = new CategorySchema(data);
    console.log(category);
    await category.save();

    // Return the scraped items
    return items;
};

// Export the scrape function as the default export
module.exports = scrape;

?Explanation:

In this code, we are using Puppeteer to scrape product data from a website. After extracting the details, we create a schema (CategorySchema) to structure and save this data into our database. This step is particularly useful if we want to integrate the scraped products into our e-commerce store. If storing the data in a database is not required, you can omit the schema-related code.
Before scraping, it's important to understand the HTML structure of the page and identify which CSS selectors contain the content you want to extract.
In my case, I used the relevant CSS selectors identified on the Myntra website to extract the content I was targeting.

Release Statement This article is reproduced at: https://dev.to/niharikaa/web-scraping-made-easy-parse-any-html-page-with-puppeteer-3dk8?1 If there is any infringement, please contact [email protected] to delete it

Latest tutorial More>

How to avoid memory leaks when slicing Go language?
Memory Leak in Go SlicesUnderstanding memory leaks in Go slices can be a challenge. This article aims to provide clarification by examining two approa...

Programming Posted on 2025-07-08
$How to Fix \"mysql_config not found\" Error When Installing MySQL-python on Ubuntu/Linux?$
How to Fix \"mysql_config not found\" Error When Installing MySQL-python on Ubuntu/Linux?
MySQL-python Installation Error: "mysql_config not found"Attempting to install MySQL-python on Ubuntu/Linux Box may encounter an error messa...

Programming Posted on 2025-07-08
Causes and solutions for Face Detection Failure: Error -215
Error Handling: Resolving "error: (-215) !empty() in function detectMultiScale" in OpenCVWhen attempting to utilize the detectMultiScale() m...

Programming Posted on 2025-07-08
FastAPI Custom 404 Page Creation Guide
Custom 404 Not Found Page with FastAPITo create a custom 404 Not Found page, FastAPI offers several approaches. The appropriate method depends on your...

Programming Posted on 2025-07-08
$Why Isn\'t My CSS Background Image Appearing?$
Why Isn\'t My CSS Background Image Appearing?
Troubleshoot: CSS Background Image Not AppearingYou've encountered an issue where your background image fails to load despite following tutorial i...

Programming Posted on 2025-07-08
How to Efficiently Convert Timezones in PHP?
Efficient Timezone Conversion in PHPIn PHP, handling timezones can be a straightforward task. This guide will provide an easy-to-implement method for ...

Programming Posted on 2025-07-08
When does a Go web application close the database connection?
Managing Database Connections in Go Web ApplicationsIn simple Go web applications that utilize databases like PostgreSQL, the timing of database conne...

Programming Posted on 2025-07-08
`console.log` shows the reason for the modified object value exception
Objects and Console.log: An Oddity UnraveledWhen working with objects and console.log, you may encounter peculiar behavior. Let's unravel this mys...

Programming Posted on 2025-07-08
User local time format and time zone offset display guide
Displaying Date/Time in User's Locale Format with Time OffsetWhen presenting dates and times to end-users, it's crucial to display them in the...

Programming Posted on 2025-07-08
Tips for finding element position in Java array
Retrieving Element Position in Java ArraysWithin Java's Arrays class, there is no direct "indexOf" method to determine the position of a...

Programming Posted on 2025-07-08
How to Parse Numbers in Exponential Notation Using Decimal.Parse()?
Parsing a Number from Exponential NotationWhen attempting to parse a string expressed in exponential notation using Decimal.Parse("1.2345E-02&quo...

Programming Posted on 2025-07-08
How to Capture and Stream stdout in Real Time for Chatbot Command Execution?
Capturing stdout in Real Time from Command ExecutionIn the realm of developing chatbots capable of executing commands, a common requirement is the abi...

Programming Posted on 2025-07-08
How to upload files with additional parameters using java.net.URLConnection and multipart/form-data encoding?
Uploading Files with HTTP RequestsTo upload files to an HTTP server while also submitting additional parameters, java.net.URLConnection and multipart/...

Programming Posted on 2025-07-08
How Can I Execute Multiple SQL Statements in a Single Query Using Node-MySQL?
Multi-Statement Query Support in Node-MySQLIn Node.js, the question arises when executing multiple SQL statements in a single query using the node-mys...

Programming Posted on 2025-07-08
Guide to Solve CORS Issues in Spring Security 4.1 and above
Spring Security CORS Filter: Troubleshooting Common IssuesWhen integrating Spring Security into an existing project, you may encounter CORS-related er...

Programming Posted on 2025-07-08