想象一下建立一个电子商务平台,我们可以轻松地从 eBay、Amazon 和 Flipkart 等主要商店实时获取产品数据。当然,有 Shopify 和类似的服务,但说实话 - 仅为一个项目购买订阅可能会感觉有点麻烦。所以,我想,为什么不抓取这些网站并将产品直接存储在我们的数据库中呢?这将是为我们的电子商务项目获取产品的一种高效且具有成本效益的方式。
网络抓取涉及通过解析网页的 HTML 来读取和收集内容,从而从网站中提取数据。它通常涉及自动化浏览器或向网站发送 HTTP 请求,然后分析 HTML 结构以检索文本、链接或图像等特定信息。Puppeteer 是一个用于抓取网站的库。
Puppeteer 是一个 Node.js 库。它提供了一个高级 API,用于控制无头 Chrome 或 Chromium 浏览器。无头 Chrome 是一个无需 UI 即可运行所有内容的 Chrome 版本(非常适合在后台运行内容)。
我们可以使用 puppeteer 自动执行各种任务,例如:
首先我们必须安装库,继续执行此操作。
使用 npm:
npm i puppeteer # Downloads compatible Chrome during installation. npm i puppeteer-core # Alternatively, install as a library, without downloading Chrome.
使用纱线:
yarn add puppeteer // Downloads compatible Chrome during installation. yarn add puppeteer-core // Alternatively, install as a library, without downloading Chrome.
使用 pnpm:
pnpm add puppeteer # Downloads compatible Chrome during installation. pnpm add puppeteer-core # Alternatively, install as a library, without downloading Chrome.
这是如何抓取网站的示例。 (P.S. 我使用此代码从 Myntra 网站检索我的电子商务项目的产品。)
const puppeteer = require("puppeteer"); const CategorySchema = require("./models/Category"); // Define the scrape function as a named async function const scrape = async () => { // Launch a new browser instance const browser = await puppeteer.launch({ headless: false }); // Open a new page const page = await browser.newPage(); // Navigate to the target URL and wait until the DOM is fully loaded await page.goto('https://www.myntra.com/mens-sport-wear?rawQuery=mens sport wear', { waitUntil: 'domcontentloaded' }); // Wait for additional time to ensure all content is loaded await new Promise((resolve) => setTimeout(resolve, 25000)); // Extract product details from the page const items = await page.evaluate(() => { // Select all product elements const elements = document.querySelectorAll('.product-base'); const elementsArray = Array.from(elements); // Map each element to an object with the desired properties const results = elementsArray.map((element) => { const image = element.querySelector(".product-imageSliderContainer img")?.getAttribute("src"); return { image: image ?? null, brand: element.querySelector(".product-brand")?.textContent, title: element.querySelector(".product-product")?.textContent, discountPrice: element.querySelector(".product-price .product-discountedPrice")?.textContent, actualPrice: element.querySelector(".product-price .product-strike")?.textContent, discountPercentage: element.querySelector(".product-price .product-discountPercentage")?.textContent?.split(' ')[0]?.slice(1, -1), total: 20, // Placeholder value, adjust as needed available: 10, // Placeholder value, adjust as needed ratings: Math.round((Math.random() * 5) * 10) / 10 // Random rating for demonstration }; }); return results; // Return the list of product details }); // Close the browser await browser.close(); // Prepare the data for saving const data = { category: "mens-sport-wear", subcategory: "Mens", list: items }; // Create a new Category document and save it to the database // Since we want to store product information in our e-commerce store, we use a schema and save it to the database. // If you don't need to save the data, you can omit this step. const category = new CategorySchema(data); console.log(category); await category.save(); // Return the scraped items return items; }; // Export the scrape function as the default export module.exports = scrape;
?解释:
免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。
Copyright© 2022 湘ICP备2022001581号-3