簡単なレベルですべて自分で Amazon を解析する

表紙 > プログラミング > 簡単なレベルですべて自分で Amazon を解析する

簡単なレベルですべて自分で Amazon を解析する

2024 年 11 月 6 日に公開

ブラウズ：954

I came across a script on the Internet that allows you to parse product cards from Amazon. And I just needed a solution to a problem like that.

I wracked my brain while looking for a way to parse product cards from Amazon. The problem is that Amazon uses different design options for different outputs, in particular – if you need to parse the cards with the search query "bags" – the cards will be arranged vertically, as I need it, but if you take, for example, "t-shirts" – then the cards will be arranged horizontally, and in such way the script falls into an error, it works out opening the page, but does not want to scroll.

Amazon parsing on easy level and all by yourself

Moreover, after reading various articles where users are puzzling over how to bypass captcha on Amazon, I upgraded the script and now it can bypass the captcha if it occurs (it works with 2captcha). The script checks for the presence of a captcha on the page after each loading of a new page, and if the captcha occurs, it sends a request to the 2capcha server, and after receiving the solution, substitutes it and continues to work.

However, how to bypass the captcha is not the most difficult problem, since this is a trivial task nowadays. The more pressing question is how to make the script work not only with the vertical arrangement of product cards, but also with the horizontal one.

Below I will describe in detail what the script includes, demonstrate its work, and if you can help to solve the problem, if you know what to add (change) in the script so that it works on horizontal setup of cards, I will be grateful.

And for now the script can help someone at least in its limited functionality.

So, let's take the script apart piece by piece!

Preparation

Firstly, the script imports the modules needed to complete the task

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import os
from time import sleep
import requests

Let's take it apart in parts:

from selenium import webdriver

That imports the webdriver class, which allows you to control the browser (in my case Firefox) through the script

from selenium.webdriver.common.by import By

That imports theBy class, with which the script will search for elements to parse by XPath (it can search for other attributes, but in this case Xpath will be used)

from selenium.webdriver.common.keys import Keys

That imports the Keys class, which will be used to simulate keystrokes, in the case of this script, it will scroll the page down Keys.PAGE_DOWN

from selenium.webdriver.common.action_chains import ActionChains

That imports theActionChains class to create complex sequential actions, in our case – clicking on the PAGE_DOWN button and waiting for all elements on the page to load (since on Amazon cards are loaded as they are being scrolled)

from selenium.webdriver.support.ui import WebDriverWait

That imports the WebDriverWait class, which waits until the information we are looking for is loaded, for example, a product description, which we will search by Xpath

from selenium.webdriver.support import expected_conditions as EC

That imports the expected_conditions class (abbreviated EC) which works in conjunction with the previous class and tells WebDriverWait which specific condition it needs to wait for. That increases the reliability of the script so that it would not start interacting with the unloaded yet content.

import csv

That imports the csv module to work with csv files.

import os

That imports the os module to work with the operating system (creating directories, checking for the files presence, etc.).

from time import sleep

We import the sleep function – this is the function that will pause the script for a specific time (in my case, 2 seconds, but you can set more) so that the elements would load while scrolling.

import requests

That imports the requests library for sending HTTP requests, to interact with the 2captcha recognition service.

Configuration

After everything is imported, the script starts configuring the browser for work, in particular:

Installing the API key to access the 2captcha service

# API key for 2Captcha
API_KEY =

The script contains a user-agent (it can be changed, of course), which is installed for the browser. After that, the browser starts with the specified settings.

`user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

options = webdriver.FirefoxOptions()
options.add_argument(f"user-agent={user_agent}")

driver = webdriver.Firefox(options=options)
`

Next comes the captcha solution module. This is exactly the place that users are looking for when they search how to solve a captcha. We will not analyze this piece of code for a long time, since there were no particular problems with it.

In short, the script, after each page load, checks for the presence of a captcha on the page and if it finds it there, solves it by sending it to the 2captcha server. If there is no captcha, it just continues the execution further.

`def solve_captcha(driver):
# Check for the presence of a captcha on the page
try:
captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')
if captcha_element:
print("Captcha detected. Solving...")
site_key = captcha_element.get_attribute('data-sitekey')
current_url = driver.current_url

        # Send captcha request to 2Captcha
        captcha_id = requests.post(
            'http://2captcha.com/in.php', 
            data={
                'key': API_KEY, 
                'method': 'userrecaptcha', 
                'googlekey': site_key, 
                'pageurl': current_url
            }
        ).text.split('|')[1]

        # Wait for the captcha to be solved
        recaptcha_answer = ''
        while True:
            sleep(5)
            response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
            if response.text == 'CAPCHA_NOT_READY':
                continue
            if 'OK|' in response.text:
                recaptcha_answer = response.text.split('|')[1]
                break

        # Inject the captcha answer into the page
        driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML = "{recaptcha_answer}";')
        driver.find_element(By.ID, 'submit').click()
        sleep(5)
        print("Captcha solved.")
except Exception as e:
    print("No captcha found or error occurred:", e)

Parsing
Next comes a section of the code that is responsible for sorting pages, loading, and scrolling them

try:
base_url = "https://www.amazon.in/s?k=bags"

for page_number in range(1, 10): 
    page_url = f"{base_url}&page={page_number}"

    driver.get(page_url)
    driver.implicitly_wait(10)

    solve_captcha(driver)

    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')))

    for _ in range(5):  
        ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()
        sleep(2)

The next piece is the collection of product data. The most important part. In this part, the script examines the loaded page and takes the data that is specified from there. In our case it is the product name, number of reviews, price, URL, product rating.

`product_name_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')
rating_number_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base s-underline-text"]')
star_rating_elements = driver.find_elements(By.XPATH, '//span[@class="a-icon-alt"]')
price_elements = driver.find_elements(By.XPATH, '//span[@class="a-price-whole"]')
product_urls = driver.find_elements(By.XPATH, '//a[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')

    product_names = [element.text for element in product_name_elements]
    rating_numbers = [element.text for element in rating_number_elements]
    star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]
    prices = [element.text for element in price_elements]
    urls = [element.get_attribute('href') for element in product_urls]

Next, the specified data is uploaded to a folder (a csv file is created for each page, which is saved to the output files folder). If the folder is missing, the script creates it.

` output_directory = "output files"
if not os.path.exists(output_directory):
os.makedirs(output_directory)

    with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])
        for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):
            csv_writer.writerow([url, name, price, star_rating, num_ratings])

And the final stage is the completion of work and the release of resources.

finally:
driver.quit()

The full script

`from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import os
from time import sleep
import requests

API key for 2Captcha

API_KEY = "Your API Key"

Set a custom user agent to mimic a real browser

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

options = webdriver.FirefoxOptions()
options.add_argument(f"user-agent={user_agent}")

driver = webdriver.Firefox(options=options)

def solve_captcha(driver):
# Check for the presence of a captcha on the page
try:
captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')
if captcha_element:
print("Captcha detected. Solving...")
site_key = captcha_element.get_attribute('data-sitekey')
current_url = driver.current_url

        # Send captcha request to 2Captcha
        captcha_id = requests.post(
            'http://2captcha.com/in.php', 
            data={
                'key': API_KEY, 
                'method': 'userrecaptcha', 
                'googlekey': site_key, 
                'pageurl': current_url
            }
        ).text.split('|')[1]

        # Wait for the captcha to be solved
        recaptcha_answer = ''
        while True:
            sleep(5)
            response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
            if response.text == 'CAPCHA_NOT_READY':
                continue
            if 'OK|' in response.text:
                recaptcha_answer = response.text.split('|')[1]
                break

        # Inject the captcha answer into the page
        driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML = "{recaptcha_answer}";')
        driver.find_element(By.ID, 'submit').click()
        sleep(5)
        print("Captcha solved.")
except Exception as e:
    print("No captcha found or error occurred:", e)

try:
# Starting page URL
base_url = "https://www.amazon.in/s?k=bags"

for page_number in range(1, 2): 
    page_url = f"{base_url}&page={page_number}"

    driver.get(page_url)
    driver.implicitly_wait(10)

    # Attempt to solve captcha if detected
    solve_captcha(driver)

    # Explicit Wait
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')))

    for _ in range(5):  
        ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()
        sleep(2)

    product_name_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')
    rating_number_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base s-underline-text"]')
    star_rating_elements = driver.find_elements(By.XPATH, '//span[@class="a-icon-alt"]')
    price_elements = driver.find_elements(By.XPATH, '//span[@class="a-price-whole"]')
    product_urls = driver.find_elements(By.XPATH, '//a[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')

    # Extract and print the text content of each product name, number of ratings, and star rating, urls
    product_names = [element.text for element in product_name_elements]
    rating_numbers = [element.text for element in rating_number_elements]
    star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]
    prices = [element.text for element in price_elements]
    urls = [element.get_attribute('href') for element in product_urls]

    sleep(5)        
    output_directory = "output files"
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])
        for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):
            csv_writer.writerow([url, name, price, star_rating, num_ratings])

finally:
driver.quit()

This way the script works without errors, but only for vertical product cards. Here is an example of how the script works.

I will be glad to discuss it in the comments if you have something to say about it.

リリースステートメントこの記事は次の場所に転載されています: https://dev.to/markus009/amazon-parsing-on-easy-level-and-all-by-yourself-4dlj?1 侵害がある場合は、[email protected] までご連絡ください。それを削除するには

最新のチュートリアルもっと>

PHP: 動的 Web サイトの背後にある秘密のソースが明らかに
PHP (ハイパーテキストプリプロセッサ) は、動的でインタラクティブな Web サイトを作成するために広く使用されているサーバー側プログラミング言語です。シンプルな構文、動的なコンテンツ生成機能、サーバー側処理、および迅速な開発機能で知られており、ほとんどの Web ホストでサポートされています...

プログラミング 2024 年 11 月 6 日に公開
クリーンで保守可能なコードのための JavaScript における変数名のベストプラクティス
はじめに: コードの明確さとメンテナンスの強化 JavaScript 開発者にとって、クリーンで理解しやすく保守しやすいコードを書くことは非常に重要です。これを達成するための重要な点は、効果的な変数の命名を行うことです。変数に適切な名前を付けると、コードが読みやすくなるだけでなく、理解や保守も簡単に...

プログラミング 2024 年 11 月 6 日に公開
Spring AOP の内部動作を明らかにする
この投稿では、Spring のアスペクト指向プログラミング (AOP) の内部メカニズムをわかりやすく説明します。多くの場合「魔法」の一種と考えられるロギングなどの機能を AOP がどのように実現するかを理解することに焦点を当てます。コア Java 実装を実際に見てみると、本当に魔法のようなものでは...

プログラミング 2024 年 11 月 6 日に公開
JavaScript Eリリースノート: 最新の JavaScript の力を解き放つ
正式には ECMAScript 2015 として知られる JavaScript ES6 には、開発者が JavaScript を記述する方法を変革する重要な機能強化と新機能が導入されました。ここでは、ES6 を定義し、JavaScript でのプログラミングをより効率的で楽しいものにした上位 20 ...

プログラミング 2024 年 11 月 6 日に公開
Javascript の POST リクエストを理解する
function newPlayer(newForm) { fetch("http://localhost:3000/Players", { method: "POST", headers: { 'Content-Type': 'application...

プログラミング 2024 年 11 月 6 日に公開
Savitzky-Golay フィルタリングを使用してノイズの多い曲線を滑らかにする方法
ノイズの多いデータの曲線の平滑化: Savitzky-Golay フィルタリングの探索データセットの分析を追求する中で、ノイズの多い曲線を平滑化するという課題が生じます。明瞭さを高め、根底にあるパターンを明らかにします。このタスクに特に効果的な方法の 1 つは、Savitzky-Golay フィルタ...

プログラミング 2024 年 11 月 6 日に公開
可変引数メソッドのオーバーロード
可変引数メソッドのオーバーロード可変長の引数を取るメソッドをオーバーロードできます。このプログラムは、可変引数メソッドをオーバーロードする 2 つの方法を示しています: 1 さまざまな varargs パラメータータイプ: vaTest(int...) や vaTest(boolean...)...

プログラミング 2024 年 11 月 6 日に公開
クラシッククラスコンポーネント内で React フックを活用するにはどうすればよいですか?
React フックとクラシッククラスコンポーネントの統合React フックはクラスベースのコンポーネント設計の代替手段を提供しますが、既存のクラスに組み込むことで徐々に採用することができます。コンポーネント。これは、高次コンポーネント (HOC) を使用して実現できます。次のクラスコンポーネン...

プログラミング 2024 年 11 月 6 日に公開
Vite と React を使用して高速なシングルページアプリケーション (SPA) を構築する方法
現代の Web 開発の世界では、シングルページアプリケーション (SPA) が、動的で読み込みの速い Web サイトを作成するための一般的な選択肢となっています。 React は、ユーザーインターフェイスを構築するために最も広く使用されている JavaScript ライブラリの 1 つであり、...

プログラミング 2024 年 11 月 6 日に公開
JavaScript での文字列連結のステップバイステップガイド
JavaScript における文字列の連結は、2 つ以上の文字列を結合して 1 つの文字列を形成するプロセスです。このガイドでは、演算子、= 演算子、concat() メソッド、テンプレートリテラルの使用など、これを実現するためのさまざまな方法を説明します。各メソッドはシンプルかつ効果的で...

プログラミング 2024 年 11 月 6 日に公開
Web UX: ユーザーに意味のあるエラーを表示する
ユーザー主導でユーザーフレンドリーな Web サイトを作成することは、開発チーム全体が機能やコアビジネスに価値を付加しないことに多くの時間を費やすことになるため、場合によっては難しい場合があります。しかし、短期的にはユーザーを助け、長期的には価値を付加することができます。納期に厳格なプロジェクト ...

プログラミング 2024 年 11 月 6 日に公開
小規模クラスのマニピュレーター
Small Class マニピュレータの新しいメジャーリリースコードは完全にリファクタリングされ、属性操作の新しいサポートがコーディングされましたこれは操作の例です: $classFile = \Small\ClassManipulator\ClassManipulator::from...

プログラミング 2024 年 11 月 6 日に公開
機械学習プロジェクトにおける効果的なモデルのバージョン管理
機械学習 (ML) プロジェクトにおいて、最も重要なコンポーネントの 1 つはバージョン管理です。従来のソフトウェア開発とは異なり、ML プロジェクトの管理にはソースコードだけでなく、時間の経過とともに進化するデータとモデルも関係します。そのため、実験を管理し、最適なモデルを選択し、最終的に実稼働...

プログラミング 2024 年 11 月 6 日に公開
PHPでキーを保持しながら列の値で連想配列をグループ化するにはどうすればよいですか?
キーを保持しながら列の値で連想配列をグループ化するそれぞれが「id」などの属性を持つエンティティを表す連想配列の配列を考えます。そして「名前」。課題は、元のキーを維持しながら、特定の列 'id' に基づいてこれらの配列をグループ化することです。これを実現するには、PHP の fore...

プログラミング 2024 年 11 月 6 日に公開
Gradle で特定の推移的な依存関係を除外する方法は?
Gradle での推移的な依存関係の除外Gradle では、アプリケーションプラグインを使用して jar ファイルを生成するときに、次のような推移的な依存関係が発生する可能性があります。除外したい場合があります。これを実現するには、exclude メソッドを使用できます。Exclude のデフォル...

プログラミング 2024 年 11 月 6 日に公開