Amazon은 쉬운 수준에서 스스로 구문 분석합니다.

첫 장 > 프로그램 작성 > Amazon은 쉬운 수준에서 스스로 구문 분석합니다.

Amazon은 쉬운 수준에서 스스로 구문 분석합니다.

2024-11-06에 게시됨

검색:941

I came across a script on the Internet that allows you to parse product cards from Amazon. And I just needed a solution to a problem like that.

I wracked my brain while looking for a way to parse product cards from Amazon. The problem is that Amazon uses different design options for different outputs, in particular – if you need to parse the cards with the search query "bags" – the cards will be arranged vertically, as I need it, but if you take, for example, "t-shirts" – then the cards will be arranged horizontally, and in such way the script falls into an error, it works out opening the page, but does not want to scroll.

Amazon parsing on easy level and all by yourself

Moreover, after reading various articles where users are puzzling over how to bypass captcha on Amazon, I upgraded the script and now it can bypass the captcha if it occurs (it works with 2captcha). The script checks for the presence of a captcha on the page after each loading of a new page, and if the captcha occurs, it sends a request to the 2capcha server, and after receiving the solution, substitutes it and continues to work.

However, how to bypass the captcha is not the most difficult problem, since this is a trivial task nowadays. The more pressing question is how to make the script work not only with the vertical arrangement of product cards, but also with the horizontal one.

Below I will describe in detail what the script includes, demonstrate its work, and if you can help to solve the problem, if you know what to add (change) in the script so that it works on horizontal setup of cards, I will be grateful.

And for now the script can help someone at least in its limited functionality.

So, let's take the script apart piece by piece!

Preparation

Firstly, the script imports the modules needed to complete the task

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import os
from time import sleep
import requests

Let's take it apart in parts:

from selenium import webdriver

That imports the webdriver class, which allows you to control the browser (in my case Firefox) through the script

from selenium.webdriver.common.by import By

That imports theBy class, with which the script will search for elements to parse by XPath (it can search for other attributes, but in this case Xpath will be used)

from selenium.webdriver.common.keys import Keys

That imports the Keys class, which will be used to simulate keystrokes, in the case of this script, it will scroll the page down Keys.PAGE_DOWN

from selenium.webdriver.common.action_chains import ActionChains

That imports theActionChains class to create complex sequential actions, in our case – clicking on the PAGE_DOWN button and waiting for all elements on the page to load (since on Amazon cards are loaded as they are being scrolled)

from selenium.webdriver.support.ui import WebDriverWait

That imports the WebDriverWait class, which waits until the information we are looking for is loaded, for example, a product description, which we will search by Xpath

from selenium.webdriver.support import expected_conditions as EC

That imports the expected_conditions class (abbreviated EC) which works in conjunction with the previous class and tells WebDriverWait which specific condition it needs to wait for. That increases the reliability of the script so that it would not start interacting with the unloaded yet content.

import csv

That imports the csv module to work with csv files.

import os

That imports the os module to work with the operating system (creating directories, checking for the files presence, etc.).

from time import sleep

We import the sleep function – this is the function that will pause the script for a specific time (in my case, 2 seconds, but you can set more) so that the elements would load while scrolling.

import requests

That imports the requests library for sending HTTP requests, to interact with the 2captcha recognition service.

Configuration

After everything is imported, the script starts configuring the browser for work, in particular:

Installing the API key to access the 2captcha service

# API key for 2Captcha
API_KEY =

The script contains a user-agent (it can be changed, of course), which is installed for the browser. After that, the browser starts with the specified settings.

`user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

options = webdriver.FirefoxOptions()
options.add_argument(f"user-agent={user_agent}")

driver = webdriver.Firefox(options=options)
`

Next comes the captcha solution module. This is exactly the place that users are looking for when they search how to solve a captcha. We will not analyze this piece of code for a long time, since there were no particular problems with it.

In short, the script, after each page load, checks for the presence of a captcha on the page and if it finds it there, solves it by sending it to the 2captcha server. If there is no captcha, it just continues the execution further.

`def solve_captcha(driver):
# Check for the presence of a captcha on the page
try:
captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')
if captcha_element:
print("Captcha detected. Solving...")
site_key = captcha_element.get_attribute('data-sitekey')
current_url = driver.current_url

        # Send captcha request to 2Captcha
        captcha_id = requests.post(
            'http://2captcha.com/in.php', 
            data={
                'key': API_KEY, 
                'method': 'userrecaptcha', 
                'googlekey': site_key, 
                'pageurl': current_url
            }
        ).text.split('|')[1]

        # Wait for the captcha to be solved
        recaptcha_answer = ''
        while True:
            sleep(5)
            response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
            if response.text == 'CAPCHA_NOT_READY':
                continue
            if 'OK|' in response.text:
                recaptcha_answer = response.text.split('|')[1]
                break

        # Inject the captcha answer into the page
        driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML = "{recaptcha_answer}";')
        driver.find_element(By.ID, 'submit').click()
        sleep(5)
        print("Captcha solved.")
except Exception as e:
    print("No captcha found or error occurred:", e)

Parsing
Next comes a section of the code that is responsible for sorting pages, loading, and scrolling them

try:
base_url = "https://www.amazon.in/s?k=bags"

for page_number in range(1, 10): 
    page_url = f"{base_url}&page={page_number}"

    driver.get(page_url)
    driver.implicitly_wait(10)

    solve_captcha(driver)

    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')))

    for _ in range(5):  
        ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()
        sleep(2)

The next piece is the collection of product data. The most important part. In this part, the script examines the loaded page and takes the data that is specified from there. In our case it is the product name, number of reviews, price, URL, product rating.

`product_name_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')
rating_number_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base s-underline-text"]')
star_rating_elements = driver.find_elements(By.XPATH, '//span[@class="a-icon-alt"]')
price_elements = driver.find_elements(By.XPATH, '//span[@class="a-price-whole"]')
product_urls = driver.find_elements(By.XPATH, '//a[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')

    product_names = [element.text for element in product_name_elements]
    rating_numbers = [element.text for element in rating_number_elements]
    star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]
    prices = [element.text for element in price_elements]
    urls = [element.get_attribute('href') for element in product_urls]

Next, the specified data is uploaded to a folder (a csv file is created for each page, which is saved to the output files folder). If the folder is missing, the script creates it.

` output_directory = "output files"
if not os.path.exists(output_directory):
os.makedirs(output_directory)

    with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])
        for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):
            csv_writer.writerow([url, name, price, star_rating, num_ratings])

And the final stage is the completion of work and the release of resources.

finally:
driver.quit()

The full script

`from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import os
from time import sleep
import requests

API key for 2Captcha

API_KEY = "Your API Key"

Set a custom user agent to mimic a real browser

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

options = webdriver.FirefoxOptions()
options.add_argument(f"user-agent={user_agent}")

driver = webdriver.Firefox(options=options)

def solve_captcha(driver):
# Check for the presence of a captcha on the page
try:
captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')
if captcha_element:
print("Captcha detected. Solving...")
site_key = captcha_element.get_attribute('data-sitekey')
current_url = driver.current_url

        # Send captcha request to 2Captcha
        captcha_id = requests.post(
            'http://2captcha.com/in.php', 
            data={
                'key': API_KEY, 
                'method': 'userrecaptcha', 
                'googlekey': site_key, 
                'pageurl': current_url
            }
        ).text.split('|')[1]

        # Wait for the captcha to be solved
        recaptcha_answer = ''
        while True:
            sleep(5)
            response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
            if response.text == 'CAPCHA_NOT_READY':
                continue
            if 'OK|' in response.text:
                recaptcha_answer = response.text.split('|')[1]
                break

        # Inject the captcha answer into the page
        driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML = "{recaptcha_answer}";')
        driver.find_element(By.ID, 'submit').click()
        sleep(5)
        print("Captcha solved.")
except Exception as e:
    print("No captcha found or error occurred:", e)

try:
# Starting page URL
base_url = "https://www.amazon.in/s?k=bags"

for page_number in range(1, 2): 
    page_url = f"{base_url}&page={page_number}"

    driver.get(page_url)
    driver.implicitly_wait(10)

    # Attempt to solve captcha if detected
    solve_captcha(driver)

    # Explicit Wait
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')))

    for _ in range(5):  
        ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()
        sleep(2)

    product_name_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-medium a-color-base a-text-normal"]')
    rating_number_elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base s-underline-text"]')
    star_rating_elements = driver.find_elements(By.XPATH, '//span[@class="a-icon-alt"]')
    price_elements = driver.find_elements(By.XPATH, '//span[@class="a-price-whole"]')
    product_urls = driver.find_elements(By.XPATH, '//a[@class="a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal"]')

    # Extract and print the text content of each product name, number of ratings, and star rating, urls
    product_names = [element.text for element in product_name_elements]
    rating_numbers = [element.text for element in rating_number_elements]
    star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]
    prices = [element.text for element in price_elements]
    urls = [element.get_attribute('href') for element in product_urls]

    sleep(5)        
    output_directory = "output files"
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:
        csv_writer = csv.writer(csvfile)
        csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])
        for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):
            csv_writer.writerow([url, name, price, star_rating, num_ratings])

finally:
driver.quit()

This way the script works without errors, but only for vertical product cards. Here is an example of how the script works.

I will be glad to discuss it in the comments if you have something to say about it.

릴리스 선언문 이 기사는 https://dev.to/markus009/amazon-parsing-on-easy-level-and-all-by-yourself-4dlj?1에 복제되어 있습니다. 침해가 있는 경우에는 [email protected]으로 문의하십시오. 그것을 삭제하려면

최신 튜토리얼 더>

regex를 사용하여 PHP에서 괄호 안에서 텍스트를 추출하는 방법
$ fullstring = "이 (텍스트)을 제외한 모든 것을 무시합니다. $ start = strpos ( '('( '('). , $ fullstring); $ end = strlen ($ fullstring) - s...

프로그램 작성 2025-02-21에 게시되었습니다
$\ "일반 오류 : 2006 MySQL Server가 사라졌습니다 \"데이터를 삽입 할 때?$
\ "일반 오류 : 2006 MySQL Server가 사라졌습니다 \"데이터를 삽입 할 때?
] MySQL 데이터베이스에 삽입하면 때때로 "일반 오류 : 2006 MySQL 서버가 사라졌습니다."오류가 발생할 수 있습니다. 이 오류는 일반적으로 MySQL 구성의 두 변수 중 하나로 인해 서버에 대한 연결이 손실 될 때 발생합니다. 솔루션...

프로그램 작성 2025-02-21에 게시되었습니다
버전 5.6.5 이전에 MySQL의 Timestamp 열을 사용하여 current_timestamp를 사용하는 데 제한 사항은 무엇입니까?
5.6.5 이전에 mySQL 버전에서 기본적으로 또는 업데이트 클로즈가있는 타임 스탬프 열에서 제한 기본적으로 current_timestamp 또는 업데이트에 하나의 타임 스탬프 열만 있도록 테이블을 제한하는 제한이었습니다. current_timestamp ...

프로그램 작성 2025-02-21에 게시되었습니다
McRypt에서 OpenSSL로 암호화를 마이그레이션하고 OpenSSL을 사용하여 McRypt 암호화 데이터를 해제 할 수 있습니까?
질문 : McRypt에서 OpenSSL로 내 암호화 라이브러리를 업그레이드 할 수 있습니까? 그렇다면 어떻게? 대답 : 예, McRypt에서 OpenSSL로 암호화 라이브러리를 업그레이드 할 수 있습니다. OpenSSL? OpenSSL을 사용하여 Mc...

프로그램 작성 2025-02-21에 게시되었습니다
$Point-In-Polygon 감지에 더 효율적인 방법 : Ray Tracing 또는 Matplotlib \ 's Path.contains_points?$
Point-In-Polygon 감지에 더 효율적인 방법 : Ray Tracing 또는 Matplotlib \ 's Path.contains_points?
Ray Tracing Method Ray Tracing Method는 수평 광선을 상호 작용합니다. 다각형의 측면과의 검사 중. 교차로의 수를 계산하고 지점이 패리티에 따라 다각형 내부에 있는지 결정합니다. matplotlib의 경로 .contain...

프로그램 작성 2025-02-21에 게시되었습니다
SQLALCHEMY 필터 조항에서 'Flake8'플래킹 부울 비교가 된 이유는 무엇입니까?
제공된 예에서 부울 필드. 데이터베이스 테이블의 (쓸모없는) 비 관찰 테스트 사례의 수를 결정하는 데 사용됩니다. 코드는 필터 절에서 테스트 케이스를 사용합니다. pre> 그러나 flake8은 경고를보고합니다. "e712 : false와의 비교는 &...

프로그램 작성 2025-02-21에 게시되었습니다
HTML 서식 태그
HTML 서식 요소 **HTML Formatting is a process of formatting text for better look and feel. HTML provides us ability to format text without...

프로그램 작성 2025-02-21에 게시되었습니다
Firefox Back 버튼을 사용할 때 JavaScript 실행이 중단되는 이유는 무엇입니까?
탐색 기록 문제 : javaScript가 Firefox 뒤로 버튼을 사용한 후 실행을 중단합니다 Firefox 사용자 뒤로 버튼을 통해 이전에 방문한 페이지로 돌아갑니다. 이 문제는 Chrome 및 Internet Explorer와 같은 다른 브라우저에...

프로그램 작성 2025-02-21에 게시되었습니다
익명의 JavaScript 이벤트 처리기를 깨끗하게 제거하는 방법은 무엇입니까?
익명 이벤트 리스너 제거 ELMENTS에 대한 익명 이벤트 리스너 추가 유연성과 단순성을 제공하지만 제거 할 때 요소를 교체하지 않고 도전을 제기 할 수 있습니다. 질문이 발생합니다 : 익명의 이벤트 리스너가 이러한 방식으로 추가 할 수 있습니까? 요...

프로그램 작성 2025-02-21에 게시되었습니다
교체 지시문을 사용하여 GO MOD에서 모듈 경로 불일치를 해결하는 방법은 무엇입니까?
Go mod에서 모듈 경로 불일치 극복 replace github.com/coreos/bbolt v1.3.5 => go.etcd.io/bbolt v1.3.5 다른 패키지가 다른 패키지를 가져 오는 곳에서 충돌을 ...

프로그램 작성 2025-02-21에 게시되었습니다
Google API에서 최신 JQuery 라이브러리를 검색하는 방법은 무엇입니까?
https://code.jquery.com/jquery-latest.min.js (jQuery Hosted, Minified) https : //code.jquery .com/jquery-latest.js (jQuery 호스팅, 압축되지 않은) https...

프로그램 작성 2025-02-21에 게시되었습니다
MySQL에서 데이터를 피벗하여 그룹을 어떻게 사용할 수 있습니까?
를 사용하여 데이터 시각화를 향상시키기 위해 행과 열의 재 배열을 나타냅니다. . 여기서 우리는 공통 도전에 접근합니다. 그룹 by. 합 또는 사례와 같은 조건부 응집 기능과 함께 절에 의해. 다음 쿼리를 고려해 봅시다 :...

프로그램 작성 2025-02-21에 게시되었습니다
열의 열이 다른 데이터베이스 테이블을 어떻게 통합하려면 어떻게해야합니까?
다른 열이있는 결합 테이블 ] 는 데이터베이스 테이블을 다른 열로 병합하려고 할 때 도전에 직면 할 수 있습니다. 간단한 방법은 열이 적은 테이블의 누락 된 열에 null 값을 추가하는 것입니다. 예를 들어 예를 들어, 표 A와 표 B의 두 테이블을 고려...

프로그램 작성 2025-02-21에 게시되었습니다
동적 인 크기의 부모 요소 내에서 요소의 스크롤 범위를 제한하는 방법은 무엇입니까?
수직 스크롤 요소에 대한 CSS 높이 제한 구현 $("#map").css({ marginTop: (scrollVal - $("#sidebar").offset().to...

프로그램 작성 2025-02-21에 게시되었습니다
$PHP \의 기능 재정의 제한을 극복하는 방법은 무엇입니까?$
PHP \의 기능 재정의 제한을 극복하는 방법은 무엇입니까?
} // error : "redeclare foo ()" function this ($ a, $ B) { return $ a * $ b; } 그러나 PHP 도구 벨트에는 숨겨진 보석이 있습니다. runkit_function_renam...

프로그램 작성 2025-02-21에 게시되었습니다