”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > Python 中网页抓取的当前问题和错误以及解决它们的技巧!

Python 中网页抓取的当前问题和错误以及解决它们的技巧!

发布于2024-11-07
浏览:121

Introduction

Greetings! I'm Max, a Python developer from Ukraine, a developer with expertise in web scraping, data analysis, and processing.

My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

One of our community members wrote this blog as a contribution to Crawlee Blog. If you want to contribute blogs like these to Crawlee Blog, please reach out to us on our discord channel.

Crawlee & Apify

This is the official developer community of Apify and Crawlee. | 8318 members

Current problems and mistakes of web scraping in Python and tricks to solve them! discord.com

As a freelancer, I've built small solutions and large, complex data mining systems for products over the years.

Today, I want to discuss the realities of web scraping with Python in 2024. We'll look at the mistakes I sometimes see and the problems you'll encounter and offer solutions to some of them.

Let's get started.

Just take requests and beautifulsoup and start making a lot of money...

No, this is not that kind of article.

1. "I got a 200 response from the server, but it's an unreadable character set."

Yes, it can be surprising. But I've seen this message from customers and developers six years ago, four years ago, and in 2024. I read a post on Reddit just a few months ago about this issue.

Let's look at a simple code example. This will work for requests, httpx, and aiohttp with a clean installation and no extensions.

import httpx

url = 'https://www.wayfair.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = httpx.get(url, headers=headers)

print(response.content[:10])

The print result will be similar to:

b'\x83\x0c\x00\x00\xc4\r\x8e4\x82\x8a'

It's not an error - it's a perfectly valid server response. It's encoded somehow.

The answer lies in the Accept-Encoding header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is Brotli, and uses it as the most efficient method.

This can happen if none of the libraries listed above have a Brotli dependency among their standard dependencies. However, they all support decompression from this format if you already have Brotli installed.

Therefore, it's sufficient to install the appropriate library:

pip install Brotli

This will allow you to get the result of the print:

b'

You can obtain the same result for aiohttp and httpx by doing the installation with extensions:

pip install aiohttp[speedups]
pip install httpx[brotli]

By the way, adding the brotli dependency was my first contribution to crawlee-python. They use httpx as the base HTTP client.

You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.

The most common solution to this situation that I've seen is for developers to simply stop using the Accept-Encoding header, thus getting an uncompressed response from the server. Why is that bad? The main page of Wayfair takes about 1 megabyte uncompressed and about 0.165 megabytes compressed.

Therefore, in the absence of this header:

  • You increase the load on your internet bandwidth.
  • If you use a proxy with traffic, you increase the cost of each of your requests.
  • You increase the load on the server's internet bandwidth.
  • You're revealing yourself as a scraper, since any browser uses compression.

But I think the problem is a bit deeper than that. Many web scraping developers simply don't understand what the headers they use do. So if this applies to you, when you're working on your next project, read up on these things; they may surprise you.

2. "I use headers as in an incognito browser, but I get a 403 response". Here's Johnn-... I mean, Cloudflare

Yes, that's right. 2023 brought us not only Large Language Models like ChatGPT but also improved Cloudflare protection.

Those who have been scraping the web for a long time might say, "Well, we've already dealt with DataDome, PerimeterX, InCapsula, and the like."

But Cloudflare has changed the rules of the game. It is one of the largest CDN providers in the world, serving a huge number of sites. Therefore, its services are available to many sites with a fairly low entry barrier. This makes it radically different from the technologies mentioned earlier, which were implemented purposefully when they wanted to protect the site from scraping.

Cloudflare is the reason why, when you start reading another course on "How to do web scraping using requests and beautifulsoup", you can close it immediately. Because there's a big chance that what you learn will simply not work on any "decent" website.

Let's look at another simple code example:

from httpx import Client

client = Client(http2=True)

url = 'https://www.g2.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = client.get(url, headers=headers)

print(response)

Of course, the response would be 403.

What if we use curl?

curl -XGET -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"' -H 'Accept: text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg xml,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' 'https://www.g2.com/' -s -o /dev/null -w "%{http_code}\n"

Also 403.

Why is this happening?

Because Cloudflare uses TLS fingerprints of many HTTP clients popular among developers, site administrators can also customize how aggressively Cloudflare blocks clients based on these fingerprints.

For curl, we can solve it like this:

curl -XGET -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0"' -H 'Accept: text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg xml,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Connection: keep-alive' 'https://www.g2.com/' --tlsv1.3 -s -o /dev/null -w "%{http_code}\n"

You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.

There are different approaches to getting around this. But let's solve it by manipulating TLS.

The bad news is that all the Python clients I know of use the ssl library to handle TLS. And it doesn't give you the ability to manipulate TLS subtly.

The good news is that the Python community is great and implements solutions that exist in other programming languages.

The first way to solve this problem is to use tls-client

This Python wrapper around the Golang library provides an API similar to requests.

pip install tls-client
from tls_client import Session

client = Session(client_identifier="firefox_120")

url = 'https://www.g2.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = client.get(url, headers=headers)

print(response)

The tls_client supports TLS presets for popular browsers, the relevance of which is maintained by developers. To use this, you must pass the necessary client_identifier. However, the library also allows for subtle manual manipulation of TLS.

The second way to solve this problem is to use curl_cffi

This wrapper around the C library patches curl and provides an API similar to requests.

pip install curl_cffi
from curl_cffi import requests

url = 'https://www.g2.com/'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Accept": "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg xml,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Connection": "keep-alive",
}

response = requests.get(url, headers=headers, impersonate="chrome124")

print(response)

curl_cffi also provides TLS presets for some browsers, which are specified via the impersonate parameter. It also provides options for subtle manual manipulation of TLS.

I think someone just said, "They're literally doing the same thing." That's right, and they're both still very raw.

Let's do some simple comparisons:

Feature tls_client curl_cffi
TLS preset
TLS manual
async support -
big company support -
number of contributors -

Obviously, curl_cffi wins in this comparison. But as an active user, I have to say that sometimes there are some pretty strange errors that I'm just unsure how to deal with. And let's be honest, so far, they are both pretty raw.

I think we will soon see other libraries that solve this problem.

One might ask, what about Scrapy? I'll be honest: I don't really keep up with their updates. But I haven't heard about Zyte doing anything to bypass TLS fingerprinting. So out of the box Scrapy will also be blocked, but nothing is stopping you from using curl_cffi in your Scrapy Spider.

3. What about headless browsers and Cloudflare Turnstile?

Yes, sometimes we need to use headless browsers. Although I'll be honest, from my point of view, they are used too often even when clearly not necessary.

Even in a headless situation, the folks at Cloudflare have managed to make life difficult for the average web scraper by creating a monster called Cloudflare Turnstile.

To test different tools, you can use this demo page.

To quickly test whether a library works with the browser, you should start by checking the usual non-headless mode. You don't even need to use automation; just open the site using the desired library and act manually.

What libraries are worth checking out for this?

Candidate #1 Playwright playwright-stealth

It'll be blocked and won't let you solve the captcha.

Playwright is a great library for browser automation. However the developers explicitly state that they don't plan to develop it as a web scraping tool.

And I haven't heard of any Python projects that effectively solve this problem.

Candidate #2 undetected_chromedriver

It'll be blocked and won't let you solve the captcha.

This is a fairly common library for working with headless browsers in Python, and in some cases, it allows bypassing Cloudflare Turnstile. But on the target website, it is blocked. Also, in my projects, I've encountered at least two other cases where Cloudflare blocked undetected_chromedriver.

In general, undetected_chromedriver is a good library for your projects, especially since it uses good old Selenium under the hood.

Candidate #3 botasaurus-driver

It allows you to go past the captcha after clicking.

I don't know how its developers pulled this off, but it works. Its main feature is that it was developed specifically for web scraping. It also has a higher-level library to work with - botasaurus.

On the downside, so far, it's pretty raw, and botasaurus-driver has no documentation and has a rather challenging API to work with.

To summarize, most likely, your main library for headless browsing will be undetected_chromedriver. But in some particularly challenging cases, you might need to use botasaurus.

4. What about frameworks?

High-level frameworks are designed to speed up and ease development by allowing us to focus on business logic, although we often pay the price in flexibility and control.

So, what are the frameworks for web scraping in 2024?

Scrapy

It's impossible to talk about Python web scraping frameworks without mentioning Scrapy. Scrapinghub (now Zyte) first released it in 2008. For 16 years, it has been developed as an open-source library upon which development companies built their business solutions.

Talking about the advantages of Scrapy, you could write a separate article. But I will emphasize the two of them:

  • The huge amount of tutorials that have been released over the years
  • Middleware libraries are written by the community and are extending their functionality. For example, scrapy-playwright.

But what are the downsides?

In recent years, Zyte has been focusing more on developing its own platform. Scrapy mostly gets fixes only.

  • Lack of development towards bypassing anti-scraping systems. You have to implement them yourself, but then, why do you need a framework?
  • Scrapy was originally developed with the asynchronous framework Twisted. Partial support for asyncio was added only in version 2.0. Looking through the source code, you may notice some workarounds that were added for this purpose.

Thus, Scrapy is a good and proven solution for sites that are not protected against web scraping. You will need to develop and add the necessary solutions to the framework in order to bypass anti-scraping measures.

Botasaurus

A new framework for web scraping using browser automation, built on botasaurus-driver. The initial commit was made on May 9, 2023.

Let's start with its advantages:

  • Allows you to bypass any Claudflare protection as well as many others using botasaurus-driver.
  • Good documentation for a quick start

Downsides include:

  • Browser automation only, not intended for HTTP clients.
  • Tight coupling with botasaurus-driver; you can't easily replace it with something better if it comes out in the future.
  • No asynchrony, only multithreading.
  • At the moment, it's quite raw and still requires fixes for stable operation.
  • There are very few training materials available at the moment.

This is a good framework for quickly building a web scraper based on browser automation. It lacks flexibility and support for HTTP clients, which is crutias for users like me.

Crawlee for Python

A new framework for web scraping in the Python ecosystem. The initial commit was made on Jan 10, 2024, with a release in the media space on July 5, 2024.

Current problems and mistakes of web scraping in Python and tricks to solve them! apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Current problems and mistakes of web scraping in Python and tricks to solve them!
A web scraping and browser automation library

Current problems and mistakes of web scraping in Python and tricks to solve them! Current problems and mistakes of web scraping in Python and tricks to solve them! Current problems and mistakes of web scraping in Python and tricks to solve them! Current problems and mistakes of web scraping in Python and tricks to solve them!

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

? Crawlee for Python is open to early adopters!

Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it.

? View full documentation, guides and examples on the Crawlee project website ?

We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information Crawlee for JS/TS on GitHub.

Installation

We…


View on GitHub


Developed by Apify, it is a Python adaptation of their famous JS framework crawlee, first released on Jul 9, 2019.

As this is a completely new solution on the market, it is now in an active design and development stage. The community is also actively involved in its development. So,we can see that the use of curl_cffi is already being discussed. The possibility of creating their own Rust-based client was previously discussed. I hope the company doesn't abandon the idea.

From Crawlee team:
"Yeah, for sure we will keep improving Crawlee for Python for years to come."

As I personally would like to see an HTTP client for Python developed and maintained by a major company. And Rust shows itself very well as a library language for Python. Let's remember at least Ruff and Pydantic v2.

Advantages:

The framework was developed by an established company in the web scraping market, which has well-developed expertise in this sphere.

  • Support for both browser automation and HTTP clients.
  • Fully asynchronous, based on asyncio.
  • Active development phase and media activity. As developers listen to the community, it is quite important in this phase.

On a separate note, it has a pretty good modular architecture. If developers introduce the ability to switch between several HTTP clients, we will get a rather flexible framework that allows us to easily change the technologies used, with a simple implementation from the development team.

Deficiencies:

  • The framework is new. There are very few training materials available at the moment.
  • At the moment, it's quite raw and still requires fixes for stable operation, as well as convenient interfaces for configuration. -There is no implementation of any means of bypassing anti-scraping systems for now other than changing sessions and proxies. But they are being discussed.

I believe that how successful crawlee-python turns out to depends primarily on the community. Due to the small number of tutorials, it is not suitable for beginners. However, experienced developers may decide to try it instead of Scrapy.

In the long run, it may turn out to be a better solution than Scrapy and Botasaurus. It already provides flexible tools for working with HTTP clients, automating browsers out of the box, and quickly switching between them. However, it lacks tools to bypass scraping protections, and their implementation in the future may be the deciding factor in choosing a framework for you.

Conclusion

If you have read all the way to here, I assume you found it interesting and maybe even helpful :)

The industry is changing and offering new challenges, and if you are professionally involved in web scraping, you will have to keep a close eye on the situation. In some other field, you would remain a developer who makes products using outdated technologies. But in modern web scraping, you become a developer who makes web scrapers that simply don't work.

Also, don't forget that you are part of the larger Python community, and your knowledge can be useful in developing tools that make things happen for all of us. As you can see, many of the tools you need are being built literally right now.

I'll be glad to read your comments. Also, if you need a web scraping expert or do you just want to discuss the article, you can find me on the following platforms: Github, Linkedin, Apify, Upwork, Contra.

Thank you for your attention :)

版本声明 本文转载于:https://dev.to/crawlee/current-problems-and-mistakes-of-web-scraping-in-python-and-tricks-to-solve-them-1ogf?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • Python中何时用"try"而非"if"检测变量值?
    Python中何时用"try"而非"if"检测变量值?
    使用“ try“ vs.” if”来测试python 在python中的变量值,在某些情况下,您可能需要在处理之前检查变量是否具有值。在使用“如果”或“ try”构建体之间决定。“ if” constructs result = function() 如果结果: 对于结果: ...
    编程 发布于2025-07-01
  • 如何在其容器中为DIV创建平滑的左右CSS动画?
    如何在其容器中为DIV创建平滑的左右CSS动画?
    通用CSS动画,用于左右运动 ,我们将探索创建一个通用的CSS动画,以向左和右移动DIV,从而到达其容器的边缘。该动画可以应用于具有绝对定位的任何div,无论其未知长度如何。问题:使用左直接导致瞬时消失 更加流畅的解决方案:混合转换和左 [并实现平稳的,线性的运动,我们介绍了线性的转换。这...
    编程 发布于2025-07-01
  • Python中嵌套函数与闭包的区别是什么
    Python中嵌套函数与闭包的区别是什么
    嵌套函数与python 在python中的嵌套函数不被考虑闭合,因为它们不符合以下要求:不访问局部范围scliables to incling scliables在封装范围外执行范围的局部范围。 make_printer(msg): DEF打印机(): 打印(味精) ...
    编程 发布于2025-07-01
  • Go web应用何时关闭数据库连接?
    Go web应用何时关闭数据库连接?
    在GO Web Applications中管理数据库连接很少,考虑以下简化的web应用程序代码:出现的问题:何时应在DB连接上调用Close()方法?,该特定方案将自动关闭程序时,该程序将在EXITS EXITS EXITS出现时自动关闭。但是,其他考虑因素可能保证手动处理。选项1:隐式关闭终止数...
    编程 发布于2025-07-01
  • 如何正确使用与PDO参数的查询一样?
    如何正确使用与PDO参数的查询一样?
    在pdo 中使用类似QUERIES在PDO中的Queries时,您可能会遇到类似疑问中描述的问题:此查询也可能不会返回结果,即使$ var1和$ var2包含有效的搜索词。错误在于不正确包含%符号。通过将变量包含在$ params数组中的%符号中,您确保将%字符正确替换到查询中。没有此修改,PDO...
    编程 发布于2025-07-01
  • 如何使用替换指令在GO MOD中解析模块路径差异?
    如何使用替换指令在GO MOD中解析模块路径差异?
    在使用GO MOD时,在GO MOD 中克服模块路径差异时,可能会遇到冲突,其中3个Party Package将另一个PAXPANCE带有导入式套件之间的另一个软件包,并在导入式套件之间导入另一个软件包。如回声消息所证明的那样: go.etcd.io/bbolt [&&&&&&&&&&&&&&&&...
    编程 发布于2025-07-01
  • 哪种方法更有效地用于点 - 填点检测:射线跟踪或matplotlib \的路径contains_points?
    哪种方法更有效地用于点 - 填点检测:射线跟踪或matplotlib \的路径contains_points?
    在Python Matplotlib's path.contains_points FunctionMatplotlib's path.contains_points function employs a path object to represent the polygon.它...
    编程 发布于2025-07-01
  • 找到最大计数时,如何解决mySQL中的“组函数\”错误的“无效使用”?
    找到最大计数时,如何解决mySQL中的“组函数\”错误的“无效使用”?
    如何在mySQL中使用mySql 检索最大计数,您可能会遇到一个问题,您可能会在尝试使用以下命令:理解错误正确找到由名称列分组的值的最大计数,请使用以下修改后的查询: 计数(*)为c 来自EMP1 按名称组 c desc订购 限制1 查询说明 select语句提取名称列和每个名称...
    编程 发布于2025-07-01
  • Java为何无法创建泛型数组?
    Java为何无法创建泛型数组?
    通用阵列创建错误 arrayList [2]; JAVA报告了“通用数组创建”错误。为什么不允许这样做?答案:Create an Auxiliary Class:public static ArrayList<myObject>[] a = new ArrayList<myO...
    编程 发布于2025-07-01
  • MySQL中如何高效地根据两个条件INSERT或UPDATE行?
    MySQL中如何高效地根据两个条件INSERT或UPDATE行?
    在两个条件下插入或更新或更新 solution:的答案在于mysql的插入中...在重复键更新语法上。如果不存在匹配行或更新现有行,则此功能强大的功能可以通过插入新行来进行有效的数据操作。如果违反了唯一的密钥约束。实现所需的行为,该表必须具有唯一的键定义(在这种情况下为'名称'...
    编程 发布于2025-07-01
  • Java数组中元素位置查找技巧
    Java数组中元素位置查找技巧
    在Java数组中检索元素的位置 利用Java的反射API将数组转换为列表中,允许您使用indexof方法。 (primitives)(链接到Mishax的解决方案) 用于排序阵列的数组此方法此方法返回元素的索引,如果发现了元素的索引,或一个负值,指示应放置元素的插入点。
    编程 发布于2025-07-01
  • 如何在Java字符串中有效替换多个子字符串?
    如何在Java字符串中有效替换多个子字符串?
    在java 中有效地替换多个substring,需要在需要替换一个字符串中的多个substring的情况下,很容易求助于重复应用字符串的刺激力量。 However, this can be inefficient for large strings or when working with nu...
    编程 发布于2025-07-01
  • \“(1)vs.(;;):编译器优化是否消除了性能差异?\”
    \“(1)vs.(;;):编译器优化是否消除了性能差异?\”
    答案: 在大多数现代编译器中,while(1)和(1)和(;;)之间没有性能差异。编译器: perl: 1 输入 - > 2 2 NextState(Main 2 -E:1)V-> 3 9 Leaveloop VK/2-> A 3 toterloop(next-> 8 last-> 9 ...
    编程 发布于2025-07-01
  • 如何干净地删除匿名JavaScript事件处理程序?
    如何干净地删除匿名JavaScript事件处理程序?
    删除匿名事件侦听器将匿名事件侦听器添加到元素中会提供灵活性和简单性,但是当需要删除它们时,可以构成挑战,而无需替换元素本身就可以替换一个问题。 element? element.addeventlistener(event,function(){/在这里工作/},false); 要解决此问题,请考...
    编程 发布于2025-07-01
  • 如何实时捕获和流媒体以进行聊天机器人命令执行?
    如何实时捕获和流媒体以进行聊天机器人命令执行?
    在开发能够执行命令的chatbots的领域中,实时从命令执行实时捕获Stdout,一个常见的需求是能够检索和显示标准输出(stdout)在cath cath cant cant cant cant cant cant cant cant interfaces in Chate cant inter...
    编程 发布于2025-07-01

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3