What is faster and cheaper to convert files in AWS: Polar or Pandas?

Front page > Programming > What is faster and cheaper to convert files in AWS: Polar or Pandas?

What is faster and cheaper to convert files in AWS: Polar or Pandas?

Published on 2024-08-18

Browse:191

Both offer a wide range of tools and advantages that may make us doubt which of the two to choose at some point. It is not about changing all the company's processes so that they start using Polars or a “death” to Pandas (this is not going to happen in the immediate future). It is about knowing other tools that can help us reduce costs and time in processes, obtaining the same or better results.

When we use cloud services we prioritize certain factors, including their cost. The services I use for this process are AWS Lambda with the Python 3.10 runtime and S3 to store the raw file and the parquet converted file.

The intention is to obtain a CSV file as raw data and process it with pandas and polar with the intention of verifying which of these two libraries offers us better optimization of resources such as memory and the weight of the resulting file.

Pandas
It is a Python library specialized in data manipulation and analysis, it is written in C and its initial release was in 2008.

*Polars *
It is a Python and Rust library specialized in data manipulation and analysis that allows parallel processes and is written mostly in Rust and was released in 2022.

The architecture of the process:

¿Qué es más rápido y económico para convertir archivos en AWS: Polar o Pandas?

The project is somewhat simple as shown in the architecture: The user deposits a CSV file in work/pandas or work/porlas and automatically starts the s3 trigger to process the file to convert it into parquet and deposit it in processed.

In this small project use two lambdas with the following configuration:
Memory: 2 GB
Ephemeral memory: 2 GB
Life time: 600 seconds

Requirements
Lambda with pandas: Pandas, Numpy and Pyarrow
Lambda with polars: Polars

The dataset used for the comparison is available on kaggle under the name “Rotten Tomatoes Movie Reviews – 1.44M rows” or can be downloaded from here.

The full repository is available on GitHub and can be cloned here.

Size or Weight
The lambda that Pandas uses requires two more plugins to create a parquet file, in this case it is PyArrow and a specific version of numpy for the version of Pandas that I was using. As a result, we obtained a lambda with a weight or size of 74.4 MB, something very close to the limit that AWS allows us for the weight of the lambda.

The lambda with Polars does not require another plugin like PyArrow which makes life simpler and reduces the size of the lambda to less than half. As a result, our lambda has a weight or size of 30.6 MB compared to the first, giving us room to install other dependencies that we may need for our transformation process.

Performance

¿Qué es más rápido y económico para convertir archivos en AWS: Polar o Pandas?
The lambda with Pandas was optimized to use compression after the first version, however, its behavior was also analyzed.
Pandas
It took 18 seconds to process the dataset and used 1894 MB of memory to process the CSV file and generate a Parquet file compared to the other versions, it was the one that used the most time and resources.

Pandas Compression
Adding a line of code allowed us to improve a little compared to the previous version (Pandas), it took 17 seconds to process the dataset and used 1837 MB, which does not represent a significant improvement in processing and computational time, but in size. of the resulting file.

Polars
It took 12 seconds to process the same dataset and I used only 1462 MB, compared to the previous two it represents a time saving of 44.44% and lower memory consumption.

Output file size

¿Qué es más rápido y económico para convertir archivos en AWS: Polar o Pandas?
Pandas
The lambda in which a compression process was not established generated a parquet file of 177.4 MB.

Pandas Compression
When configuring compression in the lambda I do not generate a 121.1 MB parquet file. One small line or option helped us reduce the file size by 31.74%. Considering that it is not a significant code change, it is a very good option.

Polars
Polars generated a 105.8 MB file that, purchased with the first version of Pandas, represents a saving of 40.36% and 12.63% against the Pandas version with compression.

Conclusion
It is not necessary to change all the internal processes that use Pandas so that they now use Polars, however, it is important to consider that if we are talking about thousands or millions of lambda executions, using Polars will help us not only with the deployment time but will also help us to have a lower cost due to the time-based charging that AWS makes for Serverless services such as Lambda.
Likewise, when we translate that 40.36% into millions of files we are talking about GBs or TBs, something that would have a significant impact within a Datalake or Dataware house or even in a cold file storage.

The reduction with Polars would not only be limited to these two factors, because it would greatly affect the output of data and/or objects from AWS because it is a service that does have a cost.

Release Statement This article is reproduced at: https://dev.to/edsantoshn/que-es-mas-rapido-y-economico-para-convertir-archivos-en-aws-polar-o-pandas-594p?1 If there is any infringement, please Contact [email protected] to delete

Latest tutorial More>

Where is the best source for jQuery libraries in your web projects?
Where Should You Source jQuery Libraries From?When including jQuery and jQuery UI in your projects, there are several options available. Let's del...

Programming Published on 2024-11-06
PHP Design Pattern: Adapter
The Adapter Design Pattern is a structural pattern that allows objects with incompatible interfaces to work together. It acts as an intermediary (or a...

Programming Published on 2024-11-06
Understanding WebSockets in PHP
WebSockets provide a real-time, full-duplex communication channel over a single TCP connection. Unlike HTTP, where the client sends requests to the se...

Programming Published on 2024-11-06
What C++11 Features are Supported in Visual Studio 2012?
C 11 Features in Visual Studio 2012With the recent release of a preview version of Visual Studio 2012, many developers are curious about the support ...

Programming Published on 2024-11-06
How Can I Automatically Run Python Scripts on Windows Startup?
Running a Python Script on Windows StartupExecuting a Python script every time Windows starts is crucial for automating tasks or launching essential p...

Programming Published on 2024-11-06
Exploring Astral.CSS: The CSS Framework Revolutionizing Web Design.
In the fast-paced world of web development, frameworks play a pivotal role in helping developers create visually appealing and functional websites eff...

Programming Published on 2024-11-06
A Comprehensive Guide to ESnd Arrow Functions
Introduction to ES6 ECMAScript 2015, also known as ES6 (ECMAScript 6), is a significant update to JavaScript, introducing new syntax and feat...

Programming Published on 2024-11-06
Uncovering Algorithms and Data Structures: The Foundation of Efficient Programming
In this series of posts, I will share my learning journey about two topics that are widely discussed in both academic environments and large technolog...

Programming Published on 2024-11-06
How do you use pprof to profile the number of goroutines in your Go program?
Profiling the Number of Goroutines with pprofDetecting potential goroutine leaks in your Go program requires monitoring the number of goroutines activ...

Programming Published on 2024-11-06
How to Pass Class Methods as Callbacks: Understanding Mechanisms and Techniques
How to Pass Class Methods as CallbacksBackgroundIn some scenarios, you may need to pass class methods as callbacks to other functions for efficient ex...

Programming Published on 2024-11-06
Web scraping- Interesting!
A cool term: CRON = programming technique that schedules tasks automatically at specified intervals Web what? When researching projects etc.,...

Programming Published on 2024-11-06
Testimonials Grid Section
? Just finished building this Testimonials Grid Section while learning CSS Grid! ? Grid is perfect for creating structured layouts. ? Live demo: https...

Programming Published on 2024-11-06
Why is REGISTER_GLOBALS Considered a Major Security Risk in PHP?
Dangers of REGISTER_GLOBALSREGISTER_GLOBALS is a PHP setting that enables all GET and POST variables to be available as global variables within PHP sc...

Programming Published on 2024-11-06
Overview of Nodemailer: Easy Email Sending in Node.js
Nodemailer is a Node.js module for sending emails. Here's a quick overview: Transporter: Defines how emails will be sent (via Gmail, custom SMTP, ...

Programming Published on 2024-11-06
Effortless Error Handling in JavaScript: How the Safe Assignment Operator Simplifies Your Code
Error handling in JavaScript can be messy. Wrapping large blocks of code in try/catch statements works, but as your project grows, debugging becomes a...

Programming Published on 2024-11-06