Pandas is one of the most popular libraries, when I was looking for an easier way to speed up its performance, I discovered FireDucks and became interested in it!
A Pandas program might encounter a serious performance issue depending on how it has been written. However, being a data scientist, I want to spend more and more time analyzing data rather than improving my code performance. So, it would be great if it could do something like interchange the order of processes and speed up the program performance automatically. For example, Process A =>Process B will be slower, so we will replace it as Process B =>Process A. (Of course, the result is guaranteed to be the same.) It is said that data scientists spend about 45% of their time preparing the data, and when I was thinking of doing something to speed-up the process, I came across a module called FireDucks.
From the FireDucks documentation, it seems to be supported for Linux only platforms. Since I use Windows on my main machine, I would like to try it from WSL2 (Windows Subsystem for Linux), an environment that can run Linux on Windows.
The environment I tried is as follows.
WSL was installed with the help of the following Microsoft documentation; the Linux distribution is Ubuntu 22.04.1 LTS.
Then actually install FireDucks. It is very easy to install, though.
pip install fireducks
It will take a few minutes to install FireDucks (along with pyarrow, pandas and other libraries).
I tried executing below code, the loading speed was so fast, pandas took 4 sec and fireDucks took only 74.5 ns.
# 1. analysis based on time period and creative duration # convert timestamp to date/time object df['timestamp_converted'] = pd.to_datetime(df['timestamp'], unit='s ') # define time period def get_part_of_day(hour): if 5All these data preprocessing and analysis took around 8 seconds in pandas, whereas it could be completed within 4 seconds when using FireDucks. Almost 2 times speed up could be achieved.
Improved performance
One of the most stressful things about using pandas is waiting when loading large data sets, and then I have to wait for complex operation like groupby. On the other hand, since FireDucks does lazy evaluation, loading itself takes no time at all, so processing is done where it is needed, and I felt it was very significant with a great reduction in total waiting time.
As for other performance, it seems that up to 16 times faster compared to pandas has been achieved, as officially announced by the organization. (I will compare the performance with various competing libraries next time.)
zero learning cost
The ability to follow the exact pandas notation without having to think about anything is a huge advantage. Apart from FireDucks, there are other data frame acceleration libraries, but they are too expensive to learn and too easy to forget.
For example, if you want to add columns with polars, you have to write something like this.
# pandas df["new_col"] = df["A"] 1 # polars df = df.with_columns((pl.col("A") 1).alias("new_col"))Nearly no need to change an existing code
I have several ETLs and other projects that use pandas, and it would be nice to see a performance improvement just by installing and replacing the import statement with FireDucks.
If you wanted to add it further, feel free to comment down below.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3