Speed up `shutil.copytree` !

Front page > Programming > Speed up `shutil.copytree` !

Speed up `shutil.copytree` !

Published on 2024-11-04

Browse:417

Speed up `shutil.copytree` !

Discuss of speeding up shutil.copytree

Write here

This is a discussion on , see: https://discuss.python.org/t/speed-up-shutil-copytree/62078. If you have any ideas, send me please !

Background

shutil is a very useful moudle in Python. You can find it in github: https://github.com/python/cpython/blob/master/Lib/shutil.py

shutil.copytree is a function that copies a folder to another folder.

In this function, it calls _copytree function to copy.

What does _copytree do ?

Ignoring specified files/directories.
Creating destination directories.
Copying files or directories while handling symbolic links.
Collecting and eventually raising errors encountered (e.g., permission issues).
Replicating metadata of the source directory to the destination directory.

Problems

_copytree speed is not very fast when the numbers of files are large or the file size is large.

Test here:

import os
import shutil

os.mkdir('test')
os.mkdir('test/source')

def bench_mark(func, *args):
    import time
    start = time.time()
    func(*args)
    end = time.time()
    print(f'{func.__name__} takes {end - start} seconds')
    return end - start

# write in 3000 files
def write_in_5000_files():
    for i in range(5000):
        with open(f'test/source/{i}.txt', 'w') as f:
            f.write('Hello World'   os.urandom(24).hex())
            f.close()

bench_mark(write_in_5000_files)

def copy():
    shutil.copytree('test/source', 'test/destination')

bench_mark(copy)

The result is:

write_in_5000_files takes 4.084963083267212 seconds
copy takes 27.12768316268921 seconds

What I done

Multithreading

I use multithread to speed up the copying process. And I rename the funtion _copytree_single_threaded add a new function _copytree_multithreaded. Here is the copytree_multithreaded:

def _copytree_multithreaded(src, dst, symlinks=False, ignore=None, copy_function=shutil.copy2,
                            ignore_dangling_symlinks=False, dirs_exist_ok=False, max_workers=4):
    """Recursively copy a directory tree using multiple threads."""
    sys.audit("shutil.copytree", src, dst)

    # get the entries to copy
    entries = list(os.scandir(src))

    # make the pool
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # submit the tasks
        futures = [
            executor.submit(_copytree_single_threaded, entries=[entry], src=src, dst=dst,
                            symlinks=symlinks, ignore=ignore, copy_function=copy_function,
                            ignore_dangling_symlinks=ignore_dangling_symlinks,
                            dirs_exist_ok=dirs_exist_ok)
            for entry in entries
        ]

        # wait for the tasks
        for future in as_completed(futures):
            try:
                future.result()
            except Exception as e:
                print(f"Failed to copy: {e}")
                raise

I add a judgement to choose use multithread or not.

if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100*1024*1024:
        # multithreaded version
        return _copytree_multithreaded(src, dst, symlinks=symlinks, ignore=ignore,
                                        copy_function=copy_function,
                                        ignore_dangling_symlinks=ignore_dangling_symlinks,
                                        dirs_exist_ok=dirs_exist_ok)

else:
    # single threaded version
    return _copytree_single_threaded(entries=entries, src=src, dst=dst,
                                        symlinks=symlinks, ignore=ignore,
                                        copy_function=copy_function,
                                        ignore_dangling_symlinks=ignore_dangling_symlinks,
                                        dirs_exist_ok=dirs_exist_ok)

Test

I write 50000 files in the source folder. Bench Mark:

def bench_mark(func, *args):
    import time
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start}s")

Write in:

import os
os.mkdir("Test")
os.mkdir("Test/source")

# write in 50000 files
def write_in_file():
    for i in range(50000):
         with open(f"Test/source/{i}.txt", 'w') as f:
             f.write(f"{i}")
             f.close()

Two comparing:

def copy1():
    import shutil
    shutil.copytree('test/source', 'test/destination1')

def copy2():
    import my_shutil
    my_shutil.copytree('test/source', 'test/destination2')

"my_shutil" is my modified version of shutil.

copy1 costs 173.04780609999943s
copy2 costs 155.81321870000102s

copy2 is faster than copy1 a lot. You can run many times.

Advantages & Disadvantages

Use multithread can speed up the copying process. But it will increase the memory usage. But we do not need to rewrite the multithread in the code.

Async

Thanks to "Barry Scott". I will follow his/her suggestion :

You might get the same improvement for less overhead by using async I/O.

I write these code:

import os
import shutil
import asyncio
from concurrent.futures import ThreadPoolExecutor
import time


# create directory
def create_target_directory(dst):
    os.makedirs(dst, exist_ok=True)

# copy 1 file
async def copy_file_async(src, dst):
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, shutil.copy2, src, dst)

# copy directory
async def copy_directory_async(src, dst, symlinks=False, ignore=None, dirs_exist_ok=False):
    entries = os.scandir(src)
    create_target_directory(dst)

    tasks = []
    for entry in entries:
        src_path = entry.path
        dst_path = os.path.join(dst, entry.name)

        if entry.is_dir(follow_symlinks=not symlinks):
            tasks.append(copy_directory_async(src_path, dst_path, symlinks, ignore, dirs_exist_ok))
        else:
            tasks.append(copy_file_async(src_path, dst_path))

    await asyncio.gather(*tasks)
# choose copy method
def choose_copy_method(entries, src, dst, **kwargs):
    if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100 * 1024 * 1024:
        # async version
        asyncio.run(copy_directory_async(src, dst, **kwargs))
    else:
        # single thread version
        shutil.copytree(src, dst, **kwargs)
# test function
def bench_mark(func, *args):
    start = time.perf_counter()
    func(*args)
    end = time.perf_counter()
    print(f"{func.__name__} costs {end - start:.2f}s")

# write in 50000 files
def write_in_50000_files():
    for i in range(50000):
        with open(f"Test/source/{i}.txt", 'w') as f:
            f.write(f"{i}")

def main():
    os.makedirs('Test/source', exist_ok=True)
    write_in_50000_files()

    # 单线程复制
    def copy1():
        shutil.copytree('Test/source', 'Test/destination1')

    def copy2():
        shutil.copytree('Test/source', 'Test/destination2')

    # async
    def copy3():
        entries = list(os.scandir('Test/source'))
        choose_copy_method(entries, 'Test/source', 'Test/destination3')

    bench_mark(copy1)
    bench_mark(copy2)
    bench_mark(copy3)

    shutil.rmtree('Test')

if __name__ == "__main__":
    main()

Output:

copy1 costs 187.21s
copy2 costs 244.33s
copy3 costs 111.27s

You can see that the async version is faster than the single thread version. But the single thread version is faster than the multi-thread version. ( Maybe my test environment is not very good, you can try and send your result as a reply to me )

Thank you Barry Scott !

Advantages & Disadvantages

Async is a good choice. But no solution is perfect. If you find some problem, you can send me as a reply.

End

This is my first time to write discussion on python.org. If there is any problem, please let me know. Thank you.

My Github: https://github.com/mengqinyuan
My Dev.to: https://dev.to/mengqinyuan

Release Statement This article is reproduced at: https://dev.to/mengqinyuan/add-multithreading-to-shutil--2lm1?1 If there is any infringement, please contact [email protected] to delete it

Latest tutorial More>

React: Understanding React&#s Event System
Overview of React's Event System What is a Synthetic Event? Synthetic events are an event-handling mechanism designed by React to ach...

Programming Published on 2024-11-05
Why am I getting a 301 Moved Permanently Error when using Multipart/Form-Data POST requests?
Multipart/Form-Data POSTsWhen attempting to POST data using multipart/form-data, error messages like the one provided can be encountered. Understandin...

Programming Published on 2024-11-05
How to Determine Temporal Boundaries in PHP Using Date and Time Objects?
Determining Temporal Boundaries in PHPIn this programming scenario, we're tasked with ascertaining whether a given time falls within a predefined ...

Programming Published on 2024-11-05
How to Fix jQuery Drag/Resize Issues with CSS Transform Scale?
jQuery Drag/Resize with CSS Transform ScaleProblem: When applying a CSS transform, specifically transform: matrix(0.5, 0, 0, 0.5, 0, 0);, to a div and...

Programming Published on 2024-11-05
$How to Fix the \"ValueError: Failed to Convert NumPy Array to Tensor (Unsupported Object Type Float)\" Error in TensorFlow?$
How to Fix the \"ValueError: Failed to Convert NumPy Array to Tensor (Unsupported Object Type Float)\" Error in TensorFlow?
TensorFlow: Resolving "ValueError: Failed to Convert NumPy Array to Tensor (Unsupported Object Type Float)"A common error encountered when w...

Programming Published on 2024-11-05
How to Efficiently Determine the Existence of a Local Storage Item?
Determining the Existence of a Local Storage ItemWhen working with web storage, it's crucial to verify the existence of specific items before acce...

Programming Published on 2024-11-05
What is an Atomic in Java? Understanding Atomicity and Thread Safety in Java
1. Introduction to Atomic in Java 1.1 What is an Atomic in Java? In Java, the java.util.concurrent.atomic package offers a set of cla...

Programming Published on 2024-11-05
Main Configuration Files for Frontend/Backend
From a DevOps perspective, understanding the configuration files in both Java and Node.js (backend and frontend) codebases is essential for managing b...

Programming Published on 2024-11-05
$What Causes and How to Resolve \"Unexpected Indentation\" Error in Python?$
What Causes and How to Resolve \"Unexpected Indentation\" Error in Python?
What is the Significance of Unexpected Indentation in Python?In the realm of Python programming, the meticulously crafted indentation plays a pivotal ...

Programming Published on 2024-11-05
When Should You Use `setImmediate` vs `process.nextTick` in Node.js?
Understanding the Differences Between setImmediate and nextTickNode.js version 0.10 introduced setImmediate, a new API intended to complement process....

Programming Published on 2024-11-05
How to Get the Height of Hidden Elements in jQuery Efficiently?
Getting Height of Hidden Elements in jQueryWhen dealing with hidden elements, retrieving their height can be challenging. The conventional approach of...

Programming Published on 2024-11-05
$Why Can\'t I Use Variables in Go Struct Tags?$
Why Can\'t I Use Variables in Go Struct Tags?
Using Variables in Go Struct TagsIn Go, struct tags are used to specify metadata about the fields within a struct. While it is possible to define tags...

Programming Published on 2024-11-05
Qopy: My Favorite Clipboard Manager as a Developer
As a developer, I'm always on the lookout for tools that can make my workflow smoother and more efficient. Recently, I stumbled upon Qopy, an open...

Programming Published on 2024-11-05
$Why Isn\'t My Hover Effect Working on My Button?$
Why Isn\'t My Hover Effect Working on My Button?
Changing Button Color on Hover: An Alternative ResolutionWhen attempting to alter the color of a button on hover, it can be frustrating if the solutio...

Programming Published on 2024-11-05
Building a frontend using only Python
Frontend development can be a daunting, even nightmarish, task for backend-focused developers. Early in my career, the lines between frontend and back...

Programming Published on 2024-11-05