In certain scenarios, it becomes necessary to calculate the MD5 hash of large files that exceed the available RAM. The native Python function hashlib.md5() is not suitable for such scenarios as it requires the entire file to be loaded into memory.
To overcome this limitation, a practical approach is to read the file in manageable chunks and iteratively update the hash. This allows efficient hash computation without exceeding memory limits.
import hashlib
def md5_for_file(f, block_size=2**20):
md5 = hashlib.md5()
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
return md5.digest()
To calculate the MD5 hash of a file, use the following syntax:
with open(filename, 'rb') as f:
md5_hash = md5_for_file(f)
The md5_hash variable will contain the computed MD5 hash as a bytes-like object.
Make sure to open the file in binary mode ('rb') to avoid incorrect results. For comprehensive file processing, consider the following function:
import os
import hashlib
def generate_file_md5(rootdir, filename, blocksize=2**20):
m = hashlib.md5()
with open(os.path.join(rootdir, filename), 'rb') as f:
while True:
buf = f.read(blocksize)
if not buf:
break
m.update(buf)
return m.hexdigest()
This function takes a file path and returns the MD5 hash as a hexadecimal string.
By utilizing these techniques, you can efficiently compute MD5 hashes for large files without encountering memory limitations.
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3