Get the MD5 hash of big files in Python

Get the MD5 hash of big files in Python - python

I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5() function.
The problem is with very big files that their sizes could exceed the RAM size.
How can I get the MD5 hash of a file without loading the whole file into memory?

You need to read the file in chunks of suitable size:
def md5_for_file(f, block_size=2**20):
md5 = hashlib.md5()
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
return md5.digest()
Note: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.
So to do the whole lot in one method - use something like:
def generate_file_md5(rootdir, filename, blocksize=2**20):
m = hashlib.md5()
with open( os.path.join(rootdir, filename) , "rb" ) as f:
while True:
buf = f.read(blocksize)
if not buf:
break
m.update( buf )
return m.hexdigest()
The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 Windows installation
I cross-checked the results using the jacksum tool.
jacksum -a md5 <filename>

Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().
This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.
In Python 3.8+ you can do
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes

Python < 3.7
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''):
h.update(chunk)
return h.digest()
Python 3.8 and above
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
while chunk := f.read(chunk_num_blocks*h.block_size):
h.update(chunk)
return h.digest()
Original post
If you want a more Pythonic (no while True) way of reading the file, check this code:
import hashlib
def checksum_md5(filename):
md5 = hashlib.md5()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
md5.update(chunk)
return md5.digest()
Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').

Here's my version of Piotr Czapla's method:
def md5sum(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f:
for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
md5.update(chunk)
return md5.hexdigest()

Using multiple comment/answers for this question, here is my solution:
import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
'''
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
'''
md5 = hashlib.md5()
with open(path,'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
md5.update(chunk)
if hr:
return md5.hexdigest()
return md5.digest()
This is Pythonic
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performance optimizations

A Python 2/3 portable solution
To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:
To be Python 2.7 and Python 3 portable, you ought to use the io packages, like this:
import hashlib
import io
def md5sum(src):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
content = fd.read()
md5.update(content)
return md5
If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:
def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
return md5
The trick here is to use the iter() function with a sentinel (the empty string).
The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.
If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:
def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
calculated = 0
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
calculated += len(chunk)
callback(calculated)
return md5

A remix of Bastien Semene's code that takes the Hawkwing comment about generic hashing function into consideration...
def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
"""
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
Linux Ext4 block size
sudo tune2fs -l /dev/sda5 | grep -i 'block size'
> Block size: 4096
Input:
path: a path
algorithm: an algorithm in hashlib.algorithms
ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
block_size: a multiple of 128 corresponding to the block size of your filesystem
human_readable: switch between digest() or hexdigest() output, default hexdigest()
Output:
hash
"""
if algorithm not in hashlib.algorithms:
raise NameError('The algorithm "{algorithm}" you specified is '
'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))
hash_algo = hashlib.new(algorithm) # According to hashlib documentation using new()
# will be slower then calling using named
# constructors, ex.: hashlib.md5()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
hash_algo.update(chunk)
if human_readable:
file_hash = hash_algo.hexdigest()
else:
file_hash = hash_algo.digest()
return file_hash

You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b).

I think the following code is more Pythonic:
from hashlib import md5
def get_md5(fname):
m = md5()
with open(fname, 'rb') as fp:
for chunk in fp:
m.update(chunk)
return m.hexdigest()

I don't like loops. Based on Nathan Feger's answer:
md5 = hashlib.md5()
with open(filename, 'rb') as f:
functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()

Implementation of Yuval Adam's answer for Django:
import hashlib
from django.db import models
class MyModel(models.Model):
file = models.FileField() # Any field based on django.core.files.File
def get_hash(self):
hash = hashlib.md5()
for chunk in self.file.chunks(chunk_size=8192):
hash.update(chunk)
return hash.hexdigest()

I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs in MySQL, so I experimented with various file sizes and the straightforward Python approach, viz:
FileHash = hashlib.md5(FileData).hexdigest()
I couldn’t detect any noticeable performance difference with a range of file sizes 2 KB to 20 MB and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.

import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
strip1 = i.strip('\n')
hash_object = hashlib.md5(strip1.encode())
hash2 = hash_object.hexdigest()
print hash2

Related

Calling argument to function and getting the file hashes

I'm attempting to get the hashes of the file which is the argument supplied. Here is my current code:
import hashlib
import argparse
md5 = hashlib.md5()
sha1 = hashlib.sha1()
sha256 = hashlib.sha256()
BUF_SIZE = 32768
parse = argparse.ArgumentParser()
parse.add_argument("-test", help = 'testing')
args = parse.parse_args()
def hashing(hashThis=args.test):
with open(hashThis, 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
md5.update(data)
sha1.update(data)
sha256.update(data)
#print hashes
print('MD5: {0}'.format(md5.hexdigest()))
print('SHA1: {0}'.format(sha1.hexdigest()))
print('SHA256: {0}'.format(sha256.hexdigest()))
hashing(hashThis=args.test)
This gives me the following output:
user#user:~/Testing$ python test.py -test test.txt
MD5: d41d8cd98f00b204e9800998ecf8427e
SHA1: da39a3ee5e6b4b0d3255bfef95601890afd80709
SHA256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
The issue is that the hashes given are for an empty file, by using sha256sum of the same file I get
user#user:~/Testing$ sha256sum test.txt
8f434346648f6b96df89dda901c5176b10a6d83961dd3c1ac88b59b2dc327aa4 test.txt
Its not pulling the data from the file, and it works if I use the same code outside of a function. I feel like I'm missing something obvious, but can't figure it out.

You need to be updating the hash objects within the while loop - right now the while loop only exits once 'data' is empty, so all you hash is that empty byte array

Creating an MD5 Hash of A ZipFile

I want to create an MD5 hash of a ZipFile, not of one of the files inside it. However, ZipFile objects aren't easily convertible to streams.
from hashlib import md5
from zipfile import ZipFile
zipped = ZipFile(r'/Foo/Bar/Filename.zip')
hasher = md5()
hasher.update(zipped)
return hasher.hexdigest()
The above code generates the error :TypeError: must be convertible to a buffer, not ZipFile.
Is there a straightforward way to turn a ZipFile into a stream?
There's no security issues here, I just need a quick an easy way to determine if I've seen a file before. hash(zipped) works fine, but I'd like something a little more robust if possible.

Just open the ZipFile as a regular file. Following code works on my machine.
from hashlib import md5
m = md5()
with open("/Foo/Bar/Filename.zip", "rb") as f:
data = f.read() #read file in chunk and call update on each chunk if file is large.
m.update(data)
print m.hexdigest()

This function should return the MD5 hash of any file, provided it's path (requires pycrypto module):
from Crypto.Hash import MD5
def get_MD5(file_path):
chunk_size = 8192
h = MD5.new()
with open(file_path, 'rb') as f:
while True:
chunk = f.read(chunk_size)
if len(chunk):
h.update(chunk)
else:
break
return h.hexdigest()
print get_MD5('pics.zip') # example
output:
6a690fa3e5b34e30be0e7f4216544365
Info on pycrypto

Troubleshoot a simple checksum comparison script

I have created a simple checksum script that checks a checksum of a file called tecu.a2l and compares it to a few .md5 files - ensuring that they all have the exact same checksum whenever this script is running.
To make things easier to understand:
Lets say i have tecu.a2l with the checksum 1x2x3x. So the md5 files (if generated correctly) should have the same checksum (1x2x3x). If one of the md5 files has a different checksum than what the latest tecu.a2l has it will give an error.
Hopefully the code might fill in the blanks if you did not quite understand my description.
import hashlib
import dst_creator_constants as CONST
import Tkinter
path_a2l = 'C:<path>\tecu.a2l'
md5 = hashlib.md5()
blocks = 65565
with open(path_a2l, 'rb') as a2l:
readA2L = a2l.read(blocks)
generatedMD5 = md5.hexdigest()
print "stop1"
ihx_md5_files = CONST.PATH_DELIVERABLES_DST
for file in ihx_md5_files:
print "stop2"
if file.endswith('.md5'):
print "stop3"
readMD5 = file.read()
if compare_checksums:
print "Yes"
# Add successful TkInter msg here
else:
print "No"
# Add error msg here
def compare_checksums(generatedMD5, readMD5):
if generatedMD5 == readMD5:
return True
else:
return False
When i run this script, nothing happens. No messages, nothing. If i type in python checksum.py into cmd - it returns no message. So i put in some print statements to see what could be the issue. The issue is, that stop3 is never shown in the command prompt - which means that the problem has something to do with if file.endswith('.md5'): statement.
I have no idea why it's the culprit as I have used this file.endswith() statement on a previous script I wrote in relation to this and there it has worked so I am turning to you.

You are not creating a hash object. Your file stays in your readA2L variable. Also, your file may be larger than the 65565 byte buffer you allow it. Try to update your hasher like the function below and let us know what the result is.
import hashlib as h
from os.path import isfile
hasher = h.md5()
block_size = 65536
def get_hexdigest(file_path, hasher, block_size):
if isfile(file_path):
with open(file_path, 'rb') as f:
buf = f.read(block_size)
while len(buf) > 0:
# Update the hasher until the entire file has been read
hasher.update(buf)
buf = f.read(block_size)
digest = hasher.hexdigest()
else:
return None
return digest

Determine if variable is an open file pointer, or a string

I'd like to write a function to calculate the md5 hash of a file, where I could supply the function with either a string that indicates the full file path, or an opened file pointer.
Right now, my function only accepts a string:
def getMD5Hash(fname):
""" Returns an md5 hash
"""
try:
with open(fname,'rb') as fo:
md5 = hashlib.md5()
chunk_sz = md5.block_size * 128
data = fo.read(chunk_sz)
while data:
md5.update(data)
data = fo.read(chunk_sz)
md5hash = base64.urlsafe_b64encode(md5.digest()).decode('UTF-8').rstrip('=\n')
except IOError:
md5hash = None
How can I detect if fname is a string or an open file pointer?

Python has several different file-like types (file, StringIO, io.TextIOWrapper, etc.), which makes asking "Is this a file?" difficult. Instead, ask "Is this a string?" and assume that anything that isn't must be a file:
def getMD5Hash(fname):
if isinstance(fname, str):
# It's a string!
else:
# I guess it's a file, then.

How to read from a text file compressed with 7z?

I would like to read (in Python 2.7), line by line, from a csv (text) file, which is 7z compressed. I don't want to decompress the entire (large) file, but to stream the lines.
I tried pylzma.decompressobj() unsuccessfully. I get a data error. Note that this code doesn't yet read line by line:
input_filename = r"testing.csv.7z"
with open(input_filename, 'rb') as infile:
obj = pylzma.decompressobj()
o = open('decompressed.raw', 'wb')
obj = pylzma.decompressobj()
while True:
tmp = infile.read(1)
if not tmp: break
o.write(obj.decompress(tmp))
o.close()
Output:
o.write(obj.decompress(tmp))
ValueError: data error during decompression

This will allow you to iterate the lines. It's partially derived from some code I found in an answer to another question.
At this point in time (pylzma-0.5.0) the py7zlib module doesn't implement an API that would allow archive members to be read as a stream of bytes or characters — its ArchiveFile class only provides a read() function that decompresses and returns the uncompressed data in a member all at once. Given that, about the best that can be done is return bytes or lines iteratively via a Python generator using that as a buffer.
The following does the latter, but may not help if the problem is the archive member file itself is huge.
The code below should work in Python 3.x as well as 2.7.
import io
import os
import py7zlib
class SevenZFileError(py7zlib.ArchiveError):
pass
class SevenZFile(object):
#classmethod
def is_7zfile(cls, filepath):
""" Determine if filepath points to a valid 7z archive. """
is7z = False
fp = None
try:
fp = open(filepath, 'rb')
archive = py7zlib.Archive7z(fp)
_ = len(archive.getnames())
is7z = True
finally:
if fp: fp.close()
return is7z
def __init__(self, filepath):
fp = open(filepath, 'rb')
self.filepath = filepath
self.archive = py7zlib.Archive7z(fp)
def __contains__(self, name):
return name in self.archive.getnames()
def readlines(self, name, newline=''):
r""" Iterator of lines from named archive member.
`newline` controls how line endings are handled.
It can be None, '', '\n', '\r', and '\r\n' and works the same way as it does
in StringIO. Note however that the default value is different and is to enable
universal newlines mode, but line endings are returned untranslated.
"""
archivefile = self.archive.getmember(name)
if not archivefile:
raise SevenZFileError('archive member %r not found in %r' %
(name, self.filepath))
# Decompress entire member and return its contents iteratively.
data = archivefile.read().decode()
for line in io.StringIO(data, newline=newline):
yield line
if __name__ == '__main__':
import csv
if SevenZFile.is_7zfile('testing.csv.7z'):
sevenZfile = SevenZFile('testing.csv.7z')
if 'testing.csv' not in sevenZfile:
print('testing.csv is not a member of testing.csv.7z')
else:
reader = csv.reader(sevenZfile.readlines('testing.csv'))
for row in reader:
print(', '.join(row))

If you were using Python 3.3+, you might be able to do this using the lzma module which was added to the standard library in that version.
See: lzma Examples

If you can use python 3, there is a useful library, py7zr, which supports partially 7zip decompression as below:
import py7zr
import re
filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
allfiles = archive.getnames()
selective_files = [f if filter_pattern.match(f) for f in allfiles]
archive.extract(targets=selective_files)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get the MD5 hash of big files in Python - python

Here's my version of Piotr Czapla's method: def md5sum(filename): md5 = hashlib.md5() with open(filename, 'rb') as f: for chunk in iter(lambda: f.read(128 * md5.block_size), b''): md5.update(chunk) return md5.hexdigest()

You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block. m.update(a); m.update(b) is equivalent to m.update(a+b).

I think the following code is more Pythonic: from hashlib import md5 def get_md5(fname): m = md5() with open(fname, 'rb') as fp: for chunk in fp: m.update(chunk) return m.hexdigest()

I don't like loops. Based on Nathan Feger's answer: md5 = hashlib.md5() with open(filename, 'rb') as f: functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None) md5.hexdigest()

import hashlib,re opened = open('/home/parrot/pass.txt','r') opened = open.readlines() for i in opened: strip1 = i.strip('\n') hash_object = hashlib.md5(strip1.encode()) hash2 = hash_object.hexdigest() print hash2

Related

Calling argument to function and getting the file hashes

Creating an MD5 Hash of A ZipFile

Troubleshoot a simple checksum comparison script

Determine if variable is an open file pointer, or a string

How to read from a text file compressed with 7z?

Categories

Resources