Creating an MD5 Hash of A ZipFile - python

I want to create an MD5 hash of a ZipFile, not of one of the files inside it. However, ZipFile objects aren't easily convertible to streams.
from hashlib import md5
from zipfile import ZipFile
zipped = ZipFile(r'/Foo/Bar/Filename.zip')
hasher = md5()
hasher.update(zipped)
return hasher.hexdigest()
The above code generates the error :TypeError: must be convertible to a buffer, not ZipFile.
Is there a straightforward way to turn a ZipFile into a stream?
There's no security issues here, I just need a quick an easy way to determine if I've seen a file before. hash(zipped) works fine, but I'd like something a little more robust if possible.

Just open the ZipFile as a regular file. Following code works on my machine.
from hashlib import md5
m = md5()
with open("/Foo/Bar/Filename.zip", "rb") as f:
data = f.read() #read file in chunk and call update on each chunk if file is large.
m.update(data)
print m.hexdigest()

This function should return the MD5 hash of any file, provided it's path (requires pycrypto module):
from Crypto.Hash import MD5
def get_MD5(file_path):
chunk_size = 8192
h = MD5.new()
with open(file_path, 'rb') as f:
while True:
chunk = f.read(chunk_size)
if len(chunk):
h.update(chunk)
else:
break
return h.hexdigest()
print get_MD5('pics.zip') # example
output:
6a690fa3e5b34e30be0e7f4216544365
Info on pycrypto

Related

Fixed file hash with sha512

I am using this code to hash files however the hashes changes each time I run the script... How to have a fixed hash please ? Maybe there is a random seed, i don't want any for this.
I just have a list of files in folder and I need their unique and fixed hash for each one :
import sys
import os
import hashlib
# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536 # lets read stuff in 64kb chunks!
files = [f for f in os.listdir('.') if os.path.isfile(f)]
sha512 = hashlib.sha512()
hashes = []
for file in files:
with open(file, 'rb') as f:
while True:
data = f.read(BUF_SIZE)
if not data:
break
sha512.update(data)
hashes.append(file+"|"+sha512.hexdigest())
with open("hash.txt", 'w+') as f:
for h in hashes:
f.write(h+'\n')
print(h)
Output is a file with each filename and file hash. All file hash must be the same each time I run the script (not the case rn)

Hash each line in Text file Python MD5

I'm trying to write a program which will open a text file and give me an md5 hash for each line of text. For example I have a text file with:
66090001081992
66109801042010
68340016052015
68450001062015
79450001062016
This is my code:
import hashlib
hasher = hashlib.md5()
archivo_empleados = open("empleados.txt","rb")
lista = (archivo_empleados.readlines())
archivo_empleados.close()
You could open the file with a with context manager(don't need to call .close()), then iterate each line of the file with a for loop and print the MD5 hash string. You also need to encode in utf-8 before hashing.
import hashlib
def compute_MD5_hash(string, encoding='utf-8'):
md5_hasher = hashlib.md5()
md5_hasher.update(string.encode(encoding))
return md5_hasher.hexdigest()
with open("path/to/file") as f:
for line in f:
print(compute_MD5_hash(line))
Which gives hash strings like the following:
58d3ab1af1afd247a90b046d4fefa330
6dea9449f52e07bae45a9c1ed6a03bbc
9e2d8de8f8b3df8a7078b8dc12bb3e35
20819f8084927f700bd58cb8108aabcd
620596810c149a5bc86762d2e1924074
You can have a look at the various hashlib functions in the documentation.

Download a gzipped file, md5 checksum it, and then save extracted data if matches

I'm currently attempting to download two files using Python, one a gzipped file, and the other, its checksum.
I would like to verify that the gzipped file's contents match the md5 checksum, and then I would like to save the contents to a target directory.
I found out how to download the files here, and I learned how to calculate the checksum here. I load the URLs from a JSON config file, and I learned how to parse JSON file values here.
I put it all together into the following script, but I'm stuck attempting to store the verified contents of the gzipped file.
import json
import gzip
import urllib
import hashlib
# Function for creating an md5 checksum of a file
def md5Gzip(fname):
hash_md5 = hashlib.md5()
with gzip.open(fname, 'rb') as f:
# Make an iterable of the file and divide into 4096 byte chunks
# The iteration ends when we hit an empty byte string (b"")
for chunk in iter(lambda: f.read(4096), b""):
# Update the MD5 hash with the chunk
hash_md5.update(chunk)
return hash_md5.hexdigest()
# Open the configuration file in the current directory
with open('./config.json') as configFile:
data = json.load(configFile)
# Open the downloaded checksum file
with open(urllib.urlretrieve(data['checksumUrl'])[0]) as checksumFile:
md5Checksum = checksumFile.read()
# Open the downloaded db file and get it's md5 checksum via gzip.open
fileMd5 = md5Gzip(urllib.urlretrieve(data['fileUrl'])[0])
if (fileMd5 == md5Checksum):
print 'Downloaded Correct File'
# save correct file
else:
print 'Downloaded Incorrect File'
# do some error handling
In your md5Gzip, return a tuple instead of just the hash.
def md5Gzip(fname):
hash_md5 = hashlib.md5()
file_content = None
with gzip.open(fname, 'rb') as f:
# Make an iterable of the file and divide into 4096 byte chunks
# The iteration ends when we hit an empty byte string (b"")
for chunk in iter(lambda: f.read(4096), b""):
# Update the MD5 hash with the chunk
hash_md5.update(chunk)
# get file content
f.seek(0)
file_content = f.read()
return hash_md5.hexdigest(), file_content
Then, in your code:
fileMd5, file_content = md5Gzip(urllib.urlretrieve(data['fileUrl'])[0])

How do I find the MD5 hash of an ISO file using Python?

I am writing a simple tool that allows me to quickly check MD5 hash values of downloaded ISO files. Here is my algorithm:
import sys
import hashlib
def main():
filename = sys.argv[1] # Takes the ISO 'file' as an argument in the command line
testFile = open(filename, "r") # Opens and reads the ISO 'file'
# Use hashlib here to find MD5 hash of the ISO 'file'. This is where I'm having problems
hashedMd5 = hashlib.md5(testFile).hexdigest()
realMd5 = input("Enter the valid MD5 hash: ") # Promt the user for the valid MD5 hash
if (realMd5 == hashedMd5): # Check if valid
print("GOOD!")
else:
print("BAD!!")
main()
My problem is on the 9th line when I try to take the MD5 hash of the file. I'm getting the Type Error: object supporting the buffer API required. Could anyone shed some light on to how to make this function work?
The object created by hashlib.md5 doesn't take a file object. You need to feed it data a piece at a time, and then request the hash digest.
import hashlib
testFile = open(filename, "rb")
hash = hashlib.md5()
while True:
piece = testFile.read(1024)
if piece:
hash.update(piece)
else: # we're at end of file
hex_hash = hash.hexdigest()
break
print hex_hash # will produce what you're looking for
You need to read the file:
import sys
import hashlib
def main():
filename = sys.argv[1] # Takes the ISO 'file' as an argument in the command line
testFile = open(filename, "rb") # Opens and reads the ISO 'file'
# Use hashlib here to find MD5 hash of the ISO 'file'. This is where I'm having problems
m = hashlib.md5()
while True:
data = testFile.read(4*1024*1024)
if not data: break
m.update(data)
hashedMd5 = m.hexdigest()
realMd5 = input("Enter the valid MD5 hash: ") # Promt the user for the valid MD5 hash
if (realMd5 == hashedMd5): # Check if valid
print("GOOD!")
else:
print("BAD!!")
main()
And you probably need to open the file in binary ("rb") and read the blocks of data in chunks. An ISO file is likely too large to fit in memory.

Get the MD5 hash of big files in Python

I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5() function.
The problem is with very big files that their sizes could exceed the RAM size.
How can I get the MD5 hash of a file without loading the whole file into memory?
You need to read the file in chunks of suitable size:
def md5_for_file(f, block_size=2**20):
md5 = hashlib.md5()
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
return md5.digest()
Note: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.
So to do the whole lot in one method - use something like:
def generate_file_md5(rootdir, filename, blocksize=2**20):
m = hashlib.md5()
with open( os.path.join(rootdir, filename) , "rb" ) as f:
while True:
buf = f.read(blocksize)
if not buf:
break
m.update( buf )
return m.hexdigest()
The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 Windows installation
I cross-checked the results using the jacksum tool.
jacksum -a md5 <filename>
Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().
This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.
In Python 3.8+ you can do
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Python < 3.7
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''):
h.update(chunk)
return h.digest()
Python 3.8 and above
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
while chunk := f.read(chunk_num_blocks*h.block_size):
h.update(chunk)
return h.digest()
Original post
If you want a more Pythonic (no while True) way of reading the file, check this code:
import hashlib
def checksum_md5(filename):
md5 = hashlib.md5()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
md5.update(chunk)
return md5.digest()
Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').
Here's my version of Piotr Czapla's method:
def md5sum(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f:
for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
md5.update(chunk)
return md5.hexdigest()
Using multiple comment/answers for this question, here is my solution:
import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
'''
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
'''
md5 = hashlib.md5()
with open(path,'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
md5.update(chunk)
if hr:
return md5.hexdigest()
return md5.digest()
This is Pythonic
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performance optimizations
A Python 2/3 portable solution
To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:
To be Python 2.7 and Python 3 portable, you ought to use the io packages, like this:
import hashlib
import io
def md5sum(src):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
content = fd.read()
md5.update(content)
return md5
If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:
def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
return md5
The trick here is to use the iter() function with a sentinel (the empty string).
The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.
If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:
def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
calculated = 0
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
calculated += len(chunk)
callback(calculated)
return md5
A remix of Bastien Semene's code that takes the Hawkwing comment about generic hashing function into consideration...
def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
"""
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
Linux Ext4 block size
sudo tune2fs -l /dev/sda5 | grep -i 'block size'
> Block size: 4096
Input:
path: a path
algorithm: an algorithm in hashlib.algorithms
ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
block_size: a multiple of 128 corresponding to the block size of your filesystem
human_readable: switch between digest() or hexdigest() output, default hexdigest()
Output:
hash
"""
if algorithm not in hashlib.algorithms:
raise NameError('The algorithm "{algorithm}" you specified is '
'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))
hash_algo = hashlib.new(algorithm) # According to hashlib documentation using new()
# will be slower then calling using named
# constructors, ex.: hashlib.md5()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
hash_algo.update(chunk)
if human_readable:
file_hash = hash_algo.hexdigest()
else:
file_hash = hash_algo.digest()
return file_hash
You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b).
I think the following code is more Pythonic:
from hashlib import md5
def get_md5(fname):
m = md5()
with open(fname, 'rb') as fp:
for chunk in fp:
m.update(chunk)
return m.hexdigest()
I don't like loops. Based on Nathan Feger's answer:
md5 = hashlib.md5()
with open(filename, 'rb') as f:
functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()
Implementation of Yuval Adam's answer for Django:
import hashlib
from django.db import models
class MyModel(models.Model):
file = models.FileField() # Any field based on django.core.files.File
def get_hash(self):
hash = hashlib.md5()
for chunk in self.file.chunks(chunk_size=8192):
hash.update(chunk)
return hash.hexdigest()
I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs in MySQL, so I experimented with various file sizes and the straightforward Python approach, viz:
FileHash = hashlib.md5(FileData).hexdigest()
I couldn’t detect any noticeable performance difference with a range of file sizes 2 KB to 20 MB and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.
import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
strip1 = i.strip('\n')
hash_object = hashlib.md5(strip1.encode())
hash2 = hash_object.hexdigest()
print hash2

Categories