Related
I am trying to delete a duplicated image by comparing md5 file hash.
my code is
from PIL import Image
import hashlib
import os
import sys
import io
img_file = urllib.request.urlopen(img_url, timeout=30)
f = open('C:\\Users\\user\\Documents\\ + img_name, 'wb')
f.write(img_file.read())
f.close # subject image, status = ok
im = Image.open('C:\\Users\\user\\Documents\\ + img_name)
m = hashlib.md5() # get hash
with io.BytesIO() as memf:
im.save(memf, 'PNG')
data = memf.getvalue()
m.update(data)
md5hash = m.hexdigest() # hash done, status = ok
im.close()
if md5hash in hash_list[name]: # comparing hash
os.remove('C:\\Users\\user\\Documents\\ + img_name) # delete file, ERROR
else:
hash_list[name].append(m.hexdigest())
and i get this error
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process:
'C:\\Users\\user\\Documents\\myimage.jpg'
I tried admin command prompt, but still getting this error. Could you find what is accessing the file?
Just noticed you're using f.close instead of f.close()
Add () and check if problem still occurs.
Cheers ;)
Your issue has indeed been as Adrian Daniszewski said, however, there are quite few more programming problems with your code.
First of all, you should familiarize yourself with with. You use with for BytesIO() but it can also be used for opening files.
The benefit of with open(...) as f: is the fact that you don't have to search whether you closed the file or remember to close it. It will close the file at the end of its indentation.
Second, there is a bit of duplication in your code. Your code should be DRY to avoid being forced to change multiple locations with the same stuff.
Imagine having to change where you save the byte files. Right now you will be forced to change in three different locations.
Now imagine not noticing one of these locations.
My suggestion would be first of all to save the path to a variable and use that -
bytesimgfile = 'C:\\Users\\user\\Documents\\' + img_name
An example to use with in your code would be like this:
with open(bytesimgfile , 'wb') as f:
f.write(img_file.read())
A full example with your given code:
from PIL import Image
import hashlib
import os
import sys
import io
img_file = urllib.request.urlopen(img_url, timeout=30)
bytesimgfile = 'C:\\Users\\user\\Documents\\' + img_name
with open(bytesimgfile , 'wb'):
f.write(img_file.read())
with Image.open(bytesimgfile) as im:
m = hashlib.md5() # get hash
with io.BytesIO() as memf:
im.save(memf, 'PNG')
data = memf.getvalue()
m.update(data)
md5hash = m.hexdigest() # hash done, status = ok
if md5hash in hash_list[name]: # comparing hash
os.remove(bytesimgfile) # delete file, ERROR
else:
hash_list[name].append(m.hexdigest())
This was a suggestion from another stack thread that I'm finally getting back to. It was part of a discussion about how to embed tools into a maya file.
You can write the whole thing as a python package, zip it, then stuff the binary contents of the zip file into a fileInfo. When you need to code, look for it in the user's $MAYA_APP_DIR; if there's no zip, write the contents of the fileInfo to disk as a zip and then insert the zip into sys.path
Source discussions were:
python copy scripts into script
and Maya Python Create and Use Zipped Package?
So far the programming is going okay, but I think I hit a snag. When I attempt this:
with open("..directory/myZip.zip","rb") as file:
cmds.fileInfo("myZip", file.read())
..and then I..
print cmds.fileInfo("myZip",q=1)
I get
[u'PK\003\004\024']
which is a bad translation of the first line of gibberish when reading the zip file as a text document.
How can I embed my zip file into my maya file as binary?
====================
Update:
Maya doesn't like writing to the file as a direct read of the utf-8 encoded zip file. I found various methods of making it into an acceptable string that could be written, but the decoding back to the file didn't appear to work. I see now that Theodox's suggestion was to write it to binary and put that in the fileInfo node.
How can one encode, store and then decode to write to file later?
If I were to convert to binary using for instance:
' '.join(format(ord(x), 'b') for x in line)
What code would I need to turn that back into the original utf-8 zip information?
you can find related code here:
http://tech-artists.org/forum/showthread.php?4161-Maya-API-Singleton-Nodes&highlight=mayapersist
the relevant bit is
import base64
encoded = base64.b64encode(value)
decoded = base64.b64decode(encoded)
basically it's the same idea, except using the base64 module instead of binascii. Any method of converting an arbitary character stream to an ascii-safe representation will work fine, as long as you use a reversable method: the potential problem you need to watch out for is a character in the data block that looks to maya like a close quote - an open quote int he fileInfo will be messy in an MA file.
This example uses YAML to do arbitrary key-value pairs but that part is irrelevant to storing the binary stuff. I've used this technique for fairly large data (up to 640k if i recall) but I don't know if it has an upper limit in terms of what you can stash in Maya
Found the answer. Great script on stack overflow. I had to encode to 'string_escape', which is something I found while trying to figure out the whole characters situation. But anyways, you open the zip, encode to 'string_escape', write it to fileInfo and then before fetching to write it back to zip, you decode it back.
Convert binary to ASCII and vice versa
import maya.cmds as cmds
import binascii
def text_to_bits(text, encoding='utf-8', errors='surrogatepass'):
bits = bin(int(binascii.hexlify(text.encode(encoding, errors)), 16))[2:]
return bits.zfill(8 * ((len(bits) + 7) // 8))
def text_from_bits(bits, encoding='utf-8', errors='surrogatepass'):
n = int(bits, 2)
return int2bytes(n).decode(encoding, errors)
def int2bytes(i):
hex_string = '%x' % i
n = len(hex_string)
return binascii.unhexlify(hex_string.zfill(n + (n & 1)))
And then you can
with open("..\maya\scripts/test.zip","rb") as thing:
texty = text_to_bits(thing.read().encode('string_escape'))
cmds.fileInfo("binaryZip",texty)
...later
with open("..\maya\scripts/test_2.zip","wb") as thing:
texty = cmds.fileInfo("binaryZip",q=1)
thing.write( text_from_bits( texty ).decode('string_escape') )
and this appears to work.. so far..
Figured I'd post the final product for anyone wanting to undertake this approach. I tried to do some corruption checking, so that a bad zip doesn't get passed between machines. That's what all the check hashing is for.
def writeTimeFull(tl):
import TimeFull
#reload(TimeFull)
with open(TimeFull.__file__.replace(".pyc",".py"),"r") as file:
cmds.scriptNode( tl.scriptConnection[1][0], e=1, bs=file.read() )
cmds.expression("spark_timeliner_activator",
e=1,s='if (Spark_Timeliner.ShowTimeliner == 1)\n'
'{\n'
'\tsetAttr Spark_Timeliner.ShowTimeliner 0;\n'
'\tpython \"Timeliner.InitTimeliner()\";\n'
'}',
o="Spark_Timeliner",ae=1,uc=all)
def checkHash(zipPath,hash1,hash2,hash3):
check = False
hashes = [hash1,hash2,hash3]
for ii, hash in enumerate(hashes):
if hash == hashes[(ii+1)%3]:
hashes[(ii+2)%3] = hashes[ii]
check = True
if check:
if md5(zipPath) == hashes[0]:
return [zipPath,hashes[0],hashes[1],hashes[2]]
else:
cmds.warning("Hash checks and/or zip are corrupted. Attain toolbox_fix.zip, put it in scripts folder and restart.")
return []
#this writes the zip file to the local users space
def saveOutZip(filename):
if os.path.isfile(filename):
if not os.path.isfile(filename.replace('_pkg','_'+__version__)):
os.rename(filename,filename.replace('_pkg','_'+__version__))
with open(filename,"w") as zipFile:
zipInfo = cmds.fileInfo("zipInfo1",q=1)[0]
zipHash_1 = cmds.fileInfo("zipHash1",q=1)[0]
zipHash_2 = cmds.fileInfo("zipHash2",q=1)[0]
zipHash_3 = cmds.fileInfo("zipHash3",q=1)[0]
zipFile.write( base64.b64decode(zipInfo) )
if checkHash(filename,zipHash_1,zipHash_2,zipHash_3):
cmds.fileInfo("zipInfo2",zipInfo)
return filename
with open(filename,"w") as zipFile:
zipInfo = cmds.fileInfo("zipInfo2",q=1)[0]
zipHash_1 = cmds.fileInfo("zipHash1",q=1)[0]
zipHash_2 = cmds.fileInfo("zipHash2",q=1)[0]
zipHash_3 = cmds.fileInfo("zipHash3",q=1)[0]
zipFile.write( base64.b64decode(zipInfo) )
if checkHash(filename,zipHash_1,zipHash_2,zipHash_3):
cmds.fileInfo("zipInfo1",zipInfo)
return filename
return False
#this writes the local zip to this file
def loadInZip(filename):
zipResults = []
for ii in range(0,10):
with open(filename,"r") as theRead:
zipResults.append([base64.b64encode(theRead.read())]+checkHash(filename,md5(filename),md5(filename),md5(filename)))
if ii>0 and zipResults[ii]==zipResults[ii-1]:
cmds.fileInfo("zipInfo1",zipResults[ii][0])
cmds.fileInfo("zipInfo2",zipResults[ii-1][0])
cmds.fileInfo("zipHash1",zipResults[ii][2])
cmds.fileInfo("zipHash2",zipResults[ii][3])
cmds.fileInfo("zipHash3",zipResults[ii][4])
return True
#file check
#http://stackoverflow.com/questions/3431825/generating-a-md5-checksum-of-a-file
def md5(fname):
import hashlib
hash = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
return hash.hexdigest()
filename = path+'/toolbox_pkg.zip'
zipPaths = [path+'/toolbox_update.zip',
path+'/toolbox_fix.zip',
path+'/toolbox_'+__version__+'.zip',
filename]
zipPaths_exist = [os.path.isfile(zipPath) for zipPath in zipPaths ]
if any(zipPaths_exist[:2]):
if zipPaths_exist[0]:
cmds.warning('Timeliner update present. Forcing file to update version')
if zipPaths_exist[2]:
os.remove(zipPaths[3])
elif os.path.isfile(zipPaths[3]):
os.rename(zipPaths[3], zipPaths[2])
os.rename(zipPaths[0],zipPaths[3])
if zipPaths_exist[1]:
os.remove(zipPaths[1])
else:
cmds.warning('Timeliner fix present. Replacing file to the fix version.')
if os.path.isfile(zipPaths[3]):
os.remove(zipPaths[3])
os.rename(zipPaths[1],zipPaths[3])
loadInZip(filename)
if not cmds.fileInfo("zipInfo1",q=1) and not cmds.fileInfo("zipInfo2",q=1):
loadInZip(filename)
if not os.path.isfile(filename):
saveOutZip(filename)
sys.path.append(filename)
import Timeliner
Timeliner.InitTimeliner(theVers=__version__)
if not any(zipPaths[:2]):
if __version__ > Timeliner.__version__:
cmds.warning('Saving out newer version of timeliner to local machine. Restart Maya to access latest version.')
saveOutZip(filename)
elif __version__ < Timeliner.__version__:
cmds.warning('Timeliner on machine is newer than file version. Saving machine version over timeliner in file.')
loadInZip(filename)
__version__ = Timeliner.__version__
if __name__ != "__main__":
tl = getTimeliner()
writeTimeFull(tl)
I'd like to write a function to calculate the md5 hash of a file, where I could supply the function with either a string that indicates the full file path, or an opened file pointer.
Right now, my function only accepts a string:
def getMD5Hash(fname):
""" Returns an md5 hash
"""
try:
with open(fname,'rb') as fo:
md5 = hashlib.md5()
chunk_sz = md5.block_size * 128
data = fo.read(chunk_sz)
while data:
md5.update(data)
data = fo.read(chunk_sz)
md5hash = base64.urlsafe_b64encode(md5.digest()).decode('UTF-8').rstrip('=\n')
except IOError:
md5hash = None
How can I detect if fname is a string or an open file pointer?
Python has several different file-like types (file, StringIO, io.TextIOWrapper, etc.), which makes asking "Is this a file?" difficult. Instead, ask "Is this a string?" and assume that anything that isn't must be a file:
def getMD5Hash(fname):
if isinstance(fname, str):
# It's a string!
else:
# I guess it's a file, then.
I've had a look around for the answer to this, but I only seem to be able to find software that does it for you. Does anybody know how to go about doing this in python?
I wrote a piece of python code that verifies the hashes of downloaded files against what's in a .torrent file. Assuming you want to check a download for corruption you may find this useful.
You need the bencode package to use this. Bencode is the serialization format used in .torrent files. It can marshal lists, dictionaries, strings and numbers somewhat like JSON.
The code takes the hashes contained in the info['pieces'] string:
torrent_file = open(sys.argv[1], "rb")
metainfo = bencode.bdecode(torrent_file.read())
info = metainfo['info']
pieces = StringIO.StringIO(info['pieces'])
That string contains a succession of 20 byte hashes (one for each piece). These hashes are then compared with the hash of the pieces of on-disk file(s).
The only complicated part of this code is handling multi-file torrents because a single torrent piece can span more than one file (internally BitTorrent treats multi-file downloads as a single contiguous file). I'm using the generator function pieces_generator() to abstract that away.
You may want to read the BitTorrent spec to understand this in more details.
Full code bellow:
import sys, os, hashlib, StringIO, bencode
def pieces_generator(info):
"""Yield pieces from download file(s)."""
piece_length = info['piece length']
if 'files' in info: # yield pieces from a multi-file torrent
piece = ""
for file_info in info['files']:
path = os.sep.join([info['name']] + file_info['path'])
print path
sfile = open(path.decode('UTF-8'), "rb")
while True:
piece += sfile.read(piece_length-len(piece))
if len(piece) != piece_length:
sfile.close()
break
yield piece
piece = ""
if piece != "":
yield piece
else: # yield pieces from a single file torrent
path = info['name']
print path
sfile = open(path.decode('UTF-8'), "rb")
while True:
piece = sfile.read(piece_length)
if not piece:
sfile.close()
return
yield piece
def corruption_failure():
"""Display error message and exit"""
print("download corrupted")
exit(1)
def main():
# Open torrent file
torrent_file = open(sys.argv[1], "rb")
metainfo = bencode.bdecode(torrent_file.read())
info = metainfo['info']
pieces = StringIO.StringIO(info['pieces'])
# Iterate through pieces
for piece in pieces_generator(info):
# Compare piece hash with expected hash
piece_hash = hashlib.sha1(piece).digest()
if (piece_hash != pieces.read(20)):
corruption_failure()
# ensure we've read all pieces
if pieces.read():
corruption_failure()
if __name__ == "__main__":
main()
Here how I've extracted HASH value from torrent file:
#!/usr/bin/python
import sys, os, hashlib, StringIO
import bencode
def main():
# Open torrent file
torrent_file = open(sys.argv[1], "rb")
metainfo = bencode.bdecode(torrent_file.read())
info = metainfo['info']
print hashlib.sha1(bencode.bencode(info)).hexdigest()
if __name__ == "__main__":
main()
It is the same as running command:
transmissioncli -i test.torrent 2>/dev/null | grep "^hash:" | awk '{print $2}'
Hope, it helps :)
According to this, you should be able to find the md5sums of files by searching for the part of the data that looks like:
d[...]6:md5sum32:[hash is here][...]e
(SHA is not part of the spec)
I have used hashlib (which replaces md5 in Python 2.6/3.0), and it worked fine if I opened a file and put its content in the hashlib.md5() function.
The problem is with very big files that their sizes could exceed the RAM size.
How can I get the MD5 hash of a file without loading the whole file into memory?
You need to read the file in chunks of suitable size:
def md5_for_file(f, block_size=2**20):
md5 = hashlib.md5()
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
return md5.digest()
Note: Make sure you open your file with the 'rb' to the open - otherwise you will get the wrong result.
So to do the whole lot in one method - use something like:
def generate_file_md5(rootdir, filename, blocksize=2**20):
m = hashlib.md5()
with open( os.path.join(rootdir, filename) , "rb" ) as f:
while True:
buf = f.read(blocksize)
if not buf:
break
m.update( buf )
return m.hexdigest()
The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2.7.2 Windows installation
I cross-checked the results using the jacksum tool.
jacksum -a md5 <filename>
Break the file into 8192-byte chunks (or some other multiple of 128 bytes) and feed them to MD5 consecutively using update().
This takes advantage of the fact that MD5 has 128-byte digest blocks (8192 is 128×64). Since you're not reading the entire file into memory, this won't use much more than 8192 bytes of memory.
In Python 3.8+ you can do
import hashlib
with open("your_filename.txt", "rb") as f:
file_hash = hashlib.md5()
while chunk := f.read(8192):
file_hash.update(chunk)
print(file_hash.digest())
print(file_hash.hexdigest()) # to get a printable str instead of bytes
Python < 3.7
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(chunk_num_blocks*h.block_size), b''):
h.update(chunk)
return h.digest()
Python 3.8 and above
import hashlib
def checksum(filename, hash_factory=hashlib.md5, chunk_num_blocks=128):
h = hash_factory()
with open(filename,'rb') as f:
while chunk := f.read(chunk_num_blocks*h.block_size):
h.update(chunk)
return h.digest()
Original post
If you want a more Pythonic (no while True) way of reading the file, check this code:
import hashlib
def checksum_md5(filename):
md5 = hashlib.md5()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
md5.update(chunk)
return md5.digest()
Note that the iter() function needs an empty byte string for the returned iterator to halt at EOF, since read() returns b'' (not just '').
Here's my version of Piotr Czapla's method:
def md5sum(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f:
for chunk in iter(lambda: f.read(128 * md5.block_size), b''):
md5.update(chunk)
return md5.hexdigest()
Using multiple comment/answers for this question, here is my solution:
import hashlib
def md5_for_file(path, block_size=256*128, hr=False):
'''
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
'''
md5 = hashlib.md5()
with open(path,'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
md5.update(chunk)
if hr:
return md5.hexdigest()
return md5.digest()
This is Pythonic
This is a function
It avoids implicit values: always prefer explicit ones.
It allows (very important) performance optimizations
A Python 2/3 portable solution
To calculate a checksum (md5, sha1, etc.), you must open the file in binary mode, because you'll sum bytes values:
To be Python 2.7 and Python 3 portable, you ought to use the io packages, like this:
import hashlib
import io
def md5sum(src):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
content = fd.read()
md5.update(content)
return md5
If your files are big, you may prefer to read the file by chunks to avoid storing the whole file content in memory:
def md5sum(src, length=io.DEFAULT_BUFFER_SIZE):
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
return md5
The trick here is to use the iter() function with a sentinel (the empty string).
The iterator created in this case will call o [the lambda function] with no arguments for each call to its next() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.
If your files are really big, you may also need to display progress information. You can do that by calling a callback function which prints or logs the amount of calculated bytes:
def md5sum(src, callback, length=io.DEFAULT_BUFFER_SIZE):
calculated = 0
md5 = hashlib.md5()
with io.open(src, mode="rb") as fd:
for chunk in iter(lambda: fd.read(length), b''):
md5.update(chunk)
calculated += len(chunk)
callback(calculated)
return md5
A remix of Bastien Semene's code that takes the Hawkwing comment about generic hashing function into consideration...
def hash_for_file(path, algorithm=hashlib.algorithms[0], block_size=256*128, human_readable=True):
"""
Block size directly depends on the block size of your filesystem
to avoid performances issues
Here I have blocks of 4096 octets (Default NTFS)
Linux Ext4 block size
sudo tune2fs -l /dev/sda5 | grep -i 'block size'
> Block size: 4096
Input:
path: a path
algorithm: an algorithm in hashlib.algorithms
ATM: ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')
block_size: a multiple of 128 corresponding to the block size of your filesystem
human_readable: switch between digest() or hexdigest() output, default hexdigest()
Output:
hash
"""
if algorithm not in hashlib.algorithms:
raise NameError('The algorithm "{algorithm}" you specified is '
'not a member of "hashlib.algorithms"'.format(algorithm=algorithm))
hash_algo = hashlib.new(algorithm) # According to hashlib documentation using new()
# will be slower then calling using named
# constructors, ex.: hashlib.md5()
with open(path, 'rb') as f:
for chunk in iter(lambda: f.read(block_size), b''):
hash_algo.update(chunk)
if human_readable:
file_hash = hash_algo.hexdigest()
else:
file_hash = hash_algo.digest()
return file_hash
You can't get its md5 without reading the full content. But you can use the update function to read the file's content block by block.
m.update(a); m.update(b) is equivalent to m.update(a+b).
I think the following code is more Pythonic:
from hashlib import md5
def get_md5(fname):
m = md5()
with open(fname, 'rb') as fp:
for chunk in fp:
m.update(chunk)
return m.hexdigest()
I don't like loops. Based on Nathan Feger's answer:
md5 = hashlib.md5()
with open(filename, 'rb') as f:
functools.reduce(lambda _, c: md5.update(c), iter(lambda: f.read(md5.block_size * 128), b''), None)
md5.hexdigest()
Implementation of Yuval Adam's answer for Django:
import hashlib
from django.db import models
class MyModel(models.Model):
file = models.FileField() # Any field based on django.core.files.File
def get_hash(self):
hash = hashlib.md5()
for chunk in self.file.chunks(chunk_size=8192):
hash.update(chunk)
return hash.hexdigest()
I'm not sure that there isn't a bit too much fussing around here. I recently had problems with md5 and files stored as blobs in MySQL, so I experimented with various file sizes and the straightforward Python approach, viz:
FileHash = hashlib.md5(FileData).hexdigest()
I couldn’t detect any noticeable performance difference with a range of file sizes 2 KB to 20 MB and therefore no need to 'chunk' the hashing. Anyway, if Linux has to go to disk, it will probably do it at least as well as the average programmer's ability to keep it from doing so. As it happened, the problem was nothing to do with md5. If you're using MySQL, don't forget the md5() and sha1() functions already there.
import hashlib,re
opened = open('/home/parrot/pass.txt','r')
opened = open.readlines()
for i in opened:
strip1 = i.strip('\n')
hash_object = hashlib.md5(strip1.encode())
hash2 = hash_object.hexdigest()
print hash2