get murmur hash of a file with Python 3

get murmur hash of a file with Python 3 - python

The documentation for the python library Murmur is a bit sparse.
I have been trying to adapt the code from this answer:
import hashlib
from functools import partial
def md5sum(filename):
with open(filename, mode='rb') as f:
d = hashlib.md5()
for buf in iter(partial(f.read, 128), b''):
d.update(buf)
return d.hexdigest()
print(md5sum('utils.py'))
From what I read in the answer, the md5 can't operate on the whole file at once so it needs this looping. Not sure exactly what would happen on the line d.update(buf) however.
The public methods in hashlib.md5() are:
'block_size',
'copy',
'digest',
'digest_size',
'hexdigest',
'name',
'update'
whereas mmh3 has
'hash',
'hash64',
'hash_bytes'
No update or hexdigest methods..
Does anyone know how to achieve a similar result?
The motivation is testing for uniqueness as fast as possible, the results here suggests murmur is a good candidate.
Update -
Following the comment from #Bakuriu I had a look at mmh3 which seems to be better documented.
The public methods inside it are:
import mmh3
print([x for x in dir(mmh3) if x[0]!='_'])
>>> ['hash', 'hash128', 'hash64', 'hash_bytes', 'hash_from_buffer']
..so no "update" method. I had a look at the source code for mmh3.hash_from_buffer but it does not look like it contains a loop and it is also not in Python, can't really follow it. Here is a link to the line
So for now will just use CRC-32 which is supposed to be almost as good for the purpose, and it is well documented how to do it. If anyone posts a solution will test it out.

To hash a file using murmur, one has to load it completely into memory and hash it in one go.
import mmh3
with open('main.py') as file:
data = file.read()
hash = mmh3.hash_bytes(data, 0xBEFFE)
print(hash.hex())
If your file is too large to fit into memory, you could use incremental/progressive hashing: add your data in multiple chunks and hash them on the fly (like your example above).
Is there a Python library for progressive hashing with murmur?
I tried to find one, but it seems there is none.
Is progressive hashing even possible with murmur?
There is a working implementation in C:
https://github.com/rurban/smhasher/blob/master/PMurHash.h
https://github.com/rurban/smhasher/blob/master/PMurHash.c

Related

Reading results of gurobi optimisation ("results.sol") in new python script

I am trying to run a rolling horizon optimisation where I have multiple optimisation scripts, each generating their own results. Instead of printing results to screen at every interval, I want to write each of the results using model.write("results.sol") - and then read them back into a results processing script (separate python script).
I have tried using read("results.sol") using Python, but the file format is not recognised. Is there any way that you can read/process the .sol file format that Gurobi outputs? It would seem bizarre if you cannot read the .sol file at some later point and generate plots etc.
Maybe I have missed something blindingly obvious.

Hard to answer without seeing your code as we have to guess what you are doing.
But well...
When you use
model.write("out.sol")
Gurobi will use it's own format to write it (and what is written is inferred from the file-suffix).
This can easily be read by:
model.read("out.sol")
If you used
x = read("out.sol")
you are using python's basic IO-tools and of course python won't interpret that file in respect to the format. Furthermore reading like that is text-mode (and maybe binary is required; not sure).
General rule: if you wrote the solution using a class-method of class model, then read using a class-method of class model too.
The usage above is normally used to reinstate some state of your model (e.g. MIP-start). If you want to plot it, you will have to do further work. In this case, using python's IO tools might be a good idea and you should respect the format described here. This could be read as csv or manually (and opposed to my remark earlier: it is text-mode; not binary).
So assuming the example from the link is in file gur.sol:
import csv
with open('gur.sol', newline='\n') as csvfile:
reader = csv.reader((line.replace(' ', ' ') for line in csvfile), delimiter=' ')
next(reader) # skip header
sol = {}
for var, value in reader:
sol[var] = float(value)
print(sol)
Output:
{'z': 0.2, 'x': 1.0, 'y': 0.5}
Remarks:
Code is ugly because python's csv module has some limitations
Delimiter is two-spaces in this format and we need to hack the code to read it (as only one character is allowed in this function)
Code might be tailored to python 3 (what i'm using; probably the next() method will be different in py2)
pandas would be much much better for this purpose (huge tool with a very good csv_reader)

Prefer BytesIO or bytes for internal interface in Python?

I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?

The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).

What is the most Pythonic way of handling standard input?

I've started doing programming contests/challenges, and often the questions involve reading in from standard input. I've been doing
import fileinput
inputLines = []
for line in fileinput.input():
inputLines.append(line)
I then can do whatever calculations I need to do with inputLines. Is there a more Pythonic (i.e., better) way of doing this?

If you just want to read from stdin, not from any files named in the command line, then you should not use fileinput.
If you want a list containing the lines from stdin, then:
import sys
inputLines = list(sys.stdin)

import sys
for line in sys.stdin:
print "The line was", line

I think fileinput is the flexible way to do it in Python, given they made a module just to do that.
If you know more about what type of input you will be reading, there might be libraries that are better suited to your needs. For example, I do a lot of numerical work, so pandas works great for me because it has a read_csv. Take a look at the docs (and try and tell us more about specific reading needs, if you can narrow it down).

MD5 and SHA-2 collisions in Python

I'm writing a simple MP3 cataloguer to keep track of which MP3's are on my various devices. I was planning on using MD5 or SHA2 keys to identify matching files even if they have been renamed/moved, etc. I'm not trying to match MP3's that are logically equivalent (i.e.: same song but encoded differently). I have about 8000 MP3's. Only about 6700 of them generated unique keys.
My problem is that I'm running into collisions regardless of the hashing algorithm I choose. In one case, I have two files that happen to be tracks #1 and #2 on the same album, they are different file sizes yet produce identical hash keys whether I use MD5, SHA2-256, SHA2-512, etc...
This is the first time I'm really using hash keys on files and this is an unexpected result. I feel something fishy is going on here from the little I know about these hashing algorithms. Could this be an issue related to MP3's or Python's implementation?
Here's the snippet of code that I'm using:
data = open(path, 'r').read()
m = hashlib.md5(data)
m.update(data)
md5String = m.hexdigest()
Any answers or insights to why this is happening would be much appreciated. Thanks in advance.
--UPDATE--:
I tried executing this code in linux (with Python 2.6) and it did not produce a collision. As demonstrated by the stat call, the files are not the same. I also downloaded WinMD5 and this did not produce a collision(8d327ef3937437e0e5abbf6485c24bb3 and 9b2c66781cbe8c1be7d6a1447994430c). Is this a bug with Python hashlib on Windows? I tried the same under Python 2.7.1 and 2.6.6 and both provide the same result.
import hashlib
import os
def createMD5( path):
fh = open(path, 'r')
data = fh.read()
m = hashlib.md5(data)
md5String = m.hexdigest()
fh.close()
return md5String
print os.stat(path1)
print os.stat(path2)
print createMD5(path1)
print createMD5(path2)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=6617216L, st_atime=1303808346L, st_mtime=1167098073L, st_ctime=1290222341L)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=4921346L, st_atime=1303808348L, st_mtime=1167098076L, st_ctime=1290222341L)
>>> a7a10146b241cddff031eb03bd572d96
>>> a7a10146b241cddff031eb03bd572d96

I sort of have the feeling that you are reading a chunk of data which is smaller than the expected, and this chunk happens to be the same for both files. I don't know why, but try to open the file in binary with 'rb'. read() should read up to end of file, but windows behaves differently. From the docs
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.

The files you're having a problem with are almost certainly identical if several different hashing algorithms all return the same hash results on them, or there's a bug in your implementation.
As a sanity test write your own "hash" that just returns the file's contents wholly, and see if this one generates the same "hashes".

As others have stated, a single hash collision is unlikely, and multiple nigh on impossible, unless the files are identical. I would recommend generating the sums with an external utility as something of a sanity check. For example, in Ubuntu (and most/all other Linux distributions):
blair#blair-eeepc:~$ md5sum Bandwagon.mp3
b87cbc2c17cd46789cb3a3c51a350557 Bandwagon.mp3
blair#blair-eeepc:~$ sha256sum Bandwagon.mp3
b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0 Bandwagon.mp3
A quick Google search shows there are similar utilities available for Windows machines. If you see the collisions with the external utilities, then the files are identical. If there are no collisions, you are doing something wrong. I doubt the Python implementation is wrong, as I get the same results when doing the hash in Python:
>>> import hashlib
>>> hashlib.md5(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b87cbc2c17cd46789cb3a3c51a350557'
>>> hashlib.sha256(open('Bandwagon.mp3', 'r').read()).hexdigest()
'b909b027271b4c3a918ec19fc85602233a4c5f418e8456648c426403526e7bc0'

Like #Delan Azabani said, there is something fishy here; collisions are bound to happen, but not that often. Check if the songs are the same, and update your post please.
Also, if you feel that you don't have enough keys, you can use two (or even more) hashing algorithms at the same time: by using MD5 for example, you have 2**128, or 340282366920938463463374607431768211456 keys. By using SHA-1, you have 2**160 or 1461501637330902918203684832716283019655932542976 keys. By combining them, you have 2**128 * 2**160, or 497323236409786642155382248146820840100456150797347717440463976893159497012533375533056.
(But if you ask me, MD5 is more than enough for your needs.)

How do I copy wsgi.input if I want to process POST data more than once?

In WSGI, post data is consumed by reading the file-like object environ['wsgi.input']. If a second element in the stack also wants to read post data it may hang the program by reading when there's nothing more to read.
How should I copy the POST data so it can be processed multiple times?

You could try putting a file-like replica of the stream back in the environment:
from cStringIO import StringIO
length = int(environ.get('CONTENT_LENGTH', '0'))
body = StringIO(environ['wsgi.input'].read(length))
environ['wsgi.input'] = body
Needing to do this is a bit of a smell, though. Ideally only one piece of code should be parsing the query string and post body, and delivering the results to other components.

Go have a look at WebOb package. It provides functionality that allows one to designate that wsgi.input should be made seekable. This has the effect of allowing you to rewind the input stream such that content can be replayed through different handler. Even if you don't use WebOb, the way it does this should be instructive as would trust Ian to have done this in an appropriate way. For search results in documentation go here.

If you're gonna read it in one fell swoop, you could always read it in, create a CStringIO file-like object of the stuff you've read and then assign it back, like this:
import cStringIO
import copy
lines = []
for line in environ['wsgi.input']:
lines.append(line)
newlines = copy.copy(lines)
environ['wsgi.input'] = cStringIO.StringIO(''.join(newlines))
There's most likely a more efficient way to do this, but I in general find wsgi's post stuff pretty brittle if you want to do anything non-trivial (like read post data muptiple times)...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

get murmur hash of a file with Python 3 - python

Related

Reading results of gurobi optimisation ("results.sol") in new python script

Prefer BytesIO or bytes for internal interface in Python?

What is the most Pythonic way of handling standard input?

MD5 and SHA-2 collisions in Python

How do I copy wsgi.input if I want to process POST data more than once?

Categories

Resources