Python unhash value - python

I am a newbie to the python. Can I unhash, or rather how can I unhash a value. I am using std hash() function. What I would like to do is to first hash a value send it somewhere and then unhash it as such:
#process X
hashedVal = hash(someVal)
#send n receive in process Y
someVal = unhash(hashedVal)
#for example print it
print someVal
Thx in advance

It can't be done.
A hash is not a compressed version of the original value, it is a number (or something similar ) derived from the original value. The nature of hash implementations is that it is possible (but statistically unlikely if the hash algorithm is a good one) that two different objects produce the same hash value.
This is known as the Pigeonhole Principle which basically states that if you have N different items, and want to place them into M different categories, where the N number is larger than M (ie. more items than categories), you're going to end up with some categories containing multiple items. Since a hash value is typically much smaller in size than the data it hashes, it follows the same principles.
As such, it is impossible to go back once you have the hash value. You need a different way of transporting data than this.
For instance, an example (but not a very good one) hash algorithm would be to calculate the number modulus 3 (ie. the remainder after dividing by 3). Then you would have the following hash values from numbers:
1 --> 1 <--+- same hash number, but different original values
2 --> 2 |
3 --> 0 |
4 --> 1 <--+
Are you trying to use the hash function in this way in order to:
Save space (you have observed that the hash value is much smaller in size than the original data)
Secure transportation (you have observed that the hash value is difficult to reverse)
Transport data (you have observed that the hash number/string is easier to transport than a complex object hierarchy)
... ?
Knowing why you want to do this might give you a better answer than just "it can't be done".
For instance, for the above 3 different observations, here's a way to do each of them properly:
Compression/Decompression, for instance using gzip or zlib (the two typically available in most programming languages/runtimes)
Encryption/Decryption, for instance using RSA, AES or a similar secure encryption algorithm
Serialization/Deserialization, which is code built to take a complex object hierarchy and produce either a binary or textual representation that later on can be deserialized back into new objects

Even if I'm almost 8 years late with an answer, I want to say it is possible to unhash data (not with the std hash() function though).
The previous answers are all describing cryptographic hash functions, which by design should compute hashes that are impossible (or at least very hard to unhash).
However, this is not the case with all hash functions.
Solution
You can use basehash python lib (pip install basehash) to achieve what you want.
There is an important thing to keep in mind though: in order to be able to unhash the data, you need to hash it without loss of data. This generally means that the bigger the pool of data types and values you would like to hash, the bigger the hash length has to be, so that you won't get hash collisions.
Anyway, here's a simple example of how to hash/unhash data:
import basehash
hash_fn = basehash.base36() # you can initialize a 36, 52, 56, 58, 62 and 94 base fn
hash_value = hash_fn.hash(1) # returns 'M8YZRZ'
unhashed = hash_fn.unhash('M8YZRZ') # returns 1
You can define the hash length on hash function initialization and hash other data types as well.
I leave out the explanation of the necessity for various bases and hash lengths to the readers who would like to find out more about hashing.

You can't "unhash" data, hash functions are irreversible due to the pigeonhole principle
http://en.wikipedia.org/wiki/Hash_function
http://en.wikipedia.org/wiki/Pigeonhole_principle
I think what you are looking for encryption/decryption. (Or compression or serialization as mentioned in other answers/comments.)

This is not possible in general. A hash function necessarily loses information, and python's hash is no exception.

Related

how to intern dictionary string keys in python?

After reading that interning string can help with performance. Do i just store the return value from the sys.intern call in the dictionary as the key and that is it?
t = {}
t[sys.intern('key')] = 'val'
Thanks
Yes, that's how you will use it.
To be more specific on the performance, the doc states that:
Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare.
There are two steps in a (classic) dict lookup: 1. hash the object into a number that is the index in the array that stores the data; 2. iterate over the array cell at this index to find a couple (key, value) with the correct key.
Usually, the second step is reasonabily fast because we choose a hash function that ensures very few collisions (different objects - same hash). But it has still to check the key you are looking for against every stored key having the same hash. This is the step 2 that will be faster : strings identity is tested before the expensive test, char by char, of the string equality.
The step 1 is harder to accelerate, because you can store the hash along with the interned string... but you have to compute the hash to find the interned string itself.
This was theory! If you really need to improve performance, first do some benchmarks.
Then think of the specificity of the domain. You are storing IPv4 addresses as keys. An IPv4 address is a number between 0 and 256^4. If you replace the human friendly representation of an address by an integer, you'll get a faster hash (hashing small numbers in CPython if almost costless: https://github.com/python/cpython/blob/master/Python/pyhash.c) and a faster lookup. The ip_address module might be the best choice in your case.
If you are sure that addresses are between boundaries (e.g. 172.16.0.0 – 172.31.255.255) you can try to use an array instead of a dict. It should be faster unless your array is huge (disk swap).
Finally, if this is not fast enough, be ready to use a faster language.

why Python list() does not always have the same order?

I am using Python 3.5 and the documentation for it at
https://docs.python.org/3.5/library/stdtypes.html#sequence-types-list-tuple-range
says:
list([iterable])
(...)
The constructor builds a list whose items are the same and in the same order as iterable’s items.
OK, for the following script:
#!/usr/bin/python3
import random
def rand6():
return random.randrange(63)
random.seed(0)
check_dict = {}
check_dict[rand6()] = 1
check_dict[rand6()] = 1
check_dict[rand6()] = 1
print(list(check_dict))
I always get
[24, 48, 54]
But, if I change the function to:
def rand6():
return bytes([random.randrange(63)])
then the order returned is not always the same:
>./foobar.py
[b'\x18', b'6', b'0']
>./foobar.py
[b'6', b'0', b'\x18']
Why?
Python dictionaries are implemented as hash tables. In most Python versions (more on this later), the order you get the keys when you iterate over a dictionary is the arbitrary order of the values in the table, which has only very little to do with the order in which they were added (when hash collisions occur, the order of insertions can matter a little bit). This order is implementation dependent. The Python language does not offer any guarantee about the order other than that it will remain the same for several iterations over a dictionary if no keys are added or removed in between.
For your dictionary with integer keys, the hash table doesn't do anything fancy. Integers hash to themselves (except -1), so with the same numbers getting put in the dict, you get a consistent order in the hash table.
For the dictionary with bytes keys however, you're seeing different behavior due to Hash Randomization. To prevent a kind of dictionary collision attack (where a webapp implemented in Python could be DoSed by sending it data with thousands of keys that hash to the same value leading to lots of collisions and very bad (O(N**2)) performance), Python picks a random seed every time it starts up and uses it to randomize the hash function for Unicode and byte strings as well as datetime types.
You can disable the hash randomization by setting the environment variable PYTHONHASHSEED to 0 (or you can pick your own seed by setting it to any positive integer up to 2**32-1).
It's worth noting that this behavior has changed in Python 3.6. Hash randomization still happens, but a dictionary's iteration order is no longer based on the hash values of the keys. While the official language policy is still that the order is arbitrary, the implementation of dict in CPython now preserves the order that its values were added. You shouldn't rely upon this behavior when using regular dicts yet, as it's possible (though it appears unlikely at this point) that the developers will decide it was a mistake and change the implementation again. If you want to guarantee that iteration occurs in a specific order, use the collections.OrderedDict class instead of a normal dict.

python data type to track duplicates

I often keep track of duplicates with something like this:
processed = set()
for big_string in strings_generator:
if big_string not in processed:
processed.add(big_string)
process(big_string)
I am dealing with massive amounts of data so don't want to maintain the processed set in memory. I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
To cut down on memory use what do you think of using hashes like this:
processed = set()
for big_string in string_generator:
key = hash(big_string)
if key not in ignored:
processed.add(key)
process(big_string)
The drawback is I could lose data through occasional hash collisions.
1 collision in 1 billion hashes would not be a problem for my use.
I tried the md5 hash but found generating the hashes became a bottleneck.
What would you suggest instead?
I'm going to assume you are hashing web pages. You have to hash at most 55 billion web pages (and that measure almost certainly overlooks some overlap).
You are willing to accept a less than one in a billion chance of collision, which means that if we look at a hash function which number of collisions is close to what we would get if the hash was truly random[ˆ1], we want a hash range of size (55*10ˆ9)*10ˆ9. That is log2((55*10ˆ9)*10ˆ9) = 66 bits.
[ˆ1]: since the hash can be considered to be chosen at random for this purpose,
p(collision) = (occupied range)/(total range)
Since there is a speed issue, but no real cryptographic concern, we can use a > 66-bits non-cryptographic hash with the nice collision distribution property outlined above.
It looks like we are looking for the 128-bit version of the Murmur3 hash. People have been reporting speed increases upwards of 12x comparing Murmur3_128 to MD5 on a 64-bit machine. You can use this library to do your speed tests. See also this related answer, which:
shows speed test results in the range of python's str_hash, which speed you have already deemed acceptable elsewhere – though python's hash is a 32-bit hash leaving you only 2ˆ32/(10ˆ9) (that is only 4) values stored with a less than one in a billion chance of collision.
spawned a library of python bindings that you should be able to use directly.
Finally, I hope to have outlined the reasoning that could allow you to compare with other functions of varied size should you feel the need for it (e.g. if you up your collision tolerance, if the size of your indexed set is smaller than the whole Internet, etc, ...).
You have to decide which is more important: space or time.
If time, then you need to create unique representations of your large_item which take as little space as possible (probably some str value) that is easy (i.e. quick) to calculate and will not have collisions, and store them in a set.
If space, find the quickest disk-backed solution you can and store the smallest possible unique value that will identify a large_item.
So either way, you want small unique identifiers -- depending on the nature of large_item this may be a big win, or not possible.
Update
they are strings of html content
Perhaps a hybrid solution then: Keep a set in memory of the normal Python hash, while also keeping the actual html content on disk, keyed by that hash; when you check to see if the current large_item is in the set and get a positive, double-check with the disk-backed solution to see if it's a real hit or not, then skip or process as appropriate. Something like this:
import dbf
on_disk = dbf.Table('/tmp/processed_items', 'hash N(17,0); value M')
index = on_disk.create_index(lambda rec: rec.hash)
fast_check = set()
def slow_check(hashed, item):
matches = on_disk.search((hashed,))
for record in matches:
if item == record.value:
return True
return False
for large_item in many_items:
hashed = hash(large_item) # only calculate once
if hashed not in fast_check or not slow_check(hashed, large_item):
on_disk.append((hashed, large_item))
fast_check.add(hashed)
process(large_item)
FYI: dbf is a module I wrote which you can find on PyPI
If many_items already resides in memory, you are not creating another copy of the large_item. You are just storing a reference to it in the ignored set.
If many_items is a file or some other generator, you'll have to look at other alternatives.
Eg if many_items is a file, perhaps you can store a pointer to the item in the file instead of the actual item
As you have already seen few options but unfortunately none of them can fully address the situation partly because
Memory Constraint, to store entire object in memory
No perfect Hash function, and for huge data set change of collision is there.
Better Hash functions (md5) are slower
Use of database like sqlite would actually make things slower
As I read this following excerpt
I have a version that uses sqlite to store the data on disk, but then this process runs much slower.
I feel if you work on this, it might help you marginally. Here how it should be
Use tmpfs to create a ramdisk. tmpfs has several advantages over other implementation because it supports swapping of less-used space to swap space.
Store the sqlite database on the ramdisk.
Change the size of the ramdisk and profile your code to check your performance.
I suppose you already have a working code to save your data in sqllite. You only need to define a tmpfs and use the path to store your database.
Caveat: This is a linux only solution
a bloom filter? http://en.wikipedia.org/wiki/Bloom_filter
well you can always decorate large_item with a processed flag. Or something similar.
You can give a try to the str type __hash__ function.
In [1]: hash('http://stackoverflow.com')
Out[1]: -5768830964305142685
It's definitely not a cryptographic hash function, but with a little chance you won't have too much collision. It works as described here: http://effbot.org/zone/python-hash.htm.
I suggest you profile standard Python hash functions and choose the fastest: they are all "safe" against collisions enough for your application.
Here are some benchmarks for hash, md5 and sha1:
In [37]: very_long_string = 'x' * 1000000
In [39]: %timeit hash(very_long_string)
10000000 loops, best of 3: 86 ns per loop
In [40]: from hashlib import md5, sha1
In [42]: %timeit md5(very_long_string).hexdigest()
100 loops, best of 3: 2.01 ms per loop
In [43]: %timeit sha1(very_long_string).hexdigest()
100 loops, best of 3: 2.54 ms per loop
md5 and sha1 are comparable in speed. hash is 20k times faster for this string and it does not seem to depend much on the size of the string itself.
how does your sql lite version work? If you insert all your strings into a database table and then run the query "select distinct big_string from table_name", the database should optimize it for you.
Another option for you would be to use hadoop.
Another option could be to split the strings into partitions such that each partition is small enough to fit in memory. then you only need to check for duplicates within each partition. the formula you use to decide the partition will choose the same partition for each duplicate. the easiest way is to just look at the first few digits of the string e.g.:
d=defaultdict(int)
for big_string in strings_generator:
d[big_string[:4]]+=1
print d
now you can decide on your partitions, go through the generator again and write each big_string to a file that has the start of the big_string in the filename. Now you could just use your original method on each file and just loop through all the files
This can be achieved much more easily by performing simpler checks first, then investigating these cases with more elaborate checks. The example below contains extracts of your code, but it is performing the checks on much smaller sets of data. It does this by first matching on a simple case that is cheap to check. And if you find that a (filesize, checksum) pairs are not discriminating enough you can easily change it for a more cheap, yet vigorous check.
# Need to define the following functions
def GetFileSize(filename):
pass
def GenerateChecksum(filename):
pass
def LoadBigString(filename):
pass
# Returns a list of duplicates pairs.
def GetDuplicates(filename_list):
duplicates = list()
# Stores arrays of filename, mapping a quick hash to a list of filenames.
filename_lists_by_quick_checks = dict()
for filename in filename_list:
quickcheck = GetQuickCheck(filename)
if not filename_lists_by_quick_checks.has_key(quickcheck):
filename_lists_by_quick_checks[quickcheck] = list()
filename_lists_by_quick_checks[quickcheck].append(filename)
for quickcheck, filename_list in filename_lists.iteritems():
big_strings = GetBigStrings(filename_list)
duplicates.extend(GetBigstringDuplicates(big_strings))
return duplicates
def GetBigstringDuplicates(strings_generator):
processed = set()
for big_string in strings_generator:
if big_sring not in processed:
processed.add(big_string)
process(big_string)
# Returns a tuple containing (filesize, checksum).
def GetQuickCheck(filename):
return (GetFileSize(filename), GenerateChecksum(filename))
# Returns a list of big_strings from a list of filenames.
def GetBigStrings(file_list):
big_strings = list()
for filename in file_list:
big_strings.append(LoadBigString(filename))
return big_strings

Creating a unique key based on file content in python

I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.
Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.
So I was going to use hash md5 to do this. But then I read somewhere that "MD5 are not meant to be unique keys" and I thought that's really weird.
What's the right way of doing this?
edit: by the way, I took two sources to get to the following, which is how I'm currently doing it and it's working just fine, with Python 2.5:
import hashlib
def md5_from_file (fileName, block_size=2**14):
md5 = hashlib.md5()
f = open(fileName)
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
f.close()
return md5.hexdigest()
Sticking with MD5 is a good idea. Just to make sure I'd append the file length or number of chunks to your file-hash table.
Yes, there is the possibility that you run into two files that have the same MD5 hash, but that's quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.
# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)
def store_file(file, nchunks, hash):
"" Tells you whether there is another file with the same contents already, by
making a table lookup ""
# This can be a DB lookup or some way to obtain your hash map
big_table = ObtainTable()
# Two level lookup table might help performance
# Will vary on the number of entries and nature of big_table
if nchunks in big_table:
if hash in big_table[hash]:
raise DuplicateFileException,\
'File is dup with %s' big_table[nchunks][lookup_hash]
else:
big_table[nchunks] = {}
big_table[nchunks].update({
hash: file.filename
})
file.save() # or something
To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.
Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are "the same" but have different signatures.
The issue with hashing is that it's generating a "small" identifier from a "large" dataset. It's like a lossy compression. While you can't guarantee uniqueness, you can use it to substantially limit the number of other items you need to compare against.
Consider that MD5 yields a 128 bit value (I think that's what it is, although the exact number of bits is irrelevant). If your input data set has 129 bits and you actually use them all, each MD5 value will appear on average twice. For longer datasets (e.g. "all text files of exactly 1024 printable characters") you're still going to run into collisions once you get enough inputs. Contrary to what another answer said, it is a mathematical certainty that you will get collisions.
See http://en.wikipedia.org/wiki/Birthday_Paradox
Granted, you have around a 1% chance of collisions with a 128 bit hash at 2.6*10^18 entries, but it's better to handle the case that you do get collisions than to hope that you never will.
The issue with MD5 is that it's broken. For most common uses there's little problem and people still use both MD5 and SHA1, but I think that if you need a hashing function then you need a strong hashing function. To the best of my knowledge there is still no standard substitute for either of these. There are a number of algorithms that are "supposed" to be strong, but we have most experience with SHA1 and MD5. That is, we (think) we know when these two break, whereas we don't really know as much when the newer algorithms break.
Bottom line: think about the risks. If you wish to walk the extra mile then you might add extra checks when you find a hash duplicate, for the price of the performance penalty.

What is a hashtable/dictionary implementation for Python that doesn't store the keys?

I'm storing millions, possibly billions of 4 byte values in a hashtable and I don't want to store any of the keys. I expect that only the hashes of the keys and the values will have to be stored. This has to be fast and all kept in RAM. The entries would still be looked up with the key, unlike set()'s.
What is an implementation of this for Python? Is there a name for this?
Yes, collisions are allowed and can be ignored.
(I can make an exception for collisions, the key can be stored for those. Alternatively, collisions can just overwrite the previously stored value.)
Bloomier filters - space-efficient associative array
From the Wikipedia:
Chazelle et al. (2004) designed a
generalization of Bloom filters that
could associate a value with each
element that had been inserted,
implementing an associative array.
Like Bloom filters, these structures
achieve a small space overhead by
accepting a small probability of false
positives. In the case of "Bloomier
filters", a false positive is defined
as returning a result when the key is
not in the map. The map will never
return the wrong value for a key that
is in the map.
How about using an ordinary dictionary and instead of doing:
d[x]=y
use:
d[hash(x)]=y
To look up:
d[hash(foo)]
Of course, if there is a hash collision, you may get the wrong value back.
Its the good old space vs runtime tradeoff: You can have constant time with linear space usage for the keys in a hastable. Or you can store the key implicitly and use log n time by using a binary tree. The (binary) hash of a value gives you the path in the tree where it will be stored.
Build your own b-tree in RAM.
Memory use:
(4 bytes) comparison hash value
(4 bytes) index of next leaf if hash <= comparison OR if negative index of value
(4 bytes) index of next leaf if hash > comparison OR if negative index of value
12 bytes per b-tree node for the b-tree. More overhead for the values (see below).
How you structure this in Python - aren't there "native arrays" of 32bit integers upported with almost no extra memory overhead...? what are they called... anyway those.
Separate ordered array of subarrays each containing one or more values. The "indexes of value" above are indexes into this big array, allowing retrieval of all values matching the hash.
This assumes a 32bit hash. You will need more bytes per b-tree node if you have
greater than 2^31-1 entries or a larger hash.
BUT Spanner in the works perhaps: Note that you will not be able, if you are not storing the key values, to verify that a hash value looked up corresponds only to your key unless through some algorithmic or organisational mechanism you have guaranteed that no two keys will have the same hash. Quite a serious issue here. Have you considered it? :)
Although python dictionaries are very efficient, I think that if you're going to store billions of items, you may want to create your own C extension with data structures, optimized for the way you are actually using it (sequential access? completely random? etc).
In order to create a C extension, you may want to use SWIG, or something like Pyrex (which I've never used).
Hash table has to store keys, unless you provide a hash function that gives absolutely no collisions, which is nearly impossible.
There is, however, if your keys are string-like, there is a very space-efficient data structure - directed acyclic word graph (DAWG). I don't know any Python implementation though.
It's not what you asked for buy why not consider Tokyo Cabinet or BerkleyDB for this job? It won't be in memory but you are trading performance for greater storage capacity. You could still keep your list in memory and use the database only to check existence.
Would you please tell us more about the keys? I'm wondering if there is any regularity in the keys that we could exploit.
If the keys are strings in a small alphabet (example: strings of digits, like phone numbers) you could use a trie data structure:
http://en.wikipedia.org/wiki/Trie
If you're actually storing millions of unique values, why not use a dictionary?
Store: d[hash(key)/32] |= 2**(hash(key)%32)
Check: (d[hash(key)/32] | 2**(hash(key)%32))
If you have billions of entries, use a numpy array of size (2**32)/32, instead. (Because, after all, you only have 4 billion possible values to store, anyway).
Why not a dictionary + hashlib?
>>> import hashlib
>>> hashtable = {}
>>> def myHash(obj):
return hashlib.sha224(obj).hexdigest()
>>> hashtable[myHash("foo")] = 'bar'
>>> hashtable
{'0808f64e60d58979fcb676c96ec938270dea42445aeefcd3a4e6f8db': 'bar'}

Categories