This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
When is a python object's hash computed and why is the hash of -1 different?
Why do -1 and -2 both hash to the same number if Python?
Since they do, how does Python tell these two numbers apart?
>>> -1 is -2
False
>>> hash(-1) is hash(-2)
True
>>> hash(-1)
-2
>>> hash(-2)
-2
-1 is a reserved value at the C level of CPython which prevents hash functions from being able to produce a hash value of -1. As noted by DSM, the same is not true in IronPython and PyPy where hash(-1) != hash(-2).
See this Quora answer:
If you write a type in a C extension module and provide a tp_hash
method, you have to avoid -1 — if you return -1, Python will assume
you meant to throw an error.
If you write a class in pure Python and provide a __hash__ method,
there's no such requirement, thankfully. But that's because the C code
that invokes your __hash__ method does that for you — if your
__hash__ returns -1, then hash() applied to your object will actually return -2.
Which really just repackages the information from effbot:
The hash value -1 is reserved (it’s used to flag errors in the C
implementation). If the hash algorithm generates this value, we simply
use -2 instead.
You can also see this in the source. For example for Python 3’s int object, this is at the end of the hash implementation:
if (x == (Py_uhash_t)-1)
x = (Py_uhash_t)-2;
return (Py_hash_t)x;
Since they do, how does Python tell these two numbers apart?
Since all hash functions map a large input space to a smaller input space, collisions are always expected, no matter how good the hash function is. Think of hashing strings, for example. If hash codes are 32-bit integers, you have 2^32 (a little more than 4 billion) hash codes. If you consider all ASCII strings of length 6, you have (2^7)^6 (just under 4.4 trillion) different items in your input space. With only this set, you are guaranteed to have many, many collisions no matter how good you are. Add Unicode characters and strings of unlimited length to that!
Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.
This hash code and equality duality is hinted at in the CPython documentation on hashable objects. In other languages/frameworks, there is a guideline/rule that if you provide a custom hash code function, you must also provide a custom equality test (performed on the same fields as the hash code function).
Indeed, the Python release today address exactly this, with a security patch that addresses the efficiency issue when this (identical hash values, but on a massive scale) is used as a denial of service attack - http://mail.python.org/pipermail/python-list/2012-April/1290792.html
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
When is a python object's hash computed and why is the hash of -1 different?
Why do -1 and -2 both hash to the same number if Python?
Since they do, how does Python tell these two numbers apart?
>>> -1 is -2
False
>>> hash(-1) is hash(-2)
True
>>> hash(-1)
-2
>>> hash(-2)
-2
-1 is a reserved value at the C level of CPython which prevents hash functions from being able to produce a hash value of -1. As noted by DSM, the same is not true in IronPython and PyPy where hash(-1) != hash(-2).
See this Quora answer:
If you write a type in a C extension module and provide a tp_hash
method, you have to avoid -1 — if you return -1, Python will assume
you meant to throw an error.
If you write a class in pure Python and provide a __hash__ method,
there's no such requirement, thankfully. But that's because the C code
that invokes your __hash__ method does that for you — if your
__hash__ returns -1, then hash() applied to your object will actually return -2.
Which really just repackages the information from effbot:
The hash value -1 is reserved (it’s used to flag errors in the C
implementation). If the hash algorithm generates this value, we simply
use -2 instead.
You can also see this in the source. For example for Python 3’s int object, this is at the end of the hash implementation:
if (x == (Py_uhash_t)-1)
x = (Py_uhash_t)-2;
return (Py_hash_t)x;
Since they do, how does Python tell these two numbers apart?
Since all hash functions map a large input space to a smaller input space, collisions are always expected, no matter how good the hash function is. Think of hashing strings, for example. If hash codes are 32-bit integers, you have 2^32 (a little more than 4 billion) hash codes. If you consider all ASCII strings of length 6, you have (2^7)^6 (just under 4.4 trillion) different items in your input space. With only this set, you are guaranteed to have many, many collisions no matter how good you are. Add Unicode characters and strings of unlimited length to that!
Therefore, the hash code only hints at the location of an object, an equality test follows to test candidate keys. To implement a membership test in a hash-table set, the hash code gives you "bucket" number in which to search for the value. However, all set items with the same hash code are in the bucket. For this, you also need an equality test to distinguish between all candidates in the bucket.
This hash code and equality duality is hinted at in the CPython documentation on hashable objects. In other languages/frameworks, there is a guideline/rule that if you provide a custom hash code function, you must also provide a custom equality test (performed on the same fields as the hash code function).
Indeed, the Python release today address exactly this, with a security patch that addresses the efficiency issue when this (identical hash values, but on a massive scale) is used as a denial of service attack - http://mail.python.org/pipermail/python-list/2012-April/1290792.html
I'm learning python (3.6) and I have discovered the following:
a = "hi"
b = "hi"
a == b #True
a is b #True
a = list(a)
b = list(b)
a = "".join(a)
b = "".join(b)
a == b #True
a is b #False
Why is the result different after conversion to list and joining back to string? I do understand that Python VM maintains a pool of strings and hence the reference is the same for a and b. But why does this not work after joining the list to the very same string?
Thanks!
The key lies here:
a = "".join(a)
b = "".join(b)
The string.join() method returns a new string, built by joining the element of a list.
Each call to string.join() instanciates a new string: in the first call a string is created and its reference is assigned to a, then, in the second call, a new string gets built and its reference is assigned to b. Because of this, the two names a and b are references to two new and distinct strings, which themselves are two separate objects.
The is operator behaves as designed, returning false as a and b are not references to the same object.
If you're trying to see if the two string are equal in content, then the operator == is likely a better choice.
You shouldn't really compare anything except singletons (like None, True or False) with is. Because is doesn't really compare the content, it just checks if it's the same object. So is will fail if you compare different objects with the same content.
The fact that your first a is b worked is because literals are interned (*). So a and b are the same object because both are literals with the same content. But that's an implementation and it could yield different results in future (or older) Python versions, so don't start comparing string literals with is on the basis that it works right now.
(*) It really should return False because the way you've written the cases they shouldn't be the same object. They just happen to be the same one because CPython optimizes some cases.
There are lots of ways to answer this, but here you can think about memory. The physical bits in your RAM that make up the data. In python, the keyword "is" checks to see if the address of two objects matches exactly. The operator "==" checks to see if the value of the objects are the same by running the comparison defined in the magic method - the python code responsible for turning operators into functions - this has to be defined for every class. The interesting part arises from the fact that they are originally identical, this question should help you with that.
when does Python allocate new memory for identical strings?.
Essentially python can optimise the "hi" strings because you've typed them before running your code it makes a table of all typed strings to save on memory. When the string object is built from a list, python doesn't know what the contents will be. By default, in your particular version of the interpreter this means a new string object is created - this saves time checking for duplicates, but requires more memory. If you want to force the interpreter to check for duplicates then use the "sys.intern" function: https://docs.python.org/3/library/sys.html
sys.intern(string):
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.
Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.
I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?
As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144
Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.
You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions
Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.
I'm writing a mapping class which persists to the disk. I am currently allowing only str keys but it would be nice if I could use a couple more types: hopefully up to anything that is hashable (ie. same requirements as the builtin dict), but more reasonable I would accept string, unicode, int, and tuples of these types.
To that end I would like to derive a deterministic serialization scheme.
Option 1 - Pickling the key
The first thought I had was to use the pickle (or cPickle) module to serialize the key, but I noticed that the output from pickle and cPickle do not match each other:
>>> import pickle
>>> import cPickle
>>> def dumps(x):
... print repr(pickle.dumps(x))
... print repr(cPickle.dumps(x))
...
>>> dumps(1)
'I1\n.'
'I1\n.'
>>> dumps('hello')
"S'hello'\np0\n."
"S'hello'\np1\n."
>>> dumps((1, 2, 'hello'))
"(I1\nI2\nS'hello'\np0\ntp1\n."
"(I1\nI2\nS'hello'\np1\ntp2\n."
Is there any implementation/protocol combination of pickle which is deterministic for some set of types (e.g. can only use cPickle with protocol 0)?
Option 2 - Repr and ast.literal_eval
Another option is to use repr to dump and ast.literal_eval to load. I have written a function to determine if a given key would survive this process (it is rather conservative on the types it allows):
def is_reprable_key(key):
return type(key) in (int, str, unicode) or (type(key) == tuple and all(
is_reprable_key(x) for x in key))
The question for this method is if repr itself is deterministic for the types that I have allowed here. I believe this would not survive the 2/3 version barrier due to the change in str/unicode literals. This also would not work for integers where 2**32 - 1 < x < 2**64 jumping between 32 and 64 bit platforms. Are there any other conditions (ie. do strings serialize differently under different conditions in the same interpreter)? Edit: I'm just trying to understand the conditions that this breaks down, not necessarily overcome them.
Option 3: Custom repr
Another option which is likely overkill is to write my own repr which flattens out the things of repr which I know (or suspect may be) a problem. I just wrote an example here: http://gist.github.com/423945
(If this all fails miserably then I can store the hash of the key along with the pickle of both the key and value, then iterate across rows that have a matching hash looking for one that unpickles to the expected key, but that really does complicate a few other things and I would rather not do it. Edit: it turns out that the builtin hash is not deterministic across platforms either. Scratch that.)
Any insights?
Important note: repr() is not deterministic if a dictionary or set type is embedded in the object you are trying to serialize. The keys could be printed in any order.
For example print repr({'a':1, 'b':2}) might print out as {'a':1, 'b':2} or {'b':2, 'a':1}, depending on how Python decides to manage the keys in the dictionary.
After reading through much of the source (of CPython 2.6.5) for the implementation of repr for the basic types I have concluded (with reasonable confidence) that repr of these types is, in fact, deterministic. But, frankly, this was expected.
I believe that the repr method is susceptible to nearly all of the same cases under which the marshal method would break down (longs > 2**32 can never be an int on a 32bit machine, not guaranteed to not change between versions or interpreters, etc.).
My solution for the time being has been to use the repr method and write a comprehensive test suite to make sure that repr returns the same values on the various platforms I am using.
In the long run the custom repr function would flatten out all platform/implementation differences, but is certainly overkill for the project at hand. I may do this in the future, however.
"Any value which is an acceptable key for a builtin dict" is not feasible: such values include arbitrary instances of classes that don't define __hash__ or comparisons, implicitly using their id for hashing and comparison purposes, and the ids won't be the same even across runs of the very same program (unless those runs are strictly identical in all respects, which is very tricky to arrange -- identical inputs, identical starting times, absolutely identical environment, etc, etc).
For strings, unicodes, ints, and tuples whose items are all of these kinds (including nested tuples), the marshal module could help (within a single version of Python: marshaling code can and does change across versions). E.g.:
>>> marshal.dumps(23)
'i\x17\x00\x00\x00'
>>> marshal.dumps('23')
't\x02\x00\x00\x0023'
>>> marshal.dumps(u'23')
'u\x02\x00\x00\x0023'
>>> marshal.dumps((23,))
'(\x01\x00\x00\x00i\x17\x00\x00\x00'
This is Python 2; Python 3 would be similar (except that all the representation of these byte strings would have a leading b, but that's a cosmetic issue, and of course u'23' becomes invalid syntax and '23' becomes a Unicode string). You can see the general idea: a leading byte represents the type, such as u for Unicode strings, i for integers, ( for tuples; then for containers comes (as a little-endian integer) the number of items followed by the items themselves, and integers are serialized into a little-endian form. marshal is designed to be portable across platforms (for a given version; not across versions) because it's used as the underlying serializations in compiled bytecode files (.pyc or .pyo).
You mention a few requirements in the paragraph, and I think you might want to be a little more clear on these. So far I gather:
You're building an SQLite backend to basically a dictionary.
You want to allow the keys to be more than basestring type (which types).
You want it to survive the Python 2 -> Python 3 barrier.
You want to support large integers above 2**32 as the key.
Ability to store infinite values (because you don't want hash collisions).
So, are you trying to build a general 'this can do it all' solution, or just trying to solve an immediate problem to continue on within a current project? You should spend a little more time to come up with a clear set of requirements.
Using a hash seemed like the best solution to me, but then you complain that you're going to have multiple rows with the same hash implying you're going to be storing enough values to even worry about the hash.
What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).
So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.
Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?
Edit:
Use case is like this:
The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).
Conclusions (?)
The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.
I found out how to take a hex hash (from any hash source) and store it compactly with base64:
# 'h' is a string of hex digits
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")
The birthday paradox applies: given a good hash function, the expected number of hashes before a collision occurs is about sqrt(N), where N is the number of different values that the hash function can take. (The wikipedia entry I've pointed to gives the exact formula). So, for example, if you want to use no more than 32 bits, your collision worries are serious for around 64K objects (i.e., 2**16 objects -- the square root of the 2**32 different values your hash function can take). How many objects do you expect to have, as an order of magnitude?
Since you mention that a collision is a minor annoyance, I recommend you aim for a hash length that's roughly the square of the number of objects you'll have, or a bit less but not MUCH less than that.
You want to make a filename - is that on a case-sensitive filesystem, as typical on Unix, or do you have to cater for case-insensitive systems too? This matters because you aim for short filenames, but the number of bits per character you can use to represent your hash as a filename changes dramatically on case-sensive vs insensitive systems.
On a case-sensitive system, you can use the standard library's base64 module (I recommend the "urlsafe" version of the encoding, i.e. this function, as avoiding '/' characters that could be present in plain base64 is important in Unix filenames). This gives you 6 usable bits per character, much better than the 4 bits/char in hex.
Even on a case-insensitive system, you can still do better than hex -- use base64.b32encode and get 5 bits per character.
These functions take and return strings; use the struct module to turn numbers into strings if your chosen hash function generates numbers.
If you do have a few tens of thousands of objects I think you'll be fine with builtin hash (32 bits, so 6-7 characters depending on your chosen encoding). For a million objects you'd want 40 bits or so (7 or 8 characters) -- you can fold (xor, don't truncate;-) a sha256 down to a long with a reasonable number of bits, say 128 or so, and use the % operator to cut it further to your desired length before encoding.
The builtin hash function of strings is fairly collision free, and also fairly short. It has 2**32 values, so it is fairly unlikely that you encounter collisions (if you use its abs value, it will have only 2**31 values).
You have been asking for the shortest hash function. That would certainly be
def hash(s):
return 0
but I guess you didn't really mean it that way...
You can make any hash you like shorter by simply truncating it. md5 is always 32 hex digits, but an arbitrary substring of it (or any other hash) has the proper qualities of a hash: equal values produce equal hashes, and the values are spread around a bunch.
I'm sure that there's a CRC32 implementation in Python, but that may be too short (8 hex digits). On the upside, it's very quick.
Found it, binascii.crc32
If you do have a collision, how are you going to tell that it actually happened?
If I were you, I would use hashlib to sha1() the repr(), and then just get a limited substring of it (first 16 characters, for example).
Unless you are talking about huge numbers of these objects, I would suggest that you just use the full hash. Then the opportunity for collision is so, so, so, so small, that you will never live to see it happen (likely).
Also, if you are dealing with that many files, I'm guessing that your caching technique should be adjusted to accommodate it.
We use hashlib.sha1.hexdigest(), which produces even longer strings, for cache objects with good success. Nobody is actually looking at cache files anyway.
Condsidering your use case, if you don't have your heart set on using separate cache files and you are not too far down that development path, you might consider using the shelve module.
This will give you a persistent dictionary (stored in a single dbm file) in which you store your objects. Pickling/unpickling is performed transparently, and you don't have to concern yourself with hashing, collisions, file I/O, etc.
For the shelve dictionary keys, you would just use repr(obj) and let shelve deal with stashing your objects for you. A simple example:
import shelve
cache = shelve.open('cache')
t = (1,2,3)
i = 10
cache[repr(t)] = t
cache[repr(i)] = i
print cache
# {'(1, 2, 3)': (1, 2, 3), '10': 10}
cache.close()
cache = shelve.open('cache')
print cache
#>>> {'(1, 2, 3)': (1, 2, 3), '10': 10}
print cache[repr(10)]
#>>> 10
Short hashes mean you may have same hash for two different files. Same may happen for big hashes too, but its way more rare.
Maybe these file names should vary based on other references, like microtime (unless these files may be created too quickly).