I'm using Python-2.6. I have very little knowledge of hash functions.
I want to use a CRC hash function to hash an IP address like '128.0.0.5' into the range [0, H). Currently I'm thinking of doing
zlib.crc32('128.0.0.5')%H.
Is this okay? There's a few ques. you could try and answer...
does it make any diff. if I hash '128.0.0.5' or its binary '0001110101010..' whatever that is or without the '.'s
zlib.crc32 returns a signed integer. Does modding (%) a neg. with a positive H always give a pos no?
Does %-ing by H affect how good the hash function is? ( I mean is that the best I could do for the available space, with the available xlib.crc32)
Thanks!
Why do you want to hash an IP address into a number? They already have a native integer representation. For example, using netaddr:
>>> import netaddr
>>> ip = netaddr.IPAddress('192.168.1.1')
>>> ip.value
3232235777
>>> netaddr.IPAddress(3232235777)
IPAddress('192.168.1.1')
does it make any diff. if I hash '128.0.0.5' or its binary '0001110101010..' whatever that is or without the '.'s
Not really.
zlib.crc32 returns a signed integer. Does modding (%) a neg. with a positive H always give a pos no?
Yes.
Does %-ing by H affect how good the hash function is? ( I mean is that the best I could do for the available space, with the available xlib.crc32)
You'd better use all the bits of the checksum to make up for their lack of an "avalanche effect". Single-digit variations such as 192.168.1.1, 192.168.1.2, etc might produce differences only in the first bits of the checksum, and since % cares only about the last bits, hashes will collide.
ad 1) It will yield different results, but does not effect the quality of the hash.
ad 2) It will always yield a positive number or zero.
ad 3) As you limit the number of possible buckets, it does affect the quality of the hash.
In general: About how large is your H? Remember that a IPv4 address is nothing more than a 32-bit value. 192.168.0.1 is just a more human readable byte-wise representation. So if your H is larger than 4294967295, there will be no need of hashing.
Related
I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?
As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144
Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.
You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions
Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.
in python i'm using the Crypto package to generate a random number of length 256 bit. The function for doing so is
import Crypto.Random.random as rand
key = rand.getrandbits(256)
This gives something like:
112699108505435943726051051450940377552177626778909564691673845134467691053980
Now my is question how do i transform this number to a string of all ascii characters? Is there a build in function for doing so or do i need to convert it to binary and split it up in blocks of eight ones and zeros and do it myself?
Thans in advance.
I don't know if it's built in, but I doubt it.
But doing it yourself is not as easy as reinterpreting your data as bytes. This is because Ascii is a 7-bit encoding scheme. The most significant bit is always zero.
The easiest way to do this is to convert your int to a packed array of bytes (to_bytes)[1] and then discard a bit from each byte, e.g. by right shifting. This wastes 1/8 of your entropy but makes for a cleaner program.
This only works because you're using a cryptographically secure source of random number generation. This means that each bit has an equal probability of being a one or zero - that is, each bit is uniformly distributed and independent of all others. [2]
[1]http://docs.python.org/3/library/stdtypes.html#additional-methods-on-integer-types
[2]https://crypto.stackexchange.com/questions/10300/recasting-randomly-generated-numbers-to-other-widths
How do I represent minimum and maximum values for integers in Python? In Java, we have Integer.MIN_VALUE and Integer.MAX_VALUE.
See also: What is the maximum float in Python?.
Python 3
In Python 3, this question doesn't apply. The plain int type is unbounded.
However, you might actually be looking for information about the current interpreter's word size, which will be the same as the machine's word size in most cases. That information is still available in Python 3 as sys.maxsize, which is the maximum value representable by a signed word. Equivalently, it's the size of the largest possible list or in-memory sequence.
Generally, the maximum value representable by an unsigned word will be sys.maxsize * 2 + 1, and the number of bits in a word will be math.log2(sys.maxsize * 2 + 2). See this answer for more information.
Python 2
In Python 2, the maximum value for plain int values is available as sys.maxint:
>>> sys.maxint # on my system, 2**63-1
9223372036854775807
You can calculate the minimum value with -sys.maxint - 1 as shown in the docs.
Python seamlessly switches from plain to long integers once you exceed this value. So most of the time, you won't need to know it.
If you just need a number that's bigger than all others, you can use
float('inf')
in similar fashion, a number smaller than all others:
float('-inf')
This works in both python 2 and 3.
The sys.maxint constant has been removed from Python 3.0 onward, instead use sys.maxsize.
Integers
PEP 237: Essentially, long renamed to int. That is, there is only one built-in integral type, named int; but it behaves mostly like the old long type.
...
The sys.maxint constant was removed, since there is no longer a limit to the value of integers. However, sys.maxsize can be used as an integer larger than any practical list or string index. It conforms to the implementation’s “natural” integer size and is typically the same as sys.maxint in previous releases on the same platform (assuming the same build options).
For Python 3, it is
import sys
max_size = sys.maxsize
min_size = -sys.maxsize - 1
In Python integers will automatically switch from a fixed-size int representation into a variable width long representation once you pass the value sys.maxint, which is either 231 - 1 or 263 - 1 depending on your platform. Notice the L that gets appended here:
>>> 9223372036854775807
9223372036854775807
>>> 9223372036854775808
9223372036854775808L
From the Python manual:
Numbers are created by numeric literals or as the result of built-in functions and operators. Unadorned integer literals (including binary, hex, and octal numbers) yield plain integers unless the value they denote is too large to be represented as a plain integer, in which case they yield a long integer. Integer literals with an 'L' or 'l' suffix yield long integers ('L' is preferred because 1l looks too much like eleven!).
Python tries very hard to pretend its integers are mathematical integers and are unbounded. It can, for instance, calculate a googol with ease:
>>> 10**100
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
You may use 'inf' like this:
import math
bool_true = 0 < math.inf
bool_false = 0 < -math.inf
Refer: math — Mathematical functions
If you want the max for array or list indices (equivalent to size_t in C/C++), you can use numpy:
np.iinfo(np.intp).max
This is same as sys.maxsize however advantage is that you don't need import sys just for this.
If you want max for native int on the machine:
np.iinfo(np.intc).max
You can look at other available types in doc.
For floats you can also use sys.float_info.max.
sys.maxsize is not the actually the maximum integer value which is supported. You can double maxsize and multiply it by itself and it stays a valid and correct value.
However, if you try sys.maxsize ** sys.maxsize, it will hang your machine for a significant amount of time. As many have pointed out, the byte and bit size does not seem to be relevant because it practically doesn't exist. I guess python just happily expands it's integers when it needs more memory space. So in general there is no limit.
Now, if you're talking about packing or storing integers in a safe way where they can later be retrieved with integrity then of course that is relevant. I'm really not sure about packing but I know python's pickle module handles those things well. String representations obviously have no practical limit.
So really, the bottom line is: what is your applications limit? What does it require for numeric data? Use that limit instead of python's fairly nonexistent integer limit.
I rely heavily on commands like this.
python -c 'import sys; print(sys.maxsize)'
Max int returned: 9223372036854775807
For more references for 'sys' you should access
https://docs.python.org/3/library/sys.html
https://docs.python.org/3/library/sys.html#sys.maxsize
code given below will help you.
for maximum value you can use sys.maxsize and for minimum you can negate same value and use it.
import sys
ni=sys.maxsize
print(ni)
Suppose a string is a number system where each thing, it can be a char, DEL or any ASCII thing, has a corresponding number according to this ASCII table. How can you convert arbitrary string of the property to number in Python?
An example
#car = 35*128**3+99*128**2+97*128**1+114*128**0=75034866
Try this:
total = 0
for c in "#car":
total <<= 7
total += ord(c)
print total
Result:
75034866
To get back the original string:
result = []
while total:
result.append(chr(total % 128))
total >>= 7
print ''.join(reversed(result))
Result:
#car
For arbitrarily long numbers, use Decimal in Python 2.x and just int in Python 3.x.
No idea about 'encryption'.
You're looking to find a mapping to obfuscate the data.
Real encryption requires the use of a trap door function - a function that is computationally easy to compute one way, but the inverse of which is difficult to compute without specific information.
Most encryption is based on prime factorization. Alice multiplies two large prime numbers together, and gives the results to Bob. Bob uses this large prime number to encrypt his data using an encryption function. Finding the inverse of Bob's encryption function requires knowing the two original prime numbers (encryption does not). Finding these numbers is a very computationally expensive task, so the encrypted data is 'safe'.
Implementing this correctly is VERY difficult. If you want to keep data safe, find a library that does it for you.
EDIT: I should specify that what I described was public key encryption. Private key encryption works a bit differently. The important thing is that there's a mathematical basis for thinking that encrypted data will be hard to decrypt without a key or some sort.
What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).
So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.
Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?
Edit:
Use case is like this:
The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).
Conclusions (?)
The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.
I found out how to take a hex hash (from any hash source) and store it compactly with base64:
# 'h' is a string of hex digits
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")
The birthday paradox applies: given a good hash function, the expected number of hashes before a collision occurs is about sqrt(N), where N is the number of different values that the hash function can take. (The wikipedia entry I've pointed to gives the exact formula). So, for example, if you want to use no more than 32 bits, your collision worries are serious for around 64K objects (i.e., 2**16 objects -- the square root of the 2**32 different values your hash function can take). How many objects do you expect to have, as an order of magnitude?
Since you mention that a collision is a minor annoyance, I recommend you aim for a hash length that's roughly the square of the number of objects you'll have, or a bit less but not MUCH less than that.
You want to make a filename - is that on a case-sensitive filesystem, as typical on Unix, or do you have to cater for case-insensitive systems too? This matters because you aim for short filenames, but the number of bits per character you can use to represent your hash as a filename changes dramatically on case-sensive vs insensitive systems.
On a case-sensitive system, you can use the standard library's base64 module (I recommend the "urlsafe" version of the encoding, i.e. this function, as avoiding '/' characters that could be present in plain base64 is important in Unix filenames). This gives you 6 usable bits per character, much better than the 4 bits/char in hex.
Even on a case-insensitive system, you can still do better than hex -- use base64.b32encode and get 5 bits per character.
These functions take and return strings; use the struct module to turn numbers into strings if your chosen hash function generates numbers.
If you do have a few tens of thousands of objects I think you'll be fine with builtin hash (32 bits, so 6-7 characters depending on your chosen encoding). For a million objects you'd want 40 bits or so (7 or 8 characters) -- you can fold (xor, don't truncate;-) a sha256 down to a long with a reasonable number of bits, say 128 or so, and use the % operator to cut it further to your desired length before encoding.
The builtin hash function of strings is fairly collision free, and also fairly short. It has 2**32 values, so it is fairly unlikely that you encounter collisions (if you use its abs value, it will have only 2**31 values).
You have been asking for the shortest hash function. That would certainly be
def hash(s):
return 0
but I guess you didn't really mean it that way...
You can make any hash you like shorter by simply truncating it. md5 is always 32 hex digits, but an arbitrary substring of it (or any other hash) has the proper qualities of a hash: equal values produce equal hashes, and the values are spread around a bunch.
I'm sure that there's a CRC32 implementation in Python, but that may be too short (8 hex digits). On the upside, it's very quick.
Found it, binascii.crc32
If you do have a collision, how are you going to tell that it actually happened?
If I were you, I would use hashlib to sha1() the repr(), and then just get a limited substring of it (first 16 characters, for example).
Unless you are talking about huge numbers of these objects, I would suggest that you just use the full hash. Then the opportunity for collision is so, so, so, so small, that you will never live to see it happen (likely).
Also, if you are dealing with that many files, I'm guessing that your caching technique should be adjusted to accommodate it.
We use hashlib.sha1.hexdigest(), which produces even longer strings, for cache objects with good success. Nobody is actually looking at cache files anyway.
Condsidering your use case, if you don't have your heart set on using separate cache files and you are not too far down that development path, you might consider using the shelve module.
This will give you a persistent dictionary (stored in a single dbm file) in which you store your objects. Pickling/unpickling is performed transparently, and you don't have to concern yourself with hashing, collisions, file I/O, etc.
For the shelve dictionary keys, you would just use repr(obj) and let shelve deal with stashing your objects for you. A simple example:
import shelve
cache = shelve.open('cache')
t = (1,2,3)
i = 10
cache[repr(t)] = t
cache[repr(i)] = i
print cache
# {'(1, 2, 3)': (1, 2, 3), '10': 10}
cache.close()
cache = shelve.open('cache')
print cache
#>>> {'(1, 2, 3)': (1, 2, 3), '10': 10}
print cache[repr(10)]
#>>> 10
Short hashes mean you may have same hash for two different files. Same may happen for big hashes too, but its way more rare.
Maybe these file names should vary based on other references, like microtime (unless these files may be created too quickly).