I've written an encrypting program in Python, one of my options is an md5 encryption. When i run a known string through my md5 encryption I receive a different hash value then if I run the EXACT same string through an md5 encryption website or cryptofox for firefox.
eg. my programs hash output - fe9c25d61e56054ea87703e30c672d91 - plaintext: g4m3
eg. online hash / cryptofox - 26e4477a0fa9cb24675379331dba9c84 - plaintext: g4m3
EXACT same word, 2 different hash values.
now heres my code snipet:
word="g4m3"
string=md5.new(word).hexdigest()
print string
You included a newline in your MD5 input string:
>>> import md5
>>> word="g4m3"
>>> md5.new(word).hexdigest() # no newline
'26e4477a0fa9cb24675379331dba9c84'
>>> md5.new(word + '\n').hexdigest() # with a newline
'fe9c25d61e56054ea87703e30c672d91'
When reading data from a file, make sure you remove the newline character at the end of the lines. You can use .rstrip('\n') to just remove line separation characters from the end of the line, or use .strip() to remove all whitespace from start or end of the line:
>>> word = 'g4m3\n'
>>> md5.new(word).hexdigest()
'fe9c25d61e56054ea87703e30c672d91'
>>> word = word.strip()
>>> md5.new(word).hexdigest()
'26e4477a0fa9cb24675379331dba9c84'
As to your question: Hashing is very sensitive. Even a single character of difference can result in a radically different output string. It may be the case that the online implementation is appending a whitespace char, or more likely, a newline. This extra character will change the output of the algorithm. (It's also possible the opposite is happening: you are appending a newline and the online one is not)
As to MD5 "encryption":
MD5 is NOT encryption. It is hashing. Encryption takes in data and a key, and spits out random data. This data is recoverable. Hashing, on the other hand, takes in data and spits out a finite amount of data that REPRESENTS the original data. The original data, however, unless stored elsewhere, is lost.
More information for reference:
Another interesting difference is the data the various types of algorithms spit out. Encryption can take in any amount of data (within the scope of the OS/software of course) and will output a bunch of data appx. equal in size to the input data. Hashing, however, will not. Since it is a mere representation of the data, it has a limited output. This can pose problems. For instance, if you had an infinite amount of data, eventually, two entirely different pieces of data would have the same hash. For this reason, when using hashing to compare two different values, it is usually a good idea to compare two separate hashes as well. The statistical probability that two separate pieces of data having TWO EQUAL HASHES is astronomically low.
Of course, then you get into hashing algorithms that utilize encryption methods at their core, but I won't go into that here.
Related
I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents. I want to use Python 3 to compute the hash, but my friend can't use Python 3 (because I'll wait till next year to send the file and by then Python 3 will be out of style, and he'll want to be using Python++ or whatever). All I can guarantee is that my friend will know how to compute the hash, in a mathematical sense---he might have to write his own code to run on his implementation of the MIX machine (which he will know how to do).
What hash do I use, and, more importantly, what do I take the hash of? For example, do I hash the str returned from a read on the file opened for reading as text? Do I hash some bytes-like object returned from a binary read? What if the file has weird end-of-line markers? Do I pad the tail end first so that the thing I am hashing is an appropriate size?
import hashlib
FILENAME = "filename"
# Now, what?
I say "sequence of bits" because not all computers are based on the 8-bit byte, and saying "sequence of bytes" is therefore too ambiguous. For example, GreenArrays, Inc. has designed a supercomputer on a chip, where each computer has 18-bit (eighteen-bit) words (when these words are used for encoding native instructions, they are composed of three 5-bit "bytes" and one 3-bit byte each). I also understand that before the 1970's, a variety of byte-sizes were used. Although the 8-bit byte may be the most common choice, and may be optimal in some sense, the choice of 8 bits per byte is arbitrary.
See Also
Is python's hash() portable?
First of all, the hash() function in Python is not the same as cryptographic hash functions in general. Here're the differences:
hash()
A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.
Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python
What does hash do in python?
Cryptographic hash functions
A cryptographic hash function (CHF) is a mathematical algorithm that maps data of an arbitrary size (often called the "message") to a bit array of a fixed size
It is deterministic, meaning that the same message always results in the same hash.
https://en.wikipedia.org/wiki/Cryptographic_hash_function
Now let's come back to your question:
I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents.
What you're looking for is one of the cryptographic hash functions. Typically, to calculate the file hash, MD5, SHA-1, SHA-256 are used. You want to open the file as binary and hash the binary bits, and finally digest it & encode it in hexadecimal form.
import hashlib
def calculateSHA256Hash(filePath):
h = hashlib.sha256()
with open(filePath, "rb") as f:
data = f.read(2048)
while data != b"":
h.update(data)
data = f.read(2048)
return h.hexdigest()
print(calculateSHA256Hash(filePath = 'stackoverflow_hash.py'))
The above code takes itself as an input, hence it produced an SHA-256 hash for itself, being 610e15155439c75f6b63cd084c6a235b42bb6a54950dcb8f2edab45d0280335e. This remains consistent as long as the code is not changed.
Another example would be to hash a txt file, test.txt with content Helloworld.
This is done by simply changing the last line of the code to "test.txt"
print(calculateSHA256Hash(filePath = 'text.txt'))
This gives a SHA-256 hash of 5ab92ff2e9e8e609398a36733c057e4903ac6643c646fbd9ab12d0f6234c8daf.
I arrived at sha256hexdigestFromFile, an alternative to #Lincoln Yan 's calculateSHA256Hash, after reviewing the standard for SHA-256.
This is also a response to my comment about 2048.
def sha256hexdigestFromFile(filePath, blocks = 1):
'''Return as a str the SHA-256 message digest of contents of
file at filePath.
Reference: Introduction of NIST (2015) Secure Hash
Standard (SHS), FIPS PUB 180-4. DOI:10.6028/NIST.FIPS.180-4
'''
assert isinstance(blocks, int) and 0 < blocks, \
'The blocks argument must be an int greater than zero.'
with open(filePath, 'rb') as MessageStream:
from hashlib import sha256
from functools import reduce
def hashUpdated(Hash, MESSAGE_BLOCK):
Hash.update(MESSAGE_BLOCK)
return Hash
def messageBlocks():
'Return a generator over the blocks of the MessageStream.'
WORD_SIZE, BLOCK_SIZE = 4, 512 # PER THE SHA-256 STANDARD
BYTE_COUNT = WORD_SIZE * BLOCK_SIZE * blocks
yield MessageStream.read(BYTE_COUNT)
return reduce(hashUpdated, messageBlocks(), sha256()).hexdigest()
This is my first test code:
import hashlib
md5Hash = hashlib.md5()
md5Hash.update('Coconuts')
print md5Hash.hexdigest()
md5Hash.update('Apples')
print md5Hash.hexdigest()
md5Hash.update('Oranges')
print md5Hash.hexdigest()
And this is my second chunk of code:
import hashlib
md5Hash = hashlib.md5()
md5Hash.update('Coconuts')
print md5Hash.hexdigest()
md5Hash.update('Bananas')
print md5Hash.hexdigest()
md5Hash.update('Oranges')
print md5Hash.hexdigest()
But the output for 1st code is:
0e8f7761bb8cd94c83e15ea7e720852a
217f2e2059306ab14286d8808f687abb
4ce7cfed2e8cb204baeba9c471d48f07
And for the second code is:
0e8f7761bb8cd94c83e15ea7e720852a
a82bf69bf25207f2846c015654ae68d1
47dba619e1f3eaa8e8a01ab93c79781e
I replaced the second string from 'Apples' to 'Bananas' and the third string still remains same. But still I am getting a different result for third string. Hashing supposed to have a same result everytime.
Am I missing something?
hashlib.md5.update() adds data to the hash. It doesn't replace the existing values; if you want to hash a new value, you need to initialize a new hashlib.md5 object.
The values you're hashing are:
"Coconuts" -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsApples" -> 217f2e2059306ab14286d8808f687abb
"CoconutsApplesOranges" -> 4ce7cfed2e8cb204baeba9c471d48f07
"Coconuts" -> 0e8f7761bb8cd94c83e15ea7e720852a
"CoconutsBananas" -> a82bf69bf25207f2846c015654ae68d1
"CoconutsBananasOranges" -> 47dba619e1f3eaa8e8a01ab93c79781e
Because you're using update method, md5Hash object is reused for the 3 strings. So it's basically the hash of the 3 strings concatenated together. So changing the second string changes the outcome for the 3rd print as well.
You need to declare a separate md5 object for each string. Use a loop (and python 3 compliant code needs the bytes prefix BTW, and also works in python 2):
import hashlib
for s in (b'Coconuts',b'Bananas',b'Oranges'):
md5Hash = hashlib.md5(s) # no need for update, pass data at construction
print(md5Hash.hexdigest())
result:
0e8f7761bb8cd94c83e15ea7e720852a
1ee31b77d0697c36914b99d1428f7f32
62f2b77089fea4c595e895901b63c10b
note that the values are now different, but at least it is the MD5 of each string, computed independently.
Expected result
What you are expecting is generally what you should be expecting from common cryptographic libraries. In most cryptographic libraries the hash object is reset after calling a method that finalizes the calculation such as hexdigest. It seems that hashlib.md5 uses alternate behavior.
Result by hashlib.md5
MD5 requires the input to be padded with a 1 bit, zero or more 0 bits and the length of the input in bits. Then the final hash value is calculated. hashlib.md5 internally seems to perform the final calculation using separate variables, keeping the state after hashing each string without this final padding.
So the result of your hashes is the concatenation of the earlier strings with the given string, followed by the correct padding, as duskwulf pointed out in his answer.
This is correctly documented by hashlib:
hash.digest()
Return the digest of the strings passed to the update() method so far. This is a string of digest_size bytes which may contain non-ASCII characters, including null bytes.
and
hash.hexdigest()
Like digest() except the digest is returned as a string of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments.
Solution for hashlib.md5
As there doesn't seem to be a reset() method you should create a new md5 object for each separate hash value you want to create. Fortunately the hash objects themselves are relatively lightweight (even if the hashing itself isn't) so this won't consume many CPU or memory resources.
Discussion of the differences
For hashing itself resetting the hash in the finalizer may not make all that much sense. But it does matter for signature generation: you might want to initialize the same signature instance and then generate multiple signatures with it. The hash function should reset so it can calculate the signature over multiple messages.
Sometimes an application requires a congregated hash over multiple inputs, including intermediate hash results. In that case however a Merkle tree of hashes is used, where the intermediate hashes themselves are hashed again.
As indicated, I consider this is bad API design by the authors of hashlib. For cryptographers it certainly doesn't follow the rule of least surprise.
So I have several very large files which represent each position in the human genome. Both of these files are binary masks for a certain type of "score" for each position in the genome and I am interested in getting a new mask where both scores are "1" i.e. the intersection of the two masks.
For example:
File 1: 00100010101
File 2: 11111110001
Desired output: 00100010001
In python, it is really fast to read these big files (they contain between 50-250 million characters) into strings. However, I can't just & the strings together. I CAN do something like
bin(int('0001',2) & int('1111', 2))
but is there a more direct way that doesn't require that I pad in the extra 0's and convert back to a string in the end?
I think the conversion to builtin integer types for the binary-and operation is likely to make it much faster than working character by character (because Python's int is written in C rather than Python). I'd suggest working on each line of your input files, rather than the whole multi-million-character strings at once. The binary-and operation doesn't require any carrying, so there's no issue working with each line separately.
To avoid awkward string operations to pad the result out the the right length, you can the str.format method to convert your integer to a binary string of the right length in one go. Here's an implementation that writes the output out to a new file:
import itertools
with open(filename1) as in1, open(filename2) as in2, open(filename3, "w") as out:
for line1, line2 in itertools.izip(in1, in2):
out.write("{0:0{1}b}\n".format(long(line1, 2) & long(line2, 2), len(line1) - 1))
I'm using one of the neat features of the string formatting mini-language to use a second argument to pass a desired length for the converted number. If you can rely upon the lines always having exactly 50 binary digits (including at the end of the files), you could hard code the length with {:050b} rather than computing it from the length of an input line.
I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?
As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144
Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.
You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions
Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.
I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!
LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.
If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.