Python - hashing binary value - python

I wanted to use sha1 alghoritm to calculate the checksum of some data, the thing is that in python hashlib input is given as string.
Is it possible to calculate sha1 in python, but somehow give raw bytes as input?
I am asking because if I would want to calculate hash of an file, in C I would use openssl library and just pass normal bytes, but in Python I need to pass string, so if I would calculate hash of some specific file I would get different results in both languages.

In Python 2.x, str objects can be arbitrary byte streams. So yes, you can just pass the data into the hashlib functions as strs.
>>> import hashlib
>>> "this is binary \0\1\2"
'this is binary \x00\x01\x02'
>>> hashlib.sha1("this is binary \0\1\2").hexdigest()
'17c27af39d476f662be60be7f25c8d3873041bb3'

Related

What hash (Python 3 hashlib) yields a portable hash of file contents?

I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents. I want to use Python 3 to compute the hash, but my friend can't use Python 3 (because I'll wait till next year to send the file and by then Python 3 will be out of style, and he'll want to be using Python++ or whatever). All I can guarantee is that my friend will know how to compute the hash, in a mathematical sense---he might have to write his own code to run on his implementation of the MIX machine (which he will know how to do).
What hash do I use, and, more importantly, what do I take the hash of? For example, do I hash the str returned from a read on the file opened for reading as text? Do I hash some bytes-like object returned from a binary read? What if the file has weird end-of-line markers? Do I pad the tail end first so that the thing I am hashing is an appropriate size?
import hashlib
FILENAME = "filename"
# Now, what?
I say "sequence of bits" because not all computers are based on the 8-bit byte, and saying "sequence of bytes" is therefore too ambiguous. For example, GreenArrays, Inc. has designed a supercomputer on a chip, where each computer has 18-bit (eighteen-bit) words (when these words are used for encoding native instructions, they are composed of three 5-bit "bytes" and one 3-bit byte each). I also understand that before the 1970's, a variety of byte-sizes were used. Although the 8-bit byte may be the most common choice, and may be optimal in some sense, the choice of 8 bits per byte is arbitrary.
See Also
Is python's hash() portable?
First of all, the hash() function in Python is not the same as cryptographic hash functions in general. Here're the differences:
hash()
A hash is an fixed sized integer that identifies a particular value. Each value needs to have its own hash, so for the same value you will get the same hash even if it's not the same object.
Note that the hash of a value only needs to be the same for one run of Python. In Python 3.3 they will in fact change for every new run of Python
What does hash do in python?
Cryptographic hash functions
A cryptographic hash function (CHF) is a mathematical algorithm that maps data of an arbitrary size (often called the "message") to a bit array of a fixed size
It is deterministic, meaning that the same message always results in the same hash.
https://en.wikipedia.org/wiki/Cryptographic_hash_function
Now let's come back to your question:
I would like to compute the hash of the contents (sequence of bits) of a file (whose length could be any number of bits, and so not necessarily a multiple of the trendy eight) and send that file to a friend along with the hash-value. My friend should be able to compute the same hash from the file contents.
What you're looking for is one of the cryptographic hash functions. Typically, to calculate the file hash, MD5, SHA-1, SHA-256 are used. You want to open the file as binary and hash the binary bits, and finally digest it & encode it in hexadecimal form.
import hashlib
def calculateSHA256Hash(filePath):
h = hashlib.sha256()
with open(filePath, "rb") as f:
data = f.read(2048)
while data != b"":
h.update(data)
data = f.read(2048)
return h.hexdigest()
print(calculateSHA256Hash(filePath = 'stackoverflow_hash.py'))
The above code takes itself as an input, hence it produced an SHA-256 hash for itself, being 610e15155439c75f6b63cd084c6a235b42bb6a54950dcb8f2edab45d0280335e. This remains consistent as long as the code is not changed.
Another example would be to hash a txt file, test.txt with content Helloworld.
This is done by simply changing the last line of the code to "test.txt"
print(calculateSHA256Hash(filePath = 'text.txt'))
This gives a SHA-256 hash of 5ab92ff2e9e8e609398a36733c057e4903ac6643c646fbd9ab12d0f6234c8daf.
I arrived at sha256hexdigestFromFile, an alternative to #Lincoln Yan 's calculateSHA256Hash, after reviewing the standard for SHA-256.
This is also a response to my comment about 2048.
def sha256hexdigestFromFile(filePath, blocks = 1):
'''Return as a str the SHA-256 message digest of contents of
file at filePath.
Reference: Introduction of NIST (2015) Secure Hash
Standard (SHS), FIPS PUB 180-4. DOI:10.6028/NIST.FIPS.180-4
'''
assert isinstance(blocks, int) and 0 < blocks, \
'The blocks argument must be an int greater than zero.'
with open(filePath, 'rb') as MessageStream:
from hashlib import sha256
from functools import reduce
def hashUpdated(Hash, MESSAGE_BLOCK):
Hash.update(MESSAGE_BLOCK)
return Hash
def messageBlocks():
'Return a generator over the blocks of the MessageStream.'
WORD_SIZE, BLOCK_SIZE = 4, 512 # PER THE SHA-256 STANDARD
BYTE_COUNT = WORD_SIZE * BLOCK_SIZE * blocks
yield MessageStream.read(BYTE_COUNT)
return reduce(hashUpdated, messageBlocks(), sha256()).hexdigest()

How to understand the bytes object returned from hash.digest()?

I was wondering about the bytes object returned from:
>>> hashlib.sha256(b'foo').digest()
>>> b',&\xb4kh\xff\xc6\x8f\xf9\x9bE<\x1d0A4\x13B-pd\x83\xbf\xa0\xf9\x8a^\x88bf\xe7\xae'
The documentation states:
This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255.
Is this the bytes version of:
>>> hashlib.sha256(b'foo').hexdigest()
>>>'2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae'
?
If so, why do both representation not lineup like for example:
>>> 'foo'
>>> b'foo'
?
This is probably related to why:
>>> hashlib.sha256(b'foo').hexdigest().decode('hex')
does not work?
Yes, it's the bytes version of this HEX string:
>>> hashlib.sha256(b'foo').hexdigest()
>>>'2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae'
Hash functions (md5, sha1, sha256) return binary data. That's why they're returning byte arrays in the python implementation. hexdigest is usually used if you want to send the value to an API, or print it for debugging purposes. But hash functions return bits (in form of array bytes).
As an example, take a look at the md5 function definition from WIKIPEDIA. You'll see that its output are bits as per the first paragraph:
The MD5 message-digest algorithm is a widely used hash function
producing a 128-bit hash value

Convert string with HEX MD5 to base64 encoding

I need to convert a HEX-type md5 string to the base64 version in Python.
For example, if I had MD5: 4297f44b13955235245b2497399d7a93
I need the code to produce Qpf0SxOVUjUkWySXOZ16kw==
This is identical to another SO asking for a C# implementation, but I need the Python code. This is similar to this SO asking to convert a single binary number to base64 in Python.
Depending on the version of Python you are running, the following will work:
Python 2
base64.b64encode("4297f44b13955235245b2497399d7a93".decode("‌​hex"))
Python 3
base64.b64encode(bytes.fromhex("4297f44b13955235245b2497399d‌​7a93"))

Python Encoding that ignores leading 0s

I'm writing code in python 3.5 that uses hashlib to spit out MD5 encryption for each packet once it is is given a pcap file and the password. I am traversing through the pcap file using pyshark. Currently, the values it is spitting out are not the same as the MD5 encryptions on the packets in the pcap file.
One of the reasons I have attributed this to is that in the hex representation of the packet, the values are represented with leading 0s. Eg: Protocol number is shown as b'06'. But the value I am updating the hashlib variable with is b'6'. And these two values are not the same for same reason:
>> b'06'==b'6'
False
The way I am encoding integers is:
(hex(int(value))[2:]).encode()
I am doing this encoding because otherwise it would result in this error: "TypeError: Unicode-objects must be encoded before hashing"
I was wondering if I could get some help finding a python encoding library that ignores leading 0s or if there was any way to get the inbuilt hex method to ignore the leading 0s.
Thanks!
Hashing b'06' and b'6' gives different results because, in this context, '06' and '6' are different.
The b string prefix in Python tells the Python interpreter to convert each character in the string into a byte. Thus, b'06' will be converted into the two bytes 0x30 0x36, whereas b'6' will be converted into the single byte 0x36. Just as hashing b'a' and b' a' (note the space) produces different results, hashing b'06' and b'6' will similarly produce different results.
If you don't understand why this happens, I recommend looking up how bytes work, both within Python and more generally - Python's handling of bytes has always been a bit counterintuitive, so don't worry if it seems confusing! It's also important to note that the way Python represents bytes has changed between Python 2 and Python 3, so be sure to check which version of Python any information you find is talking about. You can comment here, too,

Is Python's hashlib.sha256(x).hexdigest() equivalent to Rs digest(x,algo="sha256")

I'm not a python programmer, but I'm trying to translate some Python code to R. The piece of python code I'm having trouble with is:
hashlib.sha256(x).hexdigest()
My interpretation of this code is that the function is going to calculate the hash of x using the sha256 algorithm and return the value in hex.
Given that interpretation, I am using the following R function:
digest(x, algo="sha256", raw=FALSE)
Based upon my albeit limited knowledge of R and what I have read online on Python's hashlib function the two functions should be producing identical results, but they are not.
Am I missing something or am I using the wrong R function.
Yes, both the Python and the R sample code returns a hexadecimal representation of a SHA256 hash digest for the data passed in.
You do need to switch off serialisation in R, otherwise you the digest() package first creates a serialisation of the string rather than calculate the hash for the character data only; set serialize to FALSE:
> digest('', algo="sha256", serialize=FALSE)
[1] "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
> digest('hello world', algo="sha256", serialize=FALSE)
[1] "b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9"
These match their Python equivalents:
>>> import hashlib
>>> hashlib.sha256('').hexdigest()
'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'
>>> hashlib.sha256('hello world').hexdigest()
'b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9'
If your hashes then still differ between R and Python, then your data is different. That could be a subtle as a newline at the end of the line, or a byte order mark at the start.
In Python, inspect the output of print(repr(x)) to represent the data as a Python string literal; this shows non-printable characters as escape sequences. I'm sure R has similar debugging tools. Both R and Python echo string values as representations when using their interactive modes.

Categories