in python i'm using the Crypto package to generate a random number of length 256 bit. The function for doing so is
import Crypto.Random.random as rand
key = rand.getrandbits(256)
This gives something like:
112699108505435943726051051450940377552177626778909564691673845134467691053980
Now my is question how do i transform this number to a string of all ascii characters? Is there a build in function for doing so or do i need to convert it to binary and split it up in blocks of eight ones and zeros and do it myself?
Thans in advance.
I don't know if it's built in, but I doubt it.
But doing it yourself is not as easy as reinterpreting your data as bytes. This is because Ascii is a 7-bit encoding scheme. The most significant bit is always zero.
The easiest way to do this is to convert your int to a packed array of bytes (to_bytes)[1] and then discard a bit from each byte, e.g. by right shifting. This wastes 1/8 of your entropy but makes for a cleaner program.
This only works because you're using a cryptographically secure source of random number generation. This means that each bit has an equal probability of being a one or zero - that is, each bit is uniformly distributed and independent of all others. [2]
[1]http://docs.python.org/3/library/stdtypes.html#additional-methods-on-integer-types
[2]https://crypto.stackexchange.com/questions/10300/recasting-randomly-generated-numbers-to-other-widths
Related
Suppose that you hash a string in python using a custom-made hash function named sash().
sash("hello world") returns something like 2769834847158000631.
What code (in python) would implement sash() function and a unsash() function such that unsash(sash("hello world")) returns "hello world"?
If you like, assume that the string contains ASCII characters only.
There are 128 ASCII characters.
Thus, each python string is like a natural number written in base 128.
A hash is fixed in size, whereas a string is not. Therefore there will be more more possible strings than hash values, making it impossible to reverse.
In your example, you have an 11-character string containing 77 bits. Your corresponding integer would fit in 64 bits (actually 62 bits, but I will take 64 bits as what you might have been imagining). If we consider only 11-character strings (obviously there are far more), we have 277 possible strings. Assuming a 64-bit hash, there are only 264 hash values. Each hash value would have, on average, 8192 strings that map to it. So given just the hash value, you would have no idea which of those 8192 strings to decode it to.
If you don't mind a hash of unbounded size, then sure, you can simply consider the string itself to be the hash. Then no decoding required. You can get a little fancier, since you are limiting the characters to 0..127, and pack seven bits for each character into a string of bytes, reducing the size by 1/8th. This is effectively the base-128 number you are referring to. You may be able to get it smaller with compression if your 0..127 characters do not have the same probability. Then on average, the string can be compressed, with some possible strings necessarily getting larger instead of smaller.
For this question, please assume python, but it doesn't necessarily matter.
Imagine you have an arbitrary ASCII string, for example:
jrioj4oi3m_=\.,ei9#
Sparing the extensive details, I need to pass this string as a "label" on to another program, but that program doesn't support "labels" containing "special characters" or even numbers. So I'm trying to encode an ASCII string into a string that uses an arbitrary subset of ASCII.
One very naive solution would be to convert the original string into binary, then convert 0s into "a" and 1s into "b". This works to solve my problem, but I would like to learn a better solution here, to become a better programmer.
First of all, what exactly is this problem called?
This is not exactly a hashing problem, because IIRC hashing generally involves encoding into a string that is shorter than the original, and involves collisions.
I need no collisions, and I don't really care how long the encoded string is, as long as it's shorter than the naive case. (Ideally it would be the shortest length possible given the subset)
In fact, it would be ideal to specify exactly what the allowed character set is, then use a generalized encoding algorithm to do the encoding.
Decoding would be nice to know also.
A simple solution would be to first convert to a hex encoding:
jrioj4oi3m_=.,ei9# => 6a72696f6a346f69336d5f3d2e2c65693923
and then translate any numbers into non-hex letters:
6a72696f6a346f69336d5f3d2e2c65693923 => waxswzwfwatuwfwzttwdvftdsescwvwztzst
So the output string would always be exactly twice the length of the input string and only ever contain characters in the range a-z.
This can be easily achieved in python like this:
>>> enc = str.maketrans('0123456789', 'qrstuvwxyz')
>>> dec = str.maketrans('qrstuvwxyz', '0123456789')
>>> s = 'jrioj4oi3m_=.,ei9#'
>>> x = s.encode('ascii').hex().translate(enc)
>>> x
'waxswzwfwatuwfwzttwdvftdsescwvwztzst'
>>> bytes.fromhex(x.translate(dec)).decode('ascii')
'jrioj4oi3m_=.,ei9#'
Interestingly, this actually turns out to be a really simple and common math problem: Base conversion. As a programmer, you probably know, at least in theory, how to convert between base 2, 10, and 16 representations of a value. There are 96 printable ASCII characters, so any ASCII string can be considered to be a base 96 representation of a (probably very large) value. If your label only accepts 64 characters (uppercase, lowercase, digits, and 2 others, for instance), then you simply need to convert your base 96 representation into a base 64 representation of the same value.
Decoding is simply converting your base 64 representation back to the base 96 representation.
I've been reading about base64 conversion, and what I understand is that the encoded version of the original data will be 133% of the original size.
Then, I'm reading about how YouTube is able to have unique identifiers to their videos like FJZQSHn7fc and the reason was: an 11 character base64 string can map to a huge number.
Wait, say a huge number contains 20 characters, then wouldn't a base64 encoded string be 133% of that size, not shorter?
I'm very confused. Are there different types of base64 conversion (string to base64 vs. decimal to base64), once resulting in a bigger, and the other in a smaller resulting string?
Each character in base 64 can encode 6 bits of data. Thus 11 characters can encode 6x11 = 66 bits of data.
2^66 = 73786976294838206464
73786976294838206464 (approximately 7.4 x 10^19 or 74 quintillion) possible identifiers is more than enough to distinguish unique YouTube videos for the foreseeable future.
It is unlikely that YouTube is using these strings of length 11 as encodings of smaller objects. You can use base64 (just a number in base 64 after all) without having to think of it as an encoding of something else, just like you can use bytes (binary numbers with 8 bits) without thinking of those bytes as being encodings of ascii characters. The only important question with an identifier scheme is if there are enough identifiers to go around. In this case there clearly are.
Think of it like this: you have a 64bit number (called long in Java, for example).
Now, you can print that number in different ways:
As a binary number (base 2), printing 64 '0' or '1'
As a decimal number (base 10), printing up to 20 decimal digits
As a hexadecimal number (base 16), printing 16 hexadeciaml digits
As a number in base 64, printing 11 "digits" in that base. You can use any graphical symbols as digits.
... you understand by now that there are many more possibilities ...
It seems like they use the same base-64 numbers as the ones that are used in base64 encoding, that is, uppercase and lowercase letters, ordinary digits and 2 extra chars. Each character represents a 6-bit value. So you get 66 bits, and depending on the algorithm used, either the leading or trailing 2 bits are cut off to get a nice long value back.
You are confusing what things are being compared.
There are 2 statements, both comparing different things:
"base64 encoding is 133% bigger than original size"
"An 11 character base64 string can encode a huge number"
In the case of 1, they are normally referring to a string encoded maybe with ASCII using 8bits a character, and comparing that with the same string encoded in base64. That is 133% bigger, because in base64 you can't use all 255 bit combinations in every byte.
In the case of 2, they are comparing using a numeric identifier, and then either encoding it as base64, or base10. In this case, base64 is a lot shorter than base10.
You can also think of the (1) case as comparing base256 against base64, and the (2) case as comparing base10 against base64.
When you say Base64, some would think of RFC 4648. If YouTube is using RFC 4648, then it's a 12-digit number where they're omitting the last digit because it is always '=', the padding character (the 65th element of the base64 alphabet). The 12 digits represent three blocks of four digits, and four digits yield 24 bits of information. YouTube video IDs would therefore be 64-bit, not 66-bit, if they're using the standard.
Those 64 bits might be representing an unsigned integer. YouTube used MySQL and then sharded MySQL through Vitess, so you could imagine them using an UNSIGNED BIGINT key internally that they encode via RFC 4648-compliant Base64 externally.
Clearly Tom Scott thinks YouTube is squeezing 66 bits out of their 11 characters; his video says so.
If he's wrong, then their frontend might allow you to specify four distinct video IDs for the same video. Those two extra bits' values do not affect the UNSIGNED BIGINT. Which two bits they are depend on endianness and other choices of encoding.
Regardless of whether YouTube is using standard or nonstandard encoding, they can represent 18446744073709551615 in 11 characters (since the padding character is always there and and thus omitted for a 64-bit quantity).
Perhaps they use something like the following to compute a pseudorandom 64-bit integer when a new video is created:
import base64
import random
def Base64RandomSlug():
array = bytearray(random.getrandbits(8) for x in range(64 // 8))
b = base64.urlsafe_b64encode(bytes(array))
return b.decode('utf-8').rstrip('=')
I would like to be able to represent any string as a unique integer (means every integer in the world could mean only one string, and a certain string would result constantly in the same integer).
The obvious point is, that's how the computer works, representing the string 'Hello' (for example) as a number for each character, specifically a byte (assuming ASCII encoding).
But... I would like to perform arithmetic calculations over that number (Encode it as a number using RSA).
The reason this is getting messy is because assuming I have a bit larger string 'I am an average length string' I have more characters (29 in this case), and an integer with 29 bytes could come up HUGE, maybe too much for the computer to handle (when coming up with bigger strings...?).
Basically, my question is, how could I do? I wouldn't like to use any module for RSA, it's a task I would like to implement myself.
Here's how to turn a string into a single number. As you suspected, the number will get very large, but Python can handle integers of any arbitrary size. The usual way of working with encryption is to do individual bytes all at once, but I'm assuming this is only for a learning experience. This assumes a byte string, if you have a Unicode string you can encode to UTF-8 first.
num = 0
for ch in my_string:
num = num << 8 + ord(ch)
I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?
As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144
Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.
You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions
Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.