Obfuscating Strings with ASCII and base 128

Obfuscating Strings with ASCII and base 128 - python

Suppose a string is a number system where each thing, it can be a char, DEL or any ASCII thing, has a corresponding number according to this ASCII table. How can you convert arbitrary string of the property to number in Python?
An example
#car = 35*128**3+99*128**2+97*128**1+114*128**0=75034866

Try this:
total = 0
for c in "#car":
total <<= 7
total += ord(c)
print total
Result:
75034866
To get back the original string:
result = []
while total:
result.append(chr(total % 128))
total >>= 7
print ''.join(reversed(result))
Result:
#car

For arbitrarily long numbers, use Decimal in Python 2.x and just int in Python 3.x.
No idea about 'encryption'.

You're looking to find a mapping to obfuscate the data.
Real encryption requires the use of a trap door function - a function that is computationally easy to compute one way, but the inverse of which is difficult to compute without specific information.
Most encryption is based on prime factorization. Alice multiplies two large prime numbers together, and gives the results to Bob. Bob uses this large prime number to encrypt his data using an encryption function. Finding the inverse of Bob's encryption function requires knowing the two original prime numbers (encryption does not). Finding these numbers is a very computationally expensive task, so the encrypted data is 'safe'.
Implementing this correctly is VERY difficult. If you want to keep data safe, find a library that does it for you.
EDIT: I should specify that what I described was public key encryption. Private key encryption works a bit differently. The important thing is that there's a mathematical basis for thinking that encrypted data will be hard to decrypt without a key or some sort.

Related

Need help understanding binary conversion in Python

Or I guess binary in general. I'm obviously quite new to coding, so I'll appreciate any help here.
I just started learning about converting numbers into binary, specifically two's complement. The course presented the following code for converting:
num = 19
if num < 0:
isNeg = True
num = abs(num)
else:
isNeg = False
result = ''
if num == 0:
result = '0'
while num > 0:
result = str(num % 2) + result
num = num // 2
if isNeg:
result = '-' + result
This raised a couple of questions with me and after doing some research (mostly here on Stack Overflow), I found myself more confused than I was before. Hoping somebody can break things down a bit more for me. Here are some of those questions:
I thought it was outright wrong that the code suggested just appending a - to the front of a binary number to show its negative counterpart. It looks like bin() does the same thing, but don't you have to flip the bits and add a 1 or something? Is there a reason for this other than making it easy to comprehend/read?
Was reading here and one of the answers in particular said that Python doesn't really work in two's complement, but something else that mimics it. The disconnect here for me is that Python shows me one thing but is storing the numbers a different way. Again, is this just for ease of use? Is bin() using two's complement or Python's method?
Follow-up to that one, how does the 'sign-magnitude' format mentioned in the above answer differ from two's complement?
The Professor doesn't talk at all about 8-bit, 16-bit, 64-bit, etc., which I saw a lot of while reading up on this. Where does this distinction come from, and does Python use one? Or are those designations specific to the program that I might be coding?
A lot of these posts I've only reference how Python stores integers. Is that suggesting that it stores floats a different way, or are they just speaking broadly?
As I wrote this all up, I sort of realized that maybe I'm diving into the deep end before learning how to swim, but I'm curious like that and like to have a deeper understanding of stuff before moving on.

I thought it was outright wrong that the code suggested just appending a - to the front of a binary number to show its negative counterpart. It looks like bin() does the same thing, but don't you have to flip the bits and add a 1 or something? Is there a reason for this other than making it easy to comprehend/read?
You have to somehow designate the number being negative. You can add another symbol (-), add a sign bit at the very beginning, use ones'-complement, use two's-complement, or some other completely made-up scheme that works. Both the ones'- and two's-complement representation of a number require a fixed number of bits, which doesn't exist for Python integers:
>>> 2**1000
1071508607186267320948425049060001810561404811705533607443750
3883703510511249361224931983788156958581275946729175531468251
8714528569231404359845775746985748039345677748242309854210746
0506237114187795418215304647498358194126739876755916554394607
7062914571196477686542167660429831652624386837205668069376
The natural solution is to just prepend a minus sign. You can similarly write your own version of bin() that requires you to specify the number of bits and return the two's-complement representation of the number.
Was reading here and one of the answers in particular said that Python doesn't really work in two's complement, but something else that mimics it. The disconnect here for me is that Python shows me one thing but is storing the numbers a different way. Again, is this just for ease of use? Is bin() using two's complement or Python's method?
Python is a high-level language, so you don't really know (or care) how your particular Python runtime interally stores integers. Whether you use CPython, Jython, PyPy, IronPython, or something else, the language specification only defines how they should behave, not how they should be represented in memory. bin() just takes a number and prints it out using binary digits, the same way you'd convert 123 into base-2.
Follow-up to that one, how does the 'sign-magnitude' format mentioned in the above answer differ from two's complement?
Sign-magnitude usually encodes a number n as 0bXYYYYYY..., where X is the sign bit and YY... are the binary digits of the non-negative magnitude. Arithmetic with numbers encoded as two's-complement is more elegant due to the representation, while sign-magnitude encoding requires special handling for operations on numbers of opposite signs.
The Professor doesn't talk at all about 8-bit, 16-bit, 64-bit, etc., which I saw a lot of while reading up on this. Where does this distinction come from, and does Python use one? Or are those designations specific to the program that I might be coding?
No, Python doesn't define a maximum size for its integers because it's not that low-level. 2**1000000 computes fine, as will 2**10000000 if you have enough memory. n-bit numbers arise when your hardware makes it more beneficial to make your numbers a certain size. For example, processors have instructions that quickly work with 32-bit numbers but not with 87-bit numbers.
A lot of these posts I've only reference how Python stores integers. Is that suggesting that it stores floats a different way, or are they just speaking broadly?
It depends on what your Python runtime uses. Usually floating point numbers are like C doubles, but that's not required.

don't you have to flip the bits and add a 1 or something?
Yes, for two complement notation you invert all bits and add one to get the negative counterpart.
Is bin() using two's complement or Python's method?
Two's complement is a practical way to represent negative number in electronics that can have only 0 and 1. Internally the microprocessor uses two's complement for negative numbers and all modern microprocessors do. For more info, see your textbook on computer architecture.
how does the 'sign-magnitude' format mentioned in the above answer
differ from two's complement?
You should look what this code does and why it is there:
while num > 0:
result = str(num % 2) + result
num = num // 2

Represent string as an integer in python

I would like to be able to represent any string as a unique integer (means every integer in the world could mean only one string, and a certain string would result constantly in the same integer).
The obvious point is, that's how the computer works, representing the string 'Hello' (for example) as a number for each character, specifically a byte (assuming ASCII encoding).
But... I would like to perform arithmetic calculations over that number (Encode it as a number using RSA).
The reason this is getting messy is because assuming I have a bit larger string 'I am an average length string' I have more characters (29 in this case), and an integer with 29 bytes could come up HUGE, maybe too much for the computer to handle (when coming up with bigger strings...?).
Basically, my question is, how could I do? I wouldn't like to use any module for RSA, it's a task I would like to implement myself.

Here's how to turn a string into a single number. As you suspected, the number will get very large, but Python can handle integers of any arbitrary size. The usual way of working with encryption is to do individual bytes all at once, but I'm assuming this is only for a learning experience. This assumes a byte string, if you have a Unicode string you can encode to UTF-8 first.
num = 0
for ch in my_string:
num = num << 8 + ord(ch)

Fast hash for strings

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?

As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284

I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144

Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.

You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions

Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

Reversible encryption in python

Short Question: Is there a proved strong reversible encryption method (in Python)?
Requirement: Do not require 3rd part of Python libraries.
Apply environment: transport data through networks.
I saw a method using str.translate() with a key-generated table. Here is the table generating function:
def get_table(key):
m = hashlib.md5()
m.update(key)
s = m.digest()
(a, b) = struct.unpack('<QQ', s)
table = [c for c in string.maketrans('', '')]
for i in xrange(1, 1024):
table.sort(lambda x, y: int(a % (ord(x) + i) - a % (ord(y) + i)))
return ''.join(table)
Questions about this function:
Is this a good/strong reversible encryption?
In the function 1024 is a big number, need we loop so many times to get a table that strong enough?
Thanks in advance.

If you want strong encryption without a third-party library, you're out of luck--the Python Standard Library only has hash functions. If you want secure encryption you'll have to either implement something like AES yourself (this is not a good idea, as it's really easy for the inexperienced to mess up when implementing an encryption algorithm), or change your requirements and use PyCrypto.

An xor cipher would work nicely (if you bitwise-XOR each character of the message with its counterpart in the key, you can get back to the message again by XORing the ciphertext with the key again).
XOR Cipher
EDIT: Exactly how you acquire the key will determine the security of this cipher, but it's a fast, easily reversible cipher.
EDIT2: Specifically, see these lines from the Wiki on how you might make this a secure cipher system...
"If the key is random and is at least as long as the message, the XOR cipher is much more secure than when there is key repetition within a message.[3] When the keystream is generated by a pseudo-random number generator, the result is a stream cipher. With a key that is truly random, the result is a one-time pad, which is unbreakable even in theory."

you could make your own encrpytion program by using an offset factor.
ie; convert each letter into a number using ord().
add to the number using a randomly generated offset key.
convert back into letter using chr()
and to decrypt:
convert each character into a number using ord()
subtract by the offset key
convert back into letter using chr()
you know have the original message.
hope it helps you

Hashing an IP address to a number in [0, H)

I'm using Python-2.6. I have very little knowledge of hash functions.
I want to use a CRC hash function to hash an IP address like '128.0.0.5' into the range [0, H). Currently I'm thinking of doing
zlib.crc32('128.0.0.5')%H.
Is this okay? There's a few ques. you could try and answer...
does it make any diff. if I hash '128.0.0.5' or its binary '0001110101010..' whatever that is or without the '.'s
zlib.crc32 returns a signed integer. Does modding (%) a neg. with a positive H always give a pos no?
Does %-ing by H affect how good the hash function is? ( I mean is that the best I could do for the available space, with the available xlib.crc32)
Thanks!

Why do you want to hash an IP address into a number? They already have a native integer representation. For example, using netaddr:
>>> import netaddr
>>> ip = netaddr.IPAddress('192.168.1.1')
>>> ip.value
3232235777
>>> netaddr.IPAddress(3232235777)
IPAddress('192.168.1.1')

does it make any diff. if I hash '128.0.0.5' or its binary '0001110101010..' whatever that is or without the '.'s
Not really.
zlib.crc32 returns a signed integer. Does modding (%) a neg. with a positive H always give a pos no?
Yes.
Does %-ing by H affect how good the hash function is? ( I mean is that the best I could do for the available space, with the available xlib.crc32)
You'd better use all the bits of the checksum to make up for their lack of an "avalanche effect". Single-digit variations such as 192.168.1.1, 192.168.1.2, etc might produce differences only in the first bits of the checksum, and since % cares only about the last bits, hashes will collide.

ad 1) It will yield different results, but does not effect the quality of the hash.
ad 2) It will always yield a positive number or zero.
ad 3) As you limit the number of possible buckets, it does affect the quality of the hash.
In general: About how large is your H? Remember that a IPv4 address is nothing more than a 32-bit value. 192.168.0.1 is just a more human readable byte-wise representation. So if your H is larger than 4294967295, there will be no need of hashing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.