Attempted regex of string that contains multiple numbers as different base counts

Attempted regex of string that contains multiple numbers as different base counts - python

Currently building an encryption module in python 3.8 and have run into a snag with a desired feature/upgrade. Looking for some assistance in finding a solution that would be more helpful than writing a 'string crawler' to parse out an encrypted string of data.
In my first 'official' release everything works fine, but this is due to how much easier it is to split a string based of off easily identifiable prefixes in a string. For example, '0x' in a hexadecimal or '0o' in an octal.
The current definitions for what a number can inhabit use the aforementioned base types along with support for counts of 2-10 as '{n}bXX'.
What I currently have implemented works just fine for the present design, but having trouble with trying to come up with something that can handle higher bases (at least up to 64) that isn't going to be bulky or slow; also, the redesign is having trouble parsing out a string which contains multiple base counts, post assignment to their corresponding characters.
TL;DR - If I have an encoded string like so: "0x9a0o25179b83629b86740xc01d64b9HM-70o5521"
I would like it to be split as: [0x9a, 0o2517, 9b8362, 9b8674, 0xc0ld, 64b9HM-7, 0o5521]
and need help finding a better solution than: r'(?:0x)|(?:9b)|...'

Related

Why does the output output of hashes look like \x88H\x98\xda(\x04qQ vs �*�r?

I'm new to encryption, and programming in general. I'm just trying to get my head wrapped around some basic concepts.
I'm using python, Crypto.Hash.SHA256
from Crypto.Hash import SHA256
In the REPL if I type
print SHA256.new('password').digest()//j���*�rBo��)'s`=
vs
SHA256.new('password').digest()//"^\x88H\x98\xda(\x04qQ\xd0\xe5o\x8d\xc6)'s`=\rj\xab\xbd\xd6*\x11\xefr\x1d\x15B\xd8"
What are these two outputs?
How are they supposed to be interpreted?

In the first case, you are using print, so Python is trying to convert the bytes to printable characters. Unfortunately, not every byte is printable, so you get some strange output.
In the second case, since you are not calling print, the Python interpreter does something different. It takes the return value, which is a string in this case, and shows the internal representation of the string. Which is why for some characters, you get something that is printable, but in other cases, you get an escaped sequence, like \x88.
The two outputs happen to just be two representations of the same digest.
FYI, when working with pycrypto and looking at hash function outputs, I highly recommend using hexdigest instead of digest.

Testing equality for visually identical characters with different UTF-8 encodings. (Japanese)

In my project I need to deal with many different languages, one of them is Japanese. (I don't speak it myself).
I need to compare two strings to see if they are equal. One string comes from a filename on my computer, the other string is from the download link of that exact file. These 2 strings should be the same.
Turns out the same characters can be encoded in different ways or something like that.
Look at the character バ, it can be encoded in two ways.
\xe3\x83\x90
\xe3\x83\x8f\xe3\x82\x99
Number 2 is actually an ハ and a ゙ together, which results in same character. Because of this some strings will be considered different, even if they should be equal. Python is telling me that
ネバーランド
is not the same as
ネバーランド
Things I have tried:
Instead of checking for exact equality I tried using similarity:
difflib.SequenceMatcher(None, string1, string2).ratio() This unfortunately is not precise enough. I've also tried configuring junk but I cant get it to be precise enough.
Messing around with the decode and encode functions hoping that it would just magically disappear.
I've seen somewhat similar questions asked but no good solutions and i'm afraid there is none unless I filter out these special cases manually.

Normalize into its combined form. But operate with unicodes, of course, not UTF-8.
>>> u'ネバーランド' == u'ネバーランド'
False
>>> unicodedata.normalize('NFC', u'ネバーランド') == u'ネバーランド'
True

Fast hash for strings

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?

As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284

I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144

Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.

You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions

Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

Automatic conversion of the advanced string formatter from the old style

Is there any automatic way to convert a piece of code from python's old style string formatting (using %) to the new style (using .format)? For example, consider the formatting of a PDB atom specification:
spec = "%-6s%5d %4s%1s%3s %1s%4d%1s %8.3f%8.3f%8.3f%6.2f%6.2f %2s%2s"
I've been converting some of these specifications by hand as needed, but this is both error prone, and time-consuming as I have many such specifications.

Use pyupgrade
pyupgrade --py3-plus <filename>
You can convert to f-strings (formatted string literals) instead of .format() with
pyupgrade --py36-plus <filename>
You can install it with
pip install pyupgrade

The functionality of the two forms does not match up exactly, so there is no way you could automatically translate every % string into an equivalent {} string or (especially) vice-versa.
Of course there is a lot of overlap, and many of the sub-parts of the two formatting languages are the same or very similar, so someone could write a partial converter (which could, e.g., raise an exception for non-convertible code).
For a small subset of the language like what you seem to be using, you could do it pretty trivially with a simple regex—every pattern starts with % and ends with one of [sdf], and something like {:\1\2} as a replacement pattern ought to be all you need.
But why bother? Except as an exercise in writing parsers, what would be the benefit? The % operator is not deprecated, and using % with an existing % format string will obviously do at least as well as using format with a % format string converted to {}.
If you are looking at this as an exercise in writing parsers, I believe there's an incomplete example buried inside pyparsing.
Some differences that are hard to translate, off the top of my head:
* for dynamic field width or precision; format has a similar feature, but does it differently.
%(10)s, because format tries to interpret the key name as a number first, then falls back to a dict key.
%(a[b])s, because format doesn't quote or otherwise separate the key from the rest of the field, so a variety of characters simply can't be used.
%c takes integers or single-char strings; :c only integers.
%r/%s/%a analogues are not part of the format string, but a separate part of the field (which also comes on the opposite side).
%g and :g have slightly different cutoff rules.
%a and !a don't do the exact same thing.
The actual differences aren't listed anywhere; you will have to dig them out by a thorough reading of the Format Specification Mini-Language vs. the printf-style String Formatting language.

The docs explain some of the differences. As far as I can tell -- although I'm not very familiar with old-style format strings -- is that the functionality of the new style is a superset of the functionality of the oldstyle.
You'd have to do more tweaking to handle edge cases, but I think something simple like
re.replace(r'%(\w+)([sbcdoXnf...])', r'{\1\2}', your_string)
would get you 90% of the way there. The remaining translation -- going from things like %x to {0:x} -- will be too complex for a regular expression to handle (without writing some ridiculously complex conditionals inside of your regex).

Mapping Unicode to ASCII in Python

I receive strings after querying via urlopen in JSON format:
def get_clean_text(text):
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
for track in json["tracks"]:
print track["name"].lower()
get_clean_text(track["name"].lower())
For the string "türlich, türlich (sicher, dicker)" I then get
File "main.py", line 23, in get_clean_text
return text.translate(maketrans("!?,.;():", " ")).lower().strip()
TypeError: character mapping must return integer, None or unicode
I want to format the string to be "türlich türlich sicher dicker".

The question is not a complete self-contained example; I can't be sure whether it's Python 2 or 3, where maketrans came from, etc. There's a good chance I will guess wrong, which is why you should be sure to tag your questions appropriately and provide a short, self contained, correct example. (That, and the fact that various other people—some of them probably smarter than me—likely ignored your question because it was ambiguous.)
Assuming you're using 2.x, and you've done a from string import * to get maketrans, and json["name"] is unicode rather than str/bytes, here's your problem:
There are two kinds of translation tables: old-style 8-bit tables (which are just an array of 256 characters) and new-style sparse tables (which are just a dict mapping one character's ordinal to another). The str.translate function can use either, but unicode.translate can only use the second (for reasons that should be obvious if you think about it for a bit).
The string.maketrans function makes old-style 8-bit translation tables. So you can't use it with unicode.translate.
You can always write your own "makeunitrans" function as a drop-in replacement, something like this:
def makeunitrans(frm, to):
return {ord(f):ord(t) for (f,t) in zip(frm, to)}
But if you just want to map out certain characters, you could do something a bit more special purpose:
def makeunitrans(frm):
return {ord(f):ord(' ') for f in frm}
However, from your final comment, I'm not sure translate is even what you want:
I want to format the string to be "türlich türlich sicher dicker"
If you get this right, you're going to format the string to be "türlich türlich sicher dicker ", because you're mapping all those punctuation characters to spaces, not nothing.
With new-style translation tables you can map anything you want to None, which solves that problem. But you might want to step back and ask why you're using the translate method in the first place instead of, e.g., calling replace multiple times (people usually say "for performance", but you wouldn't be building the translation table in-line every time through if that were an issue) or using a trivial regular expression.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.