Time complexity of python string index access? - python

If I'm not mistaken, a Python string is stored in unicode scalars. However, unicode scalars can combine to form other grapheme clusters. Therefore, using memory displacement start + scalarSize * n for string[n] isn't the answer you're looking for.
Does this mean that Python iterates linearly through each scalar to get to the scalar you are looking for? If you have
word = 'caf' + char(65) + char(301) #café
Does Python store this as five scalars and iteratively check if any should be combined before moving on or does it run a check upon insertion and store 'pure' scalars?
Edit: I was confusing Python with another language. Python's print() prints out grapheme clusters but Python's str stores scalars no matter how you input them. So two combined scalars will print as one grapheme cluster which could be the same cluster as another scalar. When you go to call string[0] you'd get the scalar as inserted into the string.

Python string indexing does not consider grapheme clusters. It works by Unicode code points. I don't think Python actually has anything built-in for working with grapheme clusters.
String indexing takes constant time, but if you want to retrieve the nth grapheme cluster, string indexing won't do that for you.
(People sometimes suggest applying canonical composition to the string, but there are plenty of possible grapheme clusters that still take multiple code points after canonical composition.)

Related

Attempted regex of string that contains multiple numbers as different base counts

Currently building an encryption module in python 3.8 and have run into a snag with a desired feature/upgrade. Looking for some assistance in finding a solution that would be more helpful than writing a 'string crawler' to parse out an encrypted string of data.
In my first 'official' release everything works fine, but this is due to how much easier it is to split a string based of off easily identifiable prefixes in a string. For example, '0x' in a hexadecimal or '0o' in an octal.
The current definitions for what a number can inhabit use the aforementioned base types along with support for counts of 2-10 as '{n}bXX'.
What I currently have implemented works just fine for the present design, but having trouble with trying to come up with something that can handle higher bases (at least up to 64) that isn't going to be bulky or slow; also, the redesign is having trouble parsing out a string which contains multiple base counts, post assignment to their corresponding characters.
TL;DR - If I have an encoded string like so: "0x9a0o25179b83629b86740xc01d64b9HM-70o5521"
I would like it to be split as: [0x9a, 0o2517, 9b8362, 9b8674, 0xc0ld, 64b9HM-7, 0o5521]
and need help finding a better solution than: r'(?:0x)|(?:9b)|...'

Is there some way to AND two strings in python?

So I have several very large files which represent each position in the human genome. Both of these files are binary masks for a certain type of "score" for each position in the genome and I am interested in getting a new mask where both scores are "1" i.e. the intersection of the two masks.
For example:
File 1: 00100010101
File 2: 11111110001
Desired output: 00100010001
In python, it is really fast to read these big files (they contain between 50-250 million characters) into strings. However, I can't just & the strings together. I CAN do something like
bin(int('0001',2) & int('1111', 2))
but is there a more direct way that doesn't require that I pad in the extra 0's and convert back to a string in the end?
I think the conversion to builtin integer types for the binary-and operation is likely to make it much faster than working character by character (because Python's int is written in C rather than Python). I'd suggest working on each line of your input files, rather than the whole multi-million-character strings at once. The binary-and operation doesn't require any carrying, so there's no issue working with each line separately.
To avoid awkward string operations to pad the result out the the right length, you can the str.format method to convert your integer to a binary string of the right length in one go. Here's an implementation that writes the output out to a new file:
import itertools
with open(filename1) as in1, open(filename2) as in2, open(filename3, "w") as out:
for line1, line2 in itertools.izip(in1, in2):
out.write("{0:0{1}b}\n".format(long(line1, 2) & long(line2, 2), len(line1) - 1))
I'm using one of the neat features of the string formatting mini-language to use a second argument to pass a desired length for the converted number. If you can rely upon the lines always having exactly 50 binary digits (including at the end of the files), you could hard code the length with {:050b} rather than computing it from the length of an input line.

Alternative ways for binary conversion in python

I often need to convert status code to bit representation in order to determine what error/status are active on analyzers using plain-text or binary communication protocol.
I use python to poll data and to parse it. Sometime I really get confuse because I found that there is so many ways to solve a problem. Today I had to convert a string where each character is an hexadecimal digit to its binary representation. That is, each hexadecimal character must be converted into 4 bits, where the MSB start from left. Note: I need a char by char conversion, and leading zero.
I managed to build these following function which does the trick in a quasi one-liner fashion.
def convertStatus(s, base=16):
n = int(math.log2(base))
b = "".join(["{{:0>{}b}}".format(n).format(int(x, base)) for x in s])
return b
Eg., this convert the following input:
0123456789abcdef
into:
0000000100100011010001010110011110001001101010111100110111101111
Which was my goal.
Now, I am wondering what another elegant solutions could I have used to reach my goal? I also would like to better understand what are advantages and drawbacks among solutions. The function signature can be changed, but usually it is a string for input and output. Lets become imaginative...
This is simple in two steps
Converting a string to an int is almost trivial: use int(aString, base=...)
the first parameter is can be a string!
and with base, almost every option is possible
Converting a number to a string is easy with format() and the mini print language
So converting hex-strings to binary can be done as
def h2b(x):
val = int(x, base=16)
return format(val, 'b')
Here the two steps are explicitly. Possible it's better to do it in one line, or even in-line

Fast hash for strings

I have a set of ASCII strings, let's say they are file paths. They could be both short and quite long.
I'm looking for an algorithm that could calculate hash of such a strings and this hash will be also a string, but will have a fixed length, like youtube video ids:
https://www.youtube.com/watch?v=-F-3E8pyjFo
^^^^^^^^^^^
MD5 seems to be what I need, but it is critical for me to have a short hash strings.
Is there a shell command or python library which can do that?
As of Python 3 this method does not work:
Python has a built-in hash() function that's very fast and perfect for most uses:
>>> hash("dfds")
3591916071403198536
You can then make it unsigned:
>>> hashu=lambda word: ctypes.c_uint64(hash(word)).value
You can then turn it into a 16 byte hex string:
>>> hashu("dfds").to_bytes(8,"big").hex()
Or an N*2 byte string, where N is <= 8:
>>> hashn=lambda word, N : (hashu(word)%(2**(N*8))).to_bytes(N,"big").hex()
..etc. And if you want N to be larger than 8 bytes, you can just hash twice. Python's built-in is so vastly faster, it's never worth using hashlib for anything unless you need security... not just collision resistance.
>>> hashnbig=lambda word, N : ((hashu(word)+2**64*hashu(word+"2"))%(2**(N*8))).to_bytes(N,"big").hex()
And finally, use the urlsafe base64 encoding to make a much better string than "hex" gives you
>>> hashnbigu=lambda word, N : urlsafe_b64encode(((hashu(word)+2**64*hash(word+"2"))%(2**(N*8))).to_bytes(N,"big")).decode("utf8").rstrip("=")
>>> hashnbigu("foo",16)
'ZblnvrRqHwAy2lnvrR4HrA'
Caveats:
Be warned that in Python 3.3 and up, this function is
randomized and won't work for some use cases. You can disable this with PYTHONHASHSEED=0
See https://github.com/flier/pyfasthash for fast, stable hashes that
that similarly won't overload your CPU for non-cryptographic applications.
Don't use this lambda style in real code... write it out! And
stuffing things like 2**32 in your code, instead of making them
constants is bad form.
In the end 8 bytes of collision resistance is OK for a smaller
applications.... with less than a million entries, you've got
collision odds of < 0.0000001%. That's a 12 byte b64 encoded
string. But it might not be enough for larger apps.
16 bytes is enough for a UUID/OID in a cache, etc.
Speed comparison for producing 300k 16 byte hashes from a bytes-input.
builtin: 0.188
md5: 0.359
fnvhash_c: 0.113
For a complex input (tuple of 3 integers, for example), you have to convert to bytes to use the non-builtin hashes, this adds a lot of conversion overhead, making the builtin shine.
builtin: 0.197
md5: 0.603
fnvhash_c: 0.284
I guess this question is off-topic, because opinion based, but at least one hint for you, I know the FNV hash because it is used by The Sims 3 to find resources based on their names between the different content packages. They use the 64 bits version, so I guess it is enough to avoid collisions in a relatively large set of reference strings. The hash is easy to implement, if no module satisfies you (pyfasthash has an implementation of it for example).
To get a short string out of it, I would suggest you use base64 encoding. For example, this is the size of a base64-encoded 64 bits hash: nsTYVQUag88= (and you can get rid or the padding =).
Edit: I had finally the same problem as you, so I implemented the above idea: https://gist.github.com/Cilyan/9424144
Another option: hashids is designed to solve exactly this problem and has been ported to many languages, including Python. It's not really a hash in the sense of MD5 or SHA1, which are one-way; hashids "hashes" are reversable.
You are responsible for seeding the library with a secret value and selecting a minimum hash length.
Once that is done, the library can do two-way mapping between integers (single integers, like a simple primary key, or lists of integers, to support things like composite keys and sharding) and strings of the configured length (or slightly more). The alphabet used for generating "hashes" is fully configurable.
I have provided more details in this other answer.
You could use the sum program (assuming you're on linux) but keep in mind that the shorter the hash the more collisions you might have. You can always truncate MD5/SHA hashes as well.
EDIT: Here's a list of hash functions: List of hash functions
Something to keep in mind is that hash codes are one way functions - you cannot use them for "video ids" as you cannot go back from the hash to the original path. Quite apart from anything else hash collisions are quite likely and you end up with two hashes both pointing to the same video instead of different ones.
To create an Id like the youtube one the easiest way is to create a unique id however you normally do that (for example an auto key column in a database) and then map that to a unique string in a reversible way.
For example you could take an integer id and map it to 0-9a-z in base 36...or even 0-9a-zA-Z in base 62, padding the generated string out to the desired length if the id on its own does not give enough characters.

Limiting Numeric Digits in Python

I want to put numerics and strings into the same numpy array. However, I very rarely (difficult to replicate, but sometimes) run into an error where the numeric to string conversion results in a value that cannot back-translate into a decimal (ie, I get "9.8267567e", as opposed to "9.8267567e-5" in the array). This is causing problems after writing files. Here is an example of what I am doing (though on a much smaller scale):
import numpy as np
x = np.array(.94749128494582)
y = np.array(x, dtype='|S100')
My understanding is that this should allow 100 string characters, but sometimes I am seeing a cut-off after ~10. Is there another type that I should be assigning, or a way to limit the number of characters in my array (x)?
First of all, x = np.array(.94749128494582) may not be doing what you think because the argument passed into np.array should be some kind of sequence or something with the array interface. Perhaps you meant x = np.array([.94749128494582])?
Now, as for preserving the strings properly, you could solve this by using
y = np.array(x, dtype=object)
However, as Joe has mentioned in his comment, it's not very numpythonic and you may as well be using plain old python lists.
I would recommend to examine carefully why you seem to have this requirement to hold strings and numbers in the same array, it smells to me like you might have inappropriate data structures set up and could benefit from redesigning/refactoring. numpy arrays are for fast numerical operations, they are not really suited to be used for string manipulations or as some kind of storage/database.

Categories