Python: Time complexity of .isAlpha - python

So I can't find the official documentation on how the isalpha method was written (of the string module), but I imagine the algorithm used would be: 1). Convert char in question to int.2). Compare it to the larger alpha-ascii values (i.e. 90 or 122) to see if it less than or equal to these values.3). Compare it to the larger alpha-ascii values, i.e. 55 or 97 depending on the upper bound used (if only less than 90 use 55...), to see if it greater than or equal to these values.Am I correct in this assessment of the isalpha method or is it something different altogether? If so does it have a complexity of O(3)?

Python handles text as unicode. As such, what character is alphabetic and what is not depends on the character unicode category, enconpassing all characters defined on the unicode version compiled along with Python. That is tens of hundreds of characters, and hundreds of scripts, etc... each with their own alphabetic ranges. Although it all boils down to numeric ranges of the codepoints that could be compared using other algorithms, it is almost certain all characters are iterated, and the character unicode category is checked. If you want the complexity, it is then O(n) .
(Actually, it would have been O(n) on your example as well, since all characters have to be checked. For a single character, Python uses a dict, or dict-like table to get from the character to its category infomation, and that is O(1))

Related

How would you hash a string in such a way that it can be un-hashed?

Suppose that you hash a string in python using a custom-made hash function named sash().
sash("hello world") returns something like 2769834847158000631.
What code (in python) would implement sash() function and a unsash() function such that unsash(sash("hello world")) returns "hello world"?
If you like, assume that the string contains ASCII characters only.
There are 128 ASCII characters.
Thus, each python string is like a natural number written in base 128.
A hash is fixed in size, whereas a string is not. Therefore there will be more more possible strings than hash values, making it impossible to reverse.
In your example, you have an 11-character string containing 77 bits. Your corresponding integer would fit in 64 bits (actually 62 bits, but I will take 64 bits as what you might have been imagining). If we consider only 11-character strings (obviously there are far more), we have 277 possible strings. Assuming a 64-bit hash, there are only 264 hash values. Each hash value would have, on average, 8192 strings that map to it. So given just the hash value, you would have no idea which of those 8192 strings to decode it to.
If you don't mind a hash of unbounded size, then sure, you can simply consider the string itself to be the hash. Then no decoding required. You can get a little fancier, since you are limiting the characters to 0..127, and pack seven bits for each character into a string of bytes, reducing the size by 1/8th. This is effectively the base-128 number you are referring to. You may be able to get it smaller with compression if your 0..127 characters do not have the same probability. Then on average, the string can be compressed, with some possible strings necessarily getting larger instead of smaller.

What does normalize in unicodedata means?

I am very new to the encoding/decoding part and would like to know why...I have a dictionary, I wonder why normalization need to be applied in this case when the key is added? Does it has anything to do with the previous key and the new key? What happen if I don't normalize?
with open('file.csv') as input_file:
reader = csv.DictReader(input_file)
for row in reader:
pre_key = ['One sample', 'Two samples', 'Three samples']
new_key = ['one_sample', 'two_Samples', 'three_samples']
my_dict['index'][new_key] = unicodedata.normalize("NFKD",
row.get(pre_key, None))
Normalization is not about encoding and decoding, but a "normal" (expected) form to represent a character.
The classic example is about a character with an accent. Often such characters have two representation, one with the base character codepoint and then a combining codepoint describing the accent, and often the second one with just one codepoint (describing character and accent).
Additionally, sometime you have two or more accent (and descents, dots, etc.). In this case, you may want them in a specific order.
Unicode add new characters and codepoints. You may have some old typographic way to describe a letter (or kanji). On some context (displaying) it is important to make the distinction (also in English, in past letter s had two representation), but to read or to analyse, one want the semantic letter (so normalized).
And there are few cases where you may have unnecessary characters (e.g. if you type in a "unicode keyboard").
So why do we need normalization?
the simple case: we should compare strings: visually and semantically the same string could be represented into different form, so we choose a normalization form, so that we can compare strings.
collation (sorting) algorithms work a lot better (less special cases), if we have to handle just one form, but also to change case (lower case, upper case), it is better to have a single form to handle.
handling strings can be easier: if you should remove accents, the easy way it is to use a decomposition form, and then you remove the combining characters.
to encode in other character set, it is better to have a composite form (or both): if the target charset has the composite, transcode it, else: there are many ways to handle it.
So "normalize" means to transform the same string into an unique Unicode representation. The canonical transformation uses a strict definition of same; instead the compatibility normalization interpret the previous same into something like *it should have been the same, if we follow the Unicode philosophy, but practice we had to make some codepoints different to the preferred one*. So in compatibility normalization we could lose some semantics, and a pure/ideal Unicode string should never have a "compatibility" character.
In your case: the csv file could be edited by different editors, so with different convention on how to represent accented characters. So with normalization, you are sure that the same key will be encoded as same entry in the dictionary.

Time complexity of python dictionary get() update() always O(1)?

If I use only strings with maximum length of 15 as keys for a dictionary in python, is it impossible to have any collisions?
Worst case seems to be O(N) for accessing or updating a value, with N being the number of keys in the dictionary. With the built in string hash of python it's impossible to have same hashes on two different strings with maximum length of 15 and the worst case would be O(1), right?
Or do I understand something wrong?
Thanks in advance.
If I use only strings with maximum length of 15 as keys for a dictionary in python, is it impossible to have any collisions?
No. Collisions can happen. The result of the hash function is truncated according to the "host system". So that means that the hash(..) of a string on a 32-bit system is 32-bit integer, and for a 64-bit system, it usually is a 64-bit number.
Now if we count the number of strings less than or equal to 15 characters (and we here will only assume printable ASCII characters, but if we consider all unicode characters, we only make it worse), then that means we can generate:
15
---
\ i
/ (128-32)
---
i=0
different strings. Which is equal to 547'792'552'280'497'574'758'284'371'041, or approximately 5.47×1029. The number of 64-bit numbers is 264=18'446'744'073'709'551'616≈1.84×1019. So even if we only consider ASCII printable strings, then we can not map every string to a separate hash.
As a result, that means that hash collisions will happen if we keep filling the dictionary with new strings (eventually). Even if the dictionary creates one bucket per hash code, multiple strings will get in the same bucket, because the "hash space" is smaller than the "string space".
Worst case seems to be O(N) for accessing or updating a value, with N being the number of keys in the dictionary. With the built in string hash of python it's impossible to have same hashes on two different strings with maximum length of 15 and the worst case would be O(1), right?
It is O(1) but due to another reason. Since the number of strings has at most 15 characters, that means that the number of possible strings (hence keys) is fixed. For example the number of ASCII printable keys is fixed to the number we derived above (5.47×1029). Yes, we can use unicode, and this will scale up the number of keys dramatically, but it is still finite (well it is approximately 5.06×1090).
That means that means that there is an upperbound for N, and therefore there is not really such thing as assymptotic complexity. Even if we manage to generate all these strings, and in the worst case these all map to the same hash code, and therefore all are stored in the same bucket, it is still constant time, the processor will have a very hard time iterating over the bucket, but it will still be constant: at most 5.06×1090 iterations.

What is the difference between a85encode and b85encode?

Python 3.4 added the a85encode and b85encode functions (and their corresponding decoding functions).
What is the difference between the two? The documentation mentions "They differ by details such as the character map used for encoding.", but this seems unnecessarily vague.
a85encode uses the character mapping:
!"#$%&'()*+,-./0123456789:;<=>?#
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstu
with z used as a special case to represent four zero bytes (instead of !!!!!).
b85encode uses the character mapping:
0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
!#$%&()*+-;<=>?#^_`{|}~
with no special abbreviations.
If you have a choice, I'd recommend you use a85encode. It's a bit easier (and more efficient) to implement in C, as its character mapping uses all characters in ASCII order, and it's slightly more efficient at storing data containing lots of zeroes, which isn't uncommon for uncompressed binary data.
Ascii85 is the predecessor of Base85; the primary difference between the two is in-fact the character sets that are used.
Ascii85 uses the character set:
ASCII 33 ("!") to ASCII 117 ("u")
Base85 uses the character set:
0–9, A–Z, a–z, !#$%&()*+-;<=>?#^_`{|}~
These characters are specifically not included in Base85:
"',./:[]\\
a85encode and b85encode encode/decode Ascii85 and Base85 respectively.

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!
LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.
If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

Categories