What is the difference between a85encode and b85encode? - python

Python 3.4 added the a85encode and b85encode functions (and their corresponding decoding functions).
What is the difference between the two? The documentation mentions "They differ by details such as the character map used for encoding.", but this seems unnecessarily vague.

a85encode uses the character mapping:
!"#$%&'()*+,-./0123456789:;<=>?#
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstu
with z used as a special case to represent four zero bytes (instead of !!!!!).
b85encode uses the character mapping:
0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
!#$%&()*+-;<=>?#^_`{|}~
with no special abbreviations.
If you have a choice, I'd recommend you use a85encode. It's a bit easier (and more efficient) to implement in C, as its character mapping uses all characters in ASCII order, and it's slightly more efficient at storing data containing lots of zeroes, which isn't uncommon for uncompressed binary data.

Ascii85 is the predecessor of Base85; the primary difference between the two is in-fact the character sets that are used.
Ascii85 uses the character set:
ASCII 33 ("!") to ASCII 117 ("u")
Base85 uses the character set:
0โ€“9, Aโ€“Z, aโ€“z, !#$%&()*+-;<=>?#^_`{|}~
These characters are specifically not included in Base85:
"',./:[]\\
a85encode and b85encode encode/decode Ascii85 and Base85 respectively.

Related

Remove style of this concrete string?

I extracted the following string from a webpage. It seems to somehow contain font styling, which makes it hard to work with. I would like to convert it to ordinary unstyled characters, using Python.
Here is the string:
๐—ธ๐—ฒ๐—ฒ๐—ฝ ๐˜๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐—ฝ๐—ฟ๐—ฒ๐—ฐ๐—ฎ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€
The characters in that string are special Unicode codepoints used for mathematical typography. Although they shouldn't be used in other contexts, many webpages abuse Unicode for the purpose of creating styled texts; it is most common in places where HTML styling is not allowed (like StackOverflow comments :-)
As indicated in the comments, you can convert these Unicode characters into ordinary unstyled alphabetic characters using the standard unicodedata module's normalize method to do "compatibility (K) composition (C)" normalization.
unicodedata.normalize("NFKC", "๐—ธ๐—ฒ๐—ฒ๐—ฝ ๐˜๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐—ฝ๐—ฟ๐—ฒ๐—ฐ๐—ฎ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€")
There are four normalization forms, which combine two axes:
composition or decomposition:
Certain characters (like รฑ or ร–) have their own Unicode codepoints, although Unicode also includes a mechanism --zero-width "combining characters"-- to apply decorations ("accents" or "tildes") to any character. The precomposed characters with their own codes are basically there to support older encodings (like ISO-8859-x) which included these as single characters. ร‘, for example, was hexadecimal D1 in ISO-8859-1 ("latin-1"), and it was given the Unicode codepoint U+00D1 to make it easier to convert programs which expected it to be a single character. Latin-1 also includes ร• (as D5), but it does not include Tฬƒ; in Unicode, we write Tฬƒ as two characters: a capital T followed by a "combining tilde" (U+0054 U+0303). That means we could write ร‘ in two ways: as ร‘, the single composed codepoint U+00D1, or as Nฬƒ, the two-code sequence U+004E U+0303. If your display software is well-tuned, those two possibilities should look identical, and according to the Unicode standard they are semantically identical, but since the codes differ, they won't compare the same in a byte-by-byte comparison.
Composition (C) normalization converts multi-code sequences into their composed single-code versions, where those exist; it would turn U+004E U+0303 into U+00D1.
Decomposition (D) normalization converts the composed single-code characters into the semantically equivalent sequence using combining characters; it would turn U+00D1 into U+004E U+0303
compatibility (K):
Some Unicode codepoints exist only to force particular rendering styles. That includes the styled math characters you encountered, but it also includes ligatures (such as ๏ฌƒ), superscript digits (ยฒ) or letters (ยช) and some characters which have conventional meanings (ยต, meaning "one-millionth", which different from the Greek character ฮผ, or the Angstrom sign โ„ซ, which is not the same as the Scandinavian character ร…). In compatibility normalization, these characters are changed to the base unstyled character; in some cases, this loses important semantic information, but it can be useful.
All normalizations put codes into "canonical" ordering. Characters with more than one combining marks, such as cฬงฬ, can be written with the combining marks in either order. To make it easier to compare strings which contain such characters, Unicode has a designated combining order, and normalization will reorder combining characters so that they can be easily compared. (Note that this needs to be done after composition, since that can change the base character. For example, if the base character is "รง" decomposition normalization will change the base character to "c" and the cedilla will then need to be inserted in the correct place in the sequence of combining marks.

Python: Time complexity of .isAlpha

So I can't find the official documentation on how the isalpha method was written (of the string module), but I imagine the algorithm used would be: 1). Convert char in question to int.2). Compare it to the larger alpha-ascii values (i.e. 90 or 122) to see if it less than or equal to these values.3). Compare it to the larger alpha-ascii values, i.e. 55 or 97 depending on the upper bound used (if only less than 90 use 55...), to see if it greater than or equal to these values.Am I correct in this assessment of the isalpha method or is it something different altogether? If so does it have a complexity of O(3)?
Python handles text as unicode. As such, what character is alphabetic and what is not depends on the character unicode category, enconpassing all characters defined on the unicode version compiled along with Python. That is tens of hundreds of characters, and hundreds of scripts, etc... each with their own alphabetic ranges. Although it all boils down to numeric ranges of the codepoints that could be compared using other algorithms, it is almost certain all characters are iterated, and the character unicode category is checked. If you want the complexity, it is then O(n) .
(Actually, it would have been O(n) on your example as well, since all characters have to be checked. For a single character, Python uses a dict, or dict-like table to get from the character to its category infomation, and that is O(1))

What does normalize in unicodedata means?

I am very new to the encoding/decoding part and would like to know why...I have a dictionary, I wonder why normalization need to be applied in this case when the key is added? Does it has anything to do with the previous key and the new key? What happen if I don't normalize?
with open('file.csv') as input_file:
reader = csv.DictReader(input_file)
for row in reader:
pre_key = ['One sample', 'Two samples', 'Three samples']
new_key = ['one_sample', 'two_Samples', 'three_samples']
my_dict['index'][new_key] = unicodedata.normalize("NFKD",
row.get(pre_key, None))
Normalization is not about encoding and decoding, but a "normal" (expected) form to represent a character.
The classic example is about a character with an accent. Often such characters have two representation, one with the base character codepoint and then a combining codepoint describing the accent, and often the second one with just one codepoint (describing character and accent).
Additionally, sometime you have two or more accent (and descents, dots, etc.). In this case, you may want them in a specific order.
Unicode add new characters and codepoints. You may have some old typographic way to describe a letter (or kanji). On some context (displaying) it is important to make the distinction (also in English, in past letter s had two representation), but to read or to analyse, one want the semantic letter (so normalized).
And there are few cases where you may have unnecessary characters (e.g. if you type in a "unicode keyboard").
So why do we need normalization?
the simple case: we should compare strings: visually and semantically the same string could be represented into different form, so we choose a normalization form, so that we can compare strings.
collation (sorting) algorithms work a lot better (less special cases), if we have to handle just one form, but also to change case (lower case, upper case), it is better to have a single form to handle.
handling strings can be easier: if you should remove accents, the easy way it is to use a decomposition form, and then you remove the combining characters.
to encode in other character set, it is better to have a composite form (or both): if the target charset has the composite, transcode it, else: there are many ways to handle it.
So "normalize" means to transform the same string into an unique Unicode representation. The canonical transformation uses a strict definition of same; instead the compatibility normalization interpret the previous same into something like *it should have been the same, if we follow the Unicode philosophy, but practice we had to make some codepoints different to the preferred one*. So in compatibility normalization we could lose some semantics, and a pure/ideal Unicode string should never have a "compatibility" character.
In your case: the csv file could be edited by different editors, so with different convention on how to represent accented characters. So with normalization, you are sure that the same key will be encoded as same entry in the dictionary.

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!
LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.
If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

Python uses three unicode characters to represent an asian fullstop? This is weird?

The python file:
# -*- coding: utf-8 -*-
print u"ใ€‚"
print [u"ใ€‚".encode('utf8')]
Produces:
ใ€‚
['\xe3\x80\x82']
Why does python use 3 characters to store my 1 fullstop? This is really strange, if you print each one out individually, they are all different as well. Any ideas?
In UTF-8, three bytes (not really characters) are used to represent code points between U+07FF and U+FFFF, such as this character, IDEOGRAPHIC FULL STOP (U+3002).
Try dumping the script file with od -x. You should find the same three bytes used to represent the character there.
UTF-8 is a multibyte character representation so characters that are not ASCII will take up more than one byte.
Looks correctly UTF-8 encoded to me. See here for an explanation about UTF-8 encoding.
The latest version of Unicode supports more than 109,000 characters in 93 different scripts. Mathematically, the minimum number of bytes you'd need to encode that number of code points is 3, since this is 17 bits' worth of information. (Unicode actually reserves a 21-bit range, but this still fits in 3 bytes.) You might therefore reasonably expect every character to need 3 bytes in the most straightforward imaginable encoding, in which each character is represented as an integer using the smallest possible whole number of bytes. (In fact, as pointed out by dan04, you need 4 bytes to get all of Unicode's functionality.)
A common data compression technique is to use short tokens to represent frequently-occurring elements, even though this means that infrequently-occurring elements will need longer tokens than they otherwise might. UTF-8 is a Unicode encoding that uses this approach to store text written in English and other European languages in fewer bytes, at the cost of needing more bytes for text written in other languages. In UTF-8, the most common Latin characters need only 1 byte (UTF-8 overlaps with ASCII for the convenience of English users), and other common characters need only 2 bytes. But some characters need 3 or even 4 bytes, which is more than they'd need in a "naive" encoding. The particular character you're asking about needs 3 bytes in UTF-8 by definition.
In UTF-16, it happens, this code point would need only 2 bytes, though other characters will need 4 (there are no 3-byte characters in UTF-16). If you are truly concerned with space efficiency, do as John Machin suggests in his comment and use an encoding that is designed to be maximally space-efficient for your language.

Categories