What does normalize in unicodedata means? - python

I am very new to the encoding/decoding part and would like to know why...I have a dictionary, I wonder why normalization need to be applied in this case when the key is added? Does it has anything to do with the previous key and the new key? What happen if I don't normalize?
with open('file.csv') as input_file:
reader = csv.DictReader(input_file)
for row in reader:
pre_key = ['One sample', 'Two samples', 'Three samples']
new_key = ['one_sample', 'two_Samples', 'three_samples']
my_dict['index'][new_key] = unicodedata.normalize("NFKD",
row.get(pre_key, None))

Normalization is not about encoding and decoding, but a "normal" (expected) form to represent a character.
The classic example is about a character with an accent. Often such characters have two representation, one with the base character codepoint and then a combining codepoint describing the accent, and often the second one with just one codepoint (describing character and accent).
Additionally, sometime you have two or more accent (and descents, dots, etc.). In this case, you may want them in a specific order.
Unicode add new characters and codepoints. You may have some old typographic way to describe a letter (or kanji). On some context (displaying) it is important to make the distinction (also in English, in past letter s had two representation), but to read or to analyse, one want the semantic letter (so normalized).
And there are few cases where you may have unnecessary characters (e.g. if you type in a "unicode keyboard").
So why do we need normalization?
the simple case: we should compare strings: visually and semantically the same string could be represented into different form, so we choose a normalization form, so that we can compare strings.
collation (sorting) algorithms work a lot better (less special cases), if we have to handle just one form, but also to change case (lower case, upper case), it is better to have a single form to handle.
handling strings can be easier: if you should remove accents, the easy way it is to use a decomposition form, and then you remove the combining characters.
to encode in other character set, it is better to have a composite form (or both): if the target charset has the composite, transcode it, else: there are many ways to handle it.
So "normalize" means to transform the same string into an unique Unicode representation. The canonical transformation uses a strict definition of same; instead the compatibility normalization interpret the previous same into something like *it should have been the same, if we follow the Unicode philosophy, but practice we had to make some codepoints different to the preferred one*. So in compatibility normalization we could lose some semantics, and a pure/ideal Unicode string should never have a "compatibility" character.
In your case: the csv file could be edited by different editors, so with different convention on how to represent accented characters. So with normalization, you are sure that the same key will be encoded as same entry in the dictionary.

Related

Remove style of this concrete string?

I extracted the following string from a webpage. It seems to somehow contain font styling, which makes it hard to work with. I would like to convert it to ordinary unstyled characters, using Python.
Here is the string:
๐—ธ๐—ฒ๐—ฒ๐—ฝ ๐˜๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐—ฝ๐—ฟ๐—ฒ๐—ฐ๐—ฎ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€
The characters in that string are special Unicode codepoints used for mathematical typography. Although they shouldn't be used in other contexts, many webpages abuse Unicode for the purpose of creating styled texts; it is most common in places where HTML styling is not allowed (like StackOverflow comments :-)
As indicated in the comments, you can convert these Unicode characters into ordinary unstyled alphabetic characters using the standard unicodedata module's normalize method to do "compatibility (K) composition (C)" normalization.
unicodedata.normalize("NFKC", "๐—ธ๐—ฒ๐—ฒ๐—ฝ ๐˜๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐—ฝ๐—ฟ๐—ฒ๐—ฐ๐—ฎ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€")
There are four normalization forms, which combine two axes:
composition or decomposition:
Certain characters (like รฑ or ร–) have their own Unicode codepoints, although Unicode also includes a mechanism --zero-width "combining characters"-- to apply decorations ("accents" or "tildes") to any character. The precomposed characters with their own codes are basically there to support older encodings (like ISO-8859-x) which included these as single characters. ร‘, for example, was hexadecimal D1 in ISO-8859-1 ("latin-1"), and it was given the Unicode codepoint U+00D1 to make it easier to convert programs which expected it to be a single character. Latin-1 also includes ร• (as D5), but it does not include Tฬƒ; in Unicode, we write Tฬƒ as two characters: a capital T followed by a "combining tilde" (U+0054 U+0303). That means we could write ร‘ in two ways: as ร‘, the single composed codepoint U+00D1, or as Nฬƒ, the two-code sequence U+004E U+0303. If your display software is well-tuned, those two possibilities should look identical, and according to the Unicode standard they are semantically identical, but since the codes differ, they won't compare the same in a byte-by-byte comparison.
Composition (C) normalization converts multi-code sequences into their composed single-code versions, where those exist; it would turn U+004E U+0303 into U+00D1.
Decomposition (D) normalization converts the composed single-code characters into the semantically equivalent sequence using combining characters; it would turn U+00D1 into U+004E U+0303
compatibility (K):
Some Unicode codepoints exist only to force particular rendering styles. That includes the styled math characters you encountered, but it also includes ligatures (such as ๏ฌƒ), superscript digits (ยฒ) or letters (ยช) and some characters which have conventional meanings (ยต, meaning "one-millionth", which different from the Greek character ฮผ, or the Angstrom sign โ„ซ, which is not the same as the Scandinavian character ร…). In compatibility normalization, these characters are changed to the base unstyled character; in some cases, this loses important semantic information, but it can be useful.
All normalizations put codes into "canonical" ordering. Characters with more than one combining marks, such as cฬงฬ, can be written with the combining marks in either order. To make it easier to compare strings which contain such characters, Unicode has a designated combining order, and normalization will reorder combining characters so that they can be easily compared. (Note that this needs to be done after composition, since that can change the base character. For example, if the base character is "รง" decomposition normalization will change the base character to "c" and the cedilla will then need to be inserted in the correct place in the sequence of combining marks.

Python UTF8 encoding

I have looked at other questions around Python and encoding but not quite found the solution to my problem. Here it is:
I have a small script which attempts to compare 2 lists of files:
A list given in a text file, which is supposed to be encoded in UTF8 (at least Notepad++ detects it as such).
A list from a directory which I build like this:
local = [f.encode('utf-8') for f in listdir(dir) ]
However, for some characters, I do not get the same representation: when looking in a HEX editor, I find that in 1, the character รฉ is given by 65 cc whereas in 2 it is given by c3 a9 ...
What I would like is to have them to the same encoding, whatever it is.
Your first sequence is incomplete - cc is the prefix for a two-byte UTF-8 sequence. Most probably, the full sequence is 65 cc 81, which indeed is the character e (0x65) followed by a COMBINING ACUTE ACCENT (0x301, which in UTF-8 gets expressed as cc 81).
The other sequence instead is the precomposed LATIN SMALL LETTER E WITH ACUTE character (0xe9, expressed as c3 a9 in UTF-8). You'll notice in the linked page that its decomposition is exactly the first sequence.
Unicode normalization
Now, in Unicode there are many instances of different sequences that graphically and/or semantically are the same, and while it's generally a good idea to treat a UTF-8 stream as an opaque binary sequence, this poses a problem if you want to do searching or indexing - looking for one sequence won't match the other, even if they are graphically and semantically the same thing. For this reason, Unicode defines four types of normalization, that can be used to "flatten" this kind of differences and obtain the same codepoints from both the composed and decomposed forms. For example, the NFC and NFKC normalization forma in this case will give the 0xe9 code point for both your sequences, while the NFD and NFKD will give the 0x65 0x301 decomposed form.
To do this in Python you'll have first to decode your UTF-8 str objects to unicode objects, and then use the unicodedata.normalize method.
Important note: don't normalize unless you are implementing "intelligent" indexing/searching, and use the normalized data only for this purpose - i.e index and search normalized, but store/provide to the user the original form. Normalization is a lossy operation (some forms particularly so), applying it blindly over user data is like entering with a sledgehammer in a pottery shop.
File paths
Ok, this was about Unicode in general. Talking about filesystem paths is both simpler and more complicated.
In line of principle, virtually all common filesystems on Windows and Linux treat paths as opaque character1 sequences (modulo the directory separator and possibly the NUL character), with no particular normalization form applied2. So, in a given directory you can have two file names that look the same but are indeed different:
So, when dealing with file paths in line of principle you should never normalize - again, file paths are an opaque sequence of code points (actually, an opaque sequence of bytes on Linux) which should not be messed with.
However, if the list you receive and you have to deal with is normalized differently (which probably means that either it has been passed through a broken software that "helpfully" normalizes composed/decomposed sequences, or that the name has been typed in by hand) you'll have to perform some normalized matching.
If I were to deal with a similar (broken by definition) scenario, I'd do something like this:
first try to match exactly;
if this fails, try to match the normalized file name against a set containing the normalized content of the directory; notice that, if multiple original names are mapped to the same normalized name and you don't match it exactly you have no way to know which one is the "right one".
Footnotes
Linux-native filesystems all use 8-bit byte-based paths - they may be in whatever encoding, the kernel doesn't care, although recent systems generally happen to use UTF-8; Windows-native filesystem will instead use 16-bit word-based paths, which nominally contain UTF-16 (originally UCS-2) values.
On Windows it's a bit more complicated at the API level, since there's the whole ANSI API mess that performs codepage conversion, and case-insensitive matching for Win32 paths adds one more level of complication, but down at kernel and filesystem level it's all opaque 2-byte WCHAR strings.
At the top of your file add these
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Hope this helps..!

What is the difference between a85encode and b85encode?

Python 3.4 added the a85encode and b85encode functions (and their corresponding decoding functions).
What is the difference between the two? The documentation mentions "They differ by details such as the character map used for encoding.", but this seems unnecessarily vague.
a85encode uses the character mapping:
!"#$%&'()*+,-./0123456789:;<=>?#
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstu
with z used as a special case to represent four zero bytes (instead of !!!!!).
b85encode uses the character mapping:
0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
!#$%&()*+-;<=>?#^_`{|}~
with no special abbreviations.
If you have a choice, I'd recommend you use a85encode. It's a bit easier (and more efficient) to implement in C, as its character mapping uses all characters in ASCII order, and it's slightly more efficient at storing data containing lots of zeroes, which isn't uncommon for uncompressed binary data.
Ascii85 is the predecessor of Base85; the primary difference between the two is in-fact the character sets that are used.
Ascii85 uses the character set:
ASCII 33 ("!") to ASCII 117 ("u")
Base85 uses the character set:
0โ€“9, Aโ€“Z, aโ€“z, !#$%&()*+-;<=>?#^_`{|}~
These characters are specifically not included in Base85:
"',./:[]\\
a85encode and b85encode encode/decode Ascii85 and Base85 respectively.

LZ77 compression reserved bytes "< , >"

I'm learning about LZ77 compression, and I saw that when I find a repeated string of bytes, I can use a pointer of the form <distance, length>, and that the "<", ",", ">" bytes are reserved. So... How do I compress a file that has these bytes, if I cannot compress these byte,s but cannot change it by a different byte (because decoders wouldn't be able to read it). Is there a way? Or decoders only decode is there is a exact <d, l> string? (if there is, so imagine if by a coencidence, we find these bytes in a file. What would happen?)
Thanks!
LZ77 is about referencing strings back in the decompressing buffer by their lengths and distances from the current position. But it is left to you how do you encode these back-references. Many implementations of LZ77 do it in different ways.
But you are right that there must be some way to distinguish "literals" (uncompressed pieces of data meant to be copied "as is" from the input to the output) from "back-references" (which are copied from already uncompressed portion).
One way to do it is reserving some characters as "special" (so called "escape sequences"). You can do it the way you did it, that is, by using < to mark the start of a back-reference. But then you also need a way to output < if it is a literal. You can do it, for example, by establishing that when after < there's another <, then it means a literal, and you just output one <. Or, you can establish that if after < there's immediately >, with nothing in between, then that's not a back-reference, so you just output <.
It also wouldn't be the most efficient way to encode those back-references, because it uses several bytes to encode a back-reference, so it will become efficient only for referencing strings longer than those several bytes. For shorter back-references it will inflate the data instead of compressing them, unless you establish that matches shorter than several bytes are being left as is, instead of generating back-references. But again, this means lower compression gains.
If you compress only plain old ASCII texts, you can employ a better encoding scheme, because ASCII uses just 7 out of 8 bits in a byte. So you can use the highest bit to signal a back-reference, and then use the remaining 7 bits as length, and the very next byte (or two) as back-reference's distance. This way you can always tell for sure whether the next byte is a literal ASCII character or a back-reference, by checking its highest bit. If it is 0, just output the character as is. If it is 1, use the following 7 bits as length, and read up the next 2 bytes to use it as distance. This way every back-reference takes 3 bytes, so you can efficiently compress text files with repeating sequences of more than 3 characters long.
But there's a still better way to do this, which gives even more compression: you can replace your characters with bit codes of variable lengths, crafted in such a way that the characters appearing more often would have shortest codes, and those which are rare would have longer codes. To achieve that, these codes have to be so-called "prefix codes", so that no code would be a prefix of some other code. When your codes have this property, you can always distinguish them by reading these bits in sequence until you decode some of them. Then you can be sure that you won't get any other valid item by reading more bits. The next bit always starts another new sequence. To produce such codes, you need to use Huffman trees. You can then join all your bytes and different lengths of references into one such tree and generate distinct bit codes for them, depending on their frequency. When you try to decode them, you just read the bits until you reach the code of some of these elements, and then you know for sure whether it is a code of some literal character or a code for back-reference's length. In the second case, you then read some additional bits for the distance of the back-reference (also encoded with a prefix code). This is what DEFLATE compression scheme does. But this is whole another story, and you will find the details in the RFC supplied by #MarkAdler.
If I understand your question correctly, it makes no sense. There are no "reserved bytes" for the uncompressed input of an LZ77 compressor. You need to simply encodes literals and length/distance pairs unambiguously.

Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).
The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'เคชเคคเฅเคฐ'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for เค„, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).
Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

Categories