I extracted the following string from a webpage. It seems to somehow contain font styling, which makes it hard to work with. I would like to convert it to ordinary unstyled characters, using Python.
Here is the string:
𝗸𝗲𝗲𝗽 𝘁𝗮𝗸𝗶𝗻𝗴 𝗽𝗿𝗲𝗰𝗮𝘂𝘁𝗶𝗼𝗻𝘀
The characters in that string are special Unicode codepoints used for mathematical typography. Although they shouldn't be used in other contexts, many webpages abuse Unicode for the purpose of creating styled texts; it is most common in places where HTML styling is not allowed (like StackOverflow comments :-)
As indicated in the comments, you can convert these Unicode characters into ordinary unstyled alphabetic characters using the standard unicodedata module's normalize method to do "compatibility (K) composition (C)" normalization.
unicodedata.normalize("NFKC", "𝗸𝗲𝗲𝗽 𝘁𝗮𝗸𝗶𝗻𝗴 𝗽𝗿𝗲𝗰𝗮𝘂𝘁𝗶𝗼𝗻𝘀")
There are four normalization forms, which combine two axes:
composition or decomposition:
Certain characters (like ñ or Ö) have their own Unicode codepoints, although Unicode also includes a mechanism --zero-width "combining characters"-- to apply decorations ("accents" or "tildes") to any character. The precomposed characters with their own codes are basically there to support older encodings (like ISO-8859-x) which included these as single characters. Ñ, for example, was hexadecimal D1 in ISO-8859-1 ("latin-1"), and it was given the Unicode codepoint U+00D1 to make it easier to convert programs which expected it to be a single character. Latin-1 also includes Õ (as D5), but it does not include T̃; in Unicode, we write T̃ as two characters: a capital T followed by a "combining tilde" (U+0054 U+0303). That means we could write Ñ in two ways: as Ñ, the single composed codepoint U+00D1, or as Ñ, the two-code sequence U+004E U+0303. If your display software is well-tuned, those two possibilities should look identical, and according to the Unicode standard they are semantically identical, but since the codes differ, they won't compare the same in a byte-by-byte comparison.
Composition (C) normalization converts multi-code sequences into their composed single-code versions, where those exist; it would turn U+004E U+0303 into U+00D1.
Decomposition (D) normalization converts the composed single-code characters into the semantically equivalent sequence using combining characters; it would turn U+00D1 into U+004E U+0303
compatibility (K):
Some Unicode codepoints exist only to force particular rendering styles. That includes the styled math characters you encountered, but it also includes ligatures (such as ffi), superscript digits (²) or letters (ª) and some characters which have conventional meanings (µ, meaning "one-millionth", which different from the Greek character μ, or the Angstrom sign Å, which is not the same as the Scandinavian character Å). In compatibility normalization, these characters are changed to the base unstyled character; in some cases, this loses important semantic information, but it can be useful.
All normalizations put codes into "canonical" ordering. Characters with more than one combining marks, such as ḉ, can be written with the combining marks in either order. To make it easier to compare strings which contain such characters, Unicode has a designated combining order, and normalization will reorder combining characters so that they can be easily compared. (Note that this needs to be done after composition, since that can change the base character. For example, if the base character is "ç" decomposition normalization will change the base character to "c" and the cedilla will then need to be inserted in the correct place in the sequence of combining marks.
Related
Is it possible to use a superscript for a Unicode character?
For example, I have the expression aaa=u'Intel\u00AE' that creates the registered symbol as expected, but how could it be made a superscript?
Unicode does not support making arbitrary characters into superscripts. There are pre-encoded superscript characters; they are different characters you can convert to.
If you want to make an arbitrary character a superscript you'll need to use markup or a typesetting system.
I am very new to the encoding/decoding part and would like to know why...I have a dictionary, I wonder why normalization need to be applied in this case when the key is added? Does it has anything to do with the previous key and the new key? What happen if I don't normalize?
with open('file.csv') as input_file:
reader = csv.DictReader(input_file)
for row in reader:
pre_key = ['One sample', 'Two samples', 'Three samples']
new_key = ['one_sample', 'two_Samples', 'three_samples']
my_dict['index'][new_key] = unicodedata.normalize("NFKD",
row.get(pre_key, None))
Normalization is not about encoding and decoding, but a "normal" (expected) form to represent a character.
The classic example is about a character with an accent. Often such characters have two representation, one with the base character codepoint and then a combining codepoint describing the accent, and often the second one with just one codepoint (describing character and accent).
Additionally, sometime you have two or more accent (and descents, dots, etc.). In this case, you may want them in a specific order.
Unicode add new characters and codepoints. You may have some old typographic way to describe a letter (or kanji). On some context (displaying) it is important to make the distinction (also in English, in past letter s had two representation), but to read or to analyse, one want the semantic letter (so normalized).
And there are few cases where you may have unnecessary characters (e.g. if you type in a "unicode keyboard").
So why do we need normalization?
the simple case: we should compare strings: visually and semantically the same string could be represented into different form, so we choose a normalization form, so that we can compare strings.
collation (sorting) algorithms work a lot better (less special cases), if we have to handle just one form, but also to change case (lower case, upper case), it is better to have a single form to handle.
handling strings can be easier: if you should remove accents, the easy way it is to use a decomposition form, and then you remove the combining characters.
to encode in other character set, it is better to have a composite form (or both): if the target charset has the composite, transcode it, else: there are many ways to handle it.
So "normalize" means to transform the same string into an unique Unicode representation. The canonical transformation uses a strict definition of same; instead the compatibility normalization interpret the previous same into something like *it should have been the same, if we follow the Unicode philosophy, but practice we had to make some codepoints different to the preferred one*. So in compatibility normalization we could lose some semantics, and a pure/ideal Unicode string should never have a "compatibility" character.
In your case: the csv file could be edited by different editors, so with different convention on how to represent accented characters. So with normalization, you are sure that the same key will be encoded as same entry in the dictionary.
Python 3.4 added the a85encode and b85encode functions (and their corresponding decoding functions).
What is the difference between the two? The documentation mentions "They differ by details such as the character map used for encoding.", but this seems unnecessarily vague.
a85encode uses the character mapping:
!"#$%&'()*+,-./0123456789:;<=>?#
ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstu
with z used as a special case to represent four zero bytes (instead of !!!!!).
b85encode uses the character mapping:
0123456789
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
!#$%&()*+-;<=>?#^_`{|}~
with no special abbreviations.
If you have a choice, I'd recommend you use a85encode. It's a bit easier (and more efficient) to implement in C, as its character mapping uses all characters in ASCII order, and it's slightly more efficient at storing data containing lots of zeroes, which isn't uncommon for uncompressed binary data.
Ascii85 is the predecessor of Base85; the primary difference between the two is in-fact the character sets that are used.
Ascii85 uses the character set:
ASCII 33 ("!") to ASCII 117 ("u")
Base85 uses the character set:
0–9, A–Z, a–z, !#$%&()*+-;<=>?#^_`{|}~
These characters are specifically not included in Base85:
"',./:[]\\
a85encode and b85encode encode/decode Ascii85 and Base85 respectively.
I'm creating a sub-class based on 'HTMLParser' to pull out html content. Whenever I have character refs such as
' ' '&' '–' '…'
I'd like to replace them with their English counterparts of
' ' (space), '&', '-', '...', and so on.
What's the best way to convert some of the simple character refs into their correct representation?
My text is similar to:
Some text goes here&after that, 6:30 pm–8:45pm and maybe
something like …
I would like to convert this to:
Some text goes here & after that, 6:30 pm-8:45pm and maybe
something like ...
Your question has two parts. The easy part is decoding the HTML entities. The easiest way to do that is to grab this undocumented but long-stable method from the HTMLParser module:
>>> HTMLParser.HTMLParser().unescape('a < é – …')
u'a < é – …'
The second part, converting Unicode characters to ASCII lookalikes, is trickier and also quite questionable. I would try to retain the Unicode en-dash ‘–’ and similar typographical niceties, rather than convert them down to characters like the plain hyphen and straight-quotes. Unless your application can't handle non-ASCII characters at all you should aim to keep them as they are, along with all other Unicode characters.
The specific case of the U+2013 ellipsis character is potentially different because it's a ‘compatibility character’, included in Unicode only for lossless round-tripping to other encodings that feature it. Preferably you'd just type three dots, and let the font's glyph combination logic work out exactly how to draw it.
If you want to just replace compatibility characters (like this one, explicit ligatures, the Japanese fullwidth numbers, and a handful of other oddities), you could try normalising your string to Normal Form KC:
>>> unicodedata.normalize('NFKC', u'a < – …')
u'a < é – ...'
(Care, though: some other characters that you might have wanted to keep are also compatibility characters, including ‘²’.)
The next step would be to turn letters with diacriticals into plain letters, which you could do by normalising to NFKD instead and them removing all characters that have the ‘combining’ character class from the string. That would give you plain ASCII for the previously-accented Latin letters, albeit in a way that is not linguistically correct for many languages. If that's all you care about you could encode straight to ASCII:
>>> unicodedata.normalize('NFKD', u'a < – …').encode('us-ascii', 'ignore')
'a < e ...'
Anything further you might do would have to be ad-hoc as there is no accepted standard for folding strings down to ASCII. Windows has one implementation, as does Lucene (ASCIIFoldingFilter). The results are pretty variable.
I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?
Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.
The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO
When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5].
There is no correct straight-forward approach with any type of "Unicode string".
Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.
So if you want to cut 5 code points (note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8 and http://en.wikipedia.org/wiki/UTF-16 definitions. So you need to use some bit masks to figure out boundaries.
Also you still do not get characters. Because for example. Word "שָלוֹם" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".
So character is not code point.
Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".
So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.