Decoding html content and HTMLParser

Decoding html content and HTMLParser - python

I'm creating a sub-class based on 'HTMLParser' to pull out html content. Whenever I have character refs such as
' ' '&' '–' '…'
I'd like to replace them with their English counterparts of
' ' (space), '&', '-', '...', and so on.
What's the best way to convert some of the simple character refs into their correct representation?
My text is similar to:
Some text goes here&after that, 6:30 pm–8:45pm and maybe
something like …
I would like to convert this to:
Some text goes here & after that, 6:30 pm-8:45pm and maybe
something like ...

Your question has two parts. The easy part is decoding the HTML entities. The easiest way to do that is to grab this undocumented but long-stable method from the HTMLParser module:
>>> HTMLParser.HTMLParser().unescape('a < é – …')
u'a < é – …'
The second part, converting Unicode characters to ASCII lookalikes, is trickier and also quite questionable. I would try to retain the Unicode en-dash ‘–’ and similar typographical niceties, rather than convert them down to characters like the plain hyphen and straight-quotes. Unless your application can't handle non-ASCII characters at all you should aim to keep them as they are, along with all other Unicode characters.
The specific case of the U+2013 ellipsis character is potentially different because it's a ‘compatibility character’, included in Unicode only for lossless round-tripping to other encodings that feature it. Preferably you'd just type three dots, and let the font's glyph combination logic work out exactly how to draw it.
If you want to just replace compatibility characters (like this one, explicit ligatures, the Japanese fullwidth numbers, and a handful of other oddities), you could try normalising your string to Normal Form KC:
>>> unicodedata.normalize('NFKC', u'a < – …')
u'a < é – ...'
(Care, though: some other characters that you might have wanted to keep are also compatibility characters, including ‘²’.)
The next step would be to turn letters with diacriticals into plain letters, which you could do by normalising to NFKD instead and them removing all characters that have the ‘combining’ character class from the string. That would give you plain ASCII for the previously-accented Latin letters, albeit in a way that is not linguistically correct for many languages. If that's all you care about you could encode straight to ASCII:
>>> unicodedata.normalize('NFKD', u'a < – …').encode('us-ascii', 'ignore')
'a < e ...'
Anything further you might do would have to be ad-hoc as there is no accepted standard for folding strings down to ASCII. Windows has one implementation, as does Lucene (ASCIIFoldingFilter). The results are pretty variable.

Related

In 'Automating Boring Stuff Using Python' Page 208, I cannot understand this line of code [duplicate]

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.

'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.

Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.

A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.

Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False

Remove style of this concrete string?

I extracted the following string from a webpage. It seems to somehow contain font styling, which makes it hard to work with. I would like to convert it to ordinary unstyled characters, using Python.
Here is the string:
𝗸𝗲𝗲𝗽 𝘁𝗮𝗸𝗶𝗻𝗴 𝗽𝗿𝗲𝗰𝗮𝘂𝘁𝗶𝗼𝗻𝘀

The characters in that string are special Unicode codepoints used for mathematical typography. Although they shouldn't be used in other contexts, many webpages abuse Unicode for the purpose of creating styled texts; it is most common in places where HTML styling is not allowed (like StackOverflow comments :-)
As indicated in the comments, you can convert these Unicode characters into ordinary unstyled alphabetic characters using the standard unicodedata module's normalize method to do "compatibility (K) composition (C)" normalization.
unicodedata.normalize("NFKC", "𝗸𝗲𝗲𝗽 𝘁𝗮𝗸𝗶𝗻𝗴 𝗽𝗿𝗲𝗰𝗮𝘂𝘁𝗶𝗼𝗻𝘀")
There are four normalization forms, which combine two axes:
composition or decomposition:
Certain characters (like ñ or Ö) have their own Unicode codepoints, although Unicode also includes a mechanism --zero-width "combining characters"-- to apply decorations ("accents" or "tildes") to any character. The precomposed characters with their own codes are basically there to support older encodings (like ISO-8859-x) which included these as single characters. Ñ, for example, was hexadecimal D1 in ISO-8859-1 ("latin-1"), and it was given the Unicode codepoint U+00D1 to make it easier to convert programs which expected it to be a single character. Latin-1 also includes Õ (as D5), but it does not include T̃; in Unicode, we write T̃ as two characters: a capital T followed by a "combining tilde" (U+0054 U+0303). That means we could write Ñ in two ways: as Ñ, the single composed codepoint U+00D1, or as Ñ, the two-code sequence U+004E U+0303. If your display software is well-tuned, those two possibilities should look identical, and according to the Unicode standard they are semantically identical, but since the codes differ, they won't compare the same in a byte-by-byte comparison.
Composition (C) normalization converts multi-code sequences into their composed single-code versions, where those exist; it would turn U+004E U+0303 into U+00D1.
Decomposition (D) normalization converts the composed single-code characters into the semantically equivalent sequence using combining characters; it would turn U+00D1 into U+004E U+0303
compatibility (K):
Some Unicode codepoints exist only to force particular rendering styles. That includes the styled math characters you encountered, but it also includes ligatures (such as ﬃ), superscript digits (²) or letters (ª) and some characters which have conventional meanings (µ, meaning "one-millionth", which different from the Greek character μ, or the Angstrom sign Å, which is not the same as the Scandinavian character Å). In compatibility normalization, these characters are changed to the base unstyled character; in some cases, this loses important semantic information, but it can be useful.
All normalizations put codes into "canonical" ordering. Characters with more than one combining marks, such as ḉ, can be written with the combining marks in either order. To make it easier to compare strings which contain such characters, Unicode has a designated combining order, and normalization will reorder combining characters so that they can be easily compared. (Note that this needs to be done after composition, since that can change the base character. For example, if the base character is "ç" decomposition normalization will change the base character to "c" and the cedilla will then need to be inserted in the correct place in the sequence of combining marks.

fluphenazine read as \xef\xac\x82uphenazine

When I write
>>> st = "Piperazine (perphenazine, ﬂuphenazine)"
>>> st
'Piperazine (perphenazine, \xef\xac\x82uphenazine)'
What is happening? why doesn't it do this for any fl? How do I avoid this?
It looks \xef\xac\x82 is not, in fact, fl. Is there any way to 'translate' this character into fl (as the author intended it), without just excluding it via something like
unicode(st, errors='ignore').encode('ascii')

This is what is called a "ligature".
In printing, the f and l characters were typeset with a different amount of space between them from what normal pairs of sequential letters used - in fact, the f and l would merge into one character. Other ligatures include "th", "oe", and "st".
That's what you're getting in your input - the "fl" ligature character, UTF-8 encoded. It's a three-byte sequence. I would take minor issue with your assertion that it's "not, in fact fl" - it really is, but your input is UTF-8 and not ASCII :-). I'm guessing you pasted from a Word document or an ebook or something that's designed for presentation instead of data fidelity (or perhaps, from the content, it was a LaTeX-generated PDF?).
If you want to handle this particular case, you could replace that byte sequence with the ASCII letters "fl". If you want to handle all such cases, you will have to use the Unicode Consortium's "UNIDATA" file at: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt . In that file, there is a column for the "decomposition" of a character. The f-l ligature has the identifier "LATIN SMALL LIGATURE FL". There is, by the way, a Python module for this data file at https://docs.python.org/2/library/unicodedata.html . You want the "decomposition" function:
>>> import unicodedata
>>> foo = u"ﬂuphenazine"
>>> unicodedata.decomposition(foo[0])
'<compat> 0066 006C'
0066 006C is, of course, ASCII 'f' and 'l'.
Be aware that if you're trying to downcast UTF-8 data to ASCII, you're eventually going to have a bad day. There are only 127 ASCII characters, and UTF-8 has millions upon millions of code points. There are many codepoints in UTF-8 that cannot be readily represented as ASCII in a nonconvoluted way - who wants to have some text end up saying "<TREBLE CLEF> <SNOWMAN> <AIRPLANE> <YELLOW SMILEY FACE>"?

Convert hexadecimal character (ligature) to utf-8 character

I had a text content which is converted from a pdf file. There are some unwanted character in the text and I want to convert them to utf-8 characters.
For instance; 'Artificial Immune System' is converted like 'Artiﬁcial Immune System'. ﬁ is converted like a one character and I used gdex to learn the ascii value of the character but I don't know how to replace it with the real value in the all content.

I guess what you're seeing are ligatures — professional fonts have glyps that combine several individual characters into a single (better looking) glyph. So instead of writing "f" and "i", as two glyphs, the font has a single "fi" glyph. Compare "fi" (two letters) with "ﬁ" (single glyph).
In Python, you can use the unicodedata module to manipute late Unicode text. You can also exploit the conversion to NFKD normal form to split ligatures:
>>> import unicodedata
>>> unicodedata.name(u'\uFB01')
'LATIN SMALL LIGATURE FI'
>>> unicodedata.normalize("NFKD", u'Arti\uFB01cial Immune System')
u'Artificial Immune System'
So normalizing your strings with NFKD should help you along. If you find that this splits too much, then my best suggestion is to make a small mapping table of the ligatures you want to split and replace the ligatures manually:
>>> ligatures = {0xFB00: u'ff', 0xFB01: u'fi'}
>>> u'Arti\uFB01cial Immune System'.translate(ligatures)
u'Artificial Immune System'
Refer to the Wikipedia article to get a list of ligatures in Unicode.

Returning the first N characters of a unicode string

I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?

Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.
The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO

When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5].

There is no correct straight-forward approach with any type of "Unicode string".
Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.
So if you want to cut 5 code points (note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8 and http://en.wikipedia.org/wiki/UTF-16 definitions. So you need to use some bit masks to figure out boundaries.
Also you still do not get characters. Because for example. Word "שָלוֹם" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".
So character is not code point.
Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".
So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.