fluphenazine read as \xef\xac\x82uphenazine - python

When I write
>>> st = "Piperazine (perphenazine, fluphenazine)"
>>> st
'Piperazine (perphenazine, \xef\xac\x82uphenazine)'
What is happening? why doesn't it do this for any fl? How do I avoid this?
It looks \xef\xac\x82 is not, in fact, fl. Is there any way to 'translate' this character into fl (as the author intended it), without just excluding it via something like
unicode(st, errors='ignore').encode('ascii')

This is what is called a "ligature".
In printing, the f and l characters were typeset with a different amount of space between them from what normal pairs of sequential letters used - in fact, the f and l would merge into one character. Other ligatures include "th", "oe", and "st".
That's what you're getting in your input - the "fl" ligature character, UTF-8 encoded. It's a three-byte sequence. I would take minor issue with your assertion that it's "not, in fact fl" - it really is, but your input is UTF-8 and not ASCII :-). I'm guessing you pasted from a Word document or an ebook or something that's designed for presentation instead of data fidelity (or perhaps, from the content, it was a LaTeX-generated PDF?).
If you want to handle this particular case, you could replace that byte sequence with the ASCII letters "fl". If you want to handle all such cases, you will have to use the Unicode Consortium's "UNIDATA" file at: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt . In that file, there is a column for the "decomposition" of a character. The f-l ligature has the identifier "LATIN SMALL LIGATURE FL". There is, by the way, a Python module for this data file at https://docs.python.org/2/library/unicodedata.html . You want the "decomposition" function:
>>> import unicodedata
>>> foo = u"fluphenazine"
>>> unicodedata.decomposition(foo[0])
'<compat> 0066 006C'
0066 006C is, of course, ASCII 'f' and 'l'.
Be aware that if you're trying to downcast UTF-8 data to ASCII, you're eventually going to have a bad day. There are only 127 ASCII characters, and UTF-8 has millions upon millions of code points. There are many codepoints in UTF-8 that cannot be readily represented as ASCII in a nonconvoluted way - who wants to have some text end up saying "<TREBLE CLEF> <SNOWMAN> <AIRPLANE> <YELLOW SMILEY FACE>"?

Related

Identify garbage unicode string using python

My script is reads data from csv file, the csv file can have multiple strings of English or non English words.
Some time the text file has garbage strings , i want to identify those string and skip those string and process others
doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
if is_valid_unicode_str(row['Name']):
process_futher
def is_valid_unicode_str(value):
try:
function
return True
except UnicodeEncodeError:
return false
csv input:
"Name"
"袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€"
"元大寶來證券"
"John Dove"
I want to defile function is_valid_unicode_str() which will identify the garbage string and process valid one only.
I tried to use decode is but it doesnt failed while decoding garbage strings
value.decode('utf8')
The expected output are string with Chinese and English string to be process
could you please guide me how can i implement function to filter valid Unicode files?.
(ftfy developer here)
I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.
Google Translate is not very helpful here but it seems to mean something about e-commerce.
So here's everything that happened to your text:
It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
It had the letters "dcx" inserted into the middle of a UTF-8 sequence
Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
A non-breaking space (byte value a0) was removed from the end
If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.
So here's my recommendation:
Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
If you end up with a string like this again, try ftfy.fix_text_encoding on it.
You have Mojibake strings; text encoded to one (correct) codec, then decoded as another.
In this case, your text was decoded with the Windows 1252 codepage; the U+20AC EURO SIGN in the text is typical of CP1252 Mojibakes. The original encoding could be one of the GB* family of Chinese encodings, or a multiple roundtrip UTF-8 - CP1252 Mojibake. Which one I cannot determine, I cannot read Chinese, nor do I have your full data; CP1252 Mojibakes include un-printable characters like 0x81 and 0x8D bytes that might have gotten lost when you posted your question here.
I'd install the ftfy project; it won't fix GB* encodings (I requested the project add support), but it includes a new codec called sloppy-windows-1252 that'll let you reverse an erroneous decode with that codec:
>>> import ftfy # registers extra codecs on import
>>> text = u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€'
>>> print text.encode('sloppy-windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('sloppy-windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'replace')
袋�dcx与朋�们���
The � U+FFFD REPLACEMENT CHARACTER shows the decoding wasn't entirely successful, but that could be due to the fact that your copied string here is missing anything not printable or using the 0x81 or 0x8D bytes.
You can try to fix your data this way; from the file data, try to decode to one of the GB* codecs after encoding to sloppy-windows-1252, or roundtrip from UTF-8 twice and see what fits best.
If that's not good enough (you cannot fix the data) you can use the ftfy.badness.sequence_weirdness() function to try and detect the issue:
>>> from ftfy.badness import sequence_weirdness
>>> sequence_weirdness(text)
9
>>> sequence_weirdness(u'元大寶來證券')
0
>>> sequence_weirdness(u'John Dove')
0
Mojibakes score high on the sequence weirdness scale. You'd could try and find an appropriate threshold for your data by which time you'd call the data most likely to be corrupted.
However, I think we can use a non-zero return value as a starting point for another test. English text should score 0 on that scale, and so should Chinese text. Chinese mixed with English can still score over 0, but you could not then encode that Chinese text to the CP-1252 codec while you can with the broken text:
from ftfy.badness import sequence_weirdness
def is_valid_unicode_str(text):
if not sequence_weirdness(text):
# nothing weird, should be okay
return True
try:
text.encode('sloppy-windows-1252')
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return True
else:
# Encodable as CP-1252, Mojibake alert level high
return False

Python uses three unicode characters to represent an asian fullstop? This is weird?

The python file:
# -*- coding: utf-8 -*-
print u"。"
print [u"。".encode('utf8')]
Produces:
。
['\xe3\x80\x82']
Why does python use 3 characters to store my 1 fullstop? This is really strange, if you print each one out individually, they are all different as well. Any ideas?
In UTF-8, three bytes (not really characters) are used to represent code points between U+07FF and U+FFFF, such as this character, IDEOGRAPHIC FULL STOP (U+3002).
Try dumping the script file with od -x. You should find the same three bytes used to represent the character there.
UTF-8 is a multibyte character representation so characters that are not ASCII will take up more than one byte.
Looks correctly UTF-8 encoded to me. See here for an explanation about UTF-8 encoding.
The latest version of Unicode supports more than 109,000 characters in 93 different scripts. Mathematically, the minimum number of bytes you'd need to encode that number of code points is 3, since this is 17 bits' worth of information. (Unicode actually reserves a 21-bit range, but this still fits in 3 bytes.) You might therefore reasonably expect every character to need 3 bytes in the most straightforward imaginable encoding, in which each character is represented as an integer using the smallest possible whole number of bytes. (In fact, as pointed out by dan04, you need 4 bytes to get all of Unicode's functionality.)
A common data compression technique is to use short tokens to represent frequently-occurring elements, even though this means that infrequently-occurring elements will need longer tokens than they otherwise might. UTF-8 is a Unicode encoding that uses this approach to store text written in English and other European languages in fewer bytes, at the cost of needing more bytes for text written in other languages. In UTF-8, the most common Latin characters need only 1 byte (UTF-8 overlaps with ASCII for the convenience of English users), and other common characters need only 2 bytes. But some characters need 3 or even 4 bytes, which is more than they'd need in a "naive" encoding. The particular character you're asking about needs 3 bytes in UTF-8 by definition.
In UTF-16, it happens, this code point would need only 2 bytes, though other characters will need 4 (there are no 3-byte characters in UTF-16). If you are truly concerned with space efficiency, do as John Machin suggests in his comment and use an encoding that is designed to be maximally space-efficient for your language.

Decoding html content and HTMLParser

I'm creating a sub-class based on 'HTMLParser' to pull out html content. Whenever I have character refs such as
' ' '&' '–' '…'
I'd like to replace them with their English counterparts of
' ' (space), '&', '-', '...', and so on.
What's the best way to convert some of the simple character refs into their correct representation?
My text is similar to:
Some text goes here&after that, 6:30 pm–8:45pm and maybe
something like …
I would like to convert this to:
Some text goes here & after that, 6:30 pm-8:45pm and maybe
something like ...
Your question has two parts. The easy part is decoding the HTML entities. The easiest way to do that is to grab this undocumented but long-stable method from the HTMLParser module:
>>> HTMLParser.HTMLParser().unescape('a < é – …')
u'a < é – …'
The second part, converting Unicode characters to ASCII lookalikes, is trickier and also quite questionable. I would try to retain the Unicode en-dash ‘–’ and similar typographical niceties, rather than convert them down to characters like the plain hyphen and straight-quotes. Unless your application can't handle non-ASCII characters at all you should aim to keep them as they are, along with all other Unicode characters.
The specific case of the U+2013 ellipsis character is potentially different because it's a ‘compatibility character’, included in Unicode only for lossless round-tripping to other encodings that feature it. Preferably you'd just type three dots, and let the font's glyph combination logic work out exactly how to draw it.
If you want to just replace compatibility characters (like this one, explicit ligatures, the Japanese fullwidth numbers, and a handful of other oddities), you could try normalising your string to Normal Form KC:
>>> unicodedata.normalize('NFKC', u'a < – …')
u'a < é – ...'
(Care, though: some other characters that you might have wanted to keep are also compatibility characters, including ‘²’.)
The next step would be to turn letters with diacriticals into plain letters, which you could do by normalising to NFKD instead and them removing all characters that have the ‘combining’ character class from the string. That would give you plain ASCII for the previously-accented Latin letters, albeit in a way that is not linguistically correct for many languages. If that's all you care about you could encode straight to ASCII:
>>> unicodedata.normalize('NFKD', u'a < – …').encode('us-ascii', 'ignore')
'a < e ...'
Anything further you might do would have to be ad-hoc as there is no accepted standard for folding strings down to ASCII. Windows has one implementation, as does Lucene (ASCIIFoldingFilter). The results are pretty variable.

Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).
The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).
Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

Returning the first N characters of a unicode string

I have a string in unicode and I need to return the first N characters.
I am doing this:
result = unistring[:5]
but of course the length of unicode strings != length of characters.
Any ideas? The only solution is using re?
Edit: More info
unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]
returns-> ?
I think that unicode strings are two bytes (char), that's why this thing happens. If I do:
result = unistring[:2]
I get
M
which is correct,
So, should I always slice*2 or should I convert to something?
Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).
Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.
The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.
So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO
When you say:
unistring = "Μεταλλικα" #Metallica written in Greek letters
You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:
unistring = "Μεταλλικα".decode('utf-8')
or by using the unicode literal in a source file with the right encoding declaration
# coding: UTF-8
unistring = u"Μεταλλικα"
The unicode string will do what you want when you do unistring[:5].
There is no correct straight-forward approach with any type of "Unicode string".
Even Python "Unicode" UTF-16 string has variable length characters so, you can't just cut with ustring[:5]. Because some Unicode Code points may use more then one "character" i.e. Surrogate pairs.
So if you want to cut 5 code points (note these are not characters) so you may analyze the text, see http://en.wikipedia.org/wiki/UTF-8 and http://en.wikipedia.org/wiki/UTF-16 definitions. So you need to use some bit masks to figure out boundaries.
Also you still do not get characters. Because for example. Word "שָלוֹם" -- peace in Hebrew "Shalom" consists of 4 characters and 6 code points letter "shin", vowel "a" letter "lamed", letter "vav" and vowel "o" and final letter "mem".
So character is not code point.
Same for most western languages where a letter with diacritics may be represented as two code points. Search for example for "unicode normalization".
So... If you really need 5 first characters you have to use tools like ICU library. For example there is ICU library for Python that provides characters boundary iterator.

Categories