How to capture all letters from different languages in python? - python

I have a corpus of different texts from different languages.
I want to capture all characters. I use python 2.7 and defaultencodingsetting is utf-8.
I do not know why when I use this code for German umlaut it prints out German umlaut correctly :
'Erd\xC3\xA4pfel'.decode('unicode-escape').encode('latin1').decode('utf-8')
Result is:
Erdäpfel
but when I use this code :
'Erd\xC3\xA4pfel'.decode('unicode-escape').encode('utf-8').decode('utf-8')
result is :
Erdäpfel which is different.
I am not familiar with text mining.I know that for example latin1 encoding does not contain French letter which is not desired in my project.
How can I convert all unicode escape strings in my corpus regardless of their language to respective character?
Utf-8 according to documentations contains all languages but why it does not print out German umlaut correctly while latin1 encoding prints out correctly?
PS: Lowercase in unicode escaping characters sequences is not the case. I have tried both and results were the same.

You already have UTF-8 encoded data. There are no string literal characters to escape in your bytestring. You are looking at the repr() output of a string where non-printable ASCII characters are shown as escape sequences because that makes the value easily copy-pastable in an ASCII-safe way. The \xc3 you see is one byte, not separate characters:
>>> 'Erd\xC3\xA4pfel'
'Erd\xc3\xa4pfel'
>>> 'Erd\xC3\xA4pfel'[3]
'\xc3'
>>> 'Erd\xC3\xA4pfel'[4]
'\xa4'
>>> print 'Erd\xC3\xA4pfel'
Erdäpfel
You'd have to use a raw string literal or doubled backslashes to actually getting escape sequences that unicode-escape would handle:
>>> '\\xc3\\xa4'
'\\xc3\\xa4'
>>> '\\xc3\\xa4'[0]
'\\'
>>> '\\xc3\\xa4'[1]
'x'
>>> '\\xc3\\xa4'[2]
'c'
>>> '\\xc3\\xa4'[3]
'3'
>>> print '\\xc3\\xa4'
\xc3\xa4
Note how there is a separate \ backslash character in that string (echoed as \\, escaped again).
Next to interpreting actual escape sequences he unicode-escape decodes your data as Latin-1, so you end up with a Unicode string with the character U+00C3 LATIN CAPITAL LETTER A WITH TILDE in it. Encoding that back to Latin-1 bytes gives you the \xC3 byte again, and you are back to UTF-8 bytes. Decoding then as UTF-8 works correctly.
But your second attempt encoded the U+00C3 LATIN CAPITAL LETTER A WITH TILDE codepoint to UTF-8, and that encoding gives you the byte sequence \xc3\x83. Printing those bytes to your UTF-8 terminal will show the à character. The other byte, \xA4 became U+00A4 CURRENCY SIGN, and the UTF-8 byte sequence for that is \xc2\xa4, which prints as ¤.
There is absolutely no need to decode as unicode-escape here. Just leave the data as is. Or, perhaps, decode as UTF-8 to get a unicode object:
>>> 'Erd\xC3\xA4pfel'.decode('utf8')
u'Erd\xe4pfel'
>>> print 'Erd\xC3\xA4pfel'.decode('utf8')
Erdäpfel
If your actual data (and not the test you did) contains \xhh escape sequences that encode UTTF-8, then don't use unicode-escape to decode those sequences either. Use string-escape so you get a byte string containing UTF-8 data (which you can then decode to Unicode as needed):
>>> 'Erd\\xc3\\xa4pfel'
'Erd\\xc3\\xa4pfel'
>>> 'Erd\\xc3\\xa4pfel'.decode('string-escape')
'Erd\xc3\xa4pfel'
>>> 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8')
u'Erd\xe4pfel'
>>> print 'Erd\\xc3\\xa4pfel'.decode('string-escape').decode('utf8')
Erdäpfel

Related

How to decode hex encoded Cyrillic string?

I have a hex encoded Cyrillic string "041E043F043B04300442".
How can I convert into a text string?
I tried this way:
codecs.decode('041E043F043B04300442', 'hex').decode('utf-16')
'Ḅ㼄㬄〄䈄'
But I'm getting wrong symbols.
As I see from the Unicode symbols list, the first symbol should be a Cyrillic symbol:
U+041E Cyrillic Capital Letter O
What am I doing wrong?
I had to use another codec:
codecs.decode('041E043F043B04300442', 'hex').decode('utf-16be')
Now it is being decoded fine.
utf-16 defaults to the machine's endian-ness unless a byte order mark (BOM, U+FEFF) is present. Your machine appears to be little-endian, but the data is big-endian:
>>> bytes.fromhex("041E043F043B04300442").decode('utf-16')
'Ḅ㼄㬄〄䈄'
>>> bytes.fromhex("041E043F043B04300442").decode('utf-16le')
'Ḅ㼄㬄〄䈄'
>>> bytes.fromhex("041E043F043B04300442").decode('utf-16be')
'Оплат'
(English) payment
With a correct BOM added, utf-16 can work:
>>> bytes.fromhex("FEFF041E043F043B04300442").decode('utf-16')
'Оплат'

How come I can decode a UTF-8 byte string to ISO8859-1 and back again without any UnicodeEncodeError/UnicodeDecodeError?

How come the following works without any errors in Python?
>>> '你好'.encode('UTF-8').decode('ISO8859-1')
'ä½\xa0好'
>>> _.encode('ISO8859-1').decode('UTF-8')
'你好'
I would have expected it to fail with a UnicodeEncodeError or UnicodeDecodeError
Is there some property of ISO8859-1 and UTF-8 such that I can take any UTF-8 encoded string and decode it to a ISO8859-1 string, which can later be reversed to get the original UTF-8 string?
I'm working with an older database that only supports the ISO8859-1 character set. It seems like the developers were able to store Chinese and other languages in this database by decoding UTF-8 encoded strings into ISO8859-1, and storing the resulting garbage string in the database. Downstream systems which query this database then have to encode the garbage string in ISO8859-1 and then decode the result with UTF-8 to get the correct string.
I would have assumed that such a process would not work at all.
What am I missing?
The special property of ISO-8859-1 is that the 256 characters it represents correspond 1:1 with the first 256 Unicode code points, so byte 00h decodes to U+0000, and byte FFh decodes to U+00FF.
So if you encode as UTF-8 and decode as ISO-8859-1 you get a Unicode string made up of code points whose values match the UTF-8 bytes encoded:
>>> s = '你好'
>>> s.encode('utf8').hex()
'e4bda0e5a5bd'
>>> s.encode('utf8').decode('iso-8859-1')
'ä½\xa0好'
>>> for c in u:
... print(f'{c} U+{ord(c):04X}')
...
ä U+00E4 # Unicode code points are the same as the bytes of UTF-8.
½ U+00BD
  U+00A0
å U+00E5
¥ U+00A5
½ U+00BD
>>> u.encode('iso-8859-1').hex() # transform back to bytes.
'e4bda0e5a5bd'
>>> u.encode('iso-8859-1').decode('utf8') # and decode to UTF-8 again.
'你好'
Any 8-bit encoding that has a representation for all 256 bytes would also work, it just wouldn't be a 1:1 mapping. Code Page 1256 is one such encoding:
>>> for c in s.encode('utf8').decode('cp1256'):
... print(f'{c} U+{ord(c):04X}')
...
ن U+0646 # This would still .encode('cp1256') back to byte E4, for example
½ U+00BD
  U+00A0
ه U+0647
¥ U+00A5
½ U+00BD
No, there is no special property of ISO8859-1, but one property common on many 8-bit encoding: they accept all bytes from 0 to 255.
So you decode('ISO8859-1') is just transforming bytes into 256 characters (and control codes) in a unique way. Then you do the contrary action, so you lose nothing.
This happens with most of old 8-bit encoding: they should just have a corresponding Unicode codepoint (because Python expect strings to be Unicode strings).
Note: really ISO8859-1 is special with Unicode: the first 256 codepoint of Unicode correspond to the Latin-1 characters (with same number). But this doesn't matter much on your experiment.

Encode Decode using python

I have this function in python
Str = "ü";
print Str
def correctText( str ):
str = str.upper()
correctedText = str.decode('UTF8').encode('Windows-1252')
return correctedText;
corText = correctText(Str);
print corText
It works and converts characters like ü and é however it fails when i try � and ¶
Is there a way i can fix it?
According to UTF8, à and ¶ are not valid characters, meaning that don't have a number of bytes divisible by 4 (usually). What you need to do is either use some other kind of encoding or strip out errors in your str by using the unicode() function. I recommend using the ladder.
What you are trying to do is to compose valid UTF-8 codes by several consecutive Windows-1252 codes.
For example, for ü, the Windows-1252 code of à is C3 and for ¼ it's BC. Together the code C3BC happens to be the UTF-8 code of ü.
Now, for Ã?, the Windows-1252 code is C33F, which is not a valid UTF-8 code (because the second byte does not start with 10).
Are you sure this sequence occurs in your text? For example, for à, the Windows-1252 decoding of the UTF-8 code (C3A0) is à followed by a non-printable character (non-breaking space). So, if this second character is not printed, the ? might be a regular character of the text.
For ¶ the Windows-1252 encoding is C2B6. Shouldn't it be ö, for which the Windows-1252 encoding is C3B6, which equals the UTF-8 code of ö?

Umlaut in raw_input()

I am currently learning Python and I came across the following code:
text=raw_input()
for letter in text:
x=[alpha_dic[letter]]
print x
When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.
You get some trouble from multiple shortcomings in Python (2.x).
raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x
I got this to work borrowing from this answer:
# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
x=[alpha_dict[unicode(letter)]]
print x
>>> ü
>>> ['umlaut']
Python 2 and unicode are not for the feint of heart...

Python: decoding a string that consists of both unicode code points and unicode text

Parsing some HTML content I got the following string:
АБВ\u003d\"res
The common advice on handling it appears to be to decode using unicode_escape. However, this results in the following:
ÐÐÐ="res
The escaped characters get correctly decoded, but cyrillic letters for some reason get mangled. Other than using regexes to extract everything that looks like a unicode string, decoding only them using unicode_escape and then putting everything into a new string, which other methods exist to decode strings with unicode code points in Python?
unicode_escape treats the input as Latin-1 encoded; any bytes that do not represent a Python string literal escape sequence are decoded mapping bytes directly to Unicode codepoints. You gave it UTF-8 bytes, so the cyrillic characters are represented with 2 bytes each where decoded to two Latin-1 characters each, one of which is U+00D0 Ð, the other unprintable:
>>> print repr('АБВ\\u003d\\"res')
'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print repr('АБВ\\u003d\\"res'.decode('latin1'))
u'\xd0\x90\xd0\x91\xd0\x92\\u003d\\"res'
>>> print 'АБВ\\u003d\\"res'.decode('latin1')
ÐÐÐ\u003d\"res
This kind of mis-decoding is called a Mojibake, and can be repaired by re-encoding to Latin-1, then decoding from the correct codec (UTF-8 in your case):
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape')
ÐÐÐ="res
>>> print 'АБВ\u003d\\"res'.decode('unicode_escape').encode('latin1').decode('utf8')
АБВ="res
Note that this will fail if the \uhhhh escape sequences encode codepoints outside of the Latin-1 range (U+0000-U+00FF).
The Python 3 equivalent of the above uses codecs.encode():
>>> import codecs
>>> codecs.decode('АБВ\\u003d\\"res', 'unicode_escape').encode('latin1').decode('utf8')
'АБВ="res'
The regex really is the easiest solution (Python 3):
text = 'АБВ\\u003d\\"re'
re.sub(r'(?i)(?<!\\)(?:\\\\)*\\u([0-9a-f]{4})', lambda m: chr(int(m.group(1), 16)), text)
This works fine with any 4-nibble Unicode escape, and can be pretty easily extended to other escapes.
For Python 2, make all strings u'' strings, and use unichr.

Categories