Python - Reading Emoji Unicode Characters - python

I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:
u'that\u2019s \U0001f63b'
The apostrophe is represented by \u2019, but the emoji is represented by \U0001f63b. I looked up the code point for the emoji in question, and it's \uf63b. I'm not sure where the 0001 is coming from. I know comically little about character encodings.
When I print the text, character by character, using:
s = u'that\u2019s \U0001f63b'
for c in s:
print c.encode('unicode_escape')
The program produces the following output:
t
h
a
t
\u2019
s
\ud83d
\ude3b
How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001s before reading it, or is there an easier, less silly way?

I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X
>>> s = u'that\u2019s \U0001f63b'
>>> print s
that’s 😻
In python3, though -
Python 3.4.3 (default, Jul 7 2015, 15:40:07)
>>> s = u'that\u2019s \U0001f63b'
>>> s[-1]
'😻'

Your last part of confusion is likely due to the fact that you are running what is called a "narrow Python build". Python can't hold a single character with enough information to hold a single emoji. The best solution would be to move to Python 3. Otherwise, try to process the UTF-16 surrogate pair.

Related

Why under specific enviroments UNICODE characters trigger an EncodeError and at others not?

I am making a Python project that needs to work with Greek characters print, edit and return strings.
On my main PC that has the Greek language installed everything runs fine but when I am running on my English laptop the same program with the same version of python an encode error is triggered. Especially this one:
EncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)
The error happens due to this code
my_string = "Δίας"
print(my_string)
Why is this happening and what I need to do to fix it?
Why is this happening? You are using Python 2 and although it supports Unicode, it makes you jump through a few more hoops explicitly than Python 3 does. The string you provide contains characters that fall outside the normal first 128 ASCII characters, which is what is causing the problem.
The print statement tries to encode the string as standard ascii, but it runs into characters it doesn't understand and by that point, it does not know what encoding the characters are supposed to be in. You might think this is obvious: "the same encoding the file is in!" or "always UTF-8!", but Python 2 wants you to make it explicit.
What do you need to do to fix it? One solution would be to use Python 3 and not worry about it, if all you need is a quick solution. Python 3 really is the way forward at this point and using Python 2 makes you solve problems that many Python programmers today don't have to solve (although they should be able to, in the end).
If you want to keep using Python 2, you should change your code to this:
# coding=utf-8
my_string = u"Δίας"
print(my_string.encode('utf-8'))
The first line tells the interpreter explicitly what encoding the source file was written in. This helps your IDE as well, to make sure it is showing you the code correctly. The second line has the u in front of the string, telling Python my_string is in fact a unicode string. And the third line explicitly tells Python that you want the output to be utf-8 encoded as well.
A more complete explanation of all this is here https://docs.python.org/2/howto/unicode.html
If you're wondering why it works on your Greek computer, but not on your English computer - the default encoding on the Greek computer actually has the code points for the characters you're using, while the English encoding does not. This indicates that Python is clever enough to figure out that things are utf (and the string is a series of unicode code points), but by the time it needs to encode them, it doesn't know what encoding to use, as the standard (English) encoding doesn't have the characters in the string.

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?
SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).
If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.
It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

how to convert u'\uf04a' to unicode in python [duplicate]

This question already has answers here:
Python unicode codepoint to unicode character
(4 answers)
Closed 1 year ago.
I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode
The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS
Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm
one example looks like this:
"Oh god please some advice ":
Out[408]: u'Oh god please some advice \uf04c'
Given a thread like this as one example for test:
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')
print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!
'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined
With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?
References
https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py
Handling non-standard American English Characters and Symbols in a CSV, using Python
Solution:
According to the comments below: I replace these character set with the question mark('?')
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')
Hope this helpful to the other beginners.
The notation u'\uf04a' denotes the Unicode codepoint U+F04A, which is by definition a private use codepoint. This means that the Unicode standard does not assign any character to it, and never will; instead, it can be used by private agreements.
It is thus meaningless to talk about printing it. If there is a private agreement on using it in some context, then you print it using a font that has a glyph allocated to that codepoint. Different agreements and different fonts may allocate completely different characters and glyphs to the same codepoint.
It is possible that U+F04A is a result of erroneous processing (e.g., wrong conversions) of character data at some earlier phase.
u'\uf04a'
already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).
u'\uf04a'.encode("utf-8")
gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.
You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.
What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:
>>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
'who? why?'

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.
The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.
You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.
The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Greek encoding in PYTHON

i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.

Categories