Python unicode code point issues: \xe2\x82\x82 vs. CO\u2082

Python unicode code point issues: \xe2\x82\x82 vs. CO\u2082 - python

My program is required to take in inputs but I am having an issues with subscripts such as CO₂...
So when i use CO₂ as an argument into the function, it seems to be represented as a string: 'CO\xe2\x82\x82' which is apparently the string literal?
Further on, i read from a spreadsheet - xlsx file using read_excel() from pandas to find entries pertaining to CO₂. I then convert this into a dictionary but in this case, it is represented as 'CO\u2082'
I use the args from earlier represented as: 'CO\xe2\x82\x82' so it doesn't recognize an entry for CO\u2082... which then results in a key error.
My question is what would be a way to convert both these representations of CO₂ so that i can do look-ups in the dictionary? Thank you for any advice

Looks like your input to the function is encoded as UTF-8, while the XLSX file is in decoded Unicode.
b'\xe2\x82\x82' is the UTF-8 encoding of Unicode codepoint '\u2082' which is identical to '₂' on Unicode-enabled systems.
Most modern systems are unicode enabled, so the most common reason to see the former UTF-8 encoding is due to reading bytes data, which is always encoded. You can fix that by decoding it like so:
> data = b'CO\xe2\x82\x82'
> data.decode()
'CO₂'
If the encoded data are somehow in a normal (non-bytes) string, then you can do it by converting the existing string to bytes and then decoding it:
> data = 'CO\xe2\x82\x82'
> bytes(map(ord, data)).decode()
'CO₂'
From #mark-tolonen below, using the latin-1 encoding is functionally identical to bytes(map(ord, data)), but much, much faster:
> data = 'CO\xe2\x82\x82'
> data.encode('latin1').decode()
'CO₂'

Related

Python 3 - Converting byte string to string with same content

I am working on migrating project code from Python 2 to Python 3.
One piece of code is using struct.pack which provides me value in string(Python2) and byte string(Python3)
I wanted to convert byte string in python3 to normal string. Converted string should have same content to make it consistent with existing values.
For e.g.
in_val = b'\0x01\0x36\0xff\0x27' # Input value
out_val = '\0x01\0x36\0xff\0x27' # Output should be this
I have one solution to convert in_val in string then explicitly remove 'b' and '\' character which will appear after its converted to string.
Is there any other solution to convert using clean way.
Any help appreciated

str values are always Unicode code points. The first 256 values are the Latin-1 range, so you can use that codec to decode bytes directly to those codepoints:
out_val = in_val.decode('latin1')
However, you want to re-assess why you are doing this. Don't store binary data in strings, there are almost always better ways to deal with binary data. If you want to store binary data in JSON, for example, then you'd want to use Base64 or some other binary-to-text encoding scheme that better handles edge cases such as binary data containing escape codes when interpreted as text.

Read binary mode in python2.7 returns a 'str' type. Why is it so?

My objective is to read a text (or) file byte by byte in Python. I came across few stack overflow questions: Reading binary file and looping over each byte
and using the following method:
with open("./test", "rb") as in_file:
msg_char = in_file.read(1)
print(type(msg_char))
And am geting the output as
<type 'str'>
I checked this on one other question Read string from binary file which says that read returns a string; in the sense "string of bytes". I am confused. Following is the question:
Is "string of bytes" different from conventional strings (as used in C/C++ etc..).

In Python 2 the differentiation between text and bytes isn't as well-developed as it is in Python 3, which has separate types - str for text, in which the individual items are Unicode characters, bytes for binary data, where the individual items are 8-bit bytes.
Since Python 2 didn't have the bytes type it used strings for both types of data. Although the Unicode type was introduced into Python 2, no attempt was made to change the way files handled data, and decoding was left entirely to the programmer.
Similarly in C, "string" originally meant string of bytes, then wide character types were introduced later as the developers realized text was rather different from bytes data.
As a programmer you should always try to maintain separation between string data and the bytes that are used to represent it in a particular encoding. The simplest rule is "decode on input, encode on output' -- that way you know your text is using appropriate encodings.

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2

You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.

You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Python 2: Comparing a unicode and a str

This topic is already on StackOverflow but I didn't find any satisfying solution:
I have some strings in Unicode coming from a server and I have some hardcoded strings in the code which I'd like to match against. And I do understand why I can't just make a == but I do not succeed in converting them properly (I don't care if I've to do str -> unicode or unicode -> str).
I tried encode and decode but it didn't gave any result.
Here is what I receive...
fromServer = {unicode} u'Führerschein nötig'
fromCode = {str} 'Führerschein nötig'
(as you can see, it is german!)
How can have them equals in Python 2 ?

First make sure you declare the encoding of your Python source file at the top of the file. Eg. if your file is encoded as latin-1:
# -*- coding: latin-1 -*-
And second, always store text as Unicode strings:
fromCode = u'Führerschein nötig'
If you get bytes from somewhere, convert them to Unicode with str.decode before working with the text. For text files, specify the encoding when opening the file, eg:
# use codecs.open to open a text file
f = codecs.open('unicode.rst', encoding='utf-8')
Code which compares byte strings with Unicode strings will often fail at random, depending on system settings, or whatever encoding happens to be used for a text file. Don't rely on it, always make sure you compare either two unicode strings or two byte strings.
Python 3 changed this behaviour, it will not try to convert any strings. 'a' and b'a' are considered objects of a different type and comparing them will always return False.

tested on 2.7
for German umlauts latin-1 is used.
if 'Führerschein nötig'.decode('latin-1') == u'Führerschein nötig':
print('yes....')
yes....

Is this the best way to ensure that a python unicode "string" is encoded in utf-8?

Given in arbitrary "string" from a library I do not have control over, I want to make sure the "string" is a unicode type and encoded in utf-8. I would like to know if this is the best way to do this:
import types
input = <some value from a lib I dont have control over>
if isinstance(input, types.StringType):
input = input.decode("utf-8")
elif isinstance(input, types.UnicodeType):
input = input.encode("utf-8").decode("utf-8")
In my actual code I wrap this in a try/except and handle the errors but I left that part out.

A Unicode object is not encoded (it is internally but this should be transparent to you as a Python user). The line input.encode("utf-8").decode("utf-8") does not make much sense: you get the exact same sequence of Unicode characters at the end that you had in the beginning.
if isinstance(input, str):
input = input.decode('utf-8')
is all you need to ensure that str objects (byte strings) are converted into Unicode strings.

Simply;
try:
input = unicode(input.encode('utf-8'))
except ValueError:
pass
Its always better to seek forgiveness than ask permission.

I think you have a misunderstanding of Unicode and encodings. Unicode characters are just numbers. Encodings are the representation of the numbers. Think of Unicode characters as a concept like fifteen, and encodings as 15, 1111, F, XV. You have to know the encoding (decimal, binary, hexadecimal, roman numerals) before you can decode an encoding and "know" the Unicode value.
If you have no control over the input string, it is difficult to convert it to anything. For example, if the input was read from a file you'd have to know the encoding of the text file to decode it meaningfully to Unicode, and then encode it into 'UTF-8' for your C++ library.

Are you sure you want a UTF-8 encoded sequence stored in a Unicode type? Normally, Python stores characters in a types.UnicodeType using UCS-2 or -4, what is sometimes referred to as "wide" characters, which should be capable of containing characters from all reasonably common scripts.
One wonders what sort of lib this is that sometimes outputs types.StringType and sometimes types.UnicodeType. If I would take a wild guess, the lib always produces type.StringType, but doesn't tell which encoding it is in. If that is the case, you are actually looking for code that can guess what charset a type.StringType is encoded as.
In most cases, this is easy as you can assume that it is either in e.g. latin-1 or UTF-8. If the text can actually be in any odd encoding (e.g. incoming mail w/o proper header) you need a lib that guesses encoding. See http://chardet.feedparser.org/.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python unicode code point issues: \xe2\x82\x82 vs. CO\u2082 - python

Related

Python 3 - Converting byte string to string with same content

Read binary mode in python2.7 returns a 'str' type. Why is it so?

Python: Correct Way to refer to index of unicode string

Python 2: Comparing a unicode and a str

Is this the best way to ensure that a python unicode "string" is encoded in utf-8?

Categories

Resources