Convert a UTF-8 String to a string in Python

Convert a UTF-8 String to a string in Python - python

If I have a unicode string such as:
s = u'c\r\x8f\x02\x00\x00\x02\u201d'
how can I convert this to just a regular string that isn't in unicode format; i.e. I want to extract:
f = '\x00\x00\x02\u201d'
and I do not want it in unicode format. The reason why I need to do this is because I need to convert the unicode in s to an integer value, but if I try it with just s:
int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
Traceback (most recent call last):
File "<pyshell#48>", line 1, in <module>
int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
File "C:\Python27\lib\encodings\hex_codec.py", line 24, in hex_encode
output = binascii.b2a_hex(input)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 3: ordinal not in range(128)
yet if I do it with f:
int(f.encode('hex'), 16)
664608376369508L
And this is the correct integer value I want to extract from s. Is there a method where I can do this?

Normally, the device sends back something like: \x00\x00\x03\xcc which I can easily convert to 972
OK, so I think what's happening here is you're trying to read four bytes from a byte-oriented device, and decode that to an integer, interpreting the bytes as a 32-bit word in big-endian order.
To do this, use the struct module and byte strings:
>>> struct.unpack('>i', '\x00\x00\x03\xCC')[0]
972
(I'm not sure why you were trying to reverse the string then hex-encode; that would put the bytes in the wrong order and give much too large output.)
I don't know how you're reading from the device, but at some point you've decoded the bytes into a text (Unicode) string. Judging from the U+201D character in there I would guess that the device originally gave you a byte 0x94 and you decoded it using code page 1252 or another similar Windows default (‘ANSI’) code page.
>>> struct.unpack('>i', '\x00\x00\x02\x94')[0]
660
It may be possible to reverse the incorrect decoding step by encoding back to bytes using the same mapping, but this is dicey and depends on which encoding are involved (not all bytes are mapped to anything usable in all encodings). Better would be to look at where the input is coming from, find where that decode step is happening, and get rid of it so you keep hold of the raw bytes the device sent you.

Related

encoding string in python [duplicate]

I have this issue and I can't figure out how to solve it. I have this string:
data = '\xc4\xb7\x86\x17\xcd'
When I tried to encode it:
data.encode()
I get this result:
b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'
I only want:
b'\xc4\xb7\x86\x17\xcd'
Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.

You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.
So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.
If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.
If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.
As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:
raw_unicode_escape
Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.
So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:
>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'
So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).
So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:
>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)
Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.

You can use 'raw_unicode_escape' as your encoding:
In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'
As mentioned in comments you can also pass the encoding directly to the encode method of your string.
In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'

Python json Ignoring non-ascii chars UnicodeDecodeError: 'ascii' codec cant decode byte

Python 2.7.3
I have read all related threads around json/dumps UnicodeDecodeError and most of them want me to understand what encoding I need. In my case I am creating a json with various key values coming from various services (some p4 command lines) possibly different encoding. I have a map something like this
map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(map)
This throws an "UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 737: ordinal not in range(128)"
I would like to have ASCII characters dumped into a file occasionally a p4 checkin/jira might have non-ascii chars and it is perfectly okay to ignore this. I have tried "ensure_ascii = False" and it does not solve the problem. What I really want is the Encoder to simply ignore any non-ascii chars on the way. I think this is reasonable but cannot find any way out.
Suggestions?

The json.dumps() and json.dump() functions will try to decode byte strings to Unicode values when passed in, using UTF-8 by default:
>>> map = {"system1": '\x92'}
>>> json.dumps(map)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte
>>> map = {"system1": u'\x92'.encode('utf8')}
>>> json.dumps(map)
'{"system1": "\\u0092"}'
You can set the encoding keyword argument to use a different encoding for byte string (str ) characters.
These functions do this because JSON is a standard that uses Unicode for all strings. If you feed it data that is not encoded as UTF-8, this fails, as shown above.
On Python 2 the output is a byte string too, encoded to UTF-8. IT can be safely written to a file. Setting the ensure_ascii argument to False would change that and you'd get Unicode instead, which you clearly don't want.
So you need to ensure that what you put into the json.dumps() function is consistently all the same encoding, or is already decoded to unicode objects. If you don't care about the occasional missed codepoint, you'd do so with forcing a decode with the error handler set to replace or ignore:
map = {"system1": some_data_from_system1.decode('ascii', errors='ignore')}
This decodes the string forcibly, replacing any bytes that are not recognized as ASCII codepoints with a replacement character:
>>> '\x92'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0: ordinal not in range(128)
>>> '\x92'.decode('ascii', errors='replace')
u'\ufffd'
Here a U+FFFD REPLACEMENT CHARACTER codepoint is inserted instead to represent the unknown codepoint. You could also completely ignore such bytes by using errors='ignore'.

I have used a combination of How to get string objects instead of Unicode ones from JSON in Python? and the answer above to do this piece of logging.
As stated above the some_data_from_system{1|2} are not strings. The question is about a general error logging system. When things go wrong you want to dump as much information from several subsystems as possible for human inspection. The subsystems change between environments and it is not always known what encoding these use when they return "jsons" representing what went/was wrong. To this effect I have the following method stolen from the other thread but the essence basically is the decode method with the "ignore"
PLEASE NOTE: This is not a very performant method (most blind recursions are usually not). So this is not suitable for a typical production application; Depending upon the data it is possibly to run into an infinite loop. However assuming you understand the disclaimers it is okay for error logging systems.
def convert_encoding(data, encoding = 'ascii'):
if isinstance(data, dict):
return dict((convert_encoding(key), convert_encoding(value)) \
for key, value in data.iteritems())
elif isinstance(data, list):
return [convert_encoding(element) for element in data]
elif isinstance(data, unicode):
return data.encode(encoding, 'ignore')
else:
return data
map = {"system1": some_data_from_system1, "system2", some_data_from_system2}
json.dumps(convert_encoding(map), ensure_ascii = False)
Once done this generic method can be used to dump data.

If your string in json format has non ASCII characters and you need to use accordingly python's dump method:
myString = "{key: 'Brazilian Portuguese has many differenct characters like maçã (apple) or Bíblia (Blible)' }" # or a map
myJSON = json.dumps(myString, encoding="latin-1") #use utf8 if appropriate
myJSON = json.loads(myJSON)

Converting Unicode to in python [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Convert Unicode to UTF-8 Python
I'm a very new python programmer, working on my first script. the script pulls in text from a plist string, then does some things to it, then packages it up as an HTML email.
from a few of the entries, I'm getting the dreaded Unicode "outside ordinal 128" error.
Having read as much as I can find about encoding, and decoding, I know that it is important for me to get the encoded, but I'm having a difficult time understanding when or how exactly to do this.
The offending variable is first pulled in using plistlib, and converted to HTML from markdown, like this:
entry = result['Entry Text']
donotecontent = markdown2.markdown(entry)
Later, it is put in the email like this:
html = donotecontent + '<br /><br />' + var3
part1 = MIMEText(html, 'html')
msg.attach(part1)
My question is, what is the best way for me to make sure that Unicode characters in this content doesn't cause this to throw an error. I prefer not to ignore the characters.

Sorry for my broken english. I am speaking Chinese/Japanese, and using CJK characters everyday.
Ceron solved almost of this problem, thus I won't talk about how to use encode()/decode() again.
When we use str() to cast any unicode object, it will encode unicode string to bytedata; when we use unicode() to cast str object, it will decode bytedata to unicode character.
And, the encoding must be what returned from sys.getdefaultencoding().
In default, sys.getdefaultencoding() return 'ascii' by default, the encoding/decoding exception may be thrown when doing str()/unicode() casting.
If you want to do str <-> unicode conversion by str() or unicode(), and also, implicity encoding/decoding with 'utf-8', you can execute the following statement:
import sys # sys.setdefaultencoding is cancelled by site.py
reload(sys) # to re-enable sys.setdefaultencoding()
sys.setdefaultencoding('utf-8')
and it will cause later execution of str() and unicode() convert any basestring object with encoding utf-8.
However, I would prefer to use encode()/decode() explicitly, because it makes code maintenance easier for me.

Assuming you're using Python 2.x, remember: there are two types of strings: str and unicode. str are byte strings, whereas unicode are unicode strings. unicode strings can be used to represent text in any language, but to store text in a computer or to send it via email, you need to represent that text using bytes. To represent text using bytes, you need an coding format. There are many coding formats, Python uses ascii by default, but ascii can only represent a few characters, mostly english letters. If you try to encode a text with other letters using ascii, you will get the famous "outside ordinal 128". For example:
>>> u'Cerón'.encode('ascii')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 3:
ordinal not in range(128)
The same happens if you use str(u'Cerón'), because Python uses ascii by default to convert unicode to str.
To make this work, you have to use a different coding format. UTF-8 is a coding format that can express any unicode text as bytes. To convert the u'Cerón' unicode string to bytes you have to use:
>>> u'Cerón'.encode('utf-8')
'Cer\xc3\xb3n'
No errors this time.
Now, back to your email problem. I can see that you're using MIMEText, which accepts an already encoded str argument, in your case is the html variable. MIMEText also accepts an argument specifying what kind of encoding is being used. So, in your case, if html is a unicode string, you have to encode it as utf-8 and pass the charset parameter too (because HTMLText uses ascii by default):
part1 = MIMEText(html.encode('utf-8'), 'html', 'utf-8')
But be careful, because if html is already a str instead of unicode, then the encoding will fail. This is one of the problems of Python 2.x, it allows you to encode an already encoded string but it throws an error.
Another problem to add to the list is that utf-8 is compatible with ascii characters, and Python will always try to automatically encode/decode strings using ascii. If you're not properly encoding your strings, but you only use ascii characters, things will work fine. However, if for some reason some non-ascii characters slips into your message, you will get the error, this makes errors harder to detect.

Remember: You can't decode a unicode, and you can't encode a str
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Checkout this excellent tutorial

python byte string encode and decode

I am trying to convert an incoming byte string that contains non-ascii characters into a valid utf-8 string such that I can dump is as json.
b = '\x80'
u8 = b.encode('utf-8')
j = json.dumps(u8)
I expected j to be '\xc2\x80' but instead I get:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
In my situation, 'b' is coming from mysql via google protocol buffers and is filled out with some blob data.
Any ideas?
EDIT:
I have ethernet frames that are stored in a mysql table as a blob (please, everyone, stay on topic and keep from discussing why there are packets in a table). The table collation is utf-8 and the db layer (sqlalchemy, non-orm) is grabbing the data and creating structs (google protocol buffers) which store the blob as a python 'str'. In some cases I use the protocol buffers directly with out any issue. In other cases, I need to expose the same data via json. What I noticed is that when json.dumps() does its thing, '\x80' can be replaced with the invalid unicode char (\ufffd iirc)

You need to examine the documentation for the software API that you are using. BLOB is an acronym: BINARY Large Object.
If your data is in fact binary, the idea of decoding it to Unicode is of course a nonsense.
If it is in fact text, you need to know what encoding to use to decode it to Unicode.
Then you use json.dumps(a_Python_object) ... if you encode it to UTF-8 yourself, json will decode it back again:
>>> import json
>>> json.dumps(u"\u0100\u0404")
'"\\u0100\\u0404"'
>>> json.dumps(u"\u0100\u0404".encode('utf8'))
'"\\u0100\\u0404"'
>>>
UPDATE about latin1:
u'\x80' is a useless meaningless C1 control character -- the encoding is extremely unlikely to be Latin-1. Latin-1 is "a snare and a delusion" -- all 8-bit bytes are decoded to Unicode without raising an exception. Don't confuse "works" and "doesn't raise an exception".

Use b.decode('name of source encoding') to get a unicode version. This was surprising to me when I learned it. eg:
In [123]: 'foo'.decode('latin-1')
Out[123]: u'foo'

I think what you are trying to do is decode the string object of some encoding. Do you know what that encoding is? To get the unicode object.
unicode_b = b.decode('some_encoding')
and then re-encoding the unicode object using the utf_8 encoding back to a string object.
b = unicode_b.encode('utf_8')
Using the unicode object as a translator, without knowing what the original encoding of the string is I can't know for certain but there is the possibility that the conversion will not go as expected. The unicode object is not meant for converting strings of one encoding to another. I would work with the unicode object assuming you know what the encoding is, if you don't know what the encoding is then there really isn't a way to find out without trial and error, and then convert back to the encoded string when you want a string object back.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
w.writerow(user)

For what it's worth: I'm the author of xlrd.
Does xlrd produce unicode?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects.
Option 2: print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman.

Your code snippet says x.decode, but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.
So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object?
I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).
If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that decode is intended to convert from a byte sequence (str) to a unicode object. But, as John said, xlrd already uses Unicode strings, so x is already a unicode object.
In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. But in order to convert a unicode to a str, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted as
x = x.encode('ascii').decode("ISO-8859-1")
which fails because x contains a non-ASCII character.
Since x is already a unicode object, the decode is unnecessary. However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. You have to convert your data to str objects.
for item in items:
item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
This would be correct, except that you have the • character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:
Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters).
Write x.encode('latin-1', 'replace') to replace the bullet with a question mark.
Replace the bullets with a Latin-1 character like * or ·.
Use a character encoding that does contain all the characters you need.
These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.

xlrd works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should be u'\xa3'. This has been read in correctly; it is the string that you should be working with throughout your program.
When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should .encode it into that encoding, say latin-1.
>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£
# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'
# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a UTF-8 String to a string in Python - python

Related

encoding string in python [duplicate]

Python json Ignoring non-ascii chars UnicodeDecodeError: 'ascii' codec cant decode byte

Converting Unicode to in python [duplicate]

python byte string encode and decode

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

Categories

Resources