Read in special characters for pandas dataframe [duplicate]

Read in special characters for pandas dataframe [duplicate] - python

I have JSON file which contains followingly encoded strings:
"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",
I am trying to parse this file using the json module. However I am not able to decode this string correctly.
What I get after decoding the JSON using .load() method is 'HornÃ\xadkovÃ¡'. The string should be correctly decoded as 'Horníková' instead.
I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.
What type of encoding is this and how to correctly parse it in Python 3?
Is this type JSON file even valid JSON file according to the specification?

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.
Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().
In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
...: d = json.loads(f.read().encode('raw_unicode_escape').decode())
...:
In [169]: d
Out[169]: {'sender_name': 'Horníková'}

The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.
Here's an example:
#!python3
import json
# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)
# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}
# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)
# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)
# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)
Output:
bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadkovÃ¡'}
corrected_sender = Horníková

I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:
>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'

Reencode to bytes, and then redecode to text.
>>> 'HornÃ\xadkovÃ¡'.encode('latin-1').decode('utf-8')
'Horníková'
Is this type JSON file even valid JSON file according to the specification?
No.
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].
source
A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].
source
UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

Related

Convert bytearray containing unicode data to str

I need to convert a bytearray which contains non-encoded raw unicode data to an unicode string, e.g. the unicode \u2167 represents the roman number 8:
print(u'\u2167')
Ⅷ
having this information stored in a bytearray I need to find a way to convert it back to unicode. Decoding from e.g. 'utf8' obviously does not work:
b = bytearray([0x21,0x67])
print(b.decode('utf8'))
!g
Any ideas?
EDIT
#Luke's comment got me on the right track. Apparently the original data (not the simplified one I am showing here) is encoded as UTF-16le. The data is obtained from a wxpython TextDataObject. wxpython internally usually uses unicode. That is what made me think that I am dealing with unicode data.

... a bytearray which contains non-encoded raw unicode data
If it is in a bytearray, it is by definition encoded. The Python bytes or bytearray types can contain encoded Unicode data. The str type contains Unicode code points. You .decode a byte string to a Unicode string, and .encode a Unicode string into byte strings. The encoding used for your example is UTF-16BE:
>>> b = bytearray([0x21,0x67])
>>> b.decode('utf-16be')
'Ⅷ'

The line print(b.decode('utf8')) is not correct, correct usage is :
print(b.decode("utf-8"))

Pythonic way of searching for text in binary files [duplicate]

I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:
TypeError: can't use a string pattern on a bytes-like object
I'm guessing that message means that Python doesn't want to parse binary files.
I'm opening the file with the "rb" flags.
How can I parse binary files with regular expressions in Python?
EDIT: I'm using Python 3.2.0

I think you use Python 3 .
1.Opening a file in binary mode is simple but subtle. The only difference
from opening it in text mode is that
the mode parameter contains a 'b'
character.
........
4.Here’s one difference, though: a binary stream object has no encoding
attribute. That makes sense, right?
You’re reading (or writing) bytes, not
strings, so there’s no conversion for
Python to do.
http://www.diveintopython3.net/files.html#read
Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.
In Python 2, a string was an array of
bytes whose character encoding was
tracked separately. If you wanted
Python 2 to keep track of the
character encoding, you had to use a
Unicode string (u'') instead. But in
Python 3, a string is always what
Python 2 called a Unicode string —
that is, an array of Unicode
characters (of possibly varying byte
lengths).
http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html
and
In Python 3, all strings are sequences
of Unicode characters. There is no
such thing as a Python string encoded
in UTF-8, or a Python string encoded
as CP-1252. “Is this string UTF-8?” is
an invalid question. UTF-8 is a way of
encoding characters as a sequence of
bytes. If you want to take a string
and turn it into a sequence of bytes
in a particular character encoding,
Python 3 can help you with that.
http://www.diveintopython3.net/strings.html#boring-stuff
and
4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction.
An immutable sequence of Unicode
characters is called a string. An
immutable sequence of
numbers-between-0-and-255 is called a
bytes object.
....
1.To define a bytes object, use the b' ' “byte literal” syntax. Each byte
within the byte literal can be an
ASCII character or an encoded
hexadecimal number from \x00 to \xff
(0–255).
http://www.diveintopython3.net/strings.html#boring-stuff
So you will define your regex as follows
pat = re.compile(b'[a-f]+\d+')
and not as
pat = re.compile('[a-f]+\d+')
More explanations here:
15.6.4. Can’t use a string pattern on a bytes-like object

In your re.compile you need to use a bytes object, signified by an initial b:
r = re.compile(b"(This)")
This is Python 3 being picky about the difference between strings and bytes.

This is working for me for python 2.6
>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)

six.u() un-escaping HTML string

Using Python 2.7. As part of a JSON response, an API returns the string:
TweetDeck
I'm using a library that internally does:
six.u(json.dumps(s))
json.dumps() output is:
'"TweetDeck"'
This output can be decoded correctly with json.loads
But the call to six.u gives:
u'"TweetDeck"'
And attempting to decode this string with json.loads throws an error.
ValueError: Extra data: line 1 column 11 - line 1 column 86 (char 10 - 85)
Looks like the call to six.u un-escaped the href value, but i'm not entirely sure how to fix this.

six.u() is meant for unicode string literals, not JSON output. You should not use it to decode the JSON to a Unicode string.
From the six.u() documenation:
A “fake” unicode literal. text should always be a normal string literal. In Python 2, u() returns unicode, and in Python 3, a string. Also, in Python 2, the string is decoded with the unicode-escape codec, which allows unicode escapes to be used in it.
Emphasis mine.
Instead, decode the string if using Python 2:
json_string = json.dumps(s)
if hasattr(json_string, 'decode'):
# Python 2; decode to a Unicode value
json_string = json_string.decode('ascii')
or use the unicode() function and catch the NameError in Python 3:
json_string = json.dumps(s)
try:
# Python 2; decode to a Unicode value from ASCII
json_string = unicode(json_string)
except NameError:
# Python 3, already Unicode
pass
or set ensure_ascii to False when calling json.dumps():
json_string = json.dumps(s, ensure_ascii=False)
This can still return a str type in Python 2 though, but only if the input contains nothing but ASCII-only data, so the output can safely be mixed with unicode values.
Either way you get consistent values between Python 2 and Python 3; The six.u() decode also decodes \uhhhh JSON Unicode escape sequences to Unicode codepoints, while the Python 3 JSON result would leave those intact. With decoding you'd keep the \uhhhh sequences in both Python 2 and 3, with ensure_ascii you'd get Unicode codepoints in both.
Since this is a 3rd-party library, I filed a bug report; you cannot really recover from this mistake; you cannot insert extra backslashes up front then remove them afterward as you cannot distinguish those from normal backslashes.

Python: Cyrillic handling

I got this data returned b'\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e from an API. This data is in Russian which I know for sure. I am guessing these values are unicode representation of the cyrillic letters?
The data returned was a byte array.
How can I convert that into readable cyrillic string? Pretty much I need a way to convert that kind into readable human text.
EDIT: Yes this is JSON data. Forgot to mention, sorry.

Chances are you have JSON data; JSON uses \uhhhh escape sequences to represent Unicode codepoints. Use the json.loads() function on unicode (decoded) data to produce a Python string:
import json
string = json.loads(data.decode('utf8'))
UTF-8 is the default JSON encoding; check your response headers (if you are using a HTTP-based API) to see if a different encoding was used.
Demo:
>>> import json
>>> json.loads(b'"\\u041a\\u0435\\u0439\\u0442\\u043b\\u0438\\u043d\\u043f\\u0440\\u043e"'.decode('utf8'))
'Кейтлинпро'

How to convert a '\u5f71\u89c6\...' in to its real meaning?(python)

I want to convert
'[["[FK\u5f71\u89c6\u51fa\u54c1]\u7576\u65fa\u7238\u7238-17.\u7ca4\u8bed\u5b57\u5e55.TV-RMVB.rmvb", "205.53 MB"]]'
to
'[["[[FK影视出品]當旺爸爸-17.粤语字幕.TV-RMVB.rmvb", "205.53 MB"]]'
Because I make a mistake that I use json.dumps(file_list) convert a list object to str, and save the result to db.I find this mistabke until using sphinx to index these data...
I have tried use data.decode('utf-8'), but it seems not work.

Just decode from JSON again:
>>> import json
>>> json.loads('[["[FK\u5f71\u89c6\u51fa\u54c1]\u7576\u65fa\u7238\u7238-17.\u7ca4\u8bed\u5b57\u5e55.TV-RMVB.rmvb", "205.53 MB"]]')
[['[FK影视出品]當旺爸爸-17.粤语字幕.TV-RMVB.rmvb', '205.53 MB']]
You don't have UTF-8 encoded data, you have JSON-encoded data, which uses \uhhhh escape sequences to represent Unicode codepoints.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.