Parse json with \uxxx chracters using python - python

I have JSON data, which contains a text data field with escape characters such as \n, \u4e0d etc.
Using Python 2.7, my goal is to write it to CSV "as-is" i.e. \n as \n and \u4e0d as \u4e0d. (raw strings)
str(data["text"]).encode('string_escape') works as expected for \n but not for \u, giving the error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e0d' in position 32
If I try data["text"]).encode('utf-8').encode('string_escape') it works but mangles the \u in input like \xe4\xb8\x8d
data = json.loads(line)
writer.writerow(data["text"]).encode('utf-8').encode('string_escape'))
Is there a way to achieve what I need?
Many thanks

One of the challenges of programming is how to write non-display characters such as newline that perform an action instead of displaying a glyph. Python uses the backslash plus additional characters to represent these characters. For strings, the python repr function gives you the backslash-escaped representation of a string as if you were typing it in.
If I type in your example string and print it, ... well I get the new line and the unicode glyph but writing to an ascii csv would result in a unicode decode error.
>>> test = u'\n hello \u4e0d'
>>> print test
hello 不
>>>
But if I print the string representation, its what I originally typed in
>>> print repr(test)
u'\n hello \u4e0d'
>>>
If I don't want the python string part, I can just strip it out
>>> print repr(test)[2:-1]
\n hello \u4e0d
>>>
Which is better depends on what happens to that string next. If you want to get back to the real string later, stick with the python representation and then ast.literal_eval to get it back again.
>>> test2 = repr(test)
>>> original = ast.literal_eval(test2)
>>> original == test
True

You have a unicode string. You want to write it into csv file as it is. Since you can't write a unicode string on file you tried to encoded it and it got some unwanted character like '\x'. Try this solution which will convert unicode string to string without adding any unwanted character -
import ast
data = u' \n \u4e0d'
str_data = ast.literal_eval(json.dumps(data))
writer.writerow(str_data.encode('string_escape'))

Try this technique to write data to your file. First encode the data using base64, and when you want to write to the file just decode it and write that data.
>>> import base64
>>> encoded_data = '\n \u4e0d'
>>> data = base64.b64encode(encoded_data)
>>> data
'CiBcdTRlMGQ='
>>> base64.b64decode(data)
'\n \\u4e0d'
>>>

Related

convert Chinese ascii string to Chinese language string

I tried to use sys module to set default encoding to convert the string, but it does not work.
The string is:
`\xd2\xe6\xc3\xf1\xba\xcb\xd0\xc4\xd4\xf6\xb3\xa4\xbb\xec\xba\xcf`
it means 益民核心增长混合 in Chinese. But How can I convert this to Chinese language string?
I tried this:
>>> string = '\xd2\xe6\xc3\xf1\xba\xcb\xd0\xc4\xd4\xf6\xb3\xa4\xbb\xec\xba\xcf'
>>> print string.decode("gbk")
益民核心增长混合 # As you can see here, got the right answer
>>> new_str = string.decode("gbk")
>>> new_str
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # It returns the another encode type.
>>> another = u"益民核心增长混合"
>>> another
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # same as new_str
So, I just confused by this situation, why I can print string.decode("gbk") but the new_str in my python console just return another encode type?
My OS is Windows 10, my Python version is Python 2.7. Thank you very much!
You are doing it correctly.
In this case, new_str is actually a unicode string as denoted by the u prefix.
>>> new_str
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408' # It returns the another encode type.
When you decode the GBK encoded string, you get a unicode string. Each character of this string is a unicode code point, e.g.
>>> u'\u76ca'
u'\u76ca'
>>> print u'\u76ca'
益
>>> import unicodedata
>>> unicodedata.name(u'\u76ca')
'CJK UNIFIED IDEOGRAPH-76CA'
>>> print new_str
益民核心增长混合
>>> print repr(new_str)
u'\u76ca\u6c11\u6838\u5fc3\u589e\u957f\u6df7\u5408
This is how Python displays unicode strings in the interpreter - it is using repr to display it. But when you print the string, Python converts to the encoding for your terminal (sys.stdout.encoding), and that's why the string is displayed as you expect.
So, it's not a different encoding of the string, it's just the way that Python displays the string in the interpreter.

How to use Python convert a unicode string to the real string [duplicate]

This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法,删除存储在
You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code
Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

encrypting a string via binary manipulation

I'm running into a problem involving encrypting strings. What I'm doing is converting each letter into numbers using ord() function and then converting it into binary codes. Which I then invert or xor the numbers so that for the letter 'A' which have binary code of '0100 0001' will become '1011 1110' when converted back to decimal value will be 190, which I will chr() back into a letter. I've noticed that certain letters don't convert into any symbols that can be seen at all. When I tried to convert decimal value of 157 to ASCII character. I got '\x9d' instead of a ASCII value. According to the Extended ASCII Codes, it should have given me a symbol that I can read with print function and also print it to a file. Is there any way to make Python print it into a readable symbol so that I can print it? Right now I'm unable to make it work due to the inability of the program to print it into symbols that I can read and reverse the process.
Python defaults to showing the representation of strings unless you explicitly print them. \x9d is the repr (representation) of the character, if you print it you will see something else depending on which encoding and font your terminal uses
>>> chr(157)
'\x9d'
>>> print repr(chr(157)) # equivalent to the above
'\x9d'
>>> print chr(157)
� # this appears as a question mark in a diamond shaped box on my system
This doesn't stop you from writing the data to a file though.
EDIT
If by "Extended ASCII" you are referring to this character set http://en.wikipedia.org/wiki/Code_page_437, you should be able to use
>>> print chr(157).decode('CP437')
¥
This returns a unicode string suitable for printing (if your terminal supports that).
EDIT 2
It's a little different in Python 3.x as ord returns a unicode str. Instead you want a bytes str (which is equivalent to a Python2.x str):
>>> bytes([157]) # this is equivalent to ord(157) in Python 2.x
b'\x9d'
>>> bytes([157]).decode('cp437') # decode this to a unicode str with the desired encoding
'¥'
>>> print(bytes([157]).decode('cp437')) # now it's suitable for printing
¥
Make sure when you write the data to a file that you write the raw bytes str, not the unicode (printable) str:
>>> data = bytes([154, 155, 156, 157])
>>> print (data.decode('cp437')) # use decode for printing
Ü¢£¥
>>> with open('output.dat', 'wb') as f:
... f.write(data) # but not for writing to a file
...
4
>>> with open('output.dat', 'rb') as f:
... data = f.read()
... print(data)
... print(data.decode('cp437'))
...
b'\x9a\x9b\x9c\x9d'
Ü¢£¥

Converting html source content into readable format with Python 2.x

Python 2.7
I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.
This is what I've tried so far:
>>> import urllib2
>>> urllib2.unquote('£')
'£'
So that didn't work...
Then I tried:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
as you can see that doesn't work either nor any combination of the two.
I managed to find out that '£' is an HTML character entity name. The '\xa3' I wasn't able to find out.
Does anyone know how to do this, how to convert HTML content into a readable format in python?
£ is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:
>>> print u'\xa3'
£
When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.
If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:
>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'
You get a two-byte string representing the single "POUND SIGN" character.
I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.
Why doesn't that work?
In [1]: s = u'\xa3'
In [2]: s
Out[2]: u'\xa3'
In [3]: print s
£
When it comes to unescaping html entities I always used: http://effbot.org/zone/re-sub.htm#unescape-html.
The video title strings use HTML entities to encode special characters, such as ampersands and pound signs.
The \xa3 is the Python Unicode character literal for the pound sign (£). In your example, Python is displaying the __repr__() of a Unicode string, which is why you see the escapes. If you print this string, you can see it represents the pound sign:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('£')
u'\xa3'
>>> print h.unescape('£')
£
lxml, BeautifulSoup or PyQuery does the job pretty well. Or combination of these ;)

How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Categories