Basic Unicode encoding/decoding - python

Python 2.7.9 / Windows environment
when I
print myString
I'm seeing:
u'\u5df1\u6b66\u8d2a\u5929\u66f2'
Now I know the console I'm using (git-bash) is capable of displaying unicode. How can I encode (or decode, which ever is the right process to do) myString so that it displays:
己武贪天曲
I understand that the question is very basic. If anyone has good introductory material or reference, links would be most welcomed.

What you see is the result of print repr(u'\u5df1\u6b66\u8d2a\u5929\u66f2'). If isinstancetype(myString, (str, unicode)) is true then find the source where the string is defined and fix it. If myString is some other type then look at how its __str__, __repr__, __unicode__ methods are defined. To fix it; remove the code that calls unnecessary repr() (it can hide as a formatting operation e.g., "%r" % o).
To check whether your environment supports Unicode, run: print u'\u5929'. It should produce 天.
If your input is a Python literal and you can't change it (you should try at the very least to switch it to json format) then you could use ast.literal_eval(r"u'\u5929'") to get unicode string object:
import ast
print ast.literal_eval(myString)

You should try this:
message=u'\\u5df1\\u6b66\\u8d2a\\u5929\\u66f2'
print message.decode('unicode-escape')
I guess you are mising a "\" on every desired character

You should use the encode method . Consider this example :
str='hello'
print(str.encode(encoding='base64'))
For the list of available encoding , check this :
https://docs.python.org/2/library/codecs.html#standard-encodings

Related

Odd placeholder or pre-statement on string declaration [duplicate]

Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?
You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.
The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.
My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.
I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr
All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).

Python - How can I convert a special character to the unicode representation?

In a dictionary, I have the following value with equals signal:
{"appVersion":"o0u5jeWA6TwlJacNFnjiTA=="}
To be explicit, I need to replace the = for the unicode representation '\u003d' (basically the reverse process of [json.loads()][1]). How can I set the unicode value to a variable without store the value with two scapes (\\u003d)?.
I've tryed of different ways, including the enconde/decode, repr(), unichr(61), etc, and even searching a lot, cound't find anything that does this, all the ways give me the following final result (or the original result):
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
Since now, thanks for your attention.
EDIT
When I debug the code, it gives me the value of the variable with 2 escapes. The program will get this value and use it to do the following actions, including the extra escape. I'm using this code to construct a json by the json.dumps() and the result returned is a unicode with 2 escapes.
Follow a print of the final result after the JSON construction. I need to find a way to store the value in the var with just one escape.
I don't know if make difference, but I'm doing this to a custom BURP Plugin, manipulating some selected requests.
Here is an image of my POC, getting the value of the var.
The extra backslash is not actually added, The Python interpreter uses the repr() to indicate that it's a backslash not something like \t or \n when the string containing \ gets printed:
I hope this helps:
>>> t['appVersion'] = t["appVersion"].replace('=', '\u003d')
>>> t['appVersion']
'o0u5jeWA6TwlJacNFnjiTA\\u003d\\u003d'
>>> print(t['appVersion'])
o0u5jeWA6TwlJacNFnjiTA\u003d\u003d
>>> t['appVersion'] == 'o0u5jeWA6TwlJacNFnjiTA\u003d\u003d'
True

how to convert Python 2 unicode() function into correct Python 3.x syntax

I enabled the compatibility check in my Python IDE and now I realize that the inherited Python 2.7 code has a lot of calls to unicode() which are not allowed in Python 3.x.
I looked at the docs of Python2 and found no hint how to upgrade:
I don't want to switch to Python3 now, but maybe in the future.
The code contains about 500 calls to unicode()
How to proceed?
Update
The comment of user vaultah to read the pyporting guide has received several upvotes.
My current solution is this (thanks to Peter Brittain):
from builtins import str
... I could not find this hint in the pyporting docs.....
As has already been pointed out in the comments, there is already advice on porting from 2 to 3.
Having recently had to port some of my own code from 2 to 3 and maintain compatibility for each for now, I wholeheartedly recommend using python-future, which provides a great tool to help update your code (futurize) as well as clear guidance for how to write cross-compatible code.
In your specific case, I would simply convert all calls to unicode to use str and then import str from builtins. Any IDE worth its salt these days will do that global search and replace in one operation.
Of course, that's the sort of thing futurize should catch too, if you just want to use automatic conversion (and to look for other potential issues in your code).
You can test whether there is such a function as unicode() in the version of Python that you're running. If not, you can create a unicode() alias for the str() function, which does in Python 3 what unicode() did in Python 2, as all strings are unicode in Python 3.
# Python 3 compatibility hack
try:
unicode('')
except NameError:
unicode = str
Note that a more complete port is probably a better idea; see the porting guide for details.
Short answer: Replace all unicode calls with str calls.
Long answer: In Python 3, Unicode was replaced with strings because of its abundance. The following solution should work if you are only using Python 3:
unicode = str
# the rest of your goes goes here
If you are using it with both Python 2 or Python 3, use this instead:
import sys
if sys.version_info.major == 3:
unicode = str
# the rest of your code goes here
The other way: run this in the command line
$ 2to3 package -w
First, as a strategy, I would take a small part of your program and try to port it. The number of unicode calls you are describing suggest to me that your application cares about string representations more than most and each use-case is often different.
The important consideration is that all strings are unicode in Python 3. If you are using the str type to store "bytes" (for example, if they are read from a file), then you should be aware that those will not be bytes in Python3 but will be unicode characters to begin with.
Let's look at a few cases.
First, if you do not have any non-ASCII characters at all and really are not using the Unicode character set, it is easy. Chances are you can simply change the unicode() function to str(). That will assure that any object passed as an argument is properly converted. However, it is wishful thinking to assume it's that easy.
Most likely, you'll need to look at the argument to unicode() to see what it is, and determine how to treat it.
For example, if you are reading UTF-8 characters from a file in Python 2 and converting them to Unicode your code would look like this:
data = open('somefile', 'r').read()
udata = unicode(data)
However, in Python3, read() returns Unicode data to begin with, and the unicode decoding must be specified when opening the file:
udata = open('somefile', 'r', encoding='UTF-8').read()
As you can see, transforming unicode() simply when porting may depend heavily on how and why the application is doing Unicode conversions, where the data has come from, and where it is going to.
Python3 brings greater clarity to string representations, which is welcome, but can make porting daunting. For example, Python3 has a proper bytes type, and you convert byte-data to unicode like this:
udata = bytedata.decode('UTF-8')
or convert Unicode data to character form using the opposite transform.
bytedata = udata.encode('UTF-8')
I hope this at least helps determine a strategy.
You can use six library which have text_type function (unicode in py2, str in py3):
from six import text_type

Bytes string in Python

Would you know by any chance how to get rid on the bytes identifier in front of a string in a Python's list, perhaps there is some global setting that can be amended?
I retrieve a query from the Postgres 9.3, and create a list form that query. It looks like Python 3.3 interprets records in columns that are of type char(4) as if the they are bytes strings, for example:
Funds[1][1]
b'FND3'
Funds[1][1].__class__
<class 'bytes'>
So the implication is:
Funds[1][1]=='FND3'
False
I have some control over that database so I could change the column type to varchar(4), and it works well:
Funds[1][1]=='FND3'
True
But this is only a temporary solution.
The little b makes my life a nightmare for the last two days ;), and I would appreciate your help with that problem.
Thanks and Regards
Peter
You have to either manually implement __str__/__repr__ or, if you're willing to take the risk, do some sort of Regex-replace over the string.
Example __repr__:
def stringify(lst):
return "[{}]".format(", ".join(repr(x)[1:] if isinstance(x, bytes) else repr(x) for x in lst))
The b isn't part of the string, any more than the quotes around it are; they're just part of the representation when you print the string out. So, you're chasing the wrong problem, one that doesn't exist.
The problem is that the byte string b'FND3' is not the same thing as the string 'FND3'. In this particular example, that may seem silly, but if you might ever have any non-ASCII characters anywhere, it stops being silly.
For example, the string 'é' is the same as the byte string b'\xe9' in Latin-1, and it's also the same as the byte string b'\xce\xa9' in UTF-8. And of course b'\xce\a9' is the same as the string 'é' in Latin-1.
So, you have to be explicit about what encoding you're using:
Funds[1][1].decode('utf-8')=='FND3'
But why is PostgreSQL returning you byte strings? Well, that's what a char column is. It's up to the Python bindings to decide what to do with them. And without knowing which of the multiple PostgreSQL bindings you're using, and which version, it's impossible to tell you what to do. But, for example, in recent-ish psycopg, you just have to set an encoding in the connection (e.g., conn.set_client_encoding('UTF-8'); in older versions you had to register a standard typecaster and do some more stuff; etc.; in py-postgresql you have to register lambda s: s.decode('utf-8'); etc.

Python doesn't interpret UTF8 correctly

I know similar questions have been asked a million times, but despite reading through many of them I can't find a solution that applies to my situation.
I have a django application, in which I've created a management script. This script reads some text files, and outputs them to the terminal (it will do more useful stuff with the contents later, but I'm still testing it out) and the characters come out with escape sequences like \xc3\xa5 instead of the intended å. Since that escape sequence means Ã¥, which is a common misinterpretation of å because of encoding problems, I suspect there are at least two places where this is going wrong. However, I can't figure out where - I've checked all the possible culprits I can think of:
The terminal encoding is UTF-8; echo $LANG gives en_US.UTF-8
The text files are encoded in UTF-8; file * in the directory where they reside results in all entries being listed as "UTF-8 Unicode text" except one, which does not contain any non-ASCII characters and is listed as "ASCII text". Running iconv -f ascii -t utf8 thefile.txt > utf8.txt on that file yields another file with ASCII text encoding.
The Python scripts are all UTF-8 (or, in several cases, ASCII with no non-ASCII characters). I tried inserting a comment in my management script with some special characters to force it to save as UTF-8, but it did not change the behavior. The above observations on the text files apply on all Python script files as well.
The Python script that handles the text files has # -*- encoding: utf-8 -*- at the top; the only line preceding that is #!/usr/bin/python3, but I've tried both changing to .../python for Python 2.7 or removing it entirely to leave it up to Django, without results.
According to the documentation, "Django natively supports Unicode data", so I "can safely pass around Unicode strings" anywhere in the application.
I really can't think of anywhere else to look for a non-UTF-8 link in the chain. Where could I possibly have missed a setting to change to UTF-8?
For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
UPDATE:
In response to quiestions in comments:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None) for all files.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not. Adding .decode('utf-8') yields a tuple where both strings are marked with a prepending u and \xe5 (the correct escape sequence for å) instead of the odd characters before - but I can't figure out how to print them as regular strings, with no escape characters. I've tested another call to .decode('utf-8') as well as wrapping in str() but both fail with UnicodeEncodeError complaining that \xe5 can't be encoded in ascii. Since a single string works correctly, I don't know what else to test.
SSCCE:
# -*- coding: utf-8 -*-
import os, sys
for root,dirs,files in os.walk('txt-songs'):
for filename in files:
with open(os.path.join(root,filename)) as f:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding)
lines = f.readlines()
print(lines[0].strip()) # works
print(lines[0].strip(), lines[1].strip()) # does not work
The big problem here is that you're mixing up Python 2 and Python 3. In particular, you've written Python 3 code, and you're trying to run it in Python 2.7. But there are a few other problems along the way. So, let me try to explain everything that's going wrong.
I started compiling an SSCCE, and quickly found that the problem is only there if I try to print the value in a tuple. In other words, print(lines[0].strip()) works fine, but print(lines[0].strip(), lines[1].strip()) does not.
The first problem here is that the str of a tuple (or any other collection) includes the repr, not the str, of its elements. The simple way to solve this problem is to not print collections. In this case, there is really no reason to print a tuple at all; the only reason you have one is that you've built it for printing. Just do something like this:
print '({}, {})'.format(lines[0].strip(), lines[1].strip())
In cases where you already have a collection in a variable, and you want to print out the str of each element, you have to do that explicitly. You can print the repr of the str of each with this:
print tuple(map(str, my_tuple))
… or print the str of each directly with this:
print '({})'.format(', '.join(map(str, my_tuple)))
Notice that I'm using Python 2 syntax above. That's because if you actually used Python 3, there would be no tuple in the first place, and there would also be no need to call str.
You've got a Unicode string. In Python 3, unicode and str are the same type. But in Python 2, it's bytes and str that are the same type, and unicode is a different one. So, in 2.x, you don't have a str yet, which is why you need to call str.
And Python 2 is also why print(lines[0].strip(), lines[1].strip()) prints a tuple. In Python 3, that's a call to the print function with two strings as arguments, so it will print out two strings separated by a space. In Python 2, it's a print statement with one argument, which is a tuple.
If you want to write code that works the same in both 2.x and 3.x, you either need to avoid ever printing more than one argument, or use a wrapper like six.print_, or do a from __future__ import print_function, or be very careful to do ugly things like adding in extra parentheses to make sure your tuples are tuples in both versions.
So, in 3.x, you've got str objects and you just print them out. In 2.x, you've got unicode objects, and you're printing out their repr. You can change that to print out their str, or to avoid printing a tuple in the first place… but that still won't help anything.
Why? Well, printing anything, in either version, just calls str on it and then passes it to sys.stdio.write. But in 3.x, str means unicode, and sys.stdio is a TextIOWrapper; in 2.x, str means bytes, and sys.stdio is a binary file.
So, the pseudocode for what ultimately happens is:
sys.stdio.wrapped_binary_file.write(s.encode(sys.stdio.encoding, sys.stdio.errors))
sys.stdio.write(s.encode(sys.getdefaultencoding()))
And, as you saw, those will do different things, because:
print(sys.getdefaultencoding(), sys.stdout.encoding, f.encoding) yields ('ascii', 'UTF-8', None)
You can simulate Python 3 here by using a io.TextIOWrapper or codecs.StreamWriter and then using print >>f, … or f.write(…) instead of print, or you can explicitly encode all your unicode objects like this:
print '({})'.format(', '.join(element.encode('utf-8') for element in my_tuple)))
But really, the best way to deal with all of these problems is to run your existing Python 3 code in a Python 3 interpreter instead of a Python 2 interpreter.
If you want or need to use Python 2.7, that's fine, but you have to write Python 2 code. If you want to write Python 3 code, that's great, but you have to run Python 3.3. If you really want to write code that works properly in both, you can, but it's extra work, and takes a lot more knowledge.
For further details, see What's New In Python 3.0 (the "Print Is A Function" and "Text Vs. Data Instead Of Unicode Vs. 8-bit" sections), although that's written from the point of view of explaining 3.x to 2.x users, which is backward from what you need. The 3.x and 2.x versions of the Unicode HOWTO may also help.
For completeness: I'm reading from the files with lines = file.readlines() and printing with the standard print() function. No manual encoding or decoding happens at either end.
In Python 3.x, the standard print function just writes Unicode to sys.stdout. Since that's a io.TextIOWrapper, its write method is equivalent to this:
self.wrapped_binary_file.write(s.encode(self.encoding, self.errors))
So one likely problem is that sys.stdout.encoding does not match your terminal's actual encoding.
And of course another is that your shell's encoding does not match your terminal window's encoding.
For example, on OS X, I create a myscript.py like this:
print('\u00e5')
Then I fire up Terminal.app, create a session profile with encoding "Western (ISO Latin 1)", create a tab with that session profile, and do this:
$ export LANG=en_US.UTF-8
$ python3 myscript.py
… and I get exactly the behavior you're seeing.
It seems from your comment that you are using python-2 and not python-3.
If you are using python-3, it's worth reading the unicode howto guide on reading/writing to understand what python is doing.
The basic flow if encoding is:
DECODE from encoding to unicode -> Processing -> Encode from unicode to encoding
In python3 the bytes are decoded to strings and strings are encoded to bytes.
The bytes to string decoding is handled for you with open().
[..] the built-in open() function can return a file-like object that
assumes the file’s contents are in a specified encoding and accepts
Unicode parameters for methods such as read() and write(). This works
through open()‘s encoding and errors parameters [..]
So to read in unicode from a utf-8 encoded file you should be doing this:
# python-3
with open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode
If you want similar functionality using python-2, you can use codecs.open():
# python-2
import codecs
with codecs.open('utf8.txt', mode='r', encoding='utf-8') as f:
lines = f.readlines() # returns unicode

Categories