I was doing some work in Python with graphs and wanted to a save some structures in files so I could load them fast when I resumed work. One of those was a dictionary which I saved in JSON format using json.dump.
When I load it back with json.load the keys have changed from "1" to u'1'. Why is that? What does it mean? How can I change it? I use the keys later to make some lists which I will then use with the original graph which nodes are the keys (in integer form) and it causes problem in comparisons...
The u prefix signifies a Unicode string. In Python 2.x, you can convert it to a regular string with str(). That shouldn't really be necessary, though; u'1' == '1' because Python will do any conversion for you before comparing.
The u'' or u"" just means that this is a unicode string. Which in general should not be any problem unless you need a byte string. Though I would expect that your original data already was unicode, so it should not be a problem.
It is a unicode string. You can treat it as a normal python string in most cases. If you really want to convert it to a normal string use str(). If you need to convert it to a bytes type, use object.encode(encoding) where encoding is the encoding of the Unicode character, usually 'utf-8'.
Related
In Python, a string can have arbitrary bytes, via "\x??" escaping. These bytes don't necessarily have to map to a char in an encoding. For example, we can have "\xa0", even though 0xa0 isn't a good utf-8 char.
However, if I have a byte array, such as b'\xa0', I can't append it to a string without decoding it. What if I want to just append literally, just like "\xa0"?
How can I append a series of bytes to a string without decoding them at all, just like "\x" escape chars? Is there a "literal decoding" or "no decoding" option to decode()? If not, is there another way to do this?
First, consider whether storing these in a string is truly the best for your usecase. Storing as bytes/bytesarray is usually the more idiomatic option.
However, if you have considered this and still decided to proceed, then you should pass "latin1" as the encoding option to bytes.decode. This converts the bytes directly to the characters with the corresponding value.
Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:
In Python, if I have a string like:
a =" Hello - to - everybody"
And I do
a.split('-')
then I get
[u'Hello', u'to', u'everybody']
This is just an example.
How can I get a simple list without that annoying u'??
The u means that it's a unicode string - your original string must also have been a unicode string. Generally it's a good idea to keep strings Unicode as trying to convert to normal strings could potentially fail due to characters with no equivalent.
The u is purely used to let you know it's a unicode string in the representation - it will not affect the string itself.
In general, unicode strings work exactly as normal strings, so there should be no issue with leaving them as unicode strings.
In Python 3.x, unicode strings are the default, and don't have the u prepended (instead, bytes (the equivalent to old strings) are prepended with b).
If you really, really need to convert to a normal string (rarely the case, but potentially an issue if you are using an extension library that doesn't support unicode strings, for example), take a look at unicode.encode() and unicode.decode(). You can either do this before the split, or after the split using a list comprehension.
I have a opposite problem. The str '第一回\u3000甄士隐梦幻识通灵 贾雨村风尘怀闺秀' needs to be splitted by the unicode character. But I made wrong and code split('\u') that leaded to the unicode syntax error.
I should code split('\u3000')
Many problems I've ran into in Python have been related to not having something in Unicode. Is there any good reason to not use Unicode by default? I understand needing to translate something in ASCII, but it seems to be the exception and not the rule.
I know Python 3 uses Unicode for all strings. Should this encourage me as a developer to unicode() all my strings?
Generally, I'm going to say "no" there's not a good reason to use string over unicode. Remember, as well, that you don't have to call unicode() to create a unicode string, you can do so by prefixing the string with a lowercase u like u"this is a unicode string".
In Python 2.x:
A str object is basically just a sequence of bytes.
A unicode object is a sequence of characters.
Knowing this, it should be easy to choose the correct type:
If you want a string of characters use unicode.
If you want an string encoded as bytes use str (in many other languages you'd use byte[] here).
In Python 3.x the type str is a string of characters, just as you would expect. You can use bytes if you want a sequence of bytes.
I have a dictionary of dictionaries in Python:
d = {"a11y_firesafety.html":{"lang:hi": {"div1": "http://a11y.in/a11y/idea/a11y_firesafety.html:hi"}, "lang:kn": {"div1": "http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn}}}
I have this in a JSON file and I encoded it using json.dumps(). Now when I decode it using json.loads() in Python I get a result like this:
temp = {u'a11y_firesafety.html': {u'lang:hi': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:hi'}, u'lang:kn': {u'div1': u'http://a11y.in/a11ypi/idea/a11y_firesafety.html:kn'}}}
My problem is with the "u" which signifies the Unicode encoding in front of every item in my temp (dictionary of dictionaries). How to get rid of that "u"?
Why do you care about the 'u' characters? They're just a visual indicator; unless you're actually using the result of str(temp) in your code, they have no effect on your code. For example:
>>> test = u"abcd"
>>> test == "abcd"
True
If they do matter for some reason, and you don't care about consequences like not being able to use this code in an international setting, then you could pass in a custom object_hook (see the json docs here) to produce dictionaries with string contents rather than unicode.
You could also use this:
import fileinput
fout = open("out.txt", 'a')
for i in fileinput.input("in.txt"):
str = i.replace("u\"","\"").replace("u\'","\'")
print >> fout,str
The typical json responses from standard websites have these two encoding representations - u' and u"
This snippet gets rid of both of them. It may not be required as this encoding doesn't hinder any logical processing, as mentioned by previous commenter
There is no "unicode" encoding, since unicode is a different data type and I don't really see any reason unicode would be a problem, since you may always convert it to string doing e.g. foo.encode('utf-8').
However, if you really want to have string objects upfront you should probably create your own decoder class and use it while decoding JSON.