i will be very appreciated for any help
I'm getting info from website... And generally everyting works fine... But, just look at code:
r = requests.get(s_url)
print r.text
>>>[{"nameID":"D1","text":"I’ll understand what actually happened here."}]
print r.json()
>>>[{u'nameID':u'D1',u'text':u'I\u2019ll understand what actually happened
here.'"}]
As you see after r.json() ' was changed to \u2019.
Why it did it? How I can force to don't make such changes?
Thanks
It is not an apostrophe (') in your string, but a (unicode) character "right single quotation mark" (’). Which may not be as obvious to a naked eye (depending on a font used easier or more difficult to spot), but it is still a completely different character to the computer.
>>> u"’" == u"'"
False
The difference you see between displaying text attributed and what json() method returns is just a matter of representation of that character.
For instance:
>>> s = u'’'
>>> print s
’
But when you do the same with a dictionary returned by json() individual keys and values thereof are formatted using repr() resulting in:
>>> d = {'k': s}
>>> print d
{'k': u'\u2019'}
Just as in:
>>> print repr(s)
u'\u2019'
Nonetheless, it is still the same thing:
>>> d['k'] == s
True
For that matter:
>>> u'’' == u'\u2019'
True
You may want to have a look at __str__() and __repr__() methods as well as description of print for more details:
https://docs.python.org/2.7/reference/datamodel.html#object.str
https://docs.python.org/2.7/reference/datamodel.html#object.repr
https://docs.python.org/2.7/reference/simple_stmts.html#print
Related
I have a string that is stored in DB as:
FB (\u30a8\u30a2\u30eb\u30fc)
when I load this row from python code, I am unable to format it correctly.
# x = load that string
print x # returns u'FB (\\u30a8\\u30a2\\u30eb\\u30fc)'
Notice two "\" This messes up the unicode chars on frontend
Instead of showing the foreign chars, html shows it as \u30a8\u30a2\u30eb\u30fc
However, if I load append some characters to convert it into a json format and load the json, I get the expected result.
s = '{"a": "%s"}'%x
json.loads(s)['a']
#prints u'FB (\u30a8\u30a2\u30eb\u30fc)'
Notice the difference between this result (which shows up correctly on frontend) and directly printing x (which has extra ).
So though this hacky solution works, I want a cleaner solution.
I played around a lot with x.encode('utf-8') etc, but none has worked yet.
Thank you!
Since you already have a Unicode string, encode it back to ASCII and decode it with the unicode_escape codec:
>>> s = u'FB (\\u30a8\\u30a2\\u30eb\\u30fc)'
>>> s
u'FB (\\u30a8\\u30a2\\u30eb\\u30fc)'
>>> print s
FB (\u30a8\u30a2\u30eb\u30fc)
>>> s.encode('ascii').decode('unicode_escape')
u'FB (\u30a8\u30a2\u30eb\u30fc)'
>>> print s.encode('ascii').decode('unicode_escape')
FB (エアルー)
raw_string = '\u30a8\u30a2\u30eb\u30fc'
string = ''.join([unichr(int(r, 16)) for r in raw_string.split('\\u') if r])
print(string)
A way to solve this, expecting a better answer.
I have a utf8 - text corpus I can read easily in Python 2.7 :
sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()
> This is my sentence in the right format
However, when I pass this text corpus to a list (for example, for tokenizing) :
tokens = sentence.tokenize()
and print it in the notebook, I obtain bit-like caracters, like :
(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")
Whereas I would like normal characters just like in my original import.
So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?
It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.
Many examples
Original strings:
>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")
Displaying at the interactive prompt:
>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")
Printing individual strings from the tuples:
>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,
A way to print tuples without escape codes:
>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')
Hm, codecs.open(...) returns a "wrapped version of the underlying file object" then you overwrite this variable with the result from executing the read method on that object. Brave, irritating - but ok ;-)
When you type say an äöüß into your "notebook", does it show like "this" or do you see some \uxxxxx instead?
The default value for codecs.open(...) is errors=strict so if this is the same environment for all samples, this should work.
I understand, that when you write "print it" you print the list, that is different from printing the content of the list.
Sample (taking a tab typed as \t into a normal "byte" string - this is python 2.7.11):
>>> a="\t"
>>> print a # below is an expanded tab
>>> a
'\t'
>>> [a]
['\t']
>>> print [a]
['\t']
>>> for element in [a]:
... print element
...
>>> # above is an expanded tab
I have this url = 'http://www.bhaskar.com/uttar_pradesh/lucknow/='. after the "=" sign one Hindi word which denotes the word searched for is given. I want to be able to add that as a parameter to this url, so that I will only need to change the word each time and not the whole url. I tried to use this:
>>> url = 'http://www.bhaskar.com/uttar_pradesh/lucknow/='
>>> word = 'word1'
>>> conj = url + word
but this gives me the Hindi word in unicode. like this:
>>> conj
'http://www.bhaskar.com/uttar_pradesh/lucknow/=\xe0\xa6\xb8\xe0\xa6\xb0'
Can anyone help?
but this gives me the Bengali word in unicode
No, it does not :)
When you type temp in the terminal, it displays an unique interpretation of the string. When you type print(temp), however, you are getting a more user-friendly representation of the same string. In the end, however, the string pointed by temp is the same all the time, it is only presented in different ways. See, for example, if you get the second one and put it in a variable and print it:
>>> temp2 = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=\xe0\xa6\xb8\xe0\xa6\xb0'
>>> print(temp2)
http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=সর
Actually, you can create the string by using escaped values in all characters, not only the Bengali one:
>>> temp3 = '\x68\x74\x74\x70\x3a\x2f\x2f\x77\x77\x77\x2e\x63\x66\x69\x6c\x74\x2e\x69\x69\x74\x62\x2e\x61\x63\x2e\x69\x6e\x2f\x69\x6e\x64\x6f\x77\x6f\x72\x64\x6e\x65\x74\x2f\x66\x69\x72\x73\x74\x3f\x6c\x61\x6e\x67\x6e\x6f\x3d\x33\x26\x71\x75\x65\x72\x79\x77\x6f\x72\x64\x3d\xe0\xa6\xb8\xe0\xa6\xb0'
>>> print(temp3)
http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=সর
In the end, all these strings are the same:
>>> temp == temp2
True
>>> temp == temp3
True
So, don't worry, you have the correct string in the variable. You are only getting a problem if the escaped string is displayed elsewhere. Finish your program, run it until the end and you'll see there will be no errors.
This is a sample program i made:
>>> print u'\u1212'
ሒ
>>> print '\u1212'
\u1212
>>> print unicode('\u1212')
\u1212
why do i get \u1212 instead of ሒ when i print unicode('\u1212')?
I'm making a program to store data and not print it, so how do i store ሒ instead of \u1212? Now obviously i can't do something like:
x = u''+unicode('\u1212')
interestingly even if i do that, here's what i get:
\u1212
another fact that i think is worth mentioning :
>>> u'\u1212' == unicode('\u1212')
False
What do i do to store ሒ or some other character like that instead of \uxxxx?
'\u1212' is an ASCII string with 6 characters: \, u, 1, 2, 1, and 2.
unicode('\u1212') is a Unicode string with 6 characters: \, u, 1, 2, 1, and 2
u'\u1212' is a Unicode string with one character: ሒ.
You should use Unicode strings all around, if that's what you want.
u'\u1212'
If for some reason you need to convert '\u1212' to u'\u1212', use
'\u1212'.decode('unicode-escape')
(Note that in Python 3, strings are always Unicode.)
This is just a misunderstanding.
This is a unicode string: x = u'\u1212'
When you call print x it is will print its character (ሒ) as shown. If you just call x it will show the represntation of it:
u'\u1212'
All is well with the world.
This is an ascii string: y = "\u1212"
When you call print y it is will print its value (\u1212) as shown. If you just call x it will show the represntation of it:
'\\udfgdfg'
Notice the double slashes (\\) that indicate the slash is being escaped.
So, lets look at the following function call: print unicode('\u1212')
This is a function call, and we can replace the string with a variable, so we'll use the equivilent:
y = "\u1212"
print unicode(x)
But as in the second exacmple above, y is an ascii string that is being managed internally as '\udfgdfg', its not a unicode string at all. So the unicode representation of '\\udfgdfg' is exactly the same. Thus why its not behaving correctly.
[u'Iphones', u'dont', u'receieve', u'messages']
Is there a way to print it without the "u" in front of it?
What you are seeing is the __repr__() representation of the unicode string which includes the u to make it clear. If you don't want the u you could print the object (using __str__) - this works for me:
print [str(x) for x in l]
Probably better is to read up on python unicode and encode using the particular unicode codec you want:
print [x.encode() for x in l]
[edit]: to clarify repr and why the u is there - the goal of repr is to provide a convenient string representation, "to return a string that would yield an object with the same value when passed to eval()". Ie you can copy and paste the printed output and get the same object (list of unicode strings).
Python contains string classes for both unicode strings and regular strings. The u before a string indicates that it is a unicode string.
>>> mystrings = [u'Iphones', u'dont', u'receieve', u'messages']
>>> [str(s) for s in mystrings]
['Iphones', 'dont', 'receieve', 'messages']
>>> type(u'Iphones')
<type 'unicode'>
>>> type('Iphones')
<type 'str'>
See http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange for more information about the string types available in Python.