Difference between u'string' and unicode(string) - python

This is a sample program i made:
>>> print u'\u1212'
ሒ
>>> print '\u1212'
\u1212
>>> print unicode('\u1212')
\u1212
why do i get \u1212 instead of ሒ when i print unicode('\u1212')?
I'm making a program to store data and not print it, so how do i store ሒ instead of \u1212? Now obviously i can't do something like:
x = u''+unicode('\u1212')
interestingly even if i do that, here's what i get:
\u1212
another fact that i think is worth mentioning :
>>> u'\u1212' == unicode('\u1212')
False
What do i do to store ሒ or some other character like that instead of \uxxxx?

'\u1212' is an ASCII string with 6 characters: \, u, 1, 2, 1, and 2.
unicode('\u1212') is a Unicode string with 6 characters: \, u, 1, 2, 1, and 2
u'\u1212' is a Unicode string with one character: ሒ.
You should use Unicode strings all around, if that's what you want.
u'\u1212'
If for some reason you need to convert '\u1212' to u'\u1212', use
'\u1212'.decode('unicode-escape')
(Note that in Python 3, strings are always Unicode.)

This is just a misunderstanding.
This is a unicode string: x = u'\u1212'
When you call print x it is will print its character (ሒ) as shown. If you just call x it will show the represntation of it:
u'\u1212'
All is well with the world.
This is an ascii string: y = "\u1212"
When you call print y it is will print its value (\u1212) as shown. If you just call x it will show the represntation of it:
'\\udfgdfg'
Notice the double slashes (\\) that indicate the slash is being escaped.
So, lets look at the following function call: print unicode('\u1212')
This is a function call, and we can replace the string with a variable, so we'll use the equivilent:
y = "\u1212"
print unicode(x)
But as in the second exacmple above, y is an ascii string that is being managed internally as '\udfgdfg', its not a unicode string at all. So the unicode representation of '\\udfgdfg' is exactly the same. Thus why its not behaving correctly.

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36
There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'
There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

How can I slice a substring from a unicode string with Python?

I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?
The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]
Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.
u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])

If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

Decode unicode string in python

I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you
You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string

how to compare backslash in python

I have a set of strings that are read from a file say ['\x1\p1', '\x2\p2', '\x3\p3', ... etc.].
When I read them into variables and print them the strings displayed as ['\\x1\\p1', '\\x2\\p2', '\\x3\\p3', ... etc.]. I understand that the variable is represented as '\x1\p1', ... etc. internally, but when it is displayed it is displayed with double slash.
but now I want to search and replace the elements of this list in the sentence, i.e say if \x1\p1 is in the sentence "How are you doing \x1\p1" then replace '\x1\p1' with 'Y'. But the replace method does not work in this case! wonder why?
Let me explain further:
my text file (codes.txt) has entries \xs1\x32, \xs2\x54 delimited by new line. so when I read it using
with open('codes') as codes:
code_list = codes.readlines()
next, I do lets say code_list_element_1 = code_list[1].rstrip()
when I print code_list_element_1, it displays as '\\xs1\\x32'
Next, let me target string be target_string = 'Hi! my name is \xs1\x32'
now I want to replace code_list_element_1 which is supposed to be \xs1\x32 in the target_string with say 'Y'
So, I tried code_list_element_1 in target_string. I get False
Next, instead of reading the codes from a text file I initialized a variable find_me = '\xs1\x32'
now, I try find_me in target_string. I get True
and hence target_string.replace(find_me,"Y") displays what I want: "Hi! my name is Y"
You are looking at a string representation that can be pasted back into Python; the backslashes are doubled to make sure the values are not interpreted as escape sequences (such as \n, meaning a newline, or \xfe, meaning the byte with value 254, hex FE).
If you are building new string values, you also need to use those doubled backslashes to prevent Python from seeing escape sequences where there are none, or use raw string literals:
>>> '\\x1\\p1'
'\\x1\\p1'
>>> r'\x1\p1'
'\\x1\\p1'
For this specific example, not handling the backslashes properly actually results in an exception:
>>> '\x1\p1'
ValueError: invalid \x escape
because Python expects to find two hex digits after a \x escape.
raw strings (those prefixed by r are very useful for backslash-itis.
In [9]: a=r"How are you doing \x1\p1"
In [10]: a
Out[10]: 'How are you doing \\x1\\p1'
In [11]: a.replace(r'\x1\p1', 'Y')
Out[11]: 'How are you doing Y'
In [12]:

Print confusion

I am new to python when i try to print "\20%" that is
>>>"\20%"
why is the shell printing '\x10%' that is, it is showing
'\x10%'
the same is happening with join also when is do
>>>l = ['test','case']
>>>"\20%".join(l)
it shows
'test\x10%case'
I am using python 2.7.3
'\20' is an octal literal, and the same as chr(2 * 8 + 0) == chr(16).
What the Python shell displays by default is not the output of print, but the representation of the given value, which is the hexadecimal '\x10'.
If you want the string \20%, you have to either escape the backaslash ('\\20%') or use a raw string literal (r'\20%'). Both will be displayed as
>>> r'\20%'
'\\20%'
\20 is an escape sequence that refers to the DLE ASCII character whose decimal value is 16 (20 in octal, 10 in hexadecimal). Such a character is printed as the \x10 hex escape by the repr function of strings.
To specify a literal \20, either double the backslash ("\\20") or use a raw string (r"\20").
Two print "\20%"
what if you print directly:
>>> print '\20%'
% # some symbol not correctly display on this page
and do using r
>>> print r'\20%'
\20%
>>> r'\20%' # what r do.
'\\20%'
>>> print '\\20%'
\20%
>>>
Some time back I had same doubt about string and I asked a question, you may find helpful

Categories