How can I slice a substring from a unicode string with Python?

How can I slice a substring from a unicode string with Python? - python

I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?

The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]

Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>Â®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.

u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])

If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

Related

Char and bytes in python

In reading this tutorial I came across the following difference between __unicode__ and __str__ method:
Due to this difference, there’s yet another dunder method in the mix for controlling string conversion in Python 2: __unicode__. In Python 2, __str__ returns bytes, whereas __unicode__ returns characters.
How exactly is a "character" and "byte" be defined here? For example, in C a char is one byte, so wouldn't a char = a byte? Or, is this referring to (potentially) unicode characters, which could be multiple bytes? For example, if we took the following:
Ω (omega symbol)
03 A9 or u'\u03a9'
In python, would this be considered one character (Ω) and two bytes, or two characters(03 A9) and two bytes? Or maybe I am confusing the difference between char and character ?

In Python, u'\u03a9' is a string consisting of the single Unicode character Ω (U+03A9). The internal representation of that string is an implementation detail, so it doesn't make sense to ask about the bytes involved.
One source of ambiguity is a string like 'é', which could either be the single character U+00E9 or the two-character string U+0065 U+0301.
>>> len(u'\u00e9'); print(u'\u00e9')
1
é
>>> len(u'\u0065\u0301'); print(u'\u0065\u0301')
2
é
The two-byte sequence '\xce\xa9', however, can be interpret as the UTF-8 encoding of U+03A9.
>>> u'\u03a9'.encode('utf-8')
'\xce\xa9'
>>> '\xce\xa9'.decode('utf-8')
u'\u03a9'
In Python 3, that would be (with UTF-8 being the default encoding scheme)
>>> '\u03a9'.encode()
b'\xce\xa9'
>>> b'\xce\xa9'.decode()
'Ω'
Other byte sequences can be decoded to U+03A9 as well:
>>> b'\xff\xfe\xa9\x03'.decode('utf16')
'Ω'
>>> b'\xff\xfe\x00\x00\xa9\x03\x00\x00'.decode('utf32')
'Ω'

encoding unicode using UTF-8

In Python, if I type
euro = u'\u20AC'
euroUTF8 = euro.encode('utf-8')
print(euroUTF8, type(euroUTF8), len(euroUTF8))
the output is
('\xe2\x82\xac', <type 'str'>, 3)
I have two questions:
1. it looks like euroUTF8 is encoded over 3 bytes, but how do I get its binary representation to see how many bits it contain?
2. what does 'x' in '\xe2\x82\xac' mean? I don't think 'x' is a hex number. And why there are three '\'?

In Python 2, print is a statement, not a function. You are printing a tuple here. Print the individual elements by removing the (..):
>>> euro = u'\u20AC'
>>> euroUTF8 = euro.encode('utf-8')
>>> print euroUTF8, type(euroUTF8), len(euroUTF8)
€ <type 'str'> 3
Now you get the 3 individual objects written as strings to stdout; my terminal just happens to be configured to interpret anything written to it as UTF-8, so the bytes correctly result in the € Euro symbol being displayed.
The \x<hh> sequences are Python string literal escape sequences (see the reference documentation); they are the default output for the repr() applied to a string with non-ASCII, non-printable bytes in them. You'll see the same thing when echoing the value in an interactive interpreter:
>>> euroUTF8
'\xe2\x82\xac'
>>> euroUTF8[0]
'\xe2'
>>> euroUTF8[1]
'\x82'
>>> euroUTF8[2]
'\xac'
They provide you with ASCII-safe debugging output. The contents of all Python standard library containers use this format; including lists, tuples and dictionaries.
If you want to format to see the bits that make up these values, convert each byte to an integer by using the ord() function, then format the integer as binary:
>>> ' '.join([format(ord(b), '08b') for b in euroUTF8])
'11100010 10000010 10101100'

Each letter in each encoding are represented using different number of bits. UTF-8 is a 8 bit encoding, so there is no need to get a binary representation to know each bit count of each character. (If you still want to present bits, refer to Martijn's answer.)
\x means that the following value is a byte. So x is not something like a hex number that you should convert or read. It identifies the following value, which is you are interested in. \'s are used to escape that x's because they are not a part of the value.

How can I print unicode without using u'\uXXXX'

I'm trying to make a program to iterate through japanese characters (Python 2.7) and return/yield them in a printable format, but I cannot convert the hexadecimal numbers (3040-309f) into a format that can print the characters. I have found that using u'\u' works, but when I attempt to convert the numbers into that format using unicode('\u3040'), it is different from u'\u3040'. The code explains it better.
>>> s1 = u'\u309d'
>>> s2 = unicode("\u209d")
>>> print type(s1) == type(s2)
True
>>> print s1 == s2
False
>>> print s1, s2
ゝ \u209d
I have tried using UTF-8 and latin-1 for s2 as the second argument, but It does nothing. Also, I found that you can do u'\u{0}'.format(u'3040'), but I cannot make u'3040' in my iterator, and u'\u{0}'.format(unicode('3040') raises an error.

In byte string literals, the \uhhhh escape sequence is not interpreted, so you get a literal 6 characters instead.
Converting that to Unicode only decodes the string as ASCII data, not as a Python escape sequence.
You could decode from the unicode_escape encoding instead:
>>> "\u209d".decode('unicode_escape')
u'\u209d'
>>> print "\u209d".decode('unicode_escape')
₝
There are several downsides to this, however. Any other \ escape sequences also get decoded:
>>> '\\n'
'\\n'
>>> '\\n'.decode('unicode_escape')
u'\n'
so you may have to replace backslashes with doubled backslashes first to come back on top with those literal backslashes retained:
>>> '\\n'.replace('\\', '\\\\').decode('unicode_escape')
u'\\n'
But be very careful that you are not in fact trying to treat JSON data as Python string literals. JSON also uses the same escape sequence format but should instead be treated as JSON; decode with json.loads() instead:
>>> import json
>>> json.loads('"\u209d"')
u'\u209d'

how to compare backslash in python

I have a set of strings that are read from a file say ['\x1\p1', '\x2\p2', '\x3\p3', ... etc.].
When I read them into variables and print them the strings displayed as ['\\x1\\p1', '\\x2\\p2', '\\x3\\p3', ... etc.]. I understand that the variable is represented as '\x1\p1', ... etc. internally, but when it is displayed it is displayed with double slash.
but now I want to search and replace the elements of this list in the sentence, i.e say if \x1\p1 is in the sentence "How are you doing \x1\p1" then replace '\x1\p1' with 'Y'. But the replace method does not work in this case! wonder why?
Let me explain further:
my text file (codes.txt) has entries \xs1\x32, \xs2\x54 delimited by new line. so when I read it using
with open('codes') as codes:
code_list = codes.readlines()
next, I do lets say code_list_element_1 = code_list[1].rstrip()
when I print code_list_element_1, it displays as '\\xs1\\x32'
Next, let me target string be target_string = 'Hi! my name is \xs1\x32'
now I want to replace code_list_element_1 which is supposed to be \xs1\x32 in the target_string with say 'Y'
So, I tried code_list_element_1 in target_string. I get False
Next, instead of reading the codes from a text file I initialized a variable find_me = '\xs1\x32'
now, I try find_me in target_string. I get True
and hence target_string.replace(find_me,"Y") displays what I want: "Hi! my name is Y"

You are looking at a string representation that can be pasted back into Python; the backslashes are doubled to make sure the values are not interpreted as escape sequences (such as \n, meaning a newline, or \xfe, meaning the byte with value 254, hex FE).
If you are building new string values, you also need to use those doubled backslashes to prevent Python from seeing escape sequences where there are none, or use raw string literals:
>>> '\\x1\\p1'
'\\x1\\p1'
>>> r'\x1\p1'
'\\x1\\p1'
For this specific example, not handling the backslashes properly actually results in an exception:
>>> '\x1\p1'
ValueError: invalid \x escape
because Python expects to find two hex digits after a \x escape.

raw strings (those prefixed by r are very useful for backslash-itis.
In [9]: a=r"How are you doing \x1\p1"
In [10]: a
Out[10]: 'How are you doing \\x1\\p1'
In [11]: a.replace(r'\x1\p1', 'Y')
Out[11]: 'How are you doing Y'
In [12]:

Print confusion

I am new to python when i try to print "\20%" that is
>>>"\20%"
why is the shell printing '\x10%' that is, it is showing
'\x10%'
the same is happening with join also when is do
>>>l = ['test','case']
>>>"\20%".join(l)
it shows
'test\x10%case'
I am using python 2.7.3

'\20' is an octal literal, and the same as chr(2 * 8 + 0) == chr(16).
What the Python shell displays by default is not the output of print, but the representation of the given value, which is the hexadecimal '\x10'.
If you want the string \20%, you have to either escape the backaslash ('\\20%') or use a raw string literal (r'\20%'). Both will be displayed as
>>> r'\20%'
'\\20%'

\20 is an escape sequence that refers to the DLE ASCII character whose decimal value is 16 (20 in octal, 10 in hexadecimal). Such a character is printed as the \x10 hex escape by the repr function of strings.
To specify a literal \20, either double the backslash ("\\20") or use a raw string (r"\20").

Two print "\20%"
what if you print directly:
>>> print '\20%'
% # some symbol not correctly display on this page
and do using r
>>> print r'\20%'
\20%
>>> r'\20%' # what r do.
'\\20%'
>>> print '\\20%'
\20%
>>>
Some time back I had same doubt about string and I asked a question, you may find helpful

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I slice a substring from a unicode string with Python? - python

I have a unicode string as a result : u'splunk>\xae\uf001' How can I get the substring 'uf001' as a simple string in python?

The characters uf001 are not actually present in the string, so you can't just slice them off. You can do repr(s)[-6:-1] or 'u' + hex(ord(s[-1]))[2:]

Related

Char and bytes in python

encoding unicode using UTF-8

How can I print unicode without using u'\uXXXX'

how to compare backslash in python

Print confusion

Categories

Resources