How to strip unicode in a list - python

I want to strip unicode string from the list
for example
airports
[u'KATL',u'KCID']
expected output
[KATL,KCID]
Followed the below link
Strip all the elements of a string list
Tried one of the solution
my_list = ['this\n', 'is\n', 'a\n', 'list\n', 'of\n', 'words\n']
map(str.strip, my_list)
['this', 'is', 'a', 'list', 'of', 'words']
got the following error
TypeError: descriptor 'strip' requires a 'str' object but received a 'unicode'

First, I strongly suggest you switch to Python 3, which treats Unicode strings as first-class citizens (all strings are Unicode strings, but they are called str).
But if you have to make it work in Python 2, you can strip unicode strings with unicode.strip (if your strings are true Unicode strings):
>>> lst = [u'KATL\n', u'KCID\n']
>>> map(unicode.strip, lst)
[u'KATL', u'KCID']
If your unicode strings are limited to ASCII subset, you can convert them to str with:
>>> lst = [u'KATL', u'KCID']
>>> map(str, lst)
['KATL', 'KCID']
Note that this conversion will fail for non-ASCII strings. To encode Unicode codepoints as a str (string of bytes), you have to choose your encoding algorithm (usually UTF-8) and use .encode() method on your strings:
>>> lst = [u'KATL', u'KCID']
>>> map(lambda x: x.encode('utf-8'), lst)
['KATL', 'KCID']

The only reliable to convert a unicode string to a byte string is to encode it into an acceptable encoding (ascii, Latin1 and UTF8 are most common one). By definition, UTF8 is able to encode any unicode character, but you will find non ascii chars in the string, and the size in byte will no longer be the number of (unicode) characters. Latin1 is able to represent most of west european languages characters in with a 1 byte per character relation, and ascii is the set of characters that are always correctly represented.
If you want to be able to process strings containing characters not representable in the choosen charset, you can use the parameter errors='ignore' to just remove them or errors='replace' to replace them with a replacement character, often ?.
So if I have correctly understood your requirement, you could translate the list of unicode string into a list of byte strings with:
[ x.encode('ascii', errors='replace') for x in my_list ]

A listcomp seems the simplest solution:
[s.strip() for s in my_list]
If you're keen to use a map, I'd use a lambda to get the object's own personal strip function rather than demanding that it be the strip that's delivered by one particular library.
map(lambda s: s.strip(), my_list)

Related

Problem with .decode('utf-8').upper() and special characters (but only inside the string)

I would like to capitalise letters on given position in string. I have a problem with special letters - polish letters to be specific: for example "ą". Ideally would be a solution which works also for french, spanish etc. (ç, è etc.)
dobry="costąm"
print(dobry[4].decode('utf-8').upper())
I obtain:
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: unexpected end of data
while for this:
print("ą".decode('utf-8').upper())
I obtain Ą as desired.
What is more curious for letters on positions 0-3 it works fine while for:
print(dobry[5].decode('utf-8').upper())
I obtain the same problem
The string actually looks like this:
>>> list(dobry)
['c', 'o', 's', 't', '\xc4', '\x85', 'm']
So, dobry[5] == '\x85' because the letter ą is represented by two bytes. To solve this, simply use Python 3 instead of Python 2.
UTF-8 may use more than one byte to encode a character, so iterating over a bytestring and manipulating individual bytes won't always work. It's better to decode to Python 2's unicode type. Perform your manipulations, then re-encode to UTF-8.
>>> dobry="costąm"
>>> udobry = unicode(dobry, 'utf-8')
>>> changed = udobry[:4] + udobry[4].upper() + udobry[5]
>>> new_dobry = changed.encode('utf-8')
>>> print new_dobry
costĄm
As #tripleee commented, non-ascii characters may not map to a single unicode codepoint: "ą" could be the single codepoint U+0105 LATIN SMALL LETTER A WITH OGONEK or it could be composed of "a" followed by U+0328 COMBINING OGONEK.
In the composed string the "a" character can be capitalised, and "a" followed by COMBINING OGONEK will result in "Ą" (though it may look like two separate characters in the Python REPL, or the terminal, depending on the terminal settings).
Note that you need to take the extra character into account when indexing.
It's also possible to normalise the composed string to the single codepoint (canonical) version using the tools in the unicodedata module:
>>> unicodedata.normalize('NFC', u'costa\u0328m') == u"costąm"
True
but this may cause problems if, for example, you are returning the changed string to a system that expects the combining character to be preserved.
what about that instead:
print(dobry.decode('utf-8')[5].upper())

Python: convert space (and other special chars like . , ) to their corresponding unicode representations

I am trying to convert an Arabic phrase into its corresponding unicode representation string and it works fine for the Arabic text.
>>> a = ' مساء الخير'
>>> a.strip().decode('utf-8').encode('unicode-escape')
'\\u0645\\u0633\\u0627\\u0621 \\u0627\\u0644\\u062e\\u064a\\u0631'
However, I also want the space-character to be converted to its unicode representation ('\u0020'). I am observing similar behaviour with other characters like '.', ',', etc. I finally want to obtain the unicode values of each of the characters in my string as a list (simply splitting the current string with the delimiter '\u' gives me the incorrect split as the space character becomes combined with the previous unicode representation)
>>> a.strip().decode('utf-8').encode('unicode-escape').split('\\u')
['', '0645', '0633', '0627', '0621 ', '0627', '0644', '062e', '064a', '0631']
eg. I want [ ... '0621', '0020' ...] instead of the current [ ... '0621 ' ...]
It it fine to strip the first space in you do not need it, but if you want to keep the other, it would be simpler to build a list of unicode characters from the string and individually process the characters:
[ '%04x' % (ord(i),) for i in a.strip().decode('utf8') ]
or if you prefere to use format (which is now better)
[ '{0:04x}'.format(ord(i)) for i in a.strip().decode('utf8') ]
Both yield as expected:
['0645', '0633', '0627', '0621', '0020', '0627', '0644', '062e', '064a', '0631']
Any particular reason you don't create (scan/read) the string as a unicode string in the first place?
In [14]: a = u' مساء الخير'
In [15]: [hex(ord(i)) for i in a]
Out[15]:
['0x20',
'0x645',
'0x633',
'0x627',
'0x621',
'0x20',
'0x627',
'0x644',
'0x62e',
'0x64a',
'0x631']

How can I slice a substring from a unicode string with Python?

I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?
The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]
Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.
u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])

If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

How can I print unicode without using u'\uXXXX'

I'm trying to make a program to iterate through japanese characters (Python 2.7) and return/yield them in a printable format, but I cannot convert the hexadecimal numbers (3040-309f) into a format that can print the characters. I have found that using u'\u' works, but when I attempt to convert the numbers into that format using unicode('\u3040'), it is different from u'\u3040'. The code explains it better.
>>> s1 = u'\u309d'
>>> s2 = unicode("\u209d")
>>> print type(s1) == type(s2)
True
>>> print s1 == s2
False
>>> print s1, s2
ゝ \u209d
I have tried using UTF-8 and latin-1 for s2 as the second argument, but It does nothing. Also, I found that you can do u'\u{0}'.format(u'3040'), but I cannot make u'3040' in my iterator, and u'\u{0}'.format(unicode('3040') raises an error.
In byte string literals, the \uhhhh escape sequence is not interpreted, so you get a literal 6 characters instead.
Converting that to Unicode only decodes the string as ASCII data, not as a Python escape sequence.
You could decode from the unicode_escape encoding instead:
>>> "\u209d".decode('unicode_escape')
u'\u209d'
>>> print "\u209d".decode('unicode_escape')
₝
There are several downsides to this, however. Any other \ escape sequences also get decoded:
>>> '\\n'
'\\n'
>>> '\\n'.decode('unicode_escape')
u'\n'
so you may have to replace backslashes with doubled backslashes first to come back on top with those literal backslashes retained:
>>> '\\n'.replace('\\', '\\\\').decode('unicode_escape')
u'\\n'
But be very careful that you are not in fact trying to treat JSON data as Python string literals. JSON also uses the same escape sequence format but should instead be treated as JSON; decode with json.loads() instead:
>>> import json
>>> json.loads('"\u209d"')
u'\u209d'

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.
The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).
My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).
Encoding it to UTF-8 yields
>>> a.encode('utf-8')
'\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'
Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:
>>> a.encode('utf-8').decode('utf-8')
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
I found a hacky way to solve the problem, however:
>>> repr(a)
"u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'"
>>> eval(repr(a)[1:])
'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba'
>>> s = eval(repr(a)[1:]).decode('utf8')
>>> s
u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a'
# Almost there, the bytes are proper now but the former real-unicode characters
# are now escaped with \u's; need to un-escape them.
>>> import re
>>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!
This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?
In Python 2, Unicode strings may contain both unicode and bytes:
No, they may not. They contain Unicode characters.
Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).
There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.
However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.
(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:
a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'
def convert(s):
try:
return s.group(0).encode('latin1').decode('utf8')
except:
return s.group(0)
import re
a = re.sub(r'[\x80-\xFF]+', convert, a)
print a.encode('utf8')
Result:
Рус utf:ек bytes:blää
The problem is that your string is not actually encoded in a specific encoding. Your example string:
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
Is mixing python's internal representation of unicode strings with utf-8 encoded text. If we just consider the 'special' characters:
>>> orig = u'\u0435\u043a'
>>> bytes = u'\xd0\xb5\xd0\xba'
>>> print orig
ек
>>> print bytes
ек
But you say, bytes is utf-8 encoded:
>>> print bytes.encode('utf-8')
ек
>>> print bytes.encode('utf-8').decode('utf-8')
ек
Wrong! But what about:
>>> bytes = '\xd0\xb5\xd0\xba'
>>> print bytes
ек
>>> print bytes.decode('utf-8')
ек
Hurrah.
So. What does this mean for me? It means you're (probably) solving the wrong problem. What you should be asking us/trying to figure out is why your strings are in this form to begin with and how to avoid it/fix it before you have them all mixed up.
You should convert unichrs to chrs, then decode them.
u'\xd0' == u'\u00d0' is True
$ python
>>> import re
>>> a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
>>> re.sub(r'[\000-\377]*', lambda m:''.join([chr(ord(i)) for i in m.group(0)]).decode('utf8'), a)
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a'
r'[\000-\377]*' will match unichrs u'[\u0000-\u00ff]*'
u'\xd0\xb5\xd0\xba' == u'\u00d0\u00b5\u00d0\u00ba'
You use utf8 encoded bytes as unicode code points (this is the PROBLEM)
I solve the problem by pretending those mistaken unichars as the corresponding bytes
I search all these mistaken unichars, and convert them to chars, then decode them.
If I'm wrong, please tell me.
You've already got an answer, but here's a way to unscramble UTF-8-like Unicode sequences that is less likely to decode latin-1 Unicode sequences in error. The re.sub function:
Matches Unicode characters < U+0100 that resemble valid UTF-8 sequences (ref: RFC 3629).
Encodes the Unicode sequence into its equivalent latin-1 byte sequence.
Decodes the sequence using UTF-8 back into Unicode.
Replaces the original UTF-8-like sequence with the matching Unicode character.
Note this could still match a Unicode sequence if just the right characters appear next to each other, but it is much less likely.
import re
# your example
a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'
# printable Unicode characters < 256.
a += ''.join(chr(n) for n in range(32,256)).decode('latin1')
# a few UTF-8 characters decoded as latin1.
a += ''.join(unichr(n) for n in [2**7-1,2**7,2**11-1,2**11]).encode('utf8').decode('latin1')
# Some non-BMP characters
a += u'\U00010000\U0010FFFF'.encode('utf8').decode('latin1')
print repr(a)
# Unicode codepoint sequences that resemble UTF-8 sequences.
p = re.compile(ur'''(?x)
\xF0[\x90-\xBF][\x80-\xBF]{2} | # Valid 4-byte sequences
[\xF1-\xF3][\x80-\xBF]{3} |
\xF4[\x80-\x8F][\x80-\xBF]{2} |
\xE0[\xA0-\xBF][\x80-\xBF] | # Valid 3-byte sequences
[\xE1-\xEC][\x80-\xBF]{2} |
\xED[\x80-\x9F][\x80-\xBF] |
[\xEE-\xEF][\x80-\xBF]{2} |
[\xC2-\xDF][\x80-\xBF] # Valid 2-byte sequences
''')
def replace(m):
return m.group(0).encode('latin1').decode('utf8')
print
print repr(p.sub(replace,a))
###Output
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\xc2\x80\xdf\xbf\xe0\xa0\x80\xf0\x90\x80\x80\xf4\x8f\xbf\xbf'
u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff\x7f\x80\u07ff\u0800\U00010000\U0010ffff'
I solved it by
unicodeText.encode("utf-8").decode("unicode-escape").encode("latin1")

Categories