python 2.7 split unicode text using unicode comma

python 2.7 split unicode text using unicode comma - python

I have a unicode string (种类：猫; means type:cat）which contains a unicode colon ': ' in Chinese. I would like to separate the string into 2 parts using:
a.split(u': ')
The length of the result list is always 1, so not spliced.
Can someone give me a hint on how to do this type of operation? Thanks!

That's a FULLWIDTH COLON, not an ASCII colon:
>>> s = u'种类：猫'
>>> import unicodedata as ud
>>> for c in s:
... print hex(ord(c)),ud.name(c)
...
0x79cd CJK UNIFIED IDEOGRAPH-79CD
0x7c7b CJK UNIFIED IDEOGRAPH-7C7B
0xff1a FULLWIDTH COLON
0x732b CJK UNIFIED IDEOGRAPH-732B
So you can split it a number of ways:
>>> s.split(u'\uff1a') # by Unicode codepoint
[u'\u79cd\u7c7b', u'\u732b']
>>> s.split(u'\N{FULLWIDTH COLON}') # by name
[u'\u79cd\u7c7b', u'\u732b']
>>> s.split(u'：') # Using the correct (single) character
[u'\u79cd\u7c7b', u'\u732b']
Since you are using Python 2.7, to see the output correctly you'll need to print the list items:
>>> for item in s.split(u'\uff1a'):
... print item
...
种类
猫

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36

There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'

There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

Russian character decoding in python

This question only for python:
I have a city name in a string in Russian language and which is in Unicode form like,
\u041C\u043E\u0441\u043A\u0432\u0430
means
Москва
How to get original text instead of unicode characters?
Note: Do not use any import module

>>> a=u"\u041C\u043E\u0441\u043A\u0432\u0430"
>>> print a
Москва
Your string is a unicode string because each character/code point with \u is only usable from a unicode string, you should prefix the string with u. Otherwise is a regular string and each \u counts as a regular ascii character:
>>> len(a)
6
>>> b="\u041C\u043E\u0441\u043A\u0432\u0430"
>>> len(b)
36

In addition to vz0 answer : Pay attention to script's encoding.
This file will works great :
# coding: utf-8
s = u"\u041C\u043E\u0441\u043A\u0432\u0430"
print(s)
But this one will lead to an UnicodeEncodeError :
# coding: ASCII
s = u"\u041C\u043E\u0441\u043A\u0432\u0430"
print(s)

How to remove special characters from strings in python?

I have millions of strings scraped from web like:
s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True
Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:
\\x.*[0-9]

The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters
>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"
If you want to print only the ascii characters you can check if the character is in string.printable
>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'

This thing worked for me as mentioned by Padriac in comments:
s.decode('ascii', errors='ignore')

How can I slice a substring from a unicode string with Python?

I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?

The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]

Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>Â®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.

u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])

If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

Why python string cut returns 11 symbols when 12 is requested?

I use python 2.7 on OSX 10.9 and would like to cut unicode string (05. Чайка.mp3) by 12 symbols, so I use mp3file[:12] to cut it by 12 symbols. But in result I get the string like 05. Чайка.m, which is 11 symbols only. But len(mp3file[:12]) returns 12. Looks like the problem is with Russian symbol й.
What could be wrong here?
The main problem with this - I can not normally display strings with {:<12}'.format(mp3file[:12]).

You have unicode text with a combining character:
u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m'
The U+0306 is a COMBINING BREVE codepoint, ̆, it combines with the preceding и CYRILLIC SMALL LETTER I to form:
>>> print u'\u0438'
и
>>> print u'\u0438\u0306'
й
You can normalize that to the combined form, U+0439 CYRILLIC SMALL LETTER SHORT I instead:
>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0438\u0306')
u'\u0439'
This uses the unicodedata.normalize() function to produce a composed normal form.

A user-perceived character (grapheme cluster) such as й may be constructed using several Unicode codepoints, each Unicode codepoints in turn may be encoded using several bytes depending on a character encoding.
Therefore number of characters that you see may be less the corresponding sizes of Unicode or byte strings that encode them and you can also truncate inside a Unicode character if you slice a bytestring or inside a user-perceived character if you slice a Unicode string even if it is in NFC Unicode normalization form. Obviously, it is not desirable.
To properly count characters, you could use \X regex that matches eXtended grapheme cluster (a language independent "visual character"):
import regex as re # $ pip install regex
characters = re.findall(u'\\X', u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m')
print(characters)
# -> [u'0', u'5', u'.', u' ', u'\u0427', u'\u0430',
# u'\u0438\u0306', u'\u043a', u'\u0430', u'.', u'm']
Notice, that even without normalization: u'\u0438\u0306' is a separate character 'й'.
>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0646\u200D ') # 3 Unicode codepoints
u'\u0646\u200d ' # still 3 codepoints, NFC hasn't combined them
>>> import regex as re
>>> re.findall(u'\\X', u'\u0646\u200D ') # same 3 codepoints
[u'\u0646\u200d', u' '] # 2 grapheme clusters
See also, In Python, how do I most efficiently chunk a UTF-8 string for REST delivery?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python 2.7 split unicode text using unicode comma - python

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

Russian character decoding in python

How to remove special characters from strings in python?

How can I slice a substring from a unicode string with Python?

Why python string cut returns 11 symbols when 12 is requested?

Categories

Resources