I using tf.strings.unicode_split for split characters of text.
when I use English characters it's working correctly
example_texts = ['hello world']
chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
print(chars)
<tf.RaggedTensor [[b'h', b'e', b'l', b'l', b'o', b' ', b'w', b'o', b'r', b'l', b'd']]>
but if change to UTF-8 Unicode charactersit's not working like English character
example_texts = ['سلام دنیا']
chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
print(chars)
<tf.RaggedTensor [[b'\xd8\xb3', b'\xd9\x84', b'\xd8\xa7', b'\xd9\x85', b' ', b'\xd8\xaf',
b'\xd9\x86', b'\xdb\x8c', b'\xd8\xa7']]>
thank you.
From Comments
Apparently, the characters are encoded to UTF-8. In the English
example the same happens (the characters are byte strings – cf. the
byte prefix), and you don't seem to mind. To view Persian characters you try
this: b'\xd8\xb3'.decode('utf8') == 'س', just like b'h'.decode('utf8') ==
'h' (paraphrased from lenz)
Related
text = "hello world what is happening"
encodedText = text.encode('utf-16') #Encoding the input text
textReplaced = encodedText.replace('h'.encode('utf-16'), 'Q'.encode('utf-16')) #Doing the replacement of an encoded character by another encoded character
print('Input : ', text)
print('Expected Output : Qello world wQat is Qappening')
print('Actual Output : ', textReplaced.decode('utf-16'))
print('Encoded h : ', 'h'.encode('utf-16'))
print('Encoded Q : ', 'Q'.encode('utf-16'))
print('Encoded Actual Output : ', textReplaced)
Output:
Input : hello world what is happening
Expected Output : Qello world wQat is Qappening
Actual Output : Qello world what is happening
Encoded h : b'\xff\xfeh\x00'
Encoded Q : b'\xff\xfeQ\x00'
Encoded Actual Output : b'\xff\xfeQ\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00w\x00h\x00a\x00t\x00 \x00i\x00s\x00 \x00h\x00a\x00p\x00p\x00e\x00n\x00i\x00n\x00g\x00'
The problem with the code is since the encoded character has a prefix b' for every encoded string or character, the replacement is done only on the first occurrence in the Encoded Input.
The problem is that the replacement bytes include the byte order mark (b'\xff\xfe'), which is only present a the beginning of the bytestring. If you are obliged to do the replacing in bytes rather than in str, you need to encode the replacement bytes without a BOM by using the UTF-16 encoding that matches the endianness of your system (or the bytes, which might not be the same).
Assuming the endianness of the bytes is that of your system, this will work:
>>> import sys
>>> enc = 'utf-16-le' if sys.byteorder == 'little' else 'utf-16-be'
>>> textReplaced = encodedText.replace('h'.encode(enc), 'Q'.encode(enc))
>>> textReplaced.decode('utf-16')
'Qello world wQat is Qappening'
An even simpler, and more flexible, approach would be to use the bytes.translate method.
>>> trans_table = bytes.maketrans('h'.encode('utf-16'), 'Q'.encode('utf-16'))
>>> print(encodedText.translate(trans_table).decode('utf-16'))
Qello world wQat is Qappening
I get some data from a webpage and read it like this in python
origional_doc = urllib2.urlopen(url).read()
Sometimes this url has characters such as é and ä and ect., how could I remove these characters, from the string, right now this is what I am trying,
import unicodedata
origional_doc = ''.join((c for c in unicodedata.normalize('NFD', origional_doc) if unicodedata.category(c) != 'Mn'))
But I get an error
TypeError: must be unicode, not str
This should work. It will eliminate all characters that are not ascii.
original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))
using re you can sub all characters that are in a certain hexadecimal ascii range.
>>> re.sub('[\x80-\xFF]','','é and ä and ect')
' and and ect'
You can also do the inverse and sub anything thats NOT in the basic 128 characters:
>>> re.sub('[^\x00-\x7F]','','é and ä and ect')
' and and ect'
I have millions of strings scraped from web like:
s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True
Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:
\\x.*[0-9]
The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters
>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"
If you want to print only the ascii characters you can check if the character is in string.printable
>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'
This thing worked for me as mentioned by Padriac in comments:
s.decode('ascii', errors='ignore')
I use python 2.7 on OSX 10.9 and would like to cut unicode string (05. Чайка.mp3) by 12 symbols, so I use mp3file[:12] to cut it by 12 symbols. But in result I get the string like 05. Чайка.m, which is 11 symbols only. But len(mp3file[:12]) returns 12. Looks like the problem is with Russian symbol й.
What could be wrong here?
The main problem with this - I can not normally display strings with {:<12}'.format(mp3file[:12]).
You have unicode text with a combining character:
u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m'
The U+0306 is a COMBINING BREVE codepoint, ̆, it combines with the preceding и CYRILLIC SMALL LETTER I to form:
>>> print u'\u0438'
и
>>> print u'\u0438\u0306'
й
You can normalize that to the combined form, U+0439 CYRILLIC SMALL LETTER SHORT I instead:
>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0438\u0306')
u'\u0439'
This uses the unicodedata.normalize() function to produce a composed normal form.
A user-perceived character (grapheme cluster) such as й may be constructed using several Unicode codepoints, each Unicode codepoints in turn may be encoded using several bytes depending on a character encoding.
Therefore number of characters that you see may be less the corresponding sizes of Unicode or byte strings that encode them and you can also truncate inside a Unicode character if you slice a bytestring or inside a user-perceived character if you slice a Unicode string even if it is in NFC Unicode normalization form. Obviously, it is not desirable.
To properly count characters, you could use \X regex that matches eXtended grapheme cluster (a language independent "visual character"):
import regex as re # $ pip install regex
characters = re.findall(u'\\X', u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m')
print(characters)
# -> [u'0', u'5', u'.', u' ', u'\u0427', u'\u0430',
# u'\u0438\u0306', u'\u043a', u'\u0430', u'.', u'm']
Notice, that even without normalization: u'\u0438\u0306' is a separate character 'й'.
>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0646\u200D ') # 3 Unicode codepoints
u'\u0646\u200d ' # still 3 codepoints, NFC hasn't combined them
>>> import regex as re
>>> re.findall(u'\\X', u'\u0646\u200D ') # same 3 codepoints
[u'\u0646\u200d', u' '] # 2 grapheme clusters
See also, In Python, how do I most efficiently chunk a UTF-8 string for REST delivery?
Here is my problem... I have a "normal" String like :
Hello World
And unlike all the other subjects I have found, I WANT to print it as it's Unicode Codepoint Escape value !
The output I am looking for is something like this:
\u0015\u0123
If anyone has an idea :)
You are encoding ASCII codepoints only. UTF-8 is a superset of ASCII, any ASCII codepoints are encoded to the same bytes as ASCII would use. What you are printing is correct, that is UTF-8.
Use some non-ASCII codepoints to see the difference:
>>> 'Hello world with an em-dash: \u2014\n'.encode('utf8')
b'Hello world with an em-dash: \xe2\x80\x94\n'
Python will just use the characters themselves when it shows you a bytes value with printable ASCII bytes in it. Any byte value that is not printable is shown as a \x.. escape code, or a single-character escape sequence if there is one (\n for newline).
From your example output, on the other hand, you seem to be expecting to output Python unicode literal escape codes:
>>> '\u0015\u0123'
'\x15ģ'
Since U+0123 is printable, Python 3 just shows it; the non-printable U+0015 (NEGATIVE ACKNOWLEDGE) is a codepoint in the 0x00-0xFF range and is shown using the shorter \x.. escape notation.
To show only unicode escape sequences for your text, you need to process it character by character:
>>> input_text = 'Hello World!'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064\u0021
>>> input_text = 'Hello world with an em-dash: \u2014\n'
>>> print(''.join('\\u{:04x}'.format(ord(c)) for c in input_text))
\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u006f\u0072\u006c\u0064\u0020\u0077\u0069\u0074\u0068\u0020\u0061\u006e\u0020\u0065\u006d\u002d\u0064\u0061\u0073\u0068\u003a\u0020\u2014\u000a
It is important to stress that this is not UTF-8, however.
You can use ord to the encoded bytes into numbers and use string formatting you display their hex values.
>>> s = u'Hello World \u0664\u0662'
>>> print s
Hello World ٤٢
>>> print ''.join('\\x%02X' % ord(c) for c in s.encode('utf-8'))
\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x20\xD9\xA4\xD9\xA2