Consider the following string of special characters:
x = "óیďÚÚ懇償燥績凡壇壁曇ÏエÀэүウーー」ÆØøæგბთლõшүжҮÿதணடஇஉுூொெௌДВБйЫСچخرسسبŞÛşکلںغখঙঝডইঊওোéñÑÜßẞÖÄäöÜĦĦ"
when printed in ipython:
In [11]: x = "óیďÚÚ懇償燥績凡壇壁曇ÏエÀэүウーー」ÆØøæგბთლõшүжҮÿதணடஇஉுூொெௌДВБйЫСچخرسسبŞÛşکلںغখঙঝডইঊওোéñÑÜßẞÖÄäöÜĦĦ"
In [12]: x
Out[12]: '\xc3\xb3\xdb\x8c\xc4\x8f\xc3\x9a\xc3\x9a\xe6\x87\x87\xe5\x84\x9f\xe7\x87\xa5\xe7\xb8\xbe\xe5\x87\xa1\xe5\xa3\x87\xe5\xa3\x81\xe6\x9b\x87\xc3\x8f\xe3\x82\xa8\xc3\x80\xd1\x8d\xd2\xaf\xe3\x82\xa6\xe3\x83\xbc\xe3\x83\xbc\xe3\x80\x8d\xc3\x86\xc3\x98\xc3\xb8\xc3\xa6\xe1\x83\x92\xe1\x83\x91\xe1\x83\x97\xe1\x83\x9a\xc3\xb5\xd1\x88\xd2\xaf\xd0\xb6\xd2\xae\xc3\xbf\xe0\xae\xa4\xe0\xae\xa3\xe0\xae\x9f\xe0\xae\x87\xe0\xae\x89\xe0\xaf\x81\xe0\xaf\x82\xe0\xaf\x8a\xe0\xaf\x86\xe0\xaf\x8c\xd0\x94\xd0\x92\xd0\x91\xd0\xb9\xd0\xab\xd0\xa1\xda\x86\xd8\xae\xd8\xb1\xd8\xb3\xd8\xb3\xd8\xa8\xc5\x9e\xc3\x9b\xc5\x9f\xda\xa9\xd9\x84\xda\xba\xd8\xba\xe0\xa6\x96\xe0\xa6\x99\xe0\xa6\x9d\xe0\xa6\xa1\xe0\xa6\x87\xe0\xa6\x8a\xe0\xa6\x93\xe0\xa7\x8b\xc3\xa9\xc3\xb1\xc3\x91\xc3\x9c\xc3\x9f\xe1\xba\x9e\xc3\x96\xc3\x84\xc3\xa4\xc3\xb6\xc3\x9c\xc4\xa6\xc4\xa6'
This string is passed to the bellow code from another service as a list:
value_list = []
value_list.append(x)
The goal of the bellow code is to find all the special characters in the given string and return them as a list. this list is to be parsed as a text in utf-8
In [33]: value_list
Out[33]: ['\xc3\xb3\xdb\x8c\xc4\x8f\xc3\x9a\xc3\x9a\xe6\x87\x87\xe5\x84\x9f\xe7\x87\xa5\xe7\xb8\xbe\xe5\x87\xa1\xe5\xa3\x87\xe5\xa3\x81\xe6\x9b\x87\xc3\x8f\xe3\x82\xa8\xc3\x80\xd1\x8d\xd2\xaf\xe3\x82\xa6\xe3\x83\xbc\xe3\x83\xbc\xe3\x80\x8d\xc3\x86\xc3\x98\xc3\xb8\xc3\xa6\xe1\x83\x92\xe1\x83\x91\xe1\x83\x97\xe1\x83\x9a\xc3\xb5\xd1\x88\xd2\xaf\xd0\xb6\xd2\xae\xc3\xbf\xe0\xae\xa4\xe0\xae\xa3\xe0\xae\x9f\xe0\xae\x87\xe0\xae\x89\xe0\xaf\x81\xe0\xaf\x82\xe0\xaf\x8a\xe0\xaf\x86\xe0\xaf\x8c\xd0\x94\xd0\x92\xd0\x91\xd0\xb9\xd0\xab\xd0\xa1\xda\x86\xd8\xae\xd8\xb1\xd8\xb3\xd8\xb3\xd8\xa8\xc5\x9e\xc3\x9b\xc5\x9f\xda\xa9\xd9\x84\xda\xba\xd8\xba\xe0\xa6\x96\xe0\xa6\x99\xe0\xa6\x9d\xe0\xa6\xa1\xe0\xa6\x87\xe0\xa6\x8a\xe0\xa6\x93\xe0\xa7\x8b\xc3\xa9\xc3\xb1\xc3\x91\xc3\x9c\xc3\x9f\xe1\xba\x9e\xc3\x96\xc3\x84\xc3\xa4\xc3\xb6\xc3\x9c\xc4\xa6\xc4\xa6']
In [34]: separator = re.compile('[.,;:!?&()]+', re.MULTILINE | re.UNICODE)
In [35]: value_list = [" ".join([word for word in separator.sub(' ', value).split()]).strip() for value in value_list]
In [36]: word_found = []
In [37]: for value in value_list:
word_found.extend([i for i in value if 31 > ord(i) or ord(i) > 127])
....:
In [39]: word_found.pop().encode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-39-61e9eca29caa> in <module>()
----> 1 word_found.pop().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa6 in position 0: ordinal not in range(128)
It is clear, that python is reading x as a python string (which has each \x character showing the higher and lower byte). While iterating over the characters in the string, we are actually iterating over the bytes instead of the character in the original string. Because of this, ord is giving them bytes as special characters and putting in the list. When encoded to utf-8 the out of range error is coming because we are trying to decode a part of the original character in utf-8.
I need to understand, how can i iterate over this python string without changing the way in which the value is passed into value_list or read from words_found
Please help.
You need to decode the resulting string before iterating:
s = "".join(word_found) # Convert the list of characters into a string
print type(s) # <type 'string'>
u = s.decode('utf-8') # Decode it into utf-8
print type(u) # <type 'unicode'>
for c in u:
print c # Prints each unicode character
If you must have it in list format you can repack it into a list of unicode chars:
s = "".join(word_found)
u = s.decode('utf-8')
unichar_list = [c for c in u]
print unichar_list.pop() # Ħ
Related
I have this byte represented as string:
b'google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs'
I want to convert it to a raw string looking like this:
r'"google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs"
(without the r' displaying on printing)
I tried it with replace but can't figure out to get it to work.
Maybe someone can help me here.
Greetings
Edit full code:
def decode_txt_rdata(rdata, rdlen):
"""decode TXT RR rdata into a string of quoted text strings,
escaping any embedded double quotes"""
txtstrings = []
position = 0
while position < rdlen:
slen, = struct.unpack('B', rdata[position:position+1])
s = rdata[position+1:position+1+slen]
s = '"{}"'.format(s.replace(b'"', b'"').decode())
txtstrings.append(s)
position += 1 + slen
return ' '.join(txtstrings)
Use from first answer of
Convert Bytes to String
Assumption
Desired is:
r"google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs"
NOT (which is not valid):
r'"google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs"
# Byte String
s = b'google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs'
# String
decode_s = s.decode('utf-8')
print(f"Byte string\n\t{s}\n\n")
print(f"Decoded Byte string\n\t{decode_s}\n")
print(f'Is decoded == Desired \n{decode_s==r"google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs"}' )
Output
Byte string
b'google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs'
Decoded Byte string
google-site-verification=pFgmIQ6qK3YjcRAAhsKiPzmEiOVcynQslFMEba5lXvs
Is decoded == Desired
True
Given a list of hexadecimals that corresponds to the unicode, how to programmatically retrieve the unicode char?
E.g. Given the list:
>>> l = ['9359', '935A', '935B']
how to achieve this list:
>>> u = [u'\u9359', u'\u935A', u'\u935B']
>>> u
['鍙', '鍚', '鍛']
I've tried this but it throws a SyntaxError:
>>> u'\u' + l[0]
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
\uhhhh escapes are only valid in string literals, you can't use those to turn arbitrary hex values into characters. In other words, they are part of a larger syntax, and can't be used stand-alone.
Decode the hex value to an integer and pass it to the chr() function (or, on Python 2, the unichr() function):
[chr(int(v, 16)) for v in l] #
You could ask Python to interpret a string containing literal \uhhhh text as a Unicode string literal with the unicode_escape codec, but feels like overkill for individual codepoints:
[(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
Note the double backslash in the prefix added, and that we have to create byte strings for this to work at all.
Demo:
>>> l = ['9359', '935A', '935B']
>>> [chr(int(v, 16)) for v in l]
['鍙', '鍚', '鍛']
>>> [(b'\\u' + v.encode('ascii')).decode('unicode_escape') for v in l]
['鍙', '鍚', '鍛']
How do I check that a string only contains ASCII characters in Python? Something like Ruby's ascii_only?
I want to be able to tell whether string specific data read from file is in ascii
In Python 3.7 were added methods which do what you want:
str, bytes, and bytearray gained support for the new isascii() method, which can be used to test if a string or bytes contain only the ASCII characters.
Otherwise:
>>> all(ord(char) < 128 for char in 'string')
True
>>> all(ord(char) < 128 for char in 'строка')
False
Another version:
>>> def is_ascii(text):
if isinstance(text, unicode):
try:
text.encode('ascii')
except UnicodeEncodeError:
return False
else:
try:
text.decode('ascii')
except UnicodeDecodeError:
return False
return True
...
>>> is_ascii('text')
True
>>> is_ascii(u'text')
True
>>> is_ascii(u'text-строка')
False
>>> is_ascii('text-строка')
False
>>> is_ascii(u'text-строка'.encode('utf-8'))
False
You can also opt for regex to check for only ascii characters. [\x00-\x7F] can match a single ascii character:
>>> OnlyAscii = lambda s: re.match('^[\x00-\x7F]+$', s) != None
>>> OnlyAscii('string')
True
>>> OnlyAscii('Tannh‰user')
False
If you have unicode strings you can use the "encode" function and then catch the exception:
try:
mynewstring = mystring.encode('ascii')
except UnicodeEncodeError:
print("there are non-ascii characters in there")
If you have bytes, you can import the chardet module and check the encoding:
import chardet
# Get the encoding
enc = chardet.detect(mystring)['encoding']
A workaround to your problem would be to try and encode the string in a particular encoding.
For example:
'H€llø'.encode('utf-8')
This will throw the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
Now you can catch the "UnicodeDecodeError" to determine that the string did not contain just the ASCII characters.
try:
'H€llø'.encode('utf-8')
except UnicodeDecodeError:
print 'This string contains more than just the ASCII characters.'
I am using scrappy spider and my own item pipeline
value['Title'] = item['Title'][0] if ('Title' in item) else ''
value['Name'] = item['Name'][0] if ('CompanyName' in item) else ''
value['Description'] = item['Description'][0] if ('Description' in item) else ''
When i do this i am getting the value prefixed with u
Example : When i pass the value to o/p and print it
value['Title'] = u'hospital'
What went wrong in my code and why i am getting u and how to remove it
Can anyone help me ?
Thanks,
The u means that the string is represented as unicode. You can remove the u by passing the string to str. str(u'test'). But you can treat is as normal string for most purposes. For example
>>> u'test' == 'test'
True
If you have characters that cannot be represented with plain ascii you should keep the unicode way. If you call str on non ascii characters you will get an exception.
>>> test=u'বাংলা'
>>> test
u'\u09ac\u09be\u0982\u09b2\u09be'
>>> str(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
The u is not part of the string, it is just a way to indicate the type of the string.
>>> type('test')
<type 'str'>
>>> type(u'test')
<type 'unicode'>
Se the following question for more details:
What does the 'u' symbol mean in front of string values?
To remove the u sign you may encode the string as ASCII like this: value['Title'].encode("ascii").
I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you
You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string