How do I check that a string only contains ASCII characters in Python? Something like Ruby's ascii_only?
I want to be able to tell whether string specific data read from file is in ascii
In Python 3.7 were added methods which do what you want:
str, bytes, and bytearray gained support for the new isascii() method, which can be used to test if a string or bytes contain only the ASCII characters.
Otherwise:
>>> all(ord(char) < 128 for char in 'string')
True
>>> all(ord(char) < 128 for char in 'строка')
False
Another version:
>>> def is_ascii(text):
if isinstance(text, unicode):
try:
text.encode('ascii')
except UnicodeEncodeError:
return False
else:
try:
text.decode('ascii')
except UnicodeDecodeError:
return False
return True
...
>>> is_ascii('text')
True
>>> is_ascii(u'text')
True
>>> is_ascii(u'text-строка')
False
>>> is_ascii('text-строка')
False
>>> is_ascii(u'text-строка'.encode('utf-8'))
False
You can also opt for regex to check for only ascii characters. [\x00-\x7F] can match a single ascii character:
>>> OnlyAscii = lambda s: re.match('^[\x00-\x7F]+$', s) != None
>>> OnlyAscii('string')
True
>>> OnlyAscii('Tannh‰user')
False
If you have unicode strings you can use the "encode" function and then catch the exception:
try:
mynewstring = mystring.encode('ascii')
except UnicodeEncodeError:
print("there are non-ascii characters in there")
If you have bytes, you can import the chardet module and check the encoding:
import chardet
# Get the encoding
enc = chardet.detect(mystring)['encoding']
A workaround to your problem would be to try and encode the string in a particular encoding.
For example:
'H€llø'.encode('utf-8')
This will throw the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
Now you can catch the "UnicodeDecodeError" to determine that the string did not contain just the ASCII characters.
try:
'H€llø'.encode('utf-8')
except UnicodeDecodeError:
print 'This string contains more than just the ASCII characters.'
Related
I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6
You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}
The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.
I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6
You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}
The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.
Consider the following string of special characters:
x = "óیďÚÚ懇償燥績凡壇壁曇ÏエÀэүウーー」ÆØøæგბთლõшүжҮÿதணடஇஉுூொெௌДВБйЫСچخرسسبŞÛşکلںغখঙঝডইঊওোéñÑÜßẞÖÄäöÜĦĦ"
when printed in ipython:
In [11]: x = "óیďÚÚ懇償燥績凡壇壁曇ÏエÀэүウーー」ÆØøæგბთლõшүжҮÿதணடஇஉுூொெௌДВБйЫСچخرسسبŞÛşکلںغখঙঝডইঊওোéñÑÜßẞÖÄäöÜĦĦ"
In [12]: x
Out[12]: '\xc3\xb3\xdb\x8c\xc4\x8f\xc3\x9a\xc3\x9a\xe6\x87\x87\xe5\x84\x9f\xe7\x87\xa5\xe7\xb8\xbe\xe5\x87\xa1\xe5\xa3\x87\xe5\xa3\x81\xe6\x9b\x87\xc3\x8f\xe3\x82\xa8\xc3\x80\xd1\x8d\xd2\xaf\xe3\x82\xa6\xe3\x83\xbc\xe3\x83\xbc\xe3\x80\x8d\xc3\x86\xc3\x98\xc3\xb8\xc3\xa6\xe1\x83\x92\xe1\x83\x91\xe1\x83\x97\xe1\x83\x9a\xc3\xb5\xd1\x88\xd2\xaf\xd0\xb6\xd2\xae\xc3\xbf\xe0\xae\xa4\xe0\xae\xa3\xe0\xae\x9f\xe0\xae\x87\xe0\xae\x89\xe0\xaf\x81\xe0\xaf\x82\xe0\xaf\x8a\xe0\xaf\x86\xe0\xaf\x8c\xd0\x94\xd0\x92\xd0\x91\xd0\xb9\xd0\xab\xd0\xa1\xda\x86\xd8\xae\xd8\xb1\xd8\xb3\xd8\xb3\xd8\xa8\xc5\x9e\xc3\x9b\xc5\x9f\xda\xa9\xd9\x84\xda\xba\xd8\xba\xe0\xa6\x96\xe0\xa6\x99\xe0\xa6\x9d\xe0\xa6\xa1\xe0\xa6\x87\xe0\xa6\x8a\xe0\xa6\x93\xe0\xa7\x8b\xc3\xa9\xc3\xb1\xc3\x91\xc3\x9c\xc3\x9f\xe1\xba\x9e\xc3\x96\xc3\x84\xc3\xa4\xc3\xb6\xc3\x9c\xc4\xa6\xc4\xa6'
This string is passed to the bellow code from another service as a list:
value_list = []
value_list.append(x)
The goal of the bellow code is to find all the special characters in the given string and return them as a list. this list is to be parsed as a text in utf-8
In [33]: value_list
Out[33]: ['\xc3\xb3\xdb\x8c\xc4\x8f\xc3\x9a\xc3\x9a\xe6\x87\x87\xe5\x84\x9f\xe7\x87\xa5\xe7\xb8\xbe\xe5\x87\xa1\xe5\xa3\x87\xe5\xa3\x81\xe6\x9b\x87\xc3\x8f\xe3\x82\xa8\xc3\x80\xd1\x8d\xd2\xaf\xe3\x82\xa6\xe3\x83\xbc\xe3\x83\xbc\xe3\x80\x8d\xc3\x86\xc3\x98\xc3\xb8\xc3\xa6\xe1\x83\x92\xe1\x83\x91\xe1\x83\x97\xe1\x83\x9a\xc3\xb5\xd1\x88\xd2\xaf\xd0\xb6\xd2\xae\xc3\xbf\xe0\xae\xa4\xe0\xae\xa3\xe0\xae\x9f\xe0\xae\x87\xe0\xae\x89\xe0\xaf\x81\xe0\xaf\x82\xe0\xaf\x8a\xe0\xaf\x86\xe0\xaf\x8c\xd0\x94\xd0\x92\xd0\x91\xd0\xb9\xd0\xab\xd0\xa1\xda\x86\xd8\xae\xd8\xb1\xd8\xb3\xd8\xb3\xd8\xa8\xc5\x9e\xc3\x9b\xc5\x9f\xda\xa9\xd9\x84\xda\xba\xd8\xba\xe0\xa6\x96\xe0\xa6\x99\xe0\xa6\x9d\xe0\xa6\xa1\xe0\xa6\x87\xe0\xa6\x8a\xe0\xa6\x93\xe0\xa7\x8b\xc3\xa9\xc3\xb1\xc3\x91\xc3\x9c\xc3\x9f\xe1\xba\x9e\xc3\x96\xc3\x84\xc3\xa4\xc3\xb6\xc3\x9c\xc4\xa6\xc4\xa6']
In [34]: separator = re.compile('[.,;:!?&()]+', re.MULTILINE | re.UNICODE)
In [35]: value_list = [" ".join([word for word in separator.sub(' ', value).split()]).strip() for value in value_list]
In [36]: word_found = []
In [37]: for value in value_list:
word_found.extend([i for i in value if 31 > ord(i) or ord(i) > 127])
....:
In [39]: word_found.pop().encode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-39-61e9eca29caa> in <module>()
----> 1 word_found.pop().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa6 in position 0: ordinal not in range(128)
It is clear, that python is reading x as a python string (which has each \x character showing the higher and lower byte). While iterating over the characters in the string, we are actually iterating over the bytes instead of the character in the original string. Because of this, ord is giving them bytes as special characters and putting in the list. When encoded to utf-8 the out of range error is coming because we are trying to decode a part of the original character in utf-8.
I need to understand, how can i iterate over this python string without changing the way in which the value is passed into value_list or read from words_found
Please help.
You need to decode the resulting string before iterating:
s = "".join(word_found) # Convert the list of characters into a string
print type(s) # <type 'string'>
u = s.decode('utf-8') # Decode it into utf-8
print type(u) # <type 'unicode'>
for c in u:
print c # Prints each unicode character
If you must have it in list format you can repack it into a list of unicode chars:
s = "".join(word_found)
u = s.decode('utf-8')
unichar_list = [c for c in u]
print unichar_list.pop() # Ħ
I am using scrappy spider and my own item pipeline
value['Title'] = item['Title'][0] if ('Title' in item) else ''
value['Name'] = item['Name'][0] if ('CompanyName' in item) else ''
value['Description'] = item['Description'][0] if ('Description' in item) else ''
When i do this i am getting the value prefixed with u
Example : When i pass the value to o/p and print it
value['Title'] = u'hospital'
What went wrong in my code and why i am getting u and how to remove it
Can anyone help me ?
Thanks,
The u means that the string is represented as unicode. You can remove the u by passing the string to str. str(u'test'). But you can treat is as normal string for most purposes. For example
>>> u'test' == 'test'
True
If you have characters that cannot be represented with plain ascii you should keep the unicode way. If you call str on non ascii characters you will get an exception.
>>> test=u'বাংলা'
>>> test
u'\u09ac\u09be\u0982\u09b2\u09be'
>>> str(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
The u is not part of the string, it is just a way to indicate the type of the string.
>>> type('test')
<type 'str'>
>>> type(u'test')
<type 'unicode'>
Se the following question for more details:
What does the 'u' symbol mean in front of string values?
To remove the u sign you may encode the string as ASCII like this: value['Title'].encode("ascii").
Could not match unicode string in python 2.7.
expected result 749130
>>> print match("\d+", u'\ufeff749130'.encode('utf-8'))
None
>>> print match("\d+", u'\ufeff749130')
None
>>> print match("\d+", u'\ufeff749130'.decode('utf-8'))
Traceback (most recent call last):
....
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
No need to use str.decode on a unicode string. As stated in the comments, you may want to use search because match only matches from the beginning of the target string.
>>> print search("\d+", u'\ufeff749130').group()
749130