Python regex with unicode strings

Python regex with unicode strings - python

Could not match unicode string in python 2.7.
expected result 749130
>>> print match("\d+", u'\ufeff749130'.encode('utf-8'))
None
>>> print match("\d+", u'\ufeff749130')
None
>>> print match("\d+", u'\ufeff749130'.decode('utf-8'))
Traceback (most recent call last):
....
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

No need to use str.decode on a unicode string. As stated in the comments, you may want to use search because match only matches from the beginning of the target string.
>>> print search("\d+", u'\ufeff749130').group()
749130

Related

convert string into hex in python [duplicate]

I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6

You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}

The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.

How to convert a full ascii string to hex in python?

I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6

You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}

The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.

Check that a string contains only ASCII characters?

How do I check that a string only contains ASCII characters in Python? Something like Ruby's ascii_only?
I want to be able to tell whether string specific data read from file is in ascii

In Python 3.7 were added methods which do what you want:
str, bytes, and bytearray gained support for the new isascii() method, which can be used to test if a string or bytes contain only the ASCII characters.
Otherwise:
>>> all(ord(char) < 128 for char in 'string')
True
>>> all(ord(char) < 128 for char in 'строка')
False
Another version:
>>> def is_ascii(text):
if isinstance(text, unicode):
try:
text.encode('ascii')
except UnicodeEncodeError:
return False
else:
try:
text.decode('ascii')
except UnicodeDecodeError:
return False
return True
...
>>> is_ascii('text')
True
>>> is_ascii(u'text')
True
>>> is_ascii(u'text-строка')
False
>>> is_ascii('text-строка')
False
>>> is_ascii(u'text-строка'.encode('utf-8'))
False

You can also opt for regex to check for only ascii characters. [\x00-\x7F] can match a single ascii character:
>>> OnlyAscii = lambda s: re.match('^[\x00-\x7F]+$', s) != None
>>> OnlyAscii('string')
True
>>> OnlyAscii('Tannh‰user')
False

If you have unicode strings you can use the "encode" function and then catch the exception:
try:
mynewstring = mystring.encode('ascii')
except UnicodeEncodeError:
print("there are non-ascii characters in there")
If you have bytes, you can import the chardet module and check the encoding:
import chardet
# Get the encoding
enc = chardet.detect(mystring)['encoding']

A workaround to your problem would be to try and encode the string in a particular encoding.
For example:
'H€llø'.encode('utf-8')
This will throw the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)
Now you can catch the "UnicodeDecodeError" to determine that the string did not contain just the ASCII characters.
try:
'H€llø'.encode('utf-8')
except UnicodeDecodeError:
print 'This string contains more than just the ASCII characters.'

Scrapy item pipeline

I am using scrappy spider and my own item pipeline
value['Title'] = item['Title'][0] if ('Title' in item) else ''
value['Name'] = item['Name'][0] if ('CompanyName' in item) else ''
value['Description'] = item['Description'][0] if ('Description' in item) else ''
When i do this i am getting the value prefixed with u
Example : When i pass the value to o/p and print it
value['Title'] = u'hospital'
What went wrong in my code and why i am getting u and how to remove it
Can anyone help me ?
Thanks,

The u means that the string is represented as unicode. You can remove the u by passing the string to str. str(u'test'). But you can treat is as normal string for most purposes. For example
>>> u'test' == 'test'
True
If you have characters that cannot be represented with plain ascii you should keep the unicode way. If you call str on non ascii characters you will get an exception.
>>> test=u'বাংলা'
>>> test
u'\u09ac\u09be\u0982\u09b2\u09be'
>>> str(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
The u is not part of the string, it is just a way to indicate the type of the string.
>>> type('test')
<type 'str'>
>>> type(u'test')
<type 'unicode'>
Se the following question for more details:
What does the 'u' symbol mean in front of string values?

To remove the u sign you may encode the string as ASCII like this: value['Title'].encode("ascii").

How to do string formatting with unicode emdash?

I am trying do string formatting with a unicode variable. For example:
>>> x = u"Some text—with an emdash."
>>> x
u'Some text\u2014with an emdash.'
>>> print(x)
Some text—with an emdash.
>>> s = "{}".format(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 9: ordinal not in range(128)
>>> t = "%s" %x
>>> t
u'Some text\u2014with an emdash.'
>>> print(t)
Some text—with an emdash.
You can see that I have a unicode string and that it prints just fine. The trouble is when I use Python's new (and improved?) format() function. If I use the old style (using %s) everything works out fine, but when I use {} and the format() function, it fails.
Any ideas of why this is happening? I am using Python 2.7.2.

The new format() is not as forgiving when you mix ASCII and unicode strings ... so try this:
s = u"{}".format(x)

The same way.
>>> s = u"{0}".format(x)
>>> s
u'Some text\u2014with an emdash.'

Using the following worked well for me. It is a variant on the other answers.
>>> emDash = u'\u2014'
>>> "a{0}b".format(emDash)
'a—b'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex with unicode strings - python

No need to use str.decode on a unicode string. As stated in the comments, you may want to use search because match only matches from the beginning of the target string. >>> print search("\d+", u'\ufeff749130').group() 749130

Related

convert string into hex in python [duplicate]

How to convert a full ascii string to hex in python?

Check that a string contains only ASCII characters?

Scrapy item pipeline

How to do string formatting with unicode emdash?

Categories

Resources