Scrapy item pipeline - python

I am using scrappy spider and my own item pipeline
value['Title'] = item['Title'][0] if ('Title' in item) else ''
value['Name'] = item['Name'][0] if ('CompanyName' in item) else ''
value['Description'] = item['Description'][0] if ('Description' in item) else ''
When i do this i am getting the value prefixed with u
Example : When i pass the value to o/p and print it
value['Title'] = u'hospital'
What went wrong in my code and why i am getting u and how to remove it
Can anyone help me ?
Thanks,

The u means that the string is represented as unicode. You can remove the u by passing the string to str. str(u'test'). But you can treat is as normal string for most purposes. For example
>>> u'test' == 'test'
True
If you have characters that cannot be represented with plain ascii you should keep the unicode way. If you call str on non ascii characters you will get an exception.
>>> test=u'বাংলা'
>>> test
u'\u09ac\u09be\u0982\u09b2\u09be'
>>> str(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
The u is not part of the string, it is just a way to indicate the type of the string.
>>> type('test')
<type 'str'>
>>> type(u'test')
<type 'unicode'>
Se the following question for more details:
What does the 'u' symbol mean in front of string values?

To remove the u sign you may encode the string as ASCII like this: value['Title'].encode("ascii").

Related

convert string into hex in python [duplicate]

I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6
You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}
The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.

How to convert a full ascii string to hex in python?

I have this string:
string = '{'id':'other_aud1_aud2','kW':15}'
And, simply put, I would like my string to turn into an hex string like this:'7b276964273a276f746865725f617564315f61756432272c276b57273a31357d'
Have been trying binascii.hexlify(string), but it keeps returning:
TypeError: a bytes-like object is required, not 'str'
Also it's only to make it work with the following method:bytearray.fromhex(data['string_hex']).decode()
For the entire code here it is:
string_data = "{'id':'"+self.id+"','kW':"+str(value)+"}"
print(string_data)
string_data_hex = hexlify(string_data)
get_json = bytearray.fromhex(data['string_hex']).decode()
Also this is python 3.6
You can encode()the string:
string = "{'id':'other_aud1_aud2','kW':15}"
h = hexlify(string.encode())
print(h.decode())
# 7b276964273a276f746865725f617564315f61756432272c276b57273a31357d
s = unhexlify(hex).decode()
print(s)
# {'id':'other_aud1_aud2','kW':15}
The tricky bit here is that a Python 3 string is a sequence of Unicode characters, which is not the same as a sequence of ASCII characters.
In Python2, the str type and the bytes type are synonyms, and there is a separate type, unicode, that represents a sequence of Unicode characters. This makes it something of a mystery, if you have a string: is it a sequence of bytes, or is it a sequence of characters in some character-set?
In Python3, str now means unicode and we use bytes for what used to be str. Given a string—a sequence of Unicode characters—we use encode to convert it to some byte-sequence that can represent it, if there is such a sequence:
>>> 'hello'.encode('ascii')
b'hello'
>>> 'sch\N{latin small letter o with diaeresis}n'
'schön'
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('utf-8')
b'sch\xc3\xb6n'
but:
>>> 'sch\N{latin small letter o with diaeresis}n'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 3: ordinal not in range(128)
Once you have the bytes object, you already know what to do. In Python2, if you have a str, you have a bytes object; in Python3, use .encode with your chosen encoding.

How to itereate over special characters in a python string

Consider the following string of special characters:
x = "óیďÚÚ懇償燥績凡壇壁曇ÏエÀэүウーー」ÆØøæგბთლõшүжҮÿதணடஇஉுூொெௌДВБйЫСچخرسسبŞÛşکلںغখঙঝডইঊওোéñÑÜßẞÖÄäöÜĦĦ"
when printed in ipython:
In [11]: x = "óیďÚÚ懇償燥績凡壇壁曇ÏエÀэүウーー」ÆØøæგბთლõшүжҮÿதணடஇஉுூொெௌДВБйЫСچخرسسبŞÛşکلںغখঙঝডইঊওোéñÑÜßẞÖÄäöÜĦĦ"
In [12]: x
Out[12]: '\xc3\xb3\xdb\x8c\xc4\x8f\xc3\x9a\xc3\x9a\xe6\x87\x87\xe5\x84\x9f\xe7\x87\xa5\xe7\xb8\xbe\xe5\x87\xa1\xe5\xa3\x87\xe5\xa3\x81\xe6\x9b\x87\xc3\x8f\xe3\x82\xa8\xc3\x80\xd1\x8d\xd2\xaf\xe3\x82\xa6\xe3\x83\xbc\xe3\x83\xbc\xe3\x80\x8d\xc3\x86\xc3\x98\xc3\xb8\xc3\xa6\xe1\x83\x92\xe1\x83\x91\xe1\x83\x97\xe1\x83\x9a\xc3\xb5\xd1\x88\xd2\xaf\xd0\xb6\xd2\xae\xc3\xbf\xe0\xae\xa4\xe0\xae\xa3\xe0\xae\x9f\xe0\xae\x87\xe0\xae\x89\xe0\xaf\x81\xe0\xaf\x82\xe0\xaf\x8a\xe0\xaf\x86\xe0\xaf\x8c\xd0\x94\xd0\x92\xd0\x91\xd0\xb9\xd0\xab\xd0\xa1\xda\x86\xd8\xae\xd8\xb1\xd8\xb3\xd8\xb3\xd8\xa8\xc5\x9e\xc3\x9b\xc5\x9f\xda\xa9\xd9\x84\xda\xba\xd8\xba\xe0\xa6\x96\xe0\xa6\x99\xe0\xa6\x9d\xe0\xa6\xa1\xe0\xa6\x87\xe0\xa6\x8a\xe0\xa6\x93\xe0\xa7\x8b\xc3\xa9\xc3\xb1\xc3\x91\xc3\x9c\xc3\x9f\xe1\xba\x9e\xc3\x96\xc3\x84\xc3\xa4\xc3\xb6\xc3\x9c\xc4\xa6\xc4\xa6'
This string is passed to the bellow code from another service as a list:
value_list = []
value_list.append(x)
The goal of the bellow code is to find all the special characters in the given string and return them as a list. this list is to be parsed as a text in utf-8
In [33]: value_list
Out[33]: ['\xc3\xb3\xdb\x8c\xc4\x8f\xc3\x9a\xc3\x9a\xe6\x87\x87\xe5\x84\x9f\xe7\x87\xa5\xe7\xb8\xbe\xe5\x87\xa1\xe5\xa3\x87\xe5\xa3\x81\xe6\x9b\x87\xc3\x8f\xe3\x82\xa8\xc3\x80\xd1\x8d\xd2\xaf\xe3\x82\xa6\xe3\x83\xbc\xe3\x83\xbc\xe3\x80\x8d\xc3\x86\xc3\x98\xc3\xb8\xc3\xa6\xe1\x83\x92\xe1\x83\x91\xe1\x83\x97\xe1\x83\x9a\xc3\xb5\xd1\x88\xd2\xaf\xd0\xb6\xd2\xae\xc3\xbf\xe0\xae\xa4\xe0\xae\xa3\xe0\xae\x9f\xe0\xae\x87\xe0\xae\x89\xe0\xaf\x81\xe0\xaf\x82\xe0\xaf\x8a\xe0\xaf\x86\xe0\xaf\x8c\xd0\x94\xd0\x92\xd0\x91\xd0\xb9\xd0\xab\xd0\xa1\xda\x86\xd8\xae\xd8\xb1\xd8\xb3\xd8\xb3\xd8\xa8\xc5\x9e\xc3\x9b\xc5\x9f\xda\xa9\xd9\x84\xda\xba\xd8\xba\xe0\xa6\x96\xe0\xa6\x99\xe0\xa6\x9d\xe0\xa6\xa1\xe0\xa6\x87\xe0\xa6\x8a\xe0\xa6\x93\xe0\xa7\x8b\xc3\xa9\xc3\xb1\xc3\x91\xc3\x9c\xc3\x9f\xe1\xba\x9e\xc3\x96\xc3\x84\xc3\xa4\xc3\xb6\xc3\x9c\xc4\xa6\xc4\xa6']
In [34]: separator = re.compile('[.,;:!?&()]+', re.MULTILINE | re.UNICODE)
In [35]: value_list = [" ".join([word for word in separator.sub(' ', value).split()]).strip() for value in value_list]
In [36]: word_found = []
In [37]: for value in value_list:
word_found.extend([i for i in value if 31 > ord(i) or ord(i) > 127])
....:
In [39]: word_found.pop().encode('utf-8')
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-39-61e9eca29caa> in <module>()
----> 1 word_found.pop().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa6 in position 0: ordinal not in range(128)
It is clear, that python is reading x as a python string (which has each \x character showing the higher and lower byte). While iterating over the characters in the string, we are actually iterating over the bytes instead of the character in the original string. Because of this, ord is giving them bytes as special characters and putting in the list. When encoded to utf-8 the out of range error is coming because we are trying to decode a part of the original character in utf-8.
I need to understand, how can i iterate over this python string without changing the way in which the value is passed into value_list or read from words_found
Please help.
You need to decode the resulting string before iterating:
s = "".join(word_found) # Convert the list of characters into a string
print type(s) # <type 'string'>
u = s.decode('utf-8') # Decode it into utf-8
print type(u) # <type 'unicode'>
for c in u:
print c # Prints each unicode character
If you must have it in list format you can repack it into a list of unicode chars:
s = "".join(word_found)
u = s.decode('utf-8')
unichar_list = [c for c in u]
print unichar_list.pop() # Ħ

Python regex with unicode strings

Could not match unicode string in python 2.7.
expected result 749130
>>> print match("\d+", u'\ufeff749130'.encode('utf-8'))
None
>>> print match("\d+", u'\ufeff749130')
None
>>> print match("\d+", u'\ufeff749130'.decode('utf-8'))
Traceback (most recent call last):
....
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
No need to use str.decode on a unicode string. As stated in the comments, you may want to use search because match only matches from the beginning of the target string.
>>> print search("\d+", u'\ufeff749130').group()
749130

How to do string formatting with unicode emdash?

I am trying do string formatting with a unicode variable. For example:
>>> x = u"Some text—with an emdash."
>>> x
u'Some text\u2014with an emdash.'
>>> print(x)
Some text—with an emdash.
>>> s = "{}".format(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 9: ordinal not in range(128)
>>> t = "%s" %x
>>> t
u'Some text\u2014with an emdash.'
>>> print(t)
Some text—with an emdash.
You can see that I have a unicode string and that it prints just fine. The trouble is when I use Python's new (and improved?) format() function. If I use the old style (using %s) everything works out fine, but when I use {} and the format() function, it fails.
Any ideas of why this is happening? I am using Python 2.7.2.
The new format() is not as forgiving when you mix ASCII and unicode strings ... so try this:
s = u"{}".format(x)
The same way.
>>> s = u"{0}".format(x)
>>> s
u'Some text\u2014with an emdash.'
Using the following worked well for me. It is a variant on the other answers.
>>> emDash = u'\u2014'
>>> "a{0}b".format(emDash)
'a—b'

Categories