I am getting data using Xpath and the output has '\xa0' which is Unicode. I wanted to eliminate it but it returns:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
Here is my code:
page_active = requests.get('http://www.marketinout.com/stock-screener/stocks.php?list=volume_leaders&exch=asx')
active = html.fromstring(page_active.content)
data = active.xpath('//tbody/tr/td/text()')
data >>> [u'\xa0', u'\xa0', u'\xa0Bard1 Life Sciences Limited
',
u'\xa0Gold', u'\xa0Basic Materials', u'\xa0ASX', u'\xa07', u'\xa00.025', u'\xa00.015', u'\xa0150.0', u'\xa02
78,097,367', u'\xa0', u'\xa0', u'\xa0Patrys Ltd ...]
In order to eliminate '\xa0', I tried [a.replace('\xa0',' ') for a in data] but it returns:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
I also used [a.decode('utf-8').replace("\xa0","") for a in data] but I'm still getting the same error.
You are mixing bytes and Unicode, don't do that. Use Unicode string literals instead:
[a.replace(u'\xa0', u' ') for a in data]
Otherwise, Python will try to decode the byte string '\xa0' as ASCII, and 0xA0 is not a valid ASCII codepoint.
Alternatively, use unicode.strip() to remove trailing and leading whitespace; the U+00A0 codepoint counts as whitespace:
[a.strip() for a in data]
You need to tell Python to interpret your strings as Unicode.
To do this, add a u before your strings:
[a.replace(u'\xa0', u' ') for a in data]
Related
I am trying to print an ascii character along with string but I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u25cf' in position 24: ordinal not in range(128)
This is my constants.ICON_BLACK_CIRCLE
ICON_BLACK_CIRCLE = u'\u25CF'
And here I am trying to print it with some other string
print "{: ^71s}".format(constants.ICON_BLACK_CIRCLE + " - " + errormsg),
s = "| |"
print(s)
How can I fix this error?
This is how I could get rid of this exception.
Just make the second string also a unicode string
print u"{: ^71s}".format(constants.ICON_BLACK_CIRCLE + " - " +errormsg)
I have an unicode character like 🏆 and I want to get back the \Uxxxxxxxx format. But until now, couldn't find an easy way. Already tried:
text = 🏆
text.encode('utf-32').decode('utf-8')
returns error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
text.encode('utf-32').decode('unicode-escape')
returns ÿþ
How to make it return \U000XXXXX ? I know I can get the character from \U000XXXXX making:
string = "foo bar foo \U000XXXXX"
string.encode('utf-8').decode('unicode-escape')
returns "foo bar foo 🏆"
For a byte string:
>>> text = '🏆'
>>> text.encode('unicode-escape')
b'\\U0001f3c6'
for a Unicode string:
>>> text.encode('unicode-escape').decode('ascii')
'\\U0001f3c6'
I'm trying to encode this:
"LIAISONS Ã NEW YORK"
to this:
"LIAISONS à NEW YORK"
The output of print(ascii(value)) is
'LIAISONS \xc3 NEW YORK'
I tried encoding in cp1252 first and decoding after to utf8 but I get this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I also tried to encode in Latin-1/ISO-8859-2 but that is not working too.
How can I do this?
You can't go from your input value to your desired output, because the data is no longer complete.
If your input value was an actual Mojibake re-coding from UTF-8 to a Latin encoding, then you'd have two bytes for the à codepoint:
>>> target = "LIAISONS à NEW YORK"
>>> target.encode('UTF-8').decode('latin1')
'LIAISONS Ã\xa0 NEW YORK'
That's because the UTF-8 encoding for à is C3 A0:
>>> 'à'.encode('utf8').hex()
'c3a0'
In your input, the A0 byte (which doesn't map to a printable character in most Latin-based codecs) has been filtered out somewhere. You can't re-create it from thin air, because the C3 byte of the UTF-8 pair can precede any number of other bytes, all resulting in valid output:
>>> b'\xc3\xa1'.decode('utf8')
'á'
>>> b'\xc3\xa2'.decode('utf8')
'â'
>>> b'\xc3\xa3'.decode('utf8')
'ã'
>>> b'\xc3\xa4'.decode('utf8')
'ä'
and you can't easily pick one of those, not without additional natural language processing. The bytes 80-A0 and AD are all valid continuation bytes in UTF-8 for this case, but none of those bytes result in a printable Latin-1 character, so there are at least 18 different possibilities here.
new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',' ').replace(' ', ' ').replace(' ', ' ').replace('\u20b9',' ').replace('\ufffd',' ').replace('\u037e',' ').replace('\u2022',' ').replace('\u200b',' ').replace('0xc3',' ')
This is the error produced by the code:
new_text = text.decode('utf-8').replace('\u00a0', ' ').replace('\u00ad', ' ').replace('Â', ' ').replace(' ',
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
127.0.0.1 - - [29/Aug/2017 15:22:00] "GET / HTTP/1.1" 500 -
I have tried decoding ascii from unicode.
You are calling .replace on a unicode object but giving str arguments to it. The arguments are converted to unicode using the default ASCII encoding, which will fail for bytes not in range(128).
To avoid this problem do not mix str and unicode. Either pass unicode arguments to unicode methods:
new_text = text.decode('utf-8').replace(u'\\u00a0', u' ').replace(u'\\u00ad', u' ')...
or do the replacements in the str object, assuming text is a str:
new_text = text.replace('\u00a0', ' ').replace('\u00ad', ' ')...
The last piece of your chained replaces seems to be the problem.
text.replace('0xc3', ' ')
THis will try to replace the bytes 0xc3 with a space. In your code snippet it effectively reads
text.decode('utf-8').replace('0xc3', ' ')
which means that you first decode bytes to a (unicode-)string in python and then want to replace the wrong bytes. It should work if you replace the bytes before decoding:
text.replace('0xc3', ' ').decode('utf-8')
Could not match unicode string in python 2.7.
expected result 749130
>>> print match("\d+", u'\ufeff749130'.encode('utf-8'))
None
>>> print match("\d+", u'\ufeff749130')
None
>>> print match("\d+", u'\ufeff749130'.decode('utf-8'))
Traceback (most recent call last):
....
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
No need to use str.decode on a unicode string. As stated in the comments, you may want to use search because match only matches from the beginning of the target string.
>>> print search("\d+", u'\ufeff749130').group()
749130