i got a problem trying to encore non ASCII characters.
I have this function :
#function to treat special characters
tagsA=["À","Á","Â","à","á","â","Æ","æ"]
tagsC=["Ç","ç"]
tagsE=["È","É","Ê","Ë","è","é","ê","ë"]
tagsI=["Ì","Í","Î","Ï","ì","í","î","ï"]
tagsN=["Ñ","ñ"]
tagsO=["Ò","Ó","Ô","Œ","ò","ó","ô","œ"]
tagsU=["Ù","Ú","Û","Ü","ù","ú","û","ü"]
tagsY=["Ý","Ÿ","ý","ÿ"]
def toASCII(word):
for i in range (0, len(word),1):
if any(word[i] in s for s in tagsA):
word[i]="a"
if any(word[i] in s for s in tagsC):
word[i]="c"
if any(word[i] in s for s in tagsE):
word[i]="e"
if any(word[i] in s for s in tagsI):
word[i]="i"
if any(word[i] in s for s in tagsN):
word[i]="n"
if any(word[i] in s for s in tagsO):
word[i]="o"
if any(word[i] in s for s in tagsU):
word[i]="u"
if any(word[i] in s for s in tagsY):
word[i]="y"
print word
return word
i get this error usually :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
tried to change encoding to utf8 but it doesn't change the issue.
# -*- coding: utf-8 -*-
You can use the unicodedata module to remove all the accents from string.
Ex:
import unicodedata
print unicodedata.normalize('NFKD', u"ÀÁ").encode('ASCII', 'ignore')
Output:
AA
Related
I'm trying to encode this:
"LIAISONS Ã NEW YORK"
to this:
"LIAISONS à NEW YORK"
The output of print(ascii(value)) is
'LIAISONS \xc3 NEW YORK'
I tried encoding in cp1252 first and decoding after to utf8 but I get this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I also tried to encode in Latin-1/ISO-8859-2 but that is not working too.
How can I do this?
You can't go from your input value to your desired output, because the data is no longer complete.
If your input value was an actual Mojibake re-coding from UTF-8 to a Latin encoding, then you'd have two bytes for the à codepoint:
>>> target = "LIAISONS à NEW YORK"
>>> target.encode('UTF-8').decode('latin1')
'LIAISONS Ã\xa0 NEW YORK'
That's because the UTF-8 encoding for à is C3 A0:
>>> 'à'.encode('utf8').hex()
'c3a0'
In your input, the A0 byte (which doesn't map to a printable character in most Latin-based codecs) has been filtered out somewhere. You can't re-create it from thin air, because the C3 byte of the UTF-8 pair can precede any number of other bytes, all resulting in valid output:
>>> b'\xc3\xa1'.decode('utf8')
'á'
>>> b'\xc3\xa2'.decode('utf8')
'â'
>>> b'\xc3\xa3'.decode('utf8')
'ã'
>>> b'\xc3\xa4'.decode('utf8')
'ä'
and you can't easily pick one of those, not without additional natural language processing. The bytes 80-A0 and AD are all valid continuation bytes in UTF-8 for this case, but none of those bytes result in a printable Latin-1 character, so there are at least 18 different possibilities here.
I am getting data using Xpath and the output has '\xa0' which is Unicode. I wanted to eliminate it but it returns:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
Here is my code:
page_active = requests.get('http://www.marketinout.com/stock-screener/stocks.php?list=volume_leaders&exch=asx')
active = html.fromstring(page_active.content)
data = active.xpath('//tbody/tr/td/text()')
data >>> [u'\xa0', u'\xa0', u'\xa0Bard1 Life Sciences Limited
',
u'\xa0Gold', u'\xa0Basic Materials', u'\xa0ASX', u'\xa07', u'\xa00.025', u'\xa00.015', u'\xa0150.0', u'\xa02
78,097,367', u'\xa0', u'\xa0', u'\xa0Patrys Ltd ...]
In order to eliminate '\xa0', I tried [a.replace('\xa0',' ') for a in data] but it returns:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
I also used [a.decode('utf-8').replace("\xa0","") for a in data] but I'm still getting the same error.
You are mixing bytes and Unicode, don't do that. Use Unicode string literals instead:
[a.replace(u'\xa0', u' ') for a in data]
Otherwise, Python will try to decode the byte string '\xa0' as ASCII, and 0xA0 is not a valid ASCII codepoint.
Alternatively, use unicode.strip() to remove trailing and leading whitespace; the U+00A0 codepoint counts as whitespace:
[a.strip() for a in data]
You need to tell Python to interpret your strings as Unicode.
To do this, add a u before your strings:
[a.replace(u'\xa0', u' ') for a in data]
I have the next code snippet in Python (2.7.8) on Windows:
text1 = 'áéíóú'
text2 = text1.encode("utf-8")
and i have the next error exception:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
Any ideas?
You forgot to specify that you are dealing with a unicode string:
text1 = u'áéíóú' #prefix string with "u"
text2 = text1.encode("utf-8")
In python 3 this behavior has changed, and any string is unicode, so you don't need to specify it.
I have tried the following code in Linux with Python 2.7:
>>> text1 = 'áéíóú'
>>> text1
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
>>> type(text1)
<type 'str'>
>>> text1.decode("utf-8")
u'\xe1\xe9\xed\xf3\xfa'
>>> print '\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
áéíóú
>>> print u'\xe1\xe9\xed\xf3\xfa'
áéíóú
>>> u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba'
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba is the utf-8 coding of áéíóú. And \xe1\xe9\xed\xf3\xfa is the unicode coding of áéíóú.
text1 is encoded by utf-8, it only can be decoded to unicode by:
text1.decode("utf-8")
an unicode string can be encoded to an utf-8 string:
u'\xe1\xe9\xed\xf3\xfa'.encode('utf-8')
I am trying to generate random Unicode characters with two starting number+letter combination..
I have tried the following below but I am getting an error.
def rand_unicode():
b = ['03','20']
l = ''.join([random.choice('ABCDEF0123456789') for x in xrange(2)])
return unicode(u'\u'+random.choice(b)+l,'utf8')
The error I am getting:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence
I use Python 2.6.
Yeah, uh, that's not how.
return unichr(random.choice((0x300, 0x2000)) + random.randint(0, 0xff))
I read the artist of a song from its MP3 tag, then create a folder based on that name. The problem I have is when the name contains a special character like 'AC\DC'. So I wrote this code to deal with that.
def replace_all(text):
print "replace_all"
dictionary = {'\\':"", '?':"", '/':"", '...':"", ':':"", chr(148):"o"}
for i, j in dictionary.iteritems():
text = text.replace(i,j)
return text
What I am running into now is how to deal with non-english characters like an umlaout o in Motorhead or Blue Oyster cult.
As you see I tried adding the ascii-string version of umlaout o at the end of the dictionary but that failed with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
I found this code, though I don't understand it.
def strip_accents(s):
return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
It enabled me to remove the accent marks from the path of proposed dir/filenames.
I suggest using unicode for both input text and the chars replaced. In your example chr(148) is clearly not a unicode symbol.