I'm trying to read in a response from a REST API, parse it as JSON and write the properties to a CSV file.
It appears some of the characters are in an unknown encoding and can't be converted to strings when they're written out to the CSV file:
'ascii' codec can't encode character u'\xf6' in position 15: ordinal not in range(128)
So, what I've tried to do is follow the answer by "agf" on this question:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)
I added a call to unicode(content).encode("utf-8") when my script reads the contents of the response:
obj = json.loads(unicode(content).encode("utf-8"))
Now I see a exceptions.UnicodeDecodeError on this line.
Is Python attempting to decode "content" before encoding it as utf-8? I don't quite understand what's going on. There is no way to determine the encoding of the response since the API I'm calling doesn't set a Content-Type header.
Not sure how to handle this. Please advise.
Related
I am trying to read a csv file using the following lines of Python code:
crimes = pd.read_csv('C:/Users/usuario1/Desktop/python/csv/001 Boston crimes/crime.csv', encoding = 'utf8')
crimes.head(5)
But I am getting decode error as follws:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte
What is going wrong?
May be your file does not support utf-8 codec or has a character that does not support utf-8. You can try other encodings like ISO-8859-1. But it is best to check your file encoding first. To do so, something like the following should work:
1.
with open('Your/file/path') as f:
print(f)
This should print file details with encoding.
Or you can just open the csv and when you go to File -> Save As this should show your encoding.
If those don't help, you can ignore the rows that are causing problems by using `error_bad_lines=False'
crimes = pd.read_csv('Your/file/path', encoding='utf8', error_bad_lines=False)
Hope these will help
I want write some strings to file which is not in English, they are in Azeri language. Even if I do utf-8 encoding I get following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)
my code piece that wants to write to file is following:
t_w = text_list[y].encode('utf-8')
new_file.write(t_w.decode('utf-8'))
new_file.write('\n')
EDIT
Even if I make code as:
t_w = text_list[y].encode('ascii',errors='ignore')
new_file.write(t_w)
new_file.write('\n')
I get following error which is :
TypeError: write() argument must be str, not bytes
From what I can tell t_w.decode(...) attempts to convert your characters to ASCII, which doesn't encode some Azeri characters. There is no need to decode the string because you want to write it to the file as UTF-8, so omit the .decode(...) part:new_file.write(t_w)
I have a URL like this:
http://idebate.org/debatabase/debates/constitutional-governance/house-supports-dalai-lama%E2%80%99s-%E2%80%98third-way%E2%80%99-tibet
Then I used following script in python to decode this url:
full_href = urllib.unquote(full_href.encode('ascii')).decode('utf-8')
However, i got error like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 89: ordinal not in range(128)
when trying to write in file
Just as #KevinJ.Chase pointed out, you were most likely trying to write to a file with string in incompatible ascii format.
You can either change your write file encoding, or encode your full_href to ascii, something like this:
# don't decode again to utf-8
full_href = urllib.unquote(url.encode('ascii'))
... then write to your file stream
or,
...
# encode your your to compatible encoding on write, ie. utf-8
with open('yourfilenamehere', 'w') as f:
f.write(full_href.encode('utf-8'))
I am trying to convert a csv file to a json file. The whole code runs fine but when I encounter the statement:
json.dump(DictName, out_file)
I get the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 15: invalid start byte
Would someone please be able to help?
TIA.
I found the solution. While parsing the string, I converted the string to unicode using the unicode() function:
unicode(stringname, errors='replace')
and it replaced all the erroneous symbols.
I am retrieving a list of names from a webservice using a client I've written in Python. Upon retrieving the list, I encode each name to unicode and then print each of them to stdout. When I get to the name "Ólafur Jóhann Ólafsson", I get the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
Since I cannot know what the encoding will be, how do I convert all of these strings to unicode? Or can you suggest a better way to handle this problem?
First of all, you decode data to Unicode (the absence of encoding) when reading from a file, pipe, socket, terminal, etc.; and encode Unicode to an appropriate byte encoding when sending/persisting data. I suspect this is the root of your problem.
The web service should declare the encoding in the headers or data received. print normally automatically encodes Unicode to the terminal's encoding (discovered through sys.stdout.encoding) or in absence of that just ascii. If the characters in your data are not supported by the target encoding, you'll get a UnicodeEncodeError.
Since that is not the error you received, you should post some code so we can see what your are doing. Most likely, you are encoding a byte string instead of decoding. Here's an example of this:
>>> data = '\xc2\xbd' # UTF-8 encoded 1/2 symbol.
>>> data.encode('cp437')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
What I did here is call encode on a byte string. Since encode requires a Unicode string, Python used the default ascii encoding to decode the byte string to Unicode first, before encoding to cp437.
Fix this by decoding instead of encoding the data, then print will encode to stdout automatically. As long as your terminal supports the characters in the data, it will display properly:
>>> import sys
>>> sys.stdout.encoding
'cp437'
>>> print data.decode('utf8') # implicit encode to sys.stdout.encoding
½
>>> print data.decode('utf8').encode('cp437') # explicit encode.
½
The UnicodeDammit module from BeautifulSoup can automagically detect the encoding.
from BeautifulSoup import UnicodeDammit
u = UnicodeDammit("Ólafur Jóhann Ólafsson")
print u.unicode
print u.originalEncoding
This page may help you http://wiki.python.org/moin/PrintFails
The problem, I guess, is that you need to print those names to console. Do you really need it? or it's just a test environment? if you use console just for testing, you may switch to other tools like unit testing to check what values you exactly get.