Recode bytes which cannot be decoded in utf-8 in python - python

reading in from txt files - there is one byte which is causing me issues to encode:
with open(input_filename_and_director, 'rb') as f:
r = unicodecsv.reader(f, delimiter="|")
Results in an error message:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 26: invalid continuation byte
Is there anyway to specify how I want these bytes handled (i.e. to read this byte in as another character?)

Depending upon what you want, try using unicodecsv.reader(f, delimiter="|", errors='replace') or unicodecsv.reader(f, delimiter="|", errors='ignore'). unicodecsv passes through the errors parameter to the unicode encoding. See the help for unicode or here for more information.

Related

invalid continuation byte when reading file

Here is my Code line:
m_data = pd.read_table(m_path, sep='::', header=None, names=mnames)
results in the error:
'utf-8' codec can't decode byte 0xe9 in position 3114: invalid continuation byte
I have specified a coder in my code:
m_data = pd.read_table(m_path, sep='::', header=None, names=mnames,encoding='utf-8')
But the problem still exists. What should I do then?
'utf-8' codec can't decode byte 0xe9 in position 3114: invalid continuation byte
Here the error message means you should NOT use utf8 encoding.
It might be utf16, gbk and so on, if you have ever heard them.
If you still got the message like that, after some possible attempts.
I will suggest chardet package.
It is very easy to use.
import chardet
with open("your_file", mode="rb") as f:
print(chardet.detect(f.read(2000)))
rb means, read it as binary code.
2000 means, the bytes size you wanna detect. Often, the larger you set, the more accurate the results.
chardet - pypi

'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte

I am trying to read a csv file using the following lines of Python code:
crimes = pd.read_csv('C:/Users/usuario1/Desktop/python/csv/001 Boston crimes/crime.csv', encoding = 'utf8')
crimes.head(5)
But I am getting decode error as follws:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 24: invalid start byte
What is going wrong?
May be your file does not support utf-8 codec or has a character that does not support utf-8. You can try other encodings like ISO-8859-1. But it is best to check your file encoding first. To do so, something like the following should work:
1.
with open('Your/file/path') as f:
print(f)
This should print file details with encoding.
Or you can just open the csv and when you go to File -> Save As this should show your encoding.
If those don't help, you can ignore the rows that are causing problems by using `error_bad_lines=False'
crimes = pd.read_csv('Your/file/path', encoding='utf8', error_bad_lines=False)
Hope these will help

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1010494: character maps to <undefined>

please I need help with this:
url ='https://www.sec.gov/Archives/edgar/data/1437750/0001477932-13-004416.txt'
with open('file', 'wb') as f:
f.write(requests.get('%s' % url).content)
with open('file', 'r') as t:
words= t.read()
The above gives me the following error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1010494: character maps to < undefined>
Thank you!
I just experienced the same problem. When I was trying to read the file, one of my strings had a double space: ' '. Removing that double space fixed the 0x9d problem.
Why are you writing your file as a binary, and then reading it as a unicode string? Python doesn't know how to decode some bytes from the original stream until you tell it what codec to use. Since the file you've streamed in your first command is not utf-8 encoded, try decoding your file to latin-1 when reading it:
with open('file', 'r', encoding='latin-1') as t:
words = t.read()

'utf-8' codec can't decode byte 0x89

I want to read a csv file and process some columns but I keep getting issues.
Stuck with the following error:
Traceback (most recent call last):
File "C:\Users\Sven\Desktop\Python\read csv.py", line 5, in <module>
for row in reader:
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 446: invalid start byte
>>>
My Code
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv",newline='', encoding="utf8") as f:
reader = csv.reader(f,delimiter=';',quotechar='|')
#print(sum(1 for row in reader))
for row in reader:
print(row)
if row:
value = row[6]
value = value.replace('(', '')
value = value.replace(')', '')
value = value.replace(' ', '')
value = value.replace('.', '')
value = value.replace('0032', '0')
if len(value) > 0:
print(value + ' Length: ' + str(len(value)))
I'm a beginner with Python, tried googling, but hard to find the right solution.
Can anyone help me out?
The first byte of a .PNG file is 0x89. Not saying that is your problem, but the .PNG header is specifically designed so that it is NOT accidentally interpreted as text.
Why you would have a .csv file that is actually a .png I don't know. But it definitely could happen if someone accidentally renamed the file. On windows 10 every once and a while I accidentally mass-rename files by accident because of their stupid checkbox feature. Why Microsoft decided desktop machines having identical UI controls to tablets was I good idea... I don't know.
This is the most important clue:
invalid start byte
\x89 is not, as suggested in the comments, an invalid UTF-8 byte. It is a completely valid continuation byte. Meaning if it follows the correct byte value, it codes UTF-8 correctly:
http://hexutf8.com/?q=0xc90x89
So either you (1) do not have UTF-8 data as you expect, or (2) you have some malformed UTF-8 data. The Python codec is simply letting you know that it encountered \x89 in the wrong order in the sequence.
(More on continuation bytes here: http://en.wikipedia.org/wiki/UTF-8#Codepage_layout)
I was also getting the similar error when trying to read or upload the following kinds of files:
CSV File
JPEG File
PNG File
Zip File
The best way to avoid error like:
'utf-8' codec can't decode byte 0x89
'utf-8' codec can't decode byte 0xff
is to read these files as Bytes. When you treat them as byte then you need not provide any encoding value here. So when you open them you should specify:
with open(file_path, 'rb') as file:
Or in your case, the code should be something like:
import csv
with open("c:\\Users\\Sven\\Desktop\\relaties 24112014.csv", newline='', 'rb') as f:
reader = csv.reader(f,delimiter=';',quotechar='|')

decode following urls in python

I have a URL like this:
http://idebate.org/debatabase/debates/constitutional-governance/house-supports-dalai-lama%E2%80%99s-%E2%80%98third-way%E2%80%99-tibet
Then I used following script in python to decode this url:
full_href = urllib.unquote(full_href.encode('ascii')).decode('utf-8')
However, i got error like this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 89: ordinal not in range(128)
when trying to write in file
Just as #KevinJ.Chase pointed out, you were most likely trying to write to a file with string in incompatible ascii format.
You can either change your write file encoding, or encode your full_href to ascii, something like this:
# don't decode again to utf-8
full_href = urllib.unquote(url.encode('ascii'))
... then write to your file stream
or,
...
# encode your your to compatible encoding on write, ie. utf-8
with open('yourfilenamehere', 'w') as f:
f.write(full_href.encode('utf-8'))

Categories