'invalid continuation byte'- csv with multiple encodings? - python

I'm trying to download and parse a U.S. Census csv file in python. I'm getting a recurring error that suggests that there are multiple encodings in the file.
I got the file encoding using
import urllib.request
import io
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/csa-est2019-alldata.csv'
urllib.request.urlretrieve(url, 'data/source_files/census/city/2010.csv')
This gives me the file encoding
io.open('data/source_files/census/city/2010.csv')
<_io.TextIOWrapper name='data/source_files/census/city/2010.csv' mode='r' encoding='UTF-8'>
But the encoding doesn't seem to be correct? I tried using chardet.
with open('data/source_files/census/city/2010.csv', encoding = 'UTF-8') as f:
print(chardet.detect(f.read()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 11902: invalid continuation byte
I get a smilar error no matter what I try
df = pd.read_csv('data/source_files/census/city/' + '2010.csv')
import csv
with open("data/source_files/census/city/2010.csv","r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['CBSA'])
All these approaches are giving me this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 11902: invalid continuation byte
Any advice on how to get around this?

Latin-1 is a single byte encoding family so everything in it should be defined in UTF-8. But sometime Latin-1 wins.
Use this, If it shows the error of UTF-8.
import pandas as pd
url = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/csa-est2019-alldata.csv"
data = pd.read_csv(url , encoding='latin-1')
data.head()
It should show you below data.

The first code doesn't get the encoding, it just downloads the file.
The second code opens the file with an OS-specific default encoding, specifically the value of locale.getpreferredencoding(False). UTF-8 was the default for the OS used and it wasn't correct for the file.
The third code opens the file as UTF-8 again, and that is the cause of failure, not chardet.
Use the requests library instead:
>>> import requests
>>> r=requests.get('https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/csa-est2019-alldata.csv')
>>> r
<Response [200]>
>>> r.encoding
'ISO-8859-1'
The correct encoding is ISO-8859-1 also known as latin1. r.text will be the correctly decoded text.

It looks like that CSV file is not UTF-8 Encoded, so you have to use the correct option to help the File wrapper decode it correctly.
In this case, the CSV is encoded in ANSI instead of UTF-8:
open("...", encoding="ANSI")

Related

utf-8 error when opening csv file in pandas on mac

I am trying to open a csv file with Japanese characters using utf8 on my mac.
The code that I am using is as follows:
foo = pd.read_csv("filename.csv", encoding = 'utf8')
However, I have been getting the following error message.
'utf-8' codec can't decode byte 0x96 in position 0
I've tried looking around but a lot of the solutions seem to be for windows/I haven't had any success with other solutions yet.
Appreciate the help!
It seems that your file really has a non-unicode character. A correct encoding for this file strongly depends on its content, but in the most common case, 0x96 can be decoded with CP-1252. So, just try to decode it like following:
foo = pd.read_csv("filename.csv", encoding = 'cp1252')
If you don't know the original encoding of the file, you can try to detect it with third-party libs such as chardet.
I may help you a little bit more if you upload a chunk of the file to reproduce the problem.

Ignore UnicodeEncodeError when saving utf8 file

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)
U + feff is for UTF-16. Try that instead.
.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 6 years ago.
I have a CSV file that I'm uploading via an HTML form to a Python API
The API looks like this:
#app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
file = request.files['csv_file']
x = io.StringIO(file.read().decode('UTF8'), newline=None)
csv_input = csv.reader(x)
for row in csv_input:
print(row)
I found the part of the file that causes the issue. In my file it has Í character.
I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte
I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?
How do I fix this?
**
**
Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).
One the server side, I'm reading each row in the file and inserting into a database.
Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.
Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.
Instead of:
file.read().decode('UTF8')
You can use:
file.read().decode('UTF8', 'replace')
This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:
�
You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.
It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:
from codecs import open
encodings = [
"ascii",
"big5",
"big5hkscs",
"cp037",
"cp424",
"cp437",
"cp500",
"cp720",
"cp737",
"cp775",
"cp850",
"cp852",
"cp855",
"cp856",
"cp857",
"cp858",
"cp860",
"cp861",
"cp862",
"cp863",
"cp864",
"cp865",
"cp866",
"cp869",
"cp874",
"cp875",
"cp932",
"cp949",
"cp950",
"cp1006",
"cp1026",
"cp1140",
"cp1250",
"cp1251",
"cp1252",
"cp1253",
"cp1254",
"cp1255",
"cp1256",
"cp1257",
"cp1258",
"euc_jp",
"euc_jis_2004",
"euc_jisx0213",
"euc_kr",
"gb2312",
"gbk",
"gb18030",
"hz",
"iso2022_jp",
"iso2022_jp_1",
"iso2022_jp_2",
"iso2022_jp_2004",
"iso2022_jp_3",
"iso2022_jp_ext",
"iso2022_kr",
"latin_1",
"iso8859_2",
"iso8859_3",
"iso8859_4",
"iso8859_5",
"iso8859_6",
"iso8859_7",
"iso8859_8",
"iso8859_9",
"iso8859_10",
"iso8859_13",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"johab",
"koi8_r",
"koi8_u",
"mac_cyrillic",
"mac_greek",
"mac_iceland",
"mac_latin2",
"mac_roman",
"mac_turkish",
"ptcp154",
"shift_jis",
"shift_jis_2004",
"shift_jisx0213",
"utf_32",
"utf_32_be",
"utf_32_le",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_7",
"utf_8",
"utf_8_sig",
]
for encoding in encodings:
try:
with open(file, encoding=encoding) as f:
f.read()
print('Seemingly working encoding: {}'.format(encoding))
except:
pass
where file is again the filename of your file.

How to open a ASCII text graceful

It is confusing when I open a file with Python. By the way I'm using python3.4.
First it's a log file (a huge file that is appended to any time), so iconv is not possible.
Info1 file is ASCII text.
demo git:master ❯ file 1.log
1.log: ASCII text, with very long lines
Info2 ipython opens it with default encoding of 'UTF-8':
In [1]: f = open('1.log')
In [2]: f.encoding
Out[2]: 'UTF-8'
THEN
First when I open('1.log', encoding='utf-8', mode='r')
ERROR: 'utf-8' codec can't decode byte 0xb1 in position 6435: invalid start byte
Second when I open('1.log', encoding='ascii', mode='r')
ERROR: 'ascii' codec can't decode byte 0xe9 in position 6633: ordinal
not in range(128)
How can I gracefully handle this file with every line read?
This is my demo on github demo
I tried a few different combinations of encodings and I was able to get all the way through the log file by simply changing the encoding in your script to latin1, so the line open('1.log', encoding='utf-8', mode='r') becomes open('1.log', encoding='latin1', mode='r').
It's probably Windows CP 1252 or Latin 1. Try opening it with:
open('1.log', encoding='latin-1', 'rU')
Looks like its not an ascii file. The encoding test is usually inaccurate. try chardet which will detect the encoding for you.
Then
import chardet
filepointer = open(self.filename)
charset_detected = chardet.detect(filepointer.read())
Keep in mind that this can take a very very long time. Before you try that I recommend you manually cycle through the obvious encodings first.
Try UTF16 and UTF32. Then try the Windows encodings. Here is a list of several encodings.

Some readble content, but impossible to JSON dump to file

This text file (30 bytes only, the content is '(Ne pas r\xe9pondre a ce message)') can be opened and inserted in a dict successfully :
import json
d = {}
with open('temp.txt', 'r') as f:
d['blah'] = f.read()
with open('test.txt', 'w') as f:
data = json.dumps(d)
f.write(data)
But it is impossible to dump the dict into a JSON file (see traceback below). Why?
I tried lots of solutions provided by various SO questions. The closest solution I could get was this answer. When using this, I can dump to file, but then the JSON file looks like this:
# test.txt
{"blah": "(Ne pas r\u00e9pondre a ce message)"}
instead of
# test.txt
{"blah": "(Ne pas répondre a ce message)"}
Traceback:
File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte
[Finished in 0.1s with exit code 1]
Your file is not UTF-8 encoded. It uses a Latin codec, like ISO-8859-1 or Windows Codepage 1252. Reading the file gives you the encoded text.
JSON however requires Unicode text. Python 2 has a separate Unicode type, and byte strings (type str) need to be decoded using a suitable codec. The json.dumps() function uses UTF-8 by default; UTF-8 is a widely used codec for encoding Unicode text data that can handle all codepoints in the standard, and is also the default codec for JSON strings to use (JSON requires documents to be encoding in one of 3 UTF codecs).
You need to either decode the string manually or tell json.dumps() what codec to use for the byte string:
data = json.dumps(d, encoding='latin1') # applies to all bytestrings in d
or
d['blah'] = d['blah'].decode('latin1')
data = json.dumps(d)
or using io.open() to decode as you read:
import io
with io.open('test.txt', 'w', encoding='latin1') as f:
d['blah'] = f.read()
By default, the json library produces ASCII-safe JSON output by using the \uhhhh escape syntax the JSON standard allows for. This is entirely normal, the output is valid JSON and readable by any compliant JSON decoder.
If you must produce UTF-8 encoded output without the \uhhhh escape sequences, see Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Categories