Ignore UTF-8 decoding errors in CSV - python

I have a CSV file (which I have no control over). It's the result of concatenating multiple CSV files. Most of the file is UTF-8 but one of the files that went into it had fields that are encoded in what looks like Windows-1251.
I actually only care about one of the fields which contains a URL (so it's valid ASCII/UTF-8).
How do I ignore decoding errors in the other CSV fields if I only care about one field which I know is ASCII? Alternatively, for a more useful solution how do I change the encoding for each line of a CSV file if there's an encoding error?

csv.reader and csv.DictReader take lists of strings (a list of lines) as input, not just file objects.
So, open the file in binary mode (mode="rb"), figure out the encoding of each line, decode the line using that encoding and append it to a list and then call csv.reader on that list.
One simple heuristic is to try to read each line as UTF-8 and if you get a UnicodeDecodeError, try decoding it as the other encoding. We can make this more general by using the chardet library (install it with pip install chardet) to guess the encoding of each line if you can't decode it as UTF-8, instead of hardcoding which encoding to fall back on:
import codec
my_csv = "some/path/to/your_file.csv"
lines = []
with open(my_csv, "rb") as f:
for line in f:
detected_encoding = chardet.detect(line)["encoding"]
try:
line = line.decode("utf-8")
except UnicodeDecodeError as e:
line = line.decode(detected_encoding)
lines.append(line)
reader = csv.DictReader(lines)
for row in reader:
do_stuff(row)
If you do want to just hardcode the fallback encoding and don't want to use chardet (there's a good reason not to, it's not always accurate), you can just replace the variable detected_encoding with "Windows-1251" or whatever encoding you want in the code above.
This is of course not perfect because just because a line successfully decodes using some encoding doesn't mean it's actually using that encoding. If you don't have to do this more than a few times, it's better to do something like print out each line and its detected encoding and try and figure out where one encoding starts and the other ends by hand. Ultimately the right strategy to pursue here might be to try and reverse the step that lead to the broken input (concatenating of the the files) and then re-do it correctly (by normalizing them to the same encoding before concatenating).
In my case, I counted how many lines were detected as which encoding
import chardet
from collections import Counter
my_csv_file = "some_file.csv"
with open(my_csv_file, "rb") as f:
encodings = Counter(chardet.detect(line)["encoding"] for line in f)
print(encodings)
and realized that my whole file was actually encoded in some other, third encoding. Running chardet on the whole file detected the wrong encoding, but running it on each line detected a bunch of encodings and the second most common one (after ascii) was the correct encoding I needed to use to read the whole file. So ultimately all I needed was
with open(my_csv, encoding="latin_1") as f:
reader = csv.DictReader(f)
for row in reader:
do_stuff(row)
You could try using the Compact Encoding Detection library instead of chardet. It's what Google Chrome uses so maybe it'll work better, but it's written in C++ instead of Python.

Related

Converting a Massive Unicode Text File to ASCII

I've received several text files, where each file contains thousands of lines of text. Because the files use Unicode encoding, each file ends up being around 1GB. I know this might sound borderline ridiculous, but it unfortunately is the reality:
I'm using Python 2.7 on a Windows 7 machine. I've only started using Python but figured this would be a good chance to really start using the language. You've gotta use it to learn it, right?
What I'm hoping to do is to be able to make a copy of all of these massive files. The new copies would be using ASCII character encoding and would ideally be significantly smaller in size. I know that changing the character encoding is a solution because I've had success by opening a file in MS WordPad and saving it to a regular text file:
Using WordPad is a manual and slow process: I need to open the file, which takes forever because it's so big, and then save it as a new file, which also takes forever since it's so big. I'd really like to automate this by having a script run in the background while I work on other things. I've written a bit of Python to do this, but it's not working correctly. What I've done so far is the following:
def convertToAscii():
# Getting a list of the current files in the directory
cwd = os.getcwd()
current_files = os.listdir(cwd)
# I don't want to mess with all of the files, so I'll just pick the second one since the first file is the script itself
test_file = current_files[1]
# Determining a new name for the ASCII-encoded file
file_name_length = len(test_file)
ascii_file_name = test_file[:file_name_length - 3 - 1] + "_ASCII" + test_file[file_name_length - 3 - 1:]
# Then we open the new blank file
the_file = open(ascii_file_name, 'w')
# Finally, we open our original file for testing...
with io.open(test_file, encoding='utf8') as f:
# ...read it line by line
for line in f:
# ...encode each line into ASCII
line.encode("ascii")
# ...and then write the ASCII line to the new file
the_file.write(line)
# Finally, we close the new file
the_file.close()
convertToAscii()
And I end up with the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
But that doesn't make any sense.... The first line within all of the text files is either a blank line or a series of equal signs, such as ===========.
I was wondering if someone would be able to put me onto the right path for this. I understand that doing this operation can take a very long time since I'm essentially reading each file line by line and then encoding the string into ASCII. What must I do in order to get around my current issue? And is there a more efficient way to do this?
For characters that exist in ASCII, UTF-8 already encodes using single bytes. Opening a UTF8 file with only single byte characters then saving an ASCII file should be a non-operation.
For any size difference, your files would have to be some wider encoding of Unicode, like UTF-16 / UCS-2. That would also explain the utf8 codec complaining about unexpected bytes in the source file.
Find out what encoding your files actually are, then save using utf8 codec. That way your files will be just as small (equivalent to ASCII) for single byte characters, but if your source files happen to have any multibyte characters, the result file will still be able to encode them and you won't be doing a lossy conversion.
There's a potential speedup if you avoid splitting the file into lines, since the only thing that you're doing is joining the lines back together. This allows you to process the input in larger blocks.
Using the shutil.copyfileobj function (which is just read and write in a loop):
import shutil
with open('input.txt', encoding='u16') as infile, \
open('output.txt', 'w', encoding='u8') as outfile:
shutil.copyfileobj(infile, outfile)
(Using Python 3 here, by passing the encoding argument directly to open, but it should be the same as the library function io.open.)

UnicodeEncodeError no matter what encoding I try [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Python 3.5.1 mixed line code file UTF-8 and UTF-16

I have successfully been parsing data files that I recieve with a simple python script I wrote. The files I get are like this:
file.txt, ~50 columns of data, x 1000s of rows
abcd1,1234a,efgh1,5678a,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
Unfortunatly, sometimes some of the lines contain UTF-16 symbols, and look like this
abcd1,12341,efgh1,UTF-16 symbols here,ijkl1 ...etc
abcd2,1234b,efgh2,5678b,ijkl2 ...etc
...
I have been able to implement the "latin-1" coding for commands in my script like:
open('file fixed.txt', 'w', encoding="latin-1").writelines([line for line in open('file.txt', 'r', encoding="latin-1"])
My problem lies in code such as:
for line in fileinput.Fileinput('file fixed.txt', inplace=1):
line = line.replace(":",",")
print (line, ",")
I am unable to get past the encoding errors for the last command. I have tried enforcing the coding of:
# -*- coding: latin-1 -*-
At the top of the document as well as before the last mentioned command (find and replace). How can I get mixed encoded files to process for the above command? I would like to preserve the UTF-16 (unicode) symbols as they appear in the new file. Thanks in advance.
EDIT: Thanks to Alexis I was able to determine that filinput would not work for setting another encoding method. I used the below to resolve my issue.
f = open(filein,'r', encoding="latin-1")
filedata = f.read()
f.close()
newdata = filedata.replace("old data","new data")
f = open(fileout,'w', encoding="latin-1")
f.write(newdata)
f.close()
You can tell fileinput how to open your files. As the documentation says:
You can control how files are opened by providing an opening hook via the openhook parameter to fileinput.input() or FileInput(). The hook must be a function that takes two arguments, filename and mode, and returns an accordingly opened file-like object. Two useful hooks are already provided by this module.
So you'd do it like this:
def open_utf16(name, m):
return open(name, m, encoding="utf-16")
for line in fileinput.FileInput("file fixed.txt", openhook=open_utf16):
...
I use "utf-16" as the encoding since this is your file's encoding, not "latin-1". 8-bit encodings don't have error checking so Latin1 will read the bytes without noticing there's anything wrong, but you're likely to have problems down the line. If this gives you errors, your file is not in utf-16.
If your file has mixed encoding, you need to read it as binary and then decode different parts as necessary, or just process the whole thing as binary instead. The latin-1 solution in the question works by accident really.
In your example that would be something like:
with open('the/path', 'rb') as fi:
data = fi.read().replace(b'old data', b'new data')
with open('other/path', 'wb') as fo:
fo.write(data)
This is the closest to what you ask for - as far as I understand you don't even care about that field with potentially different encoding - you just want to change some content and copy the rest of the file as is. Binary mode allows you to do that.

Handle erroneously ASCII-encoded non-breaking space in UTF-8 file?

I have a Python 2.7 script which imports data from CSV files exported from various others sources.
As part of the import process I have a small function that establishes the correct character encoding for the file. I then open the file and loop the lines using:
with io.open(filename, "r", encoding=file_encoding) as input_file:
for raw_line in input_file:
cleaned_line = raw_line.replace('\x00', '').replace(u"\ufeff", "").encode('utf-8')
# do stuff
The files from this source usually come as UTF-8 (with BOM) and I detect the encoding 'utf-8-sig' and use that to open the file.
The problem I am having is that one of my data sources returns a file that seems to have an encoding error. The rest of the file (about 27k lines of CSV data) are all correct, as usual, but one line fails.
The line in question fails with this error (at the for raw_line in input_file line):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1709: invalid start byte
The line has several non-breaking spaces characters that are encoded as with a single byte with value 'A0' rather than 2 bytes with 'C2 A0'.
I am already doing some light cleaning on a line by line basis for other problems as you can see on my "cleaned_line" line at the top of the loop (I dislike doing this per line but with the files I get I haven't found a way to do it better). However, the code fails before I ever even get there.
Is there a correct/nice way to handle this particular issue? I thought I'd nailed the whole encoding issue until this.
You can tell Python to ignore decoding errors, or to replace the faulty bytes with a placeholder character.
Set errors to 'ignore' to ignore the A0 bytes:
with io.open(filename, "r", encoding=file_encoding, errors='ignore') as input_file:
or to 'replace' to replace them with the U+FFFD REPLACEMENT CHARACTER:
with io.open(filename, "r", encoding=file_encoding, errors='replace') as input_file:
UTF-8 is a self-correcting encoding; bytes are either always part of a multi-byte code point, or can be decoded as ASCII directly, so ignoring un-decodable bytes is relatively safe.
You can do encoding 'translit/long' to normalize utf-8 to table of string charts, need to import translitcodec first.

Encoding issue when writing to text file, with Python

I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python

Categories