Looping over .csv file and removing any non-ascii strings - python

I have a .csv file that contain a large number of emails, each on a separate line. I am trying to remove any emails that contain non-ascii characters. This is what Im trying:
def is_ascii(s):
return all(ord(c) < 128 for c in s)
if __name__ == "__main__":
with open('emails.csv') as csv_file:
for line in csv_file:
if(is_ascii(line)):
with open('result.csv', 'a') as output_file:
output_file.write(line)
It keeps giving me an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 5012: invalid start byte

The problem is that you have no idea what the encodings are for non-ASCII emails, so you just want to skip them.
But your code is trying to decode them with your default encoding, and then deciding whether to skip them. That's what it means to open a file in text mode, like this:
with open('emails.csv') as csv_file:
for line in csv_file:
Since that default encoding is UTF-8, as soon as you run into something encoded in some other charset that isn't UTF-8 compatible, you get an error.
The easiest way to change this it to instead is to open the file in binary mode. Then you can decode only the lines you've decided to keep:
with open('emails.csv', 'rb') as csv_file:
for line in csv_file:
if(is_ascii(line)):
line = line.decode('ascii')
with open('result.csv', 'a') as output_file:
output_file.write(line)
… or just stay with bytes the whole way by also opening the output file in binary mode:
with open('emails.csv', 'rb') as csv_file:
for line in csv_file:
if(is_ascii(line)):
with open('result.csv', 'ab') as output_file:
output_file.write(line)
Either way, you will have to change your isascii function, because a bytes is a sequence of integers from 0-255, rather than a sequence of characters, so you can't (and don't need to) call ord:
def is_ascii(s):
return all(c < 128 for c in s)
There is a potential problem. I think you'll be fine, but you should think it through (and test whatever you need to test). While text-mode file objects automatically handle non-Unix newlines, binary-mode files do not.
If you somehow have classic Mac (pre-OS X) files from the previous century, with \r endings, your code will not work. The \r won't be treated as a newline at all, so the whole file will look like one huge line. If you don't expect to have any such files, I wouldn't worry about it.
But if the only non-Unix files you have are Windows (or DOS), with \r\n, you'll be fine. The \r will get treated as part of the line rather than part of the newline, but that won't matter for your code (ord('\r') < 128, and beyond that, all you're doing is writing the whole line worth of bytes at once), so things will just work.

Related

Ignore UTF-8 decoding errors in CSV

I have a CSV file (which I have no control over). It's the result of concatenating multiple CSV files. Most of the file is UTF-8 but one of the files that went into it had fields that are encoded in what looks like Windows-1251.
I actually only care about one of the fields which contains a URL (so it's valid ASCII/UTF-8).
How do I ignore decoding errors in the other CSV fields if I only care about one field which I know is ASCII? Alternatively, for a more useful solution how do I change the encoding for each line of a CSV file if there's an encoding error?
csv.reader and csv.DictReader take lists of strings (a list of lines) as input, not just file objects.
So, open the file in binary mode (mode="rb"), figure out the encoding of each line, decode the line using that encoding and append it to a list and then call csv.reader on that list.
One simple heuristic is to try to read each line as UTF-8 and if you get a UnicodeDecodeError, try decoding it as the other encoding. We can make this more general by using the chardet library (install it with pip install chardet) to guess the encoding of each line if you can't decode it as UTF-8, instead of hardcoding which encoding to fall back on:
import codec
my_csv = "some/path/to/your_file.csv"
lines = []
with open(my_csv, "rb") as f:
for line in f:
detected_encoding = chardet.detect(line)["encoding"]
try:
line = line.decode("utf-8")
except UnicodeDecodeError as e:
line = line.decode(detected_encoding)
lines.append(line)
reader = csv.DictReader(lines)
for row in reader:
do_stuff(row)
If you do want to just hardcode the fallback encoding and don't want to use chardet (there's a good reason not to, it's not always accurate), you can just replace the variable detected_encoding with "Windows-1251" or whatever encoding you want in the code above.
This is of course not perfect because just because a line successfully decodes using some encoding doesn't mean it's actually using that encoding. If you don't have to do this more than a few times, it's better to do something like print out each line and its detected encoding and try and figure out where one encoding starts and the other ends by hand. Ultimately the right strategy to pursue here might be to try and reverse the step that lead to the broken input (concatenating of the the files) and then re-do it correctly (by normalizing them to the same encoding before concatenating).
In my case, I counted how many lines were detected as which encoding
import chardet
from collections import Counter
my_csv_file = "some_file.csv"
with open(my_csv_file, "rb") as f:
encodings = Counter(chardet.detect(line)["encoding"] for line in f)
print(encodings)
and realized that my whole file was actually encoded in some other, third encoding. Running chardet on the whole file detected the wrong encoding, but running it on each line detected a bunch of encodings and the second most common one (after ascii) was the correct encoding I needed to use to read the whole file. So ultimately all I needed was
with open(my_csv, encoding="latin_1") as f:
reader = csv.DictReader(f)
for row in reader:
do_stuff(row)
You could try using the Compact Encoding Detection library instead of chardet. It's what Google Chrome uses so maybe it'll work better, but it's written in C++ instead of Python.

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()
It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.
if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()

Python - Unicode file IO

I have a one line txt file with a bunch of unicode characters with no spaces
example
πŸ…ΎπŸ†–πŸ†•β“‚πŸ†™πŸ†šπŸˆπŸˆ‚
And I want to output a txt file with one character on each line
When I try to do this I think end up splitting the unicode charachters, how can I go about this?
There is no such thing as a text file with a bunch of unicode characters, it only makes sense to speak about a "unicode object" once the file has been read and decoded into Python objects. The data in the text file is encoded, one way or another.
So, the problem is about reading the file in the correct way in order to decode the characters to unicode objects correctly.
import io
enc_source = enc_target = 'utf-8'
with io.open('my_file.txt', encoding=enc_source) as f:
the_line = f.read().strip()
with io.open('output.txt', mode='w', encoding=enc_target) as f:
f.writelines([c + '\n' for c in the_line])
Above I am assuming the target and source file encodings are both utf-8. This is not necessarily the case, and you should know what the source file is encoded with. You get to choose enc_target, but somebody has to tell you enc_source (the file itself can't tell you).
This works in Python 3.5
line = "πŸ˜€πŸ‘"
with open("file.txt", "w", encoding="utf8") as f:
f.write("\n".join(line))

Handle erroneously ASCII-encoded non-breaking space in UTF-8 file?

I have a Python 2.7 script which imports data from CSV files exported from various others sources.
As part of the import process I have a small function that establishes the correct character encoding for the file. I then open the file and loop the lines using:
with io.open(filename, "r", encoding=file_encoding) as input_file:
for raw_line in input_file:
cleaned_line = raw_line.replace('\x00', '').replace(u"\ufeff", "").encode('utf-8')
# do stuff
The files from this source usually come as UTF-8 (with BOM) and I detect the encoding 'utf-8-sig' and use that to open the file.
The problem I am having is that one of my data sources returns a file that seems to have an encoding error. The rest of the file (about 27k lines of CSV data) are all correct, as usual, but one line fails.
The line in question fails with this error (at the for raw_line in input_file line):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1709: invalid start byte
The line has several non-breaking spaces characters that are encoded as with a single byte with value 'A0' rather than 2 bytes with 'C2 A0'.
I am already doing some light cleaning on a line by line basis for other problems as you can see on my "cleaned_line" line at the top of the loop (I dislike doing this per line but with the files I get I haven't found a way to do it better). However, the code fails before I ever even get there.
Is there a correct/nice way to handle this particular issue? I thought I'd nailed the whole encoding issue until this.
You can tell Python to ignore decoding errors, or to replace the faulty bytes with a placeholder character.
Set errors to 'ignore' to ignore the A0 bytes:
with io.open(filename, "r", encoding=file_encoding, errors='ignore') as input_file:
or to 'replace' to replace them with the U+FFFD REPLACEMENT CHARACTER:
with io.open(filename, "r", encoding=file_encoding, errors='replace') as input_file:
UTF-8 is a self-correcting encoding; bytes are either always part of a multi-byte code point, or can be decoded as ASCII directly, so ignoring un-decodable bytes is relatively safe.
You can do encoding 'translit/long' to normalize utf-8 to table of string charts, need to import translitcodec first.

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Categories