Handle erroneously ASCII-encoded non-breaking space in UTF-8 file?

Handle erroneously ASCII-encoded non-breaking space in UTF-8 file? - python

I have a Python 2.7 script which imports data from CSV files exported from various others sources.
As part of the import process I have a small function that establishes the correct character encoding for the file. I then open the file and loop the lines using:
with io.open(filename, "r", encoding=file_encoding) as input_file:
for raw_line in input_file:
cleaned_line = raw_line.replace('\x00', '').replace(u"\ufeff", "").encode('utf-8')
# do stuff
The files from this source usually come as UTF-8 (with BOM) and I detect the encoding 'utf-8-sig' and use that to open the file.
The problem I am having is that one of my data sources returns a file that seems to have an encoding error. The rest of the file (about 27k lines of CSV data) are all correct, as usual, but one line fails.
The line in question fails with this error (at the for raw_line in input_file line):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1709: invalid start byte
The line has several non-breaking spaces characters that are encoded as with a single byte with value 'A0' rather than 2 bytes with 'C2 A0'.
I am already doing some light cleaning on a line by line basis for other problems as you can see on my "cleaned_line" line at the top of the loop (I dislike doing this per line but with the files I get I haven't found a way to do it better). However, the code fails before I ever even get there.
Is there a correct/nice way to handle this particular issue? I thought I'd nailed the whole encoding issue until this.

You can tell Python to ignore decoding errors, or to replace the faulty bytes with a placeholder character.
Set errors to 'ignore' to ignore the A0 bytes:
with io.open(filename, "r", encoding=file_encoding, errors='ignore') as input_file:
or to 'replace' to replace them with the U+FFFD REPLACEMENT CHARACTER:
with io.open(filename, "r", encoding=file_encoding, errors='replace') as input_file:
UTF-8 is a self-correcting encoding; bytes are either always part of a multi-byte code point, or can be decoded as ASCII directly, so ignoring un-decodable bytes is relatively safe.

You can do encoding 'translit/long' to normalize utf-8 to table of string charts, need to import translitcodec first.

Related

Looping over .csv file and removing any non-ascii strings

I have a .csv file that contain a large number of emails, each on a separate line. I am trying to remove any emails that contain non-ascii characters. This is what Im trying:
def is_ascii(s):
return all(ord(c) < 128 for c in s)
if __name__ == "__main__":
with open('emails.csv') as csv_file:
for line in csv_file:
if(is_ascii(line)):
with open('result.csv', 'a') as output_file:
output_file.write(line)
It keeps giving me an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 5012: invalid start byte

The problem is that you have no idea what the encodings are for non-ASCII emails, so you just want to skip them.
But your code is trying to decode them with your default encoding, and then deciding whether to skip them. That's what it means to open a file in text mode, like this:
with open('emails.csv') as csv_file:
for line in csv_file:
Since that default encoding is UTF-8, as soon as you run into something encoded in some other charset that isn't UTF-8 compatible, you get an error.
The easiest way to change this it to instead is to open the file in binary mode. Then you can decode only the lines you've decided to keep:
with open('emails.csv', 'rb') as csv_file:
for line in csv_file:
if(is_ascii(line)):
line = line.decode('ascii')
with open('result.csv', 'a') as output_file:
output_file.write(line)
… or just stay with bytes the whole way by also opening the output file in binary mode:
with open('emails.csv', 'rb') as csv_file:
for line in csv_file:
if(is_ascii(line)):
with open('result.csv', 'ab') as output_file:
output_file.write(line)
Either way, you will have to change your isascii function, because a bytes is a sequence of integers from 0-255, rather than a sequence of characters, so you can't (and don't need to) call ord:
def is_ascii(s):
return all(c < 128 for c in s)
There is a potential problem. I think you'll be fine, but you should think it through (and test whatever you need to test). While text-mode file objects automatically handle non-Unix newlines, binary-mode files do not.
If you somehow have classic Mac (pre-OS X) files from the previous century, with \r endings, your code will not work. The \r won't be treated as a newline at all, so the whole file will look like one huge line. If you don't expect to have any such files, I wouldn't worry about it.
But if the only non-Unix files you have are Windows (or DOS), with \r\n, you'll be fine. The \r will get treated as part of the line rather than part of the newline, but that won't matter for your code (ord('\r') < 128, and beyond that, all you're doing is writing the whole line worth of bytes at once), so things will just work.

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()

Write bytes literal with undefined character to CSV file (Python 3)

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.
However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.
Here's a minimal example:
import urllib.request
import urllib.parse
import csv
if __name__ == "__main__":
with open("bytewrite.csv", "w", newline="") as csvfile:
a = b'\x8b' # byte literal by urllib.request
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])
And this is the output:
Traceback (most recent call last):
File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in <module>
w.writerow([b])
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to <undefined>
Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?
It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

Quoted from csv docs:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.
Corrected minimal example:
import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
a = b'\x8b'
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])

Your interpretation of the lookup tables in encodings is not correct. The code you've listed:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK
Tells you two things:
How to map the unicode character '\x8b' to bytes in iso8859-1, it's just a control character.
How to map the unicode character '\u2039' to bytes in cp1252, it's a piece of punctuation: ‹
This does not tell you how to map the unicode character '\x8b' to bytes in cp1252, which is what you're trying to do.
The root of the problem is that "\x8b" is not a valid iso8859-1 character. Look at the table here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
8b is undefined, so it just decodes as a control character. After it's decoded and we're in unicode land, what is 0x8b? This is a little tricky to find out, but it's defined in the unicode database here:
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
Now, does CP1252 have this control character, "PARTIAL LINE FORWARD"?
http://en.wikipedia.org/wiki/Windows-1252#Code_page_layout
No, it does not. So you get an error when trying to encode it in CP1252.
Unfortunately there's no good solution for this. Some ideas:
Guess what encoding the page actually is. It's probably CP1252, not ISO-8859-1, but who knows. It could even contain a mix of encodings, or incorrectly encoded data (mojibake). You can use chardet to guess the encoding, or force this URL to use CP1252 in your program (overriding what the meta tag says), or you could try a series of codecs and take the first one that decodes & encodes successfully.
Fix up the input text or the decoded unicode string using some kind of mapping of problematic characters like this. This will work most of the time, but will fail silently or do something weird if you're trying to "fix up" data where it doesn't make sense.
Do not try to convert from ISO-8859-1 to CP1252, as they aren't compatible with each other. If you use UTF-8 that might work better.
Use an encoding error handler. See this table for a list of handlers. Using xmlcharrefreplace and backslashreplace will preserve the information (but then require you to do extra steps when decoding), while replace and ignore will silently skip over the bad character.
These types of issues caused by older encodings are really hard to solve, and there is no perfect solution. This is the reason why unicode was invented.

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?

Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.

This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.