python file open() throws exception for non utf-8 character

python file open() throws exception for non utf-8 character - python

I wrote the simplest python program that exhibits the error I need help with.
lines_read = 0
urllist_file = open('../fall11_urls.txt', 'r')
for line in urllist_file:
lines_read += 1
print('line count:', lines_read)
I run this on most files and of course it works as expected but "fall11_urls.txt" is a 14 million line text file that contains URLs, one per line. Some of these lines contain text with appeaently non utf-8 characters and I get the error quoted below. I need access every one of these URLs What is the best way to handle this. These URLs can be "anything" some are 400 characters of random characters as in "https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg Some of these string contain characters such as 0x96 I need my python program to be robust against whatever might be in the file. (If it matters this runs on Ubuntu 16.04)
Here is the error
Traceback (most recent call last):
File "./count_lines.py", line 2, in <module>
for line in urllist_file:
File "/home/chris/.virtualenvs/cvml3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 5529: invalid start byte
One more bit of information iconv finds the same problem with the same file. See below
$ iconv ../fall11_urls.txt >> /dev/null
iconv: illegal input sequence at position 1042953625
My current work around is UGLY. I use iconv to find the problem then I hand edit the file in vi, then process it. and keep doing this until it is clean but I have MILLIONS of lines in many files to process. And the URLs do mostly work after I hand correct them so these are not noise or "flipped bits".

Answering my own question to let you all know what worked. Yes opening in binary worked I tried it but then I don't have a "text" file. I read up on encoding and found the following works because every binary character value is valid. It is the Safest thing to do.
urllist_file = open('../fall11_urls.txt', 'r', encoding="latin-1")
It seems that anyone opening text files they get from other people and have no way to control or know in advance what is inside might be advised to use "latin-1" because there are no invalid byte values in Latin-1.
Thanks. The suggestion to open in binary got me to investigate what other parameters open() accepts. I'm new to Python and was astounded to find that strings are just a list of bytes. (That is what 20+ years of working in C will condition you to expect.)

Did you try crook method? This should work.
urllist_file = open('../fall11_urls.txt', 'rb') then convert to whatever format you want

Related

I keep getting the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte" error when I write simple code [duplicate]

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference

Use encoding format ISO-8859-1 to solve the issue.

Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer

It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.

I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:

use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')

This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.

Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')

I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.

if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a

I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name

Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.

I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.

You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.

I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..

Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')

If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

UnicodeDecodeError when reading a text file

I am a beginner to Python (I am using 3.4). This is the relevant part of my code.
fileObject = open("countable nouns raw.txt", "rt")
bigString = fileObject.read()
fileObject.close()
Whenever I try to read this file I get:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 82273: character maps to <undefined>
I have been reading around and it seems to be something to do with my default encoding not matching the text file encoding. I've read in another post that you can use this method to read a file with a specific encoding:
import codecs
f = codecs.open("file.txt", "r", "utf-8")
But you have to know it in advance. The thing is I don't know how the text file is encoded. A few posts suggested using Chardet. I've installed it but I have no idea how to get it to read a text file.
Any ideas on how to get around this??

There is no need to use codecs.open(); that's advice for Python 2.
In Python 3 open() takes an encoding argument:
fileObject = open("countable nouns raw.txt", "rt", encoding='utf8')
This does require that you know what codec was used for the file, of course. Generally speaking is no easy way for Python to figure that out; individual file formats may include codec information or have standardised on a given codec, but if all you have a generic text file you'll have to figure out what created it and what codec that used to write the data.

In addition to using the correct Python method to specifiy the encoding when using open, you could try to get the encoding using the file tool.
A file foo.txt containing
ÙÚÛÜ
can be checked using
$ file foo.txt
foo.txt: UTF-8 Unicode text
$ wc foo.txt
1 1 9 foo.txt
As you can see by using wc, it contains nine bytes, two for each character, one newline.

To add to Martijn Pieters answer,you may want to check out this link:
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/
if you are a Mac user and have trouble figuring out what encoding a particular file you have is in.

One way you can detect the encoding on any operating system is by using the library chardet.
If you don't have it, make sure you run pip install chardet . After that, it is fairly simple:
import chardet
import requests
content = requests.get("http://yahoo.co.jp/").content
detect = chardet.detect(content)
print(detect)
This library tries to detect what the encoding is. This doesn't mean that it is 100% right, just that it will likely be correct. Then you can just read the file:
open('file.txt', encoding=detect['encoding'])

Parse file in robust way with python 3

I have a log file that I need to go through line by line, and apparently it contains some "bad bytes". I get an error message along the following lines:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 9: invalid start byte
I have been able to strip down the problem to a file "log.test" containing the following line:
Message: \260
(At least this is how it shows up in my Emacs.)
I have a file "demo_error.py" which looks like this:
import sys
with open(sys.argv[1], 'r') as lf:
for i, l in enumerate(lf):
print(i, l.strip())
I then run, from the command line:
$ python3 demo_error.py log.test
The full traceback is:
Traceback (most recent call last):
File "demo_error.py", line 5, in <module>
for i, l in enumerate(lf):
File "/usr/local/Cellar/python3/3.4.0/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 13: invalid start byte
My hunch is that I have to somehow specify a more general codec ("raw ascii" for instance) - but I'm not quite sure how to do this.
Note that this is not really a problem in Python 2.7.
And just to make my point clear: I don't mind getting an exception for the line in question - then I can simply discard the line. The problem is that the exception seems to happen on the "for" loop itself, which makes special handling of that particular line impossible.

You can also use the codecs module. When you use the codecs.open() function, you can specify how it handles errors using the errors argument:
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
The errors argument can be one of several different keywords that specify how you want Python to behave when it attempts to decode a character that is invalid for the current encoding. You'll probably be most interested in codecs.ignore_errors or codecs.replace_errors, which cause invalid characters to be either ignored or replaced with a default character, respectively.
This method can be a good alternative when you know you have corrupt data that will cause the UnicodeDecodeError to be raised even when you specify the correct encoding.
Example:
with codecs.open('file.txt', mode='r', errors='ignore'):
# ...stuff...
# Even if there is corrupt data and invalid characters for the default
# encoding, this open() will still succeed

So apparently your file does not contain valid UTF-8 (which is the default encoding).
If you know, what encoding is used (e.g. iso-8859-1 which was afaik the python2 default), you can specify it when opening by using
open(sys.argv[1], mode='r', encoding='iso-8859-1')
If the encoding is unknown or not valid as all, you can open the file as binary.
open(sys.argv[1], mode='rb')
This will make the content accessible as bytes rather than trying to interpret them as characters.

In python <=2.7, strings (str) are arrays of 8 bits characters. So when reading a file composed of 8 bits characters or bytes, you get the bytes without problem, no matter what the actual encoding is. Simply, you may read them with a wrong representation, but it will never throw any exception.
In python >=3,strings are unicode strings (16 bits per character). So when reading a file python has to decode the file, and by default it uses system encoding - not necessarily UTF-8. In your case, it seems to assume UTF-8 encoding, when your log file is not UTF-8 encoding so the exception.
If not sure of the encoding you may reasonably try to use ISO-8859-1 with
open(sys.argv[1], 'r', encoding='iso-8859-1')

How to save the output of airport -s -x to file with Python

i am learning python, and i am having troubles with saving the output of a small function to file. My python function is the following:
#!/usr/local/bin/python
import subprocess
import codecs
airport = '/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport'
def getAirportInfo():
arguments = [airport, "--scan" , "--xml"]
execute = subprocess.Popen(arguments, stdout=subprocess.PIPE)
out, err = execute.communicate()
print out
return out
airportInfo = getAirportInfo()
outFile = codecs.open('wifi-data.txt', 'w')
outFile.write(airportInfo)
outFile.close()
I guess that this would only work on a Mac, as it references some PrivateFrameworks.
Anyways, the code almost works as it should. The print statement prints a huge xml file, that i'd like to store in a file, for future processing. And here start the problems.
In the version above, the script executes without any errors, however, when i try to open the file, i get an error message, along the lines of error with utf-8 encoding. Ignoring this, opens the file, and most of the things look fine, except for a couple of things:
some SSID have non-ascii characters, like ä, ö and ü. When printing those on the screen, they are correctly displayed as \xc3\xa4 and so on. When I open the file it is displayed incorrectly, the usual random garbage.
some of the xml values look like these when printed on screen: Data("\x00\x11WLAN-0024FE056185\x01\x08\x82\x84\x8b\x96\x0c\ … x10D\x00\x01\x02") but like this when read from file: //8AAAAAAAAAAAAAAAAAAA==
I thought it could be an encoding error (seen as the Umlauts have problems, the error message says something about the utf-8 encoding being messed up, and the text containing \x type of characters), and i tried looking here for possible solutions. However, no matter what i do, there are still errors:
adding an additional argument 'utf-8' to the codecs.open yields a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 24227: ordinal not in range(128) and the generated file is empty.
explicitly encoding to utf-8 with outFile.write(airportInfo.encode('utf-8')) before saving results in the same error
not being an expert, i tried decoding it, maybe i was just doing the exact opposite of what needed to be done, but i got an UnicodeDecodeError: 'utf8' codec can't decode byte 0x8a in position 8980: invalid start byte
The only the thing that worked (unsurprisingly), was to write the repr() of the string to file, but that is just not what i need, and also i can't make a nice .plist of a file full with escape symbols.
So please, please, can somebody help me? What am i missing?
If it helps, the type that gets saved in airportInfo is str (as in type(airportInfo) == str) and not u

You don't need re-encoding when your text is already unicode. Just write the text to a file. It should just work.
In [1]: t = 'äïöú'
In [2]: with open('test.txt', 'w') as f:
f.write(t)
...:
Additionally, you can make getAirportInfo simpler by using subprocess.check_output(). Also, mixed case names should only be used for classes, not functions. See PEP8.
import subprocess
def get_airport_info():
args = ['/System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport',
'--scan', '--xml']
return subprocess.check_output(args)
airportInfo = get_airport_info()
with open('wifi-data.txt', 'w') as outf:
outf.write(airportinfo)

I should have read this before my original answer:
What is the difference between encode/decode?
I always get confused between string and unicode conversion. On my mac, import sys; sys.getfilesystemencoding() suggests that subprocess returns a 'utf-8' string - so I don't know why airportInfo.encode('utf-8') fails. Is it possible to do airportInfo.encode('utf-8', 'ignore') and throw out the invalid characters?
Also - have you tried writing your file as wb: outFile = codecs.open('wifi-data.txt', 'wb') - doesn't 'w' open an ascii file?
Regarding your text editor - that may handle unicode characters differently. If it's reading a unicode text file as ascii, then the unicode characters may appear a garbled mess. You might try naming the file .xml, in which depending on your text editor may read it better as unicode.

How do I convert LF to CRLF?

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?

Instead of converting, you should be able to just open the file using Python's universal newline support:
f = open('words.txt', 'rU')
(Note the U.)

You can use the replace method of strings. Like
txt.replace('\n', '\r\n')
EDIT :
in your case :
with open('input.txt') as inp, open('output.txt', 'w') as out:
txt = inp.read()
txt = txt.replace('\n', '\r\n')
out.write(txt)

You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.
The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.
If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.
Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.

You can use cereja package
pip install cereja==1.2.0
import cereja cereja.lf_to_crlf(dir_or_file_path)
or
cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])
You can substitute for any standard. See the filetools module

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.