read a file and try to remove all non UTF-8 chars - python

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
but I got the following error,
AttributeError: 'str' object has no attribute 'decode'
Update: I tried the code as suggested by the answer,
file_str = open(file_path, 'r', encoding='utf-8').read()
but it didn't eliminate the non utf-8 chars, so how to remove them?

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open() function documentation for further details.

If you use
file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()
, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.

Related

toprettyxml() : write() argument must be str, not bytes

My program saves a bit of XML data to a file in a prettyfied format from an XML string. This does the trick:
from xml.dom.minidom import parseString
dom = parseString(strXML)
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
outfile.write(dom.toprettyxml())
However, I noticed that my XML header is missing an encoding parameter.
<?xml version="1.0" ?>
Since my data is susceptible of containing many Unicode characters, I must make sure UTF-8 is also specified in the XML encoding field.
Now, looking at the minidom documentation, I read that "an additional keyword argument encoding can be used to specify the encoding field of the XML header". So I try this:
from xml.dom.minidom import parseString
dom = parseString(strXML)
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
outfile.write(dom.toprettyxml(encoding="UTF-8"))
But then I get:
TypeError: write() argument must be str, not bytes
Why doesn't the first piece of code yield that error? And what am I doing wrong?
Thanks!
R.
from the documentation emphasis mine:
With no argument, the XML header does not specify an encoding, and the result is Unicode string if the default encoding cannot represent all characters in the document. Encoding this string in an encoding other than UTF-8 is likely incorrect, since UTF-8 is the default encoding of XML.
With an explicit encoding argument, the result is a byte string in the specified encoding. It is recommended that this argument is always specified. To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”.
So the write method outputs a different object type whether encoding is set or not (which is rather confusing if you ask me)
So you can fix by removing the encoding:
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
outfile.write(dom.toprettyxml())
or open your file in binary mode which then accepts byte strings to be written to
with open(file_name + ".xml", "wb") as outfile:
outfile.write(dom.toprettyxml(encoding="utf8"))
You can solve the problem as follow:
with open(targetName, 'wb') as f:
f.write(dom.toprettyxml(indent='\t', encoding='utf-8'))
I don't recommend using wb mode for output, because it does not take line-ending conversion into consideration (which, for example, converts \n to \r\n on Windows when using Text mode). I instead use the following method to do this:
dom = minidom.parseString(utf_8_xml_text)
out_byte = dom.toprettyxml(encoding="utf-8")
out_text = out_byte.decode("utf-8")
with open(filename, "w", encoding="utf-8") as f:
f.write(out_text)
For python version higher than 3.9, using built-in indent function instead.

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()
It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.
if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()

Why does  appear in my data?

I downloaded the file 'pi_million_digits.txt' from here:
https://github.com/ehmatthes/pcc/blob/master/chapter_10/pi_million_digits.txt
I then used this code to open and read it:
filename = 'pi_million_digits.txt'
with open(filename) as file_object:
lines = file_object.readlines()
pi_string = ''
for line in lines:
pi_string += line.strip()
print(pi_string[:52] + "...")
print(len(pi_string))
However the output produced is correct apart from the fact it is preceded by same strange symbols: "3.141...."
What causes these strange symbols? I am stripping the lines so I'd expect such symbols to be removed.
It looks like you're opening a file with a UTF-8 encoded Byte Order Mark using the ISO-8859-1 encoding (presumably because this is the default encoding on your OS).
If you open it as bytes and read the first line, you should see something like this:
>>> next(open('pi_million_digits.txt', 'rb'))
b'\xef\xbb\xbf3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… where \xef\xbb\xbf is the UTF-8 encoding of the BOM. Opened as ISO-8859-1, it looks like what you're getting:
>>> next(open('pi_million_digits.txt', encoding='iso-8859-1'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
… and opening it as UTF-8 shows the actual BOM character U+FEFF:
>>> next(open('pi_million_digits.txt', encoding='utf-8'))
'\ufeff3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
To strip the mark out, use the special encoding utf-8-sig:
>>> next(open('pi_million_digits.txt', encoding='utf-8-sig'))
'3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679\n'
The use of next() in the examples above is just for demonstration purposes. In your code, you just need to add the encoding argument to your open() line, e.g.
with open(filename, encoding='utf-8-sig') as file_object:
# ... etc.

UnicodeEncodeError no matter what encoding I try [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`
The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")
If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)
Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary
As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)
TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).
Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed
for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")
def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))
For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"
Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.
In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.
for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Converting DOS text files to Unicode using Python

I am trying to write a Python application for converting old DOS code page text files to their Unicode equivalent. Now, I have done this before using Turbo Pascal by creating a look-up table and I'm sure the same can be done using a Python dictionary. My question is: How do I index into the dictionary to find the character I want to convert and send the equivalent Unicode to a Unicode output file?
I realize that this may be a repeat of a similar question but nothing I searched for here quite matches my question.
Python has the codecs to do the conversions:
#!python3
# Test file with bytes 0-255.
with open('dos.txt','wb') as f:
f.write(bytes(range(256)))
# Read the file and decode using code page 437 (DOS OEM-US).
# Write the file as UTF-8 encoding ("Unicode" is not an encoding)
# UTF-8, UTF-16, UTF-32 are encodings that support all Unicode codepoints.
with open('dos.txt',encoding='cp437') as infile:
with open('unicode.txt','w',encoding='utf8') as outfile:
outfile.write(infile.read())
You can use standard buildin decode method of bytes objects:
with open('dos.txt', 'r', encoding='cp437') as infile, \
open('unicode.txt', 'w', encoding='utf8') as outfile:
for line in infile:
outfile.write(line)

Categories