My program saves a bit of XML data to a file in a prettyfied format from an XML string. This does the trick:
from xml.dom.minidom import parseString
dom = parseString(strXML)
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
outfile.write(dom.toprettyxml())
However, I noticed that my XML header is missing an encoding parameter.
<?xml version="1.0" ?>
Since my data is susceptible of containing many Unicode characters, I must make sure UTF-8 is also specified in the XML encoding field.
Now, looking at the minidom documentation, I read that "an additional keyword argument encoding can be used to specify the encoding field of the XML header". So I try this:
from xml.dom.minidom import parseString
dom = parseString(strXML)
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
outfile.write(dom.toprettyxml(encoding="UTF-8"))
But then I get:
TypeError: write() argument must be str, not bytes
Why doesn't the first piece of code yield that error? And what am I doing wrong?
Thanks!
R.
from the documentation emphasis mine:
With no argument, the XML header does not specify an encoding, and the result is Unicode string if the default encoding cannot represent all characters in the document. Encoding this string in an encoding other than UTF-8 is likely incorrect, since UTF-8 is the default encoding of XML.
With an explicit encoding argument, the result is a byte string in the specified encoding. It is recommended that this argument is always specified. To avoid UnicodeError exceptions in case of unrepresentable text data, the encoding argument should be specified as “utf-8”.
So the write method outputs a different object type whether encoding is set or not (which is rather confusing if you ask me)
So you can fix by removing the encoding:
with open(file_name + ".xml", "w", encoding="utf8") as outfile:
outfile.write(dom.toprettyxml())
or open your file in binary mode which then accepts byte strings to be written to
with open(file_name + ".xml", "wb") as outfile:
outfile.write(dom.toprettyxml(encoding="utf8"))
You can solve the problem as follow:
with open(targetName, 'wb') as f:
f.write(dom.toprettyxml(indent='\t', encoding='utf-8'))
I don't recommend using wb mode for output, because it does not take line-ending conversion into consideration (which, for example, converts \n to \r\n on Windows when using Text mode). I instead use the following method to do this:
dom = minidom.parseString(utf_8_xml_text)
out_byte = dom.toprettyxml(encoding="utf-8")
out_text = out_byte.decode("utf-8")
with open(filename, "w", encoding="utf-8") as f:
f.write(out_text)
For python version higher than 3.9, using built-in indent function instead.
Related
I am using the Google vision api to extract the text from an image and I also want to store this text in a .txt file.
Whenever I use f.write(text.description) I get:
UnicodeEncodeError
With f.write(text) it gives me:
TypeError: write() argument must be str, not EntityAnnotation
f.write(text.description.encode("utf-8")) gives me:
TypeError: write() argument must be str, not bytes.
You are trying to write a variable of type EntityAnnotation which is a Json Object and not a str. Check out EntityAnnotation - Google Cloud Vision, on the position tab you can find out how the structure is made. Probably you are trying to write some information allocated in it.
Remember you can write the whole object by making it a string str(json_objt) or using
json.dumps(json_obj) in order to serialize json_obj to a JSON formatted str.
It looks like text.description contains characters which cannot be encoded in your filesystem's default encoding. This can be the case on Windows machines where the default filesystem encoding is cp1252, but is also possible on other platforms, depending on how they have been configured.
You can get around this by specifying a different encoding when you open the file - utf-8 is usually a good choice.
with open('myfile.txt', 'w', encoding='utf-8') as f:
f.write(text.description)
Note that you will need to specify the encoding if you try to read from the file:
with open('myfile.txt', 'r', encoding='utf-8') as f:
description = f.read()
I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()
It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.
if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()
Code:
from urllib import request
response = request.urlopen('http://www.amazon.com/')
body = response.read()
with open('test.html', 'wb') as f:
f.write(body)
with open('test2.html', 'w') as f:
f.write(body.decode('utf-8'))
any differences or anything need to pay attention to?
The first way
with open('test.html', 'wb') as f:
f.write(body)
simply saves the binary data you downloaded.
The second way
with open('test2.html', 'w') as f:
f.write(body.decode('utf-8'))
assumes the data is UTF-8, attempts to decode those UTF-8 bytes to Unicode text, and then re-encodes it to your default file encoding, as specified by locale.getpreferredencoding(False). So if the data is already UTF-8 it wastefully decodes and re-encodes it. And if it's not UTF-8, then it specifies the wrong encoding to decode it with. That will work ok if the file only contains plain 7-bit ASCII data, but otherwise it will give wrong results, or raise UnicodeDecodeError.
I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,
file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')
but I got the following error,
AttributeError: 'str' object has no attribute 'decode'
Update: I tried the code as suggested by the answer,
file_str = open(file_path, 'r', encoding='utf-8').read()
but it didn't eliminate the non utf-8 chars, so how to remove them?
Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.
You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:
file_str = open(file_path, 'r', encoding='utf8').read()
For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.
See the open() function documentation for further details.
If you use
file_str = open(file_path, 'r', encoding='utf8', errors='ignore').read()
, then non-UTF-8 characters will essentially be ignored. Read the open() function documentation for details. The documentation has a section on the possible values for the errors parameter.
I have to import som data to my Yahoo Marketing account - and the CSV file has to be encoding in - yahoo: CSV/TSV files: Unicode (technically UTF-16LE encoding)
writer = csv.writer(open('new_yahoo.csv','w', encoding='utf-16-le'), delimiter="\t")
writer.writerows(reader)
If you scroll down on the examples provided in the Python CSV page, you'll find that it
The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.
But if you do need to do unicode, it looks like this could help:
unicode_csv_reader() below is a generator that wraps csv.reader to handle Unicode CSV data (a list of Unicode strings).
...
For all other encodings the following UnicodeReader and UnicodeWriter classes can be used. They take an additional encoding parameter in their constructor and make sure that the data passes the real reader or writer encoded as UTF-8:
So it looks like the example they provide at the bottom should do the encoding you want.
It looks like you are using Python 3.X judging by the open command used. What you have should work, although you may need to define the newline parameter as well. newline='' will use the native line ending (CRLF on Windows, for example), but Yahoo may have other requirements. The code below generated the file correctly on Windows with CRLF line endings.
data = [
['One','Two','Three'],
[1,2,3],
[4,5,6],
[7,8,9]]
import csv
f = open('new_yahoo.csv','w', newline='', encoding='utf-16-le')
writer = csv.writer(f, delimiter='\t')
writer.writerows(data)
f.close()