How to save an xml file to disk? - python

I did something similar to this, but couldn't find a way to write the result to an xml file.

The code on the web page you linked to uses doc.toprettyxml to create a string from the XML DOM, so you can just write that string to a file:
f = open("output.xml", "w")
try:
f.write(doc.toprettyxml(indent=" "))
finally:
f.close()
In Python 2.6 (or 2.7 I suppose, whenever it comes out), you can use the "with" statement:
with open("output.xml", "w") as f:
f.write(doc.toprettyxml(indent=" "))
This also works in Python 2.5 if you put
from __future__ import with_statement
at the beginning of the file.

coonj is kind of right, but xml.dom.ext.PrettyPrint is part of the increasingly neglected PyXML extension package. If you want to stay within the supplied-as-standard minidom, you'd say:
f= open('yourfile.xml', 'wb')
doc.writexml(f, encoding= 'utf-8')
f.close()
(Or using the ‘with’ statement as mentioned by David to make it slightly shorter. Use mode 'wb' to avoid unwanted CRLF newlines on Windows interfering with encodings like UTF-16. Because XML has its own mechanisms for handling newline interpretation, it should be treated as a binary file rather than text.)
If you don't include the ‘encoding’ argument (to either writexml or toprettyxml), it'll try to write a Unicode string direct to the file, so if there are any non-ASCII characters in it, you'll get a UnicodeEncodeError. Don't try to .encode() the results of toprettyxml yourself; for non-UTF-8 encodings this can generate non-well-formed XML.
There's no ‘writeprettyxml()’ function, but it's trivially simple to do it yourself:
with open('output.xml', 'wb') as f:
doc.writexml(f, encoding= 'utf-8', indent= ' ', newl= '\n')

f = open('yourfile.xml', 'w')
xml.dom.ext.PrettyPrint(doc, f)
f.close()

Related

Why does my code add newlines into my csv file? How can I get rid of them? [duplicate]

import csv
with open('thefile.csv', 'rb') as f:
data = list(csv.reader(f))
import collections
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
with open('/pythonwork/thefile_subset11.csv', 'w') as outfile:
writer = csv.writer(outfile)
for row in data:
if counter[row[10]] >= 504:
writer.writerow(row)
This code reads thefile.csv, makes changes, and writes results to thefile_subset1.
However, when I open the resulting csv in Microsoft Excel, there is an extra blank line after each record!
Is there a way to make it not put an extra blank line?
The csv.writer module directly controls line endings and writes \r\n into the file directly. In Python 3 the file must be opened in untranslated text mode with the parameters 'w', newline='' (empty string) or it will write \r\r\n on Windows, where the default text mode will translate each \n into \r\n.
#!python3
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
In Python 2, use binary mode to open outfile with mode 'wb' instead of 'w' to prevent Windows newline translation. Python 2 also has problems with Unicode and requires other workarounds to write non-ASCII text. See the Python 2 link below and the UnicodeReader and UnicodeWriter examples at the end of the page if you have to deal with writing Unicode strings to CSVs on Python 2, or look into the 3rd party unicodecsv module:
#!python2
with open('/pythonwork/thefile_subset11.csv', 'wb') as outfile:
writer = csv.writer(outfile)
Documentation Links
https://docs.python.org/3/library/csv.html#csv.writer
https://docs.python.org/2/library/csv.html#csv.writer
Opening the file in binary mode "wb" will not work in Python 3+. Or rather, you'd have to convert your data to binary before writing it. That's just a hassle.
Instead, you should keep it in text mode, but override the newline as empty. Like so:
with open('/pythonwork/thefile_subset11.csv', 'w', newline='') as outfile:
Note: It seems this is not the preferred solution because of how the extra line was being added on a Windows system. As stated in the python document:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
Windows is one such platform where that makes a difference. While changing the line terminator as I described below may have fixed the problem, the problem could be avoided altogether by opening the file in binary mode. One might say this solution is more "elegent". "Fiddling" with the line terminator would have likely resulted in unportable code between systems in this case, where opening a file in binary mode on a unix system results in no effect. ie. it results in cross system compatible code.
From Python Docs:
On Windows, 'b' appended to the mode
opens the file in binary mode, so
there are also modes like 'rb', 'wb',
and 'r+b'. Python on Windows makes a
distinction between text and binary
files; the end-of-line characters in
text files are automatically altered
slightly when data is read or written.
This behind-the-scenes modification to
file data is fine for ASCII text
files, but it’ll corrupt binary data
like that in JPEG or EXE files. Be
very careful to use binary mode when
reading and writing such files. On
Unix, it doesn’t hurt to append a 'b'
to the mode, so you can use it
platform-independently for all binary
files.
Original:
As part of optional paramaters for the csv.writer if you are getting extra blank lines you may have to change the lineterminator (info here). Example below adapated from the python page csv docs. Change it from '\n' to whatever it should be. As this is just a stab in the dark at the problem this may or may not work, but it's my best guess.
>>> import csv
>>> spamWriter = csv.writer(open('eggs.csv', 'w'), lineterminator='\n')
>>> spamWriter.writerow(['Spam'] * 5 + ['Baked Beans'])
>>> spamWriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])
The simple answer is that csv files should always be opened in binary mode whether for input or output, as otherwise on Windows there are problems with the line ending. Specifically on output the csv module will write \r\n (the standard CSV row terminator) and then (in text mode) the runtime will replace the \n by \r\n (the Windows standard line terminator) giving a result of \r\r\n.
Fiddling with the lineterminator is NOT the solution.
A lot of the other answers have become out of date in the ten years since the original question. For Python3, the answer is right in the documentation:
If csvfile is a file object, it should be opened with newline=''
The footnote explains in more detail:
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
Use the method defined below to write data to the CSV file.
open('outputFile.csv', 'a',newline='')
Just add an additional newline='' parameter inside the open method :
def writePhoneSpecsToCSV():
rowData=["field1", "field2"]
with open('outputFile.csv', 'a',newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(rowData)
This will write CSV rows without creating additional rows!
I'm writing this answer w.r.t. to python 3, as I've initially got the same problem.
I was supposed to get data from arduino using PySerial, and write them in a .csv file. Each reading in my case ended with '\r\n', so newline was always separating each line.
In my case, newline='' option didn't work. Because it showed some error like :
with open('op.csv', 'a',newline=' ') as csv_file:
ValueError: illegal newline value: ''
So it seemed that they don't accept omission of newline here.
Seeing one of the answers here only, I mentioned line terminator in the writer object, like,
writer = csv.writer(csv_file, delimiter=' ',lineterminator='\r')
and that worked for me for skipping the extra newlines.
with open(destPath+'\\'+csvXML, 'a+') as csvFile:
writer = csv.writer(csvFile, delimiter=';', lineterminator='\r')
writer.writerows(xmlList)
The "lineterminator='\r'" permit to pass to next row, without empty row between two.
Borrowing from this answer, it seems like the cleanest solution is to use io.TextIOWrapper. I managed to solve this problem for myself as follows:
from io import TextIOWrapper
...
with open(filename, 'wb') as csvfile, TextIOWrapper(csvfile, encoding='utf-8', newline='') as wrapper:
csvwriter = csv.writer(wrapper)
for data_row in data:
csvwriter.writerow(data_row)
The above answer is not compatible with Python 2. To have compatibility, I suppose one would simply need to wrap all the writing logic in an if block:
if sys.version_info < (3,):
# Python 2 way of handling CSVs
else:
# The above logic
I used writerow
def write_csv(writer, var1, var2, var3, var4):
"""
write four variables into a csv file
"""
writer.writerow([var1, var2, var3, var4])
numbers=set([1,2,3,4,5,6,7,2,4,6,8,10,12,14,16])
rules = list(permutations(numbers, 4))
#print(rules)
selection=[]
with open("count.csv", 'w',newline='') as csvfile:
writer = csv.writer(csvfile)
for rule in rules:
number1,number2,number3,number4=rule
if ((number1+number2+number3+number4)%5==0):
#print(rule)
selection.append(rule)
write_csv(writer,number1,number2,number3,number4)
When using Python 3 the empty lines can be avoid by using the codecs module. As stated in the documentation, files are opened in binary mode so no change of the newline kwarg is necessary. I was running into the same issue recently and that worked for me:
with codecs.open( csv_file, mode='w', encoding='utf-8') as out_csv:
csv_out_file = csv.DictWriter(out_csv)

Python problem reading CSV files that contain the word NUL [duplicate]

I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.
As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.
Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.
You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.
Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm
Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
I got the same error. Saved the file in UTF-8 and it worked.
This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.
I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!
I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.
For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.
I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.
Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open
One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.

Unicode(UTF-8) can't display correctly? (Python)

I have following code in Pyhton:
# myFile.csv tend to looks like:
# 'a1', 'ふじさん', 'c1'
# 'a2', 'ふじさん', 'c2'
# 'a3', 'ふじさん', 'c3'
s = u"unicodeText" # unicodeText like, ふじさん بعدة أش 일본富士山Ölkələr
with codecs.open('myFile.csv', 'w+', 'utf-8') as f: # codecs open
f.write(s.encode('utf-8', 'ignore'))
I was using Vim to edit the code and using Vim to open "myFile.csv";
It can success display unicode text from terminal;
but not able to display unicode text from Excel, nor from browser;
My platform is osx
I don't know if is my configuration problem or actually I code it wrong way, if you any idea, please advise. Deeply appreciate!
change open to codecs.open.
Thanks for point out f.close(), deleted.
Excel (at least on Windows) likes a Unicode BOM at the start of a .csv file even with UTF-8. There is a codec for that, utf-8-sig.
Also, Python 3's normal open is all that is required and no need for f.close() in a with:
#coding:utf8
data = '''\
a1,ふじさん,c1
a2,ふじさん,c2
a3,ふじさん,c3
'''
with open('myFile.csv', 'w', encoding='utf-8-sig') as f:
f.write(data)
It seems you're trying to open the file in text mode (because you specify an encoding), but then you try to write binary data (because you encode the text before writing it to the file). You need to either open the file as binary and write encoded text, or open it as text and write text.
Furthermore, your attempt to open it as text isn't even working because you're passing utf-8 as the buffering parameter instead of the encoding parameter. See the open() documentation`.
But even if you did all that correctly, this still wouldn't really help you with an Excel file, because those have a complicated binary structure. I recommend you use something like the xlrd to read xls files and Xlswriter to write them.
Here is a simple example that should work for .csv:
with open('file.csv', 'w', encoding='utf-8') as fh:
fh.write('This >µ< is a unicode GREEK LETTER MU\n')
or alternatively
with open('file.csv', 'wb') as fh:
fh.write('This >µ< is a unicode GREEK LETTER MU\n'.encode('utf-8'))
codecs.open opens a wrapped reader/writer which will do encoding/decoding for you. So you do not need to encode your string for writing. You need to pass the 'ignore' parameter in your open call.
with open('myFile.csv', 'w+', 'utf-8', 'ignore') as f:
f.write(s)
Note that you do not need to call close as you use a with statement.
Original answer, scratch that:
Third parameter of open is buffering requiring an integer.
You should write pass the encoding like this:
with open('myFile.xls', 'w+', encoding='utf-8') as f:
Note that you open the file in text mode. No need to encode the string for writing.
Also your file mode 'w+' is a bit odd. I'm not sure, but I think it will truncate your file. If you want to append to the file you should use 'a' as mode.

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?
The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")
Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.
I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

How to write Unix end of line characters in Windows?

How can I write to files using Python (on Windows) and use the Unix end of line character?
e.g. When doing:
f = open('file.txt', 'w')
f.write('hello\n')
f.close()
Python automatically replaces \n with \r\n.
The modern way: use newline=''
Use the newline= keyword parameter to io.open() to use Unix-style LF end-of-line terminators:
import io
f = io.open('file.txt', 'w', newline='\n')
This works in Python 2.6+. In Python 3 you could also use the builtin open() function's newline= parameter instead of io.open().
The old way: binary mode
The old way to prevent newline conversion, which does not work in Python 3, is to open the file in binary mode to prevent the translation of end-of-line characters:
f = open('file.txt', 'wb') # note the 'b' meaning binary
but in Python 3, binary mode will read bytes and not characters so it won't do what you want. You'll probably get exceptions when you try to do string I/O on the stream. (e.g. "TypeError: 'str' does not support the buffer interface").
For Python 2 & 3
See: The modern way: use newline='' answer on this very page.
For Python 2 only (original answer)
Open the file as binary to prevent the translation of end-of-line characters:
f = open('file.txt', 'wb')
Quoting the Python manual:
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
You'll need to use the binary pseudo-mode when opening the file.
f = open('file.txt', 'wb')
def dos2unix(inp_file, out_file=None):
if out_file:
out_file_tmp = out_file
else:
out_file_tmp = inp_file + '_tmp'
if os.path.isfile(out_file_tmp):
os.remove(out_file_tmp)
with open(out_file_tmp, "w", newline='\n') as fout:
with open(inp_file, "r") as fin:
lines = fin.readlines()
lines = map(lambda line: line.strip() + '\n', lines)
fout.writelines(lines)
if not out_file:
shutil.move(out_file_tmp, inp_file)
print(f'dos2unix() {inp_file} is overwritten with converted data !')
else:
print(f'dos2unix() {out_file} is created with converted data !')

Categories