Character encoding compatibility between python2 and python3

Character encoding compatibility between python2 and python3 - python

I am writing plist files using ElementTree, and I need to prepend two lines of text before the tree starts, to match Apple's plist syntax. The following code works in python 2.7, but it fails in python 3.6 with TypeError: write() argument must be str, not bytes.
import xml.etree.ElementTree as ET
tree = ET.parse('com.input.plist')
with open('com.new.plist', 'w') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n')
tree.write(f, encoding='utf-8')
To get this working on python3, I can change it like so:
tree = ET.parse('com.input.plist')
with open('com.new.plist', 'w') as f:
f.write('<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n')
tree.write(f, encoding='unicode')
But this fails in python2 with LookupError: unknown encoding: unicode. How can I make this compatible with both versions?

I found the solution. I open the file in binary mode, then use string.encode() before writing.
with open('com.new.plist', 'wb') as f:
xml_header = '<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">\n'
f.write(xml_header.encode())
tree.write(f)

Related

Removing multiple XML declaration from document

I have a file that has multiple XML declarations.
<?xml version="1.0" encoding="UTF-8"?>
I am currently reading the file as a .txt file and rewriting each line that is not a XML declaration into a new .txt file. As I have many such document files, this method is taking time (around 20mins per file). I wanted to know if there was an easier way to do this.
I am using Python to do this. The files are sitting on my laptop and each file is around 11 Million lines (450mb size).
My code for iterating through the file and removing the declarations is below.
month_file = "2015-01.nml.txt"
delete_lines = [
'<?xml version="1.0" encoding="ISO-8859-1" ?>',
'<?xml version="1.0" encoding="iso-8859-1" ?>',
'<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">',
]
with open(month_file, encoding="ISO-8859-1") as in_fh:
while True:
line = in_fh.readline()
if not line: break
if any(x in line for x in delete_lines):
continue
else:
out_fh = open('myfile_faster.xml', "a")
out_fh.write(line)
out_fh.close()

This is essenstially the same as your version, but opens input and output just the once, also has a single if condition, and writes to the output as it iterates through the input (sort of like sed).
with open(in_file, mode="rt") as f_in, open(out_file, mode="wt") as f_out:
for line in f_in:
if (
not line
or line.startswith("<?xml")
or line.startswith("<!DOCTYPE")
):
continue
f_out.write(line)

Write KML file from another

I'm trying to:
- read a KML file
- remove the Placemark element if name = 'ZONE'
- write a new KML file without the element
This is my code:
from pykml import parser
kml_file_path = '../Source/Lombardia.kml'
removeList = list()
with open(kml_file_path) as f:
folder = parser.parse(f).getroot().Document.Folder
for pm in folder.Placemark:
if pm.name == 'ZONE':
removeList.append(pm)
print pm.name
for tag in removeList:
parent = tag.getparent()
parent.remove(tag)
#Write the new file
#I cannot reach the solution help me
and this is the KML:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<name>Lombardia</name>
<Style>
...
</Style>
<Folder>
<Placemark>
<name>ZOGNO</name>
<styleUrl>#FEATURES_LABELS</styleUrl>
<Point>
<coordinates>9.680530595139061,45.7941656233647,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>ZONE</name>
<styleUrl>#FEATURES_LABELS</styleUrl>
<Point>
<coordinates>10.1315885854064,45.7592449779275,0</coordinates>
</Point>
</Placemark>
</Folder>
</Document>
</kml>
The problem is that when I write the new KML file this still has the element I want to delete.
In fact, with I want to delete the element that contains name = ZONE.
What i'm doing wrong?
Thank you.
--- Final Code
This is the working code thanks to #Dawid Ferenczy:
from lxml import etree
import pykml
from pykml import parser
kml_file_path = '../Source/Lombardia.kml'
# parse the input file into an object tree
with open(kml_file_path) as f:
tree = parser.parse(f)
# get a reference to the "Document.Folder" node
folder = tree.getroot().Document.Folder
# iterate through all "Document.Folder.Placemark" nodes and find and remove all nodes
# which contain child node "name" with content "ZONE"
for pm in folder.Placemark:
if pm.name == 'ZOGNO':
parent = pm.getparent()
parent.remove(pm)
# convert the object tree into a string and write it into an output file
with open('output.kml', 'w') as output:
output.write(etree.tostring(folder, pretty_print=True))

Consider XSLT, the special purpose language designed to transform XML files. And because KML files are XML files, this solution is viable. Python's third-party module, lxml can run XSLT 1.0 scripts and do so without a single loop.
Specifically, the XSLT script runs the Identity Transform to copy entire document as is. Then, script runs an empty template on the element (conditional to specific logic) to remove that element. To accommodate the default namespace, a prefix, doc, is used for XPath search.
XSLT (save as .xsl file, a special .xml file to be loaded in Python below)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:doc="http://earth.google.com/kml/2.2">
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="doc:Placemark[doc:name='ZONE']"/>
</xsl:stylesheet>
XSLT Fiddle Demo
Python
import lxml.etree as et
# LOAD XML AND XSL
doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/XSLT_Script.xsl')
# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)
# RUN TRANSFORMATION
result = transform(doc)
# PRINT RESULT
print(result)
# SAVE TO FILE
with open('output.xml', 'wb') as f:
f.write(result)
Output
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2">
<Document>
<name>Lombardia</name>
<Style>
...
</Style>
<Folder>
<Placemark>
<name>ZOGNO</name>
<styleUrl>#FEATURES_LABELS</styleUrl>
<Point>
<coordinates>9.680530595139061,45.7941656233647,0</coordinates>
</Point>
</Placemark>
</Folder>
</Document>
</kml>

You have the following issues in your code:
you're not storing the whole parsed object tree anywhere (you have just a reference to the node "Document.Folder": folder = parser.parse(f).getroot().Document.Folder) but you want to write it back into a file so you need to store it
I don't understand why you need two loops and the list removeList when you can delete elements directly in the first loop
you're not reading the documentation - it's well described how to write the object tree into a file under examples in pykml library's documentation
Try the following code:
from lxml import etree
from pykml import parser
kml_file_path = './input.kml'
# parse the input file into an object tree
with open(kml_file_path) as f:
tree = parser.parse(f)
# get a reference to the "Document.Folder" node
folder = tree.getroot().Document.Folder
# iterate through all "Document.Folder.Placemark" nodes and find and remove all nodes
# which contain child node "name" with content "ZONE"
for pm in folder.Placemark:
if pm.name == 'ZONE':
parent = pm.getparent()
parent.remove(pm)
# convert the object tree into a string and write it into an output file
with open('output.kml', 'w') as output:
output.write(etree.tostring(tree, pretty_print=True))
It's very simple:
KML file is parsed into an object tree and stored in variable tree
the same object tree is directly manipulated (removed element)
the same object tree is written back into a file

remove <?xml version="1.0" ?> using xml.dom.minidom

I am generating XML files using xml.dom.minidom. Every time I generate a file on the very row there appears <?xml version="1.0" ?> and the generated file looks like this:
<?xml version="1.0" ?>
<Root>
data
</Root>
is not there anyway so have an output without and my output should look like
<Root>
data
</Root>

The best solution I found was to write out .childNodes[0], i.e. write out:
doc.childNodes[0].toprettyxml()
to the file, which will omit the xml version tag.

If you are happy just to trim the first line from the file, use this code;
f = open( 'file.txt', 'r' )
lines = f.readlines()
f.close()
f = open( 'file.txt'.'w' )
f.write( '\n'.join( lines[1:] ) )
f.close()

This does the job where old_data is the xml to strip
new_data = old_data[old_data.find("?>")+2:]

Generating xml in python and lxml

I have this xml from sql, and I want to do the same by python 2.7 and lxml
<?xml version="1.0" encoding="utf-16"?>
<results>
<Country name="Germany" Code="DE" Storage="Basic" Status="Fresh" Type="Photo" />
</results>
Now I have:
from lxml import etree
# create XML
results= etree.Element('results')
country= etree.Element('country')
country.text = 'Germany'
root.append(country)
filename = "xmltestthing.xml"
FILE = open(filename,"w")
FILE.writelines(etree.tostring(root, pretty_print=True))
FILE.close()
Do you know how to add rest of attributes?

Note this also prints the BOM
>>> from lxml.etree import tostring
>>> from lxml.builder import E
>>> print tostring(
E.results(
E.Country(name='Germany',
Code='DE',
Storage='Basic',
Status='Fresh',
Type='Photo')
), pretty_print=True, xml_declaration=True, encoding='UTF-16')
��<?xml version='1.0' encoding='UTF-16'?>
<results>
<Country Status="Fresh" Type="Photo" Code="DE" Storage="Basic" name="Germany"/>
</results>

from lxml import etree
# Create the root element
page = etree.Element('results')
# Make a new document tree
doc = etree.ElementTree(page)
# Add the subelements
pageElement = etree.SubElement(page, 'Country',
name='Germany',
Code='DE',
Storage='Basic')
# For multiple multiple attributes, use as shown above
# Save to XML file
outFile = open('output.xml', 'w')
doc.write(outFile, xml_declaration=True, encoding='utf-16')

Save to XML file
doc.write('output.xml', xml_declaration=True, encoding='utf-16')
instead of:
outFile = open('output.xml', 'w')
doc.write(outFile, xml_declaration=True, encoding='utf-16')

Promoting my comment to an answer:
#sukbir is probably not using Windows. What happens is that lxml writes a newline (0A 00 in UTF-16LE) between the XML header and the body. This is then molested by Win text mode to become 0D 0A 00 which makes everything after that look like UTF-16BE hence the Chinese etc characters when you display it. You can get around this in this instance by using "wb" instead of "w" when you open the file. However I'd strongly suggest that you use 'UTF-8' (spelled EXACTLY like that) as your encoding. Why are you using UTF-16? You like large files and/or weird problems?

Encoding in XML declaration python

I have created an XML file using python. But the XML declaration has only version info. How can I include encoding with XML declaration like:
<?xml version="1.0" encoding="UTF-8"?>

>>> from xml.dom.minidom import Document
>>> a=Document()
>>> a.toprettyxml(encoding="utf-8")
'<?xml version="1.0" encoding="utf-8"?>\n'
or
>>> a.toxml(encoding="utf-8")
'<?xml version="1.0" encoding="utf-8"?>'
you can set the encoding for the document.writexml() function in the same way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Character encoding compatibility between python2 and python3 - python

Related

Removing multiple XML declaration from document

Write KML file from another

remove <?xml version="1.0" ?> using xml.dom.minidom

Generating xml in python and lxml

Encoding in XML declaration python

Categories

Resources