remove <?xml version="1.0" ?> using xml.dom.minidom - python

I am generating XML files using xml.dom.minidom. Every time I generate a file on the very row there appears <?xml version="1.0" ?> and the generated file looks like this:
<?xml version="1.0" ?>
<Root>
data
</Root>
is not there anyway so have an output without and my output should look like
<Root>
data
</Root>

The best solution I found was to write out .childNodes[0], i.e. write out:
doc.childNodes[0].toprettyxml()
to the file, which will omit the xml version tag.

If you are happy just to trim the first line from the file, use this code;
f = open( 'file.txt', 'r' )
lines = f.readlines()
f.close()
f = open( 'file.txt'.'w' )
f.write( '\n'.join( lines[1:] ) )
f.close()

This does the job where old_data is the xml to strip
new_data = old_data[old_data.find("?>")+2:]

Related

Is there a way to easily escape or modify certain characters in a string in Python?

Currently, my rss script generates a rss feed for a website using an API. It worked until I forgot that some special characters aren't allowed in xml format. What is the best way to get rid of or escape the & symbol?
Here's the code:
import requests
import os
def generate_rss(data):
rss = """\
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
<title>ComicK - RSS Feed</title>
<link>https://github.com/ld3z/manga-rss</link>
<description>A simple RSS feed for ComicK!</description>
"""
for i in data:
c = i["md_comics"]
rss += """\
<item>
<title>{}</title>
<link>{}</link>
<description>{}</description>
</item>
""".format(
f"{c['title']} - Chapter {i['chap']}",
f"https://comick.app/comic/{c['slug']}",
f"Chapter {i['chap']} of {c['title']} is now available on ComicK!",
)
rss += "\n</channel>\n</rss>"
return rss
url = "https://api.comick.app/chapter/?lang=en&page=1&order=new&accept_mature_content=true"
data = requests.get(url).json()
filename = f"./comick/comick-rss-nsfw.xml"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, "w", encoding="utf-8") as f_out:
print(generate_rss(data), file=f_out)
I think it would have to be put into a list, but then I am not entirely sure if it would still be the same.

Removing multiple XML declaration from document

I have a file that has multiple XML declarations.
<?xml version="1.0" encoding="UTF-8"?>
I am currently reading the file as a .txt file and rewriting each line that is not a XML declaration into a new .txt file. As I have many such document files, this method is taking time (around 20mins per file). I wanted to know if there was an easier way to do this.
I am using Python to do this. The files are sitting on my laptop and each file is around 11 Million lines (450mb size).
My code for iterating through the file and removing the declarations is below.
month_file = "2015-01.nml.txt"
delete_lines = [
'<?xml version="1.0" encoding="ISO-8859-1" ?>',
'<?xml version="1.0" encoding="iso-8859-1" ?>',
'<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">',
]
with open(month_file, encoding="ISO-8859-1") as in_fh:
while True:
line = in_fh.readline()
if not line: break
if any(x in line for x in delete_lines):
continue
else:
out_fh = open('myfile_faster.xml', "a")
out_fh.write(line)
out_fh.close()
This is essenstially the same as your version, but opens input and output just the once, also has a single if condition, and writes to the output as it iterates through the input (sort of like sed).
with open(in_file, mode="rt") as f_in, open(out_file, mode="wt") as f_out:
for line in f_in:
if (
not line
or line.startswith("<?xml")
or line.startswith("<!DOCTYPE")
):
continue
f_out.write(line)

Read XML file in Python, search for key in a line and replace value

I have an XML file that looks like this!
<?xml version="1.0"?>
<root>
<child>
<add key="setid" value=".\print\data1" />
<add key="getid" value=".\print\data2" />
<add key="holdingid" value=".\print\data3" />
</child>
</root>
I want to read a line in the XML, search for a key match and replace a value in that line with .\donotpritnnt\data1
I can do this in NANT using XMLPOKE and xpath.
Tried this with dict, list (replacing with positionID), split (not good enough code to show here)
How can I solve this?
Assuming your XML code is in test.xml in the working directory.
new_content = []
with open("test.xml") as f:
content = f.readlines()
for line in content:
newline = line
if 'key="setid"' in line:
newline = line[:line.find("value=")] + 'value=".\donotpritnnt\data1" />\n'
new_content += [newline]
with open('test.xml', 'w') as f:
for line in new_content:
f.write(line)
Its not the nicest way, but it works for the example you provided, and may also be extentable since it iterates over the file and looks at each line. Further improvements may probably by using regex etc.

Generate output files from template file and csv data in python

I need to generate xml files poulated with data from a csv file in python
I have two input files:
one CSV file named data.csv containing data like this:
ID YEAR PASS LOGIN HEX_LOGIN
14Z 2013 (3e?k<.P#H}l hex0914Z F303935303031345A
14Z 2014 EAeW+ZM..--r hex0914Z F303935303031345A
.......
One Template file named template.xml
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year></year>
<security>
<ID></ID>
<login></login>
<hex_login></hex_login>
<pass></pass>
</security>
</SecurityProfile>
I want to get as many output files as lines in the csv data file, each output filed named YEAR_ID, with the data from the csv file in the xml fields:
Output files contentes:
Content of output file #1 named 2013_0950014z:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>2013</year>
<security>
<ID>14Z</ID>
<login>hex0914</login>
<hex_login>F303935303031345A</hex_login>
<pass>(3e?k<.P#H}l</pass>
</security>
</SecurityProfile>
Content of output file #2 named 2014_0950014z:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>2014</year>
<security>
<ID>14Z</ID>
<login>hex0914</login>
<hex_login>F303935303031345A</hex_login>
<pass>EAeW+ZM..--r</pass>
</security>
</SecurityProfile>
Thank you for your suggestions.
Can you make changes the template? If so, I would do the following to make this a bit simpler:
<?xml version="1.0"?>
<SecurityProfile xmlns="security_profile_v1">
<year>{year}</year>
<security>
<ID>{id}</ID>
<login>{login}</login>
<hex_login>{hex_login}</hex_login>
<pass>{pass}</pass>
</security>
</SecurityProfile>
Then, something like this would work:
import csv
input_file_name = "some_file.csv" #name/path of your csv file
template_file_name = "some_file.xml" #name/path of your xml template
output_file_name = "{}_09500{}.xml"
with open(template_file_name,"r") as template_file:
template = template_file.read()
with open(input_file_name,"r") as csv_file:
my_reader = csv.DictReader(csv_file)
for row in my_reader:
with open(output_file_name.format(row["YEAR"],row["ID"]),"w") as current_out:
current_out.write(template.format(year=row["YEAR"],
id=row["ID"],
login=row["LOGIN"],
hex_login=row["HEX_LOGIN"],
pass=row["PASS"]))
If you can't modify the template, or want to process it as XML instead of basic string manipulation, then it's a bit more involved.
EDIT:
Modified answer to use csv.DictReader rather than csv.reader.
Fixed variable names opening input CSV file and writing the output. Removed 'binary' mode file operations.
import csv
from collections import defaultdict
header = '<?xml version="1.0"?><SecurityProfile xmlns="security_profile_v1">\n'
footer = '\n</SecurityProfile>'
entry = '''<security>
<ID>{0[ID]}</ID>
<login>{0[LOGIN]}</login>
<hex_login>{0[HEX_LOGIN]}</hex_login>
<pass>{0[PASS]}</pass>
</security>'''
rows = defaultdict(list)
with open('infile.csv') as f:
reader = csv.DictReader(f, delimiter='\t')
for item in reader:
rows[reader['YEAR']].append(item)
for year,data in rows.iteritems():
with open('{}.xml'.format(year), 'w') as f:
f.write(header)
f.write('<year>{}</year>\n'.format(year))
for record in data:
f.write(entry.format(record))
f.write('\n')
f.write(footer)

Generating xml in python and lxml

I have this xml from sql, and I want to do the same by python 2.7 and lxml
<?xml version="1.0" encoding="utf-16"?>
<results>
<Country name="Germany" Code="DE" Storage="Basic" Status="Fresh" Type="Photo" />
</results>
Now I have:
from lxml import etree
# create XML
results= etree.Element('results')
country= etree.Element('country')
country.text = 'Germany'
root.append(country)
filename = "xmltestthing.xml"
FILE = open(filename,"w")
FILE.writelines(etree.tostring(root, pretty_print=True))
FILE.close()
Do you know how to add rest of attributes?
Note this also prints the BOM
>>> from lxml.etree import tostring
>>> from lxml.builder import E
>>> print tostring(
E.results(
E.Country(name='Germany',
Code='DE',
Storage='Basic',
Status='Fresh',
Type='Photo')
), pretty_print=True, xml_declaration=True, encoding='UTF-16')
��<?xml version='1.0' encoding='UTF-16'?>
<results>
<Country Status="Fresh" Type="Photo" Code="DE" Storage="Basic" name="Germany"/>
</results>
from lxml import etree
# Create the root element
page = etree.Element('results')
# Make a new document tree
doc = etree.ElementTree(page)
# Add the subelements
pageElement = etree.SubElement(page, 'Country',
name='Germany',
Code='DE',
Storage='Basic')
# For multiple multiple attributes, use as shown above
# Save to XML file
outFile = open('output.xml', 'w')
doc.write(outFile, xml_declaration=True, encoding='utf-16')
Save to XML file
doc.write('output.xml', xml_declaration=True, encoding='utf-16')
instead of:
outFile = open('output.xml', 'w')
doc.write(outFile, xml_declaration=True, encoding='utf-16')
Promoting my comment to an answer:
#sukbir is probably not using Windows. What happens is that lxml writes a newline (0A 00 in UTF-16LE) between the XML header and the body. This is then molested by Win text mode to become 0D 0A 00 which makes everything after that look like UTF-16BE hence the Chinese etc characters when you display it. You can get around this in this instance by using "wb" instead of "w" when you open the file. However I'd strongly suggest that you use 'UTF-8' (spelled EXACTLY like that) as your encoding. Why are you using UTF-16? You like large files and/or weird problems?

Categories