Split one large .xml file in more .xml files (python) - python

I've been trying to split one large .xml file in more .xml files in python for a few days now. The thing is I haven't really succeeded yet. So here I am asking for your help.
My large .xml file looks like this:
<Root>
<Testcase>
<Info1>[]<Info1>
<Info2>[]<Info2>
</Testcase>
<Testcase>
<Info1>[]<Info1>
<Info2>[]<Info2>
<Testcase>
...
...
...
<Testcase>
<Info1>[]<Info1>
<Info2>[]<Info2>
<Testcase>
</Root>
It has over 2000 children and what I would like to do is to parse this .xml file and split in smaller .xml files with 100 children each. That would result in 20 new .xml files.
How can I do that?
Thank you!
L.E.:
I've tried to parse the .xml file using xml.etree.ElementTree
import xml.etree.ElementTree as ET
file = open('Testcase.xml', 'r')
tree = ET.parse(file)
total_testcases = 0
for Testcase in root.findall('Testcase'):
total_testcases+=1
nr_of_files = (total_testcases/100)+1
for i in range(nr_of_files+1):
tree.write('Testcase%d.xml' % (i), encoding="UTF-8")
The thing is I don't know how to specifically get only the Testcases and copy them to another file...

Actually, root.findall('Testcase') will return a list of "Testcase" sub elements.
So what need to do is:
create root
add sub elements to root.
Here is example:
>>> tcs = root.findall('Testcase')
>>> tcs
[<Element 'Testcase' at 0x23e14e0>, <Element 'Testcase' at 0x23e1828>]
>>> len(tcs)
2
>>> r = ET.Element('Root')
>>> r.append(tcs[0])
>>> ET.tostring(r, 'utf-8')
'<Root><Testcase>\n <Info1>[]</Info1>\n <Info2>[]</Info2>\n </Testcase>\n </Root>'

Related

xml.etree.ElementTree how can i add attribute inside of a node?

I want the build an xml file and i made some research. I decided the use xml tree but i couldn't manage the use it like i want.
I want the generate this xml.
<Invoice test="how can i generate this ?">
</Invoice>
i am doing in python
import xml.etree.ElementTree as gfg
def GenerateXML(fileName):
root = gfg.Element("Invoice")
root.tail = 'test="how can i generate this ?"'
tree = gfg.ElementTree(root)
with open(fileName, "wb") as files:
tree.write(files)
It's generate xml file look like:
<Invoice />test="how can i generate this ?"
I know i shouln't use tail for i want. But i can't find a way for the make a xml look like what i want. Thank you for help.
This piece of XML structure is called "attribute".
You can get it using the set(attr, value) method.
import xml.etree.ElementTree as gfg
def GenerateXML(fileName):
root = gfg.Element("Invoice")
root.set('test', 'how can i generate this ?')
tree = gfg.ElementTree(root)
with open(fileName, "wb") as files:
tree.write(files)
GenerateXML('test.xml')
test.xml:
<Invoice test="how can i generate this ?" />

How to write Element Tree dump into file

I am try to write the xml dump into the another file. Here is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('extract_orginal.xml')
root = tree.getroot()
with open('extract.xml', 'w') as extract:
for item in root.findall(f"doc[#id='289e1292134534']"):
extract.write(ET.dump(item))
Getting the output as "NONE" in the extract.xml file. Can you please help me.
From the docs of .dump():
"Write element tree or element structure to sys.stdout. This function should be used for debugging only."
The function .dump() returns None!
I think you want to use .tostring():
import xml.etree.ElementTree as ET
tree = ET.parse('extract_orginal.xml')
root = tree.getroot()
with open('extract.xml', 'w') as extract:
for item in root.findall(f"doc[#id='289e1292134534']"):
extract.write(ET.tostring(item, encoding="utf-8"))

Append XML responses in Python

I am trying to parse multiple XML responses in one file. However, when I write a responses to file, it shows only last one. I assume I need to add append somewhere in order to keep all responses.
Here is my code:
import json
import xml.etree.ElementTree as ET
#loop test
feins = ['800228936', '451957238']
for i in feins:
rr = requests.get('https://pdb-services.nipr.com/pdb-xml-reports/hitlist_xml.cgi?report_type=0&id_fein={}'.format(i),auth=('test', 'test'))
root = ET.fromstring(rr.text)
tree = ET.ElementTree(root)
tree.write("file.xml")
Try changing
for i in feins:
...
tree = ET.ElementTree(root)
tree.write("file.xml")
to (note the indentation):
for i in feins:
...
tree = ET.ElementTree(root)
with open("file.xml", "wb") as f:
tree.write(f)
and see if it works.

How to read a huge xml file using xml.etree.ElementTree

How can I read huge xml files (more than 1GB) using this code:
import xml.etree.ElementTree as ET
tree = ET.parse(file)
doc = tree.getroot()
abstracts = doc.findall('PubmedArticle/MedlineCitation/Article/Abstract')
for abstract in abstracts:
abs_text = abstract.findall('AbstractText')
ab = ''
for txt in abs_text:
ab += txt.text
collections.col_pubmed_xmls.insert({'text': ab, 'tag': tag})
after executing this code an error says that file can not be openned in this line:
ET.parse(file)
I can read small files using this code.
What to do?

Writing XML (ET.Element) to a file

I have created an XML file using xml.etree.ElementTree.
The created XML basically looks like this:
<testsuite name="Exploatering Tests">
<testsuite name="device"></testsuite>
<testsuite name="management"></testsuite>
<testsuite name="auto"></testsuite>
</testsuite>
It all looks good, but I want to export only the child elements into a file and for that I am using the following code:
# First Testsuite Level
testsuite_exploatering = ET.Element('testsuite')
testsuite_exploatering.set("name", "Exploatering Tests")
... HERE I RECURSIVELY ADD LOTS OF ELEMENT TO testsuite_exploatering
# Write XML to file
output_xml = ET.ElementTree(testsuite_exploatering)
ET.ElementTree.write(testsuite_exploatering, output_file, xml_declaration=True, encoding="UTF-8")
It writes the XML element into the file correctly but how should I modify it to print only the inner elements into a file (I don't want to have the Exploatering Tests element written to the file). want the file to look like this:
<testsuite name="device"></testsuite>
<testsuite name="management"></testsuite>
<testsuite name="auto"></testsuite>
I'm not sure that's a legal XML structure. If you really want to force the output into that format, maybe you can iterate over the children and write each of them like so:
out_handle = open('output.xml','w')
for child in testsuite_exploatering.getchildren():
ET.ElementTree.write(ET.ElementTree(child), out_handle)
out_handle.close()

Categories