How to read a huge xml file using xml.etree.ElementTree

How to read a huge xml file using xml.etree.ElementTree - python

How can I read huge xml files (more than 1GB) using this code:
import xml.etree.ElementTree as ET
tree = ET.parse(file)
doc = tree.getroot()
abstracts = doc.findall('PubmedArticle/MedlineCitation/Article/Abstract')
for abstract in abstracts:
abs_text = abstract.findall('AbstractText')
ab = ''
for txt in abs_text:
ab += txt.text
collections.col_pubmed_xmls.insert({'text': ab, 'tag': tag})
after executing this code an error says that file can not be openned in this line:
ET.parse(file)
I can read small files using this code.
What to do?

Related

How to write Element Tree dump into file

I am try to write the xml dump into the another file. Here is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('extract_orginal.xml')
root = tree.getroot()
with open('extract.xml', 'w') as extract:
for item in root.findall(f"doc[#id='289e1292134534']"):
extract.write(ET.dump(item))
Getting the output as "NONE" in the extract.xml file. Can you please help me.

From the docs of .dump():
"Write element tree or element structure to sys.stdout. This function should be used for debugging only."
The function .dump() returns None!
I think you want to use .tostring():
import xml.etree.ElementTree as ET
tree = ET.parse('extract_orginal.xml')
root = tree.getroot()
with open('extract.xml', 'w') as extract:
for item in root.findall(f"doc[#id='289e1292134534']"):
extract.write(ET.tostring(item, encoding="utf-8"))

Append XML responses in Python

I am trying to parse multiple XML responses in one file. However, when I write a responses to file, it shows only last one. I assume I need to add append somewhere in order to keep all responses.
Here is my code:
import json
import xml.etree.ElementTree as ET
#loop test
feins = ['800228936', '451957238']
for i in feins:
rr = requests.get('https://pdb-services.nipr.com/pdb-xml-reports/hitlist_xml.cgi?report_type=0&id_fein={}'.format(i),auth=('test', 'test'))
root = ET.fromstring(rr.text)
tree = ET.ElementTree(root)
tree.write("file.xml")

Try changing
for i in feins:
...
tree = ET.ElementTree(root)
tree.write("file.xml")
to (note the indentation):
for i in feins:
...
tree = ET.ElementTree(root)
with open("file.xml", "wb") as f:
tree.write(f)
and see if it works.

python xml won't parse

Trying to read bulk data from US Patent and Trade Office. Have tried several xml files from here, I get the same results:
import xml.etree.ElementTree as ET
import re
file = 'ipgb20210105.xml'
tree = ET.parse(file)
yields: "ParseError: junk after document element: line 862, column 0"
Have tried recommendation to wrap with fake root node, but this doesn't work either:
with open(file) as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
yields: "ParseError: not well-formed (invalid token): line 2, column 2"
Any help much appreciated!

Write edited xml that replaced hypen with underscore

So I am trying to write a new xml file that I edited from the original by replacing the hyphen with an underscore and then start working on that xml file for the rest of the code.
This is my code:
import xml.etree.ElementTree as ET
from lxml import etree
#attaching xml file
xmlfile = "hook_zap.xml"
tree = ET.parse(xmlfile)
root = tree.getroot()
#replace hypen with underscore within the xml
doc = etree.parse(xmlfile)
for e in doc.xpath('//*[contains(local-name(),"-")]'):
e.tag = e.tag.replace('-','_')
refracted = etree.tostring(doc, method='xml')
#create a new xml file with refracted file
refracted.write('base.xml')
#print (refracted)
And I keep getting this error:
AttributeError: 'bytes' object has no attribute 'write'

Write refracted like any other kind data into a file:
with open('base.xml', 'w') as f:
f.write(refracted.decode('utf-8'))

Read an xml file in Python

I am reading a file with a jml extension. The code is very simple and it reads
import xml.etree.ElementTree as ET
tree = ET.parse('VOAPoints_2010_M25.jml')
root = tree.getroot()
but I get a parsing error:
ParseError: not well-formed (invalid token): line 75, column 16
the file I am trying to read is a dataset that has been used before so I am confident that there are no problems with it.
The file is
Can anyone help ?

Sorry for using an answer as a question, but formatting this inside a comment is painful.
Does the code below solve your problem?
import xml.etree.ElementTree as ET
myParser = ET.XMLParser(encoding="utf-8")
tree = ET.parse('VOAPoints_2010_M25.jml',parser=myParser)
root = tree.getroot()

Since the pound sign was the issue, you can escape it with the character entity £. Python can even automate the replace in XML file by iteratively reading each line and replacing it conditionally on the pound symbol:
import xml.etree.ElementTree as ET
oldfile = "VOAPoints_2010_M25.jml"
newfile = "VOAPoints_2010_M25_new.jml"
with open(oldfile, 'r') as otxt:
for rline in otxt:
if "£" in rline:
rline = rline.replace("£", "£")
with open(newfile, 'a') as ntxt:
ntxt.write(rline)
tree = ET.parse(newfile)
root = tree.getroot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a huge xml file using xml.etree.ElementTree - python

Related

How to write Element Tree dump into file

Append XML responses in Python

python xml won't parse

Write edited xml that replaced hypen with underscore

Read an xml file in Python

Categories

Resources