python xml won't parse - python

Trying to read bulk data from US Patent and Trade Office. Have tried several xml files from here, I get the same results:
import xml.etree.ElementTree as ET
import re
file = 'ipgb20210105.xml'
tree = ET.parse(file)
yields: "ParseError: junk after document element: line 862, column 0"
Have tried recommendation to wrap with fake root node, but this doesn't work either:
with open(file) as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
yields: "ParseError: not well-formed (invalid token): line 2, column 2"
Any help much appreciated!

Related

Write edited xml that replaced hypen with underscore

So I am trying to write a new xml file that I edited from the original by replacing the hyphen with an underscore and then start working on that xml file for the rest of the code.
This is my code:
import xml.etree.ElementTree as ET
from lxml import etree
#attaching xml file
xmlfile = "hook_zap.xml"
tree = ET.parse(xmlfile)
root = tree.getroot()
#replace hypen with underscore within the xml
doc = etree.parse(xmlfile)
for e in doc.xpath('//*[contains(local-name(),"-")]'):
e.tag = e.tag.replace('-','_')
refracted = etree.tostring(doc, method='xml')
#create a new xml file with refracted file
refracted.write('base.xml')
#print (refracted)
And I keep getting this error:
AttributeError: 'bytes' object has no attribute 'write'
Write refracted like any other kind data into a file:
with open('base.xml', 'w') as f:
f.write(refracted.decode('utf-8'))

Parsing an xml file and creating another from the parsed object

I am trying to parse an xml file(containing bad characters) using lxml module in recover = True mode.
Below is the code snippet
from lxml import etree
f=open('test.xml')
data=f.read()
f.close()
parser = etree.XMLParser(recover=True)
x = etree.fromstring(data, parser=parser)
Now I want to create another xml file (test1.xml) from the above object (x)
Could anyone please help in this matter.
Thanks
I think this is what you are searching for
from lxml import etree
# opening the source file
with open('test.xml','r') as f:
# reading the number
data=f.read()
parser = etree.XMLParser(recover=True)
# fromstring() parses XML from a string directly into an Element
x = etree.fromstring(data, parser=parser)
# taking the content retrieved
y = etree.tostring(x, pretty_print=True).decode("utf-8")
# writing the content on the output file
with open('test1.xml','w') as f:
f.write(y)

How to read a huge xml file using xml.etree.ElementTree

How can I read huge xml files (more than 1GB) using this code:
import xml.etree.ElementTree as ET
tree = ET.parse(file)
doc = tree.getroot()
abstracts = doc.findall('PubmedArticle/MedlineCitation/Article/Abstract')
for abstract in abstracts:
abs_text = abstract.findall('AbstractText')
ab = ''
for txt in abs_text:
ab += txt.text
collections.col_pubmed_xmls.insert({'text': ab, 'tag': tag})
after executing this code an error says that file can not be openned in this line:
ET.parse(file)
I can read small files using this code.
What to do?

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

Read an xml file in Python

I am reading a file with a jml extension. The code is very simple and it reads
import xml.etree.ElementTree as ET
tree = ET.parse('VOAPoints_2010_M25.jml')
root = tree.getroot()
but I get a parsing error:
ParseError: not well-formed (invalid token): line 75, column 16
the file I am trying to read is a dataset that has been used before so I am confident that there are no problems with it.
The file is
Can anyone help ?
Sorry for using an answer as a question, but formatting this inside a comment is painful.
Does the code below solve your problem?
import xml.etree.ElementTree as ET
myParser = ET.XMLParser(encoding="utf-8")
tree = ET.parse('VOAPoints_2010_M25.jml',parser=myParser)
root = tree.getroot()
Since the pound sign was the issue, you can escape it with the character entity £. Python can even automate the replace in XML file by iteratively reading each line and replacing it conditionally on the pound symbol:
import xml.etree.ElementTree as ET
oldfile = "VOAPoints_2010_M25.jml"
newfile = "VOAPoints_2010_M25_new.jml"
with open(oldfile, 'r') as otxt:
for rline in otxt:
if "£" in rline:
rline = rline.replace("£", "£")
with open(newfile, 'a') as ntxt:
ntxt.write(rline)
tree = ET.parse(newfile)
root = tree.getroot()

Categories