How can one parse whole XML documents using the LXML Sax module?

How can one parse whole XML documents using the LXML Sax module? - python

I have a script that goes through a directory with many XML files and extracts or adds information to these files. I use XPath to identify the elements of interest.
The relevant piece of code is this:
import lxml.etree as et
import lxml.sax
# deleted non relevant code
for root, dirs, files in os.walk(ROOT):
# iterate all files
for file in files:
if file.endswith('.xml'):
# join root dir and file name
file_path = os.path.join(ROOT, file)
# load root element from file
file_root = et.parse(file_path).getroot()
# This is a function that I define elsewhere in which I use XPath to identify relevant
# elements and extract, change or add some information
xml_dosomething(file_root)
# init tree object from file_root
tree = et.ElementTree(file_root)
# save modified xml tree object to file with an added text so that I can keep a copy of original.
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
I have seen in various places that people recommend using Sax(on) to speed up the processing of large files. After checking the documentation of the LXML Sax module in (https://lxml.de/sax.html) I'm at a loss as to how to modify my code so that I can leverage the Sax module. I can see the following in the documentation:
handler = lxml.sax.ElementTreeContentHandler()
then there is a list of statements like (handler.startElementNS((None, 'a'), 'a', {})) that would populate the 'handler' "document" (?) with what would be the elements of a the XML document. After that I see:
tree = handler.etree
lxml.etree.tostring(tree.getroot())
I think I understand what handler.etree does but my problem is that I want 'handler' to be the files in the directory that I'm working with rather than a string that I create by using 'handler.startElementNS' and the like. What do I need to change in my code to get the Sax module to do the work that needs to be done with the files as input?

Related

Get XML file name from loaded XML files using Python

My Python code reads XML files stored at location and loads it into Python list after parsing using lxml library as shown below:
XMLFILEList = []
FilePath = 'C:\\plugin\\TestPlugin\\'
XMLFilePath = os.listdir(FilePath)
for XMLFILE in XMLFilePath:
if XMLFILE.endswith('.xml'):
XMLFILEList.append(etree.parse(XMLFILE))
print(XMLFILEList)
Output:
[<lxml.etree._ElementTree object at 0x000001CCEEE0C748>, <lxml.etree._ElementTree object at 0x000001CCEEE0C7C8>]
Currently, I see objects of XML files.
Please can anyone help me pull original filenames of XML files. For example: if my HelloWorld.xml file is loaded into XMLFILEList. I should be able to retrieve "HelloWorld.xml"

you have a one to one correspondence between XBRLFilePath and XMLFILEList, first one is the file you loaded, second is the file contents, just use that applying your if statement.
mydict = {}
for XMLFILE in XBRLFilePath:
if XMLFILE.endswith('.xml'):
mydict[XMLFILE] = etree.parse(XMLFILE)
your dict will now have as keys the files loaded, and as values the loaded files

Using variable in file path in Python

I've found similar questions to this but can't find an exact answer and I'm having real difficulty getting this to work, so any help would be hugely appreciated.
I need to find a XML file in a folder structure that changes every time I run some automated tests.
This piece of code finds the file absolutely fine:
import xml.etree.ElementTree as ET
import glob
report = glob.glob('./Reports/Firefox/**/*.xml', recursive=True)
print(report)
I get a path returned. I then want to use that path, in the variable "report" and look for text within the XML file.
The following code finds the text fine IF the python file is in the same directory as the XML file. However, I need the python file to reside in the parent file and pass the "report" variable into the first line of code below.
tree = ET.parse("JUnit_Report.xml")
root = tree.getroot()
for testcase in root.iter('testcase'):
testname = testcase.get('name')
teststatus = testcase.get('status')
print(testname, teststatus)
I'm a real beginner at Python, is this even possible?

Build the absolute path to your report file:
report = glob.glob('./Reports/Firefox/**/*.xml', recursive=True)
abs_path_to_report = os.path.abspath(report)
Pass that variable to whatever you want:
tree = ET.parse(abs_path_to_report )

xml files from folder into list

I'm pretty new in programming and it is the first time I use xml, but for class I'm doing a gender classification project with a dataset of Blogs.
I have a folder which consists of xml files. Now I need to make a list of names of the files there.
Then I should be able to run through the list with a loop and open each file containing XML and get out of it what I want (ex. Text and class) and then store that in another variable, like adding it to a list or dictionary.
I tried something, but it isn't right and I'm kind of stuck. Can someone help me? This is wat I have so far:
path ='\\Users\\name\\directory\\folder'
dir = os.listdir( path )
def select_files_in_folder(dir, ext):
for filename in os.listdir(path):
fullname= os.path.join(path, filename)
tree = ET.parse(fullname)
for elem in doc.findall('gender'):
print(elem.get('gender'), elem.text)

If you want to build a list of all the xml files in a given directory you can do the following
def get_xml_files(path):
xml_list = []
for filename in os.listdir(path):
if filename.endswith(".xml"):
xml_list.append(os.path.join(path, filename))
return xml_list
just keep in mind that this is not recursive through the folders and it's just assuming that the xml files finish with .xml.
EDIT :
Parsing xml is highlly dependent of the library you'll be using. From your code I guess you're using xml.etree.ElementTree (keep in mind this lib is not safe against maliciously constructed data).
def get_xml_data(list):
data = []
for filename in list :
root = ET.parse(filename)
data = [ text for text in root.findall("whatever you want to get") ]
return data

Search all docx files with python-docx in a directory (batch)

I have a bunch of Word docx files that have the same embedded Excel table. I am trying to extract the same cells from several files.
I figured out how to hard code to one file:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
But how do I batch this? I tried some variations on listdir, but they are not working for me and I am too green to get there on my own.

How you loop over all of the files will really depend on your project deliverables. Are all of the files in a single folder? Are there more than just .docx files?
To address all of the issues, we'll assume that there are subdirectories, and other files mingled with your .docx files. For this, we'll use os.walk() and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
For additional reading, here is the documentation on os.listdir().
Also, it would be best to put your code into a function which is re-usable, but that's also a challenge for yourself!

Assuming that the code above get you the data you need, all you need to do is read the files from the disk and process them.
First let's define a function that does what you were already doing, then we'll loop over all the documents in the directory and process them with that function.
Edit the following untested code to suit your needs.
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)

Find names of all generated HTML files in Sphinx extension

I want to create a simple Sphinx extension which post-processes generated HTML files after they're created by HTML builder. I have written a post-processing routine using BeautifulSoup, but then I faced a trouble converting my routine into a separate Sphinx extension.
I've registered my handler for "build-finished" event using app.connect, but I still cannot figure out how to get the list of filenames to preprocess.
How to get the list of all HTML files which were generated? (or, at least)
How to get the output directory? I've found that I can use env.found_docs and builder.get_target_uri to obtain the relative path of generated page, but I need the directory name.

It's not mentioned in Sphinx documentation for some reason, but the path to output directory can be easily found as app.outdir. After discovering this fact, it was easy to gather all the filenames I needed:
def process_build_finished(app, exception):
if exception is not None:
return
target_files = []
for doc in app.env.found_docs:
target_filename = app.builder.get_target_uri(doc)
target_filename = os.path.join(app.outdir, target_filename)
target_filename = os.path.abspath(target_filename)
target_files.append(target_filename)
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can one parse whole XML documents using the LXML Sax module? - python

Related

Get XML file name from loaded XML files using Python

Using variable in file path in Python

xml files from folder into list

Search all docx files with python-docx in a directory (batch)

Find names of all generated HTML files in Sphinx extension

Categories

Resources