storing output to a list or dictionary - python

I'm using the following code as a portion of a larger program that does some error checking on a Digital Cinema Package and tries to check the validity of the XML file that lists the asses on the DCP. ANyway, this is all still very much in its infancy and I'm hoping to learn more python as a result of it.
import xml.etree.ElementTree as etree
import sys
class Parser(object):
def __init__(self, file_name):
self.file_name = file_name
def display(self, rename_this_list):
tree = etree.parse(self.file_name)
for node in tree.getiterator():
for element in rename_this_list:
if element in node.tag:
#uuid = [s.strip('urn:') for s in uuid]
fname = sys.argv[1]
key_search_words = ['KeyId']
instance = Parser(fname)
when I try to store the output so that each line is a list it doesn't format the way that I would expect. Minus the urn: I'd like to be storing each line with uuid: and the following info as an element of a list.

If you need a list, then you can try this.
def display(self, rename_this_list):
listOfNodes = []
tree = etree.parse(self.file_name)
for node in tree.getiterator():
for element in rename_this_list:
if element in node.tag:
# append text of element to the list
# without first four characters which are "urn:"
print str(listOfNodes)
return listOfNodes
Remember that keys of a dictionary have to be unique, in a dictionary you can't have two items with keys "uuid", if you want a dictionary then you can only have one dictionary with one key "uuid" and a list of all those numbers as values.

import collections
class Parser(object):
def __init__(self, file_name):
self.file_name = file_name, self.res = collections.defaultdict(list)
def display(self, rename_this_list):
tree = etree.parse(self.file_name)
for node in tree.getiterator():
for element in rename_this_list:
if element in node.tag:
uuid = node.text
key, value = uuid[4:].split(':')
Can this satisfy your need? I don't know the details of your data so if anything wrong please tell. I think the result should be like this:


is this a lambda in python? [duplicate]

This question already has answers here:
What are type hints in Python 3.5?
(5 answers)
What does -> mean in Python function definitions?
(11 answers)
Closed 2 years ago.
i am using python 3.7 and i have just started my own opensource project. Some time ago a very skilled software developer decided to help, then he didn't have enough time to continue. So i am taking his work back to develop new features for the project. Now he has designed a script to manage the reading of text from pdf and doc files. He has developed it very well but there is something i don't understand:
def extract_document_data(cls, file_path : str) -> DocumentData:
Entry point of the module, it extracts the data from the document
whose path is passed as input.
The extraction strategy is automatically chosen based on the MIME type
of the file.
#type file_path: str
#param file_path: The path of the document to be parsed.
#rtype: DocumentData
#returns: An object containing the data of the parsed document.
mime = magic.Magic(mime=True)
mime_type = mime.from_file(file_path)
document_type = DocumentType.get_instance(mime_type)
strategy = cls.strategies[document_type]
return strategy.extract_document_data(file_path)
this: -> DocumentData is very obscure for me, as if it was a lamdba it shouls be included in the methods arguments as a callback doesn't it? which meaning does it have in this position?
I can paste even the whole classe if you need a more verbose insight:
from enum import Enum
import json
import magic
import docx
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTContainer, LTTextContainer
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
class DocumentType(Enum):
Defines the handled document types.
Each value is associated to a MIME type.
def __init__(self, mime_type):
self.mime_type = mime_type
def get_instance(cls, mime_type : str):
values = [e for e in cls]
for value in values:
if value.mime_type == mime_type:
return value
raise MimeNotValidError(mime_type)
PDF = 'application/pdf'
DOCX = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
class MimeNotValidError(Exception):
Exception to be raised when a not valid MIME type is processed.
class DocumentData:
Wrapper for the extracted document data (TOC and contents).
def __init__(self, toc : list = [], pages : list = [], document_text : str = None):
self.toc = toc
self.pages = pages
if document_text is not None:
self.document_text = document_text
self.document_text = ' '.join([page.replace('\n', ' ') for page in pages])
def toc_as_json(self) -> str:
return json.dumps(self.toc)
class ExtractionStrategy:
Base class for the extraction strategies.
def extract_document_data(file_path : str) -> DocumentData:
class DOCXExtractionStrategy(ExtractionStrategy):
It implements the TOC and contents extraction from a DOCX document.
def extract_document_data(file_path : str) -> DocumentData:
document = docx.Document(file_path)
body_elements = document._body._body
# Selecting only the <w:t> elements from DOCX XML,
# as they're the only to contain some text.
text_elems = body_elements.xpath('.//w:t')
return DocumentData(document_text = ' '.join([elem.text for elem in text_elems]))
class PDFExtractionStrategy(ExtractionStrategy):
It implements the TOC and contents extraction from a PDF document.
def parse_toc(doc : PDFDocument) -> list:
raw_toc = []
outlines = doc.get_outlines()
for (level, title, dest, a, se) in outlines:
raw_toc.append((level, title))
except PDFNoOutlines:
return PDFExtractionStrategy.build_toc_tree(raw_toc)
def build_toc_tree(items : list) -> list:
Builds the TOC tree from a list of TOC items.
#type items: list
#param items: The TOC items.
Each item must have the following format: (<item depth>, <item description>).
E.g: [(1, 'Contents'), (2, 'Chapter 1'), (2, 'Chapter 2')]
#rtype: list
#returns: The TOC tree. The tree hasn't a root element, therefore it
actually is a list.
toc = []
if items is None or len(items) == 0:
return toc
current_toc_level = toc
# Using an explicit stack containing the lists corresponding to
# the various levels of the TOC, to simulate the recursive building
# of the TOC tree in a more efficient way
toc_levels_stack = []
# Each TOC item can be inserted into the current TOC level as
# string (just the item description) or as dict, where the key is
# the item description and the value is a list containing the
# children TOC items.
# To correctly determine how to insert the current item into
# the current level, a kind of look-ahead is needed, that is
# the depth of the next item has to be considered.
# Initializing the variables related to the previous item.
prev_item_depth, prev_item_desc = items[0]
# Adding a fake final item in order to handle all the TOC items
# inside the cycle.
items.append((-1, ''))
for i in range(1, len(items)):
# In fact each iteration handles the item of the previous
# one, using the current item to determine how to insert
# the previous item into the current TOC level,
# as explained before.
curr_item = items[i]
curr_item_depth = curr_item[0]
if curr_item_depth == prev_item_depth:
# The depth of the current item is the same
# as the previous one.
# Inserting the previous item into the current TOC level
# as string.
elif curr_item_depth == prev_item_depth + 1:
# The depth of the current item is increased by 1 compared to
# the previous one.
# Inserting the previous item into the current TOC level
# as dict.
prev_item_dict = { prev_item_desc : [] }
# Updating the current TOC level with the newly created one
# which contains the children of the previous item.
current_toc_level = prev_item_dict[prev_item_desc]
elif curr_item_depth < prev_item_depth:
# The depth of the current item is lesser than
# the previous one.
# Inserting the previous item into the current TOC level
# as string.
if i < len(items)-1:
# Executing these steps for all the items except the last one
depth_diff = prev_item_depth - curr_item_depth
# Removing from the stack as many TOC levels as the difference
# between the depth of the previous item and the depth of the
# current one.
for i in range(0, depth_diff):
# Updating the current TOC level with the one contained in
# the head of the stack.
current_toc_level = toc_levels_stack[-1]
# Updating the previous item with the current one
prev_item_depth, prev_item_desc = curr_item
return toc
def from_bytestring(s) -> str:
If the input string is a byte-string, converts it to a string using
UTF-8 as encoding.
#param s: A string or a byte-string.
#rtype: str
#returns: The potentially converted string.
if s:
if isinstance(s, str):
return s
return s.encode('utf-8')
def parse_layout_nodes(container : LTContainer) -> str:
Recursively extracts the text from all the nodes contained in the
input PDF layout tree/sub-tree.
#type container: LTContainer
#param container: The PDF layout tree/sub-tree from which to extract the text.
#rtype: str
#returns: A string containing the extracted text.
text_content = []
# The iterator returns the children nodes.
for node in container:
if isinstance(node, LTTextContainer):
# Only nodes of type LTTextContainer contain text.
elif isinstance(node, LTContainer):
# Recursively calling the method on the current node, which is a container itself.
# Ignoring all the other node types.
# Joining all the extracted text chunks with a new line character.
return "\n".join(text_content)
def parse_pages(doc : PDFDocument) -> list:
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
text_content = []
for i, page in enumerate(PDFPage.create_pages(doc)):
layout = device.get_result()
# Extracts the text from all the nodes of the PDF layout tree of each page
return text_content
def parse_pdf(file_path : str) -> (list, list):
toc = []
pages = []
fp = open(file_path, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
if doc.is_extractable:
toc = PDFExtractionStrategy.parse_toc(doc)
pages = PDFExtractionStrategy.parse_pages(doc)
except IOError:
return (toc, pages)
def extract_document_data(file_path : str) -> DocumentData:
toc, pages = PDFExtractionStrategy.parse_pdf(file_path)
return DocumentData(toc, pages = pages)
class DocumentDataExtractor:
Main class of the module.
It's responsible for actually executing the text extraction.
The output is constituted by the following items:
-table of contents (TOC);
-pages contents.
# Dictionary containing the extraction strategies for the different
# document types, indexed by the corresponding DocumentType enum values.
strategies = {
DocumentType.DOCX : DOCXExtractionStrategy(),
DocumentType.PDF : PDFExtractionStrategy()
def extract_document_data(cls, file_path : str) -> DocumentData:
Entry point of the module, it extracts the data from the document
whose path is passed as input.
The extraction strategy is automatically chosen based on the MIME type
of the file.
#type file_path: str
#param file_path: The path of the document to be parsed.
#rtype: DocumentData
#returns: An object containing the data of the parsed document.
mime = magic.Magic(mime=True)
mime_type = mime.from_file(file_path)
document_type = DocumentType.get_instance(mime_type)
strategy = cls.strategies[document_type]
return strategy.extract_document_data(file_path)

Is there a way to get the parent of a dataset or group while using Group.visititems?

I am trying to put an h5py File object into a tree structure so that I can use its ability to print out a representation of the tree to display the contents of a file in the same way the linux "tree" command recursively displays the contents of a directory. The best way to recursively visit all of the items in the file is with the Group.visititems method and passing in the function I will use to add nodes to the tree. Here is what I have so far:
import h5py
import argparse
import sys
from anytree import Node, RenderTree
class HDFTree:
def __init__(self,filename):
self._file = h5py.File(filename,'r')
self._root = Node(filename)
self._node_map = {filename:self._root}
def _add_node(self,name,item):
#TODO: Figure out way to get parent of fnode
parent_node = self._node_map[item.parent] # I don't think item.parent is a thing so this wont work
self._node_map[name] = Node(name,parent=parent_node)
def _create_tree(self):
def print_tree(self):
def __del__(self):
After realizing that the Dataset and Group class both indeed have a parent attribute (also pointed out by hpaulj in a comment on the question) and some cleaning up of the data, I was able to get the output that I want:
import h5py
import os
from anytree import Node, RenderTree
class HDFTree:
def __init__(self,filepath):
self._file = h5py.File(filepath,'r')
_,filename = os.path.split(filepath)
root_name,_ = os.path.splitext(filename)
self._root = Node(root_name)
self._node_map = {'':self._root}
def _add_node(self,name,item):
_,parent_name = os.path.split(
parent_node = self._node_map[parent_name]
_,child_name = os.path.split(name)
self._node_map[child_name] = Node(child_name,parent=parent_node)
def _create_tree(self):
def print_tree(self):
def __del__(self):
The name attribute of Dataset and Group classes apparently gives the full hdf5 path so I cleaned it up with some os.path functions.

How to Parse YAML Using PyYAML if there are '!' within the YAML

I have a YAML file that I'd like to parse the description variable only; however, I know that the exclamation points in my CloudFormation template (YAML file) are giving PyYAML trouble.
I am receiving the following error:
yaml.constructor.ConstructorError: could not determine a constructor for the tag '!Equals'
The file has many !Ref and !Equals. How can I ignore these constructors and get a specific variable I'm looking for -- in this case, the description variable.
If you have to deal with a YAML document with multiple different tags, and
are only interested in a subset of them, you should still
handle them all. If the elements you are intersted in are nested
within other tagged constructs you at least need to handle all of the "enclosing" tags
There is however no need to handle all of the tags individually, you
can write a constructor routine that can handle mappings, sequences
and scalars register that to PyYAML's SafeLoader using:
import yaml
inp = """\
Type: !Join [ "::", [AWS, EC2, EIP] ]
InstanceId: !Ref MyEC2Instance
description = []
def any_constructor(loader, tag_suffix, node):
if isinstance(node, yaml.MappingNode):
return loader.construct_mapping(node)
if isinstance(node, yaml.SequenceNode):
return loader.construct_sequence(node)
return loader.construct_scalar(node)
yaml.add_multi_constructor('', any_constructor, Loader=yaml.SafeLoader)
data = yaml.safe_load(inp)
which gives:
{'MyEIP': {'Type': ['::', ['AWS', 'EC2', 'EIP']], 'Properties': {'InstanceId': 'MyEC2Instance'}}}
(inp can also be a file opened for reading).
As you see above will also continue to work if an unexpected !Join tag shows up in your code,
as well as any other tag like !Equal. The tags are just dropped.
Since there are no variables in YAML, it is a bit of guesswork what
you mean by "like to parse the description variable only". If that has
an explicit tag (e.g. !Description), you can filter out the values by adding 2-3 lines
to the any_constructor, by matching the tag_suffix parameter.
if tag_suffix == u'!Description':
It is however more likely that there is some key in a mapping that is a scalar description,
and that you are interested in the value associated with that key.
if isinstance(node, yaml.MappingNode):
d = loader.construct_mapping(node)
for k in d:
if k == 'description':
return d
If you know the exact position in the data hierarchy, You can of
course also walk the data structure and extract anything you need
based on keys or list positions. Especially in that case you'd be better of
using my ruamel.yaml, was this can load tagged YAML in round-trip mode without
extra effort (assuming the above inp):
from ruamel.yaml import YAML
with YAML() as yaml:
data = yaml.load(inp)
You can define a custom constructors using a custom yaml.SafeLoader
import yaml
doc = '''
CreateNewSecurityGroup: !Equals [!Ref ExistingSecurityGroup, NONE]
class Equals(object):
def __init__(self, data): = data
def __repr__(self):
return "Equals(%s)" %
class Ref(object):
def __init__(self, data): = data
def __repr__(self):
return "Ref(%s)" %
def create_equals(loader,node):
value = loader.construct_sequence(node)
return Equals(value)
def create_ref(loader,node):
value = loader.construct_scalar(node)
return Ref(value)
class Loader(yaml.SafeLoader):
yaml.add_constructor(u'!Equals', create_equals, Loader)
yaml.add_constructor(u'!Ref', create_ref, Loader)
a = yaml.load(doc, Loader)
{'Conditions': {'CreateNewSecurityGroup': Equals([Ref(ExistingSecurityGroup), 'NONE'])}}

Merge two xml files Python and also keep comments

I'm trying to merge two xml files in python with the following code. That I found in another thread: Merge xml files with nested elements without external libraries
import sys
from xml.etree import ElementTree as et
class hashabledict(dict):
def __hash__(self):
return hash(tuple(sorted(self.items())))
class XMLCombiner(object):
def __init__(self, filenames):
assert len(filenames) > 0, 'No filenames!'
# save all the roots, in order, to be processed later
self.roots = [et.parse(f).getroot() for f in filenames]
def combine(self):
for r in self.roots[1:]:
# combine each element with the first one, and update that
self.combine_element(self.roots[0], r)
# return the string representation
return et.ElementTree(self.roots[0])
def combine_element(self, one, other):
This function recursively updates either the text or the children
of an element if another element is found in `one`, or adds it
from `other` if not found.
# Create a mapping from tag name to element, as that's what we are fltering with
mapping = {(el.tag, hashabledict(el.attrib)): el for el in one}
for el in other:
if len(el) == 0:
# Not nested
# Update the text
mapping[(el.tag, hashabledict(el.attrib))].text = el.text
except KeyError:
# An element with this name is not in the mapping
mapping[(el.tag, hashabledict(el.attrib))] = el
# Add it
# Recursively process the element, and update it in the same way
self.combine_element(mapping[(el.tag, hashabledict(el.attrib))], el)
except KeyError:
# Not in the mapping
mapping[(el.tag, hashabledict(el.attrib))] = el
# Just add it
if __name__ == '__main__':
r = XMLCombiner(sys.argv[1:-1]).combine()
print '-'*20
print et.tostring(r.getroot())
r.write(sys.argv[-1], encoding="iso-8859-1", xml_declaration=True)
The code works perfectly for merging two xml files, however I would also like to merge the comments I have in the files. I'm new at this and don't know how to not just merge the xml but also the comments I have in the files.

Search and remove element with elementTree in Python

I have an XML document in which I want to search for some elements and if they match some criteria
I would like to delete them
However, I cannot seem to be able to access the parent of the element so that I can delete it
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.attrib.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
#here I need to access the parent of prop
# in order to delete the prop
Is there a way I can do this?
You can remove child elements with the according remove method. To remove an element you have to call its parents remove method. Unfortunately Element does not provide a reference to its parents, so it is up to you to keep track of parent/child relations (which speaks against your use of elem.findall())
A proposed solution could look like this:
root = elem.getroot()
for child in root:
if != "prop":
if True:# TODO: do your check here!
PS: don't use prop.attrib.get(), use prop.get(), as explained here.
You could use xpath to select an Element's parent.
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
# Get parent and remove this prop
parent = prop.find("..")
Except if you try that it doesn't work:
So instead you have to:
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
search = './/{0}prop'.format(namespace)
# Use xpath to get all parents of props
prop_parents = elem.findall(search + '/..')
for parent in prop_parents:
# Still have to find and iterate through child props
for prop in parent.findall(search):
type = prop.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
It is two searches and a nested loop. The inner search is only on Elements known to contain props as first children, but that may not mean much depending on your schema.
I know this is an old thread but this kept popping up while I was trying to figure out a similar task. I did not like the accepted answer for two reasons:
1) It doesn't handle multiple nested levels of tags.
2) It will break if multiple xml tags are deleted in the same level one-after-another. Since each element is an index of Element._children you shouldn't delete while forward iterating.
I think a better more versatile solution is this:
import xml.etree.ElementTree as et
file = 'test.xml'
tree = et.parse(file)
root = tree.getroot()
def iterator(parents, nested=False):
for child in reversed(parents):
if nested:
if len(child) >= 1:
if True: # Add your entire condition here
iterator(root, nested=True)
For the OP, this should work - but I don't have the data you're working with to test if it's perfect.
import xml.etree.ElementTree as et
file = 'test.xml'
tree = et.parse(file)
namespace = "{http://somens}"
props = tree.findall('.//{0}prop'.format(namespace))
def iterator(parents, nested=False):
for child in reversed(parents):
if nested:
if len(child) >= 1:
if prop.attrib.get('type') == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
iterator(props, nested=True)
A solution using lxml module
from lxml import etree
root = ET.fromstring(xml_str)
for e in root.findall('.//{}node'):
parent = e.getparent()
for child in parent.find('./{}node'):
except ValueError:
Using the fact that every child must have a parent, I'm going to simplify #kitsu.eb's example. f using the findall command to get the children and parents, their indices will be equivalent.
file = open('test.xml', "r")
elem = ElementTree.parse(file)
namespace = "{http://somens}"
search = './/{0}prop'.format(namespace)
# Use xpath to get all parents of props
prop_parents = elem.findall(search + '/..')
props = elem.findall('.//{0}prop'.format(namespace))
for prop in props:
type = prop.attrib.get('type', None)
if type == 'json':
value = json.loads(prop.attrib['value'])
if value['name'] == 'Page1.Button1':
#use the index of the current child to find
#its parent and remove the child
I also used XPath for this issue, but in a different way:
root = elem.getroot()
elementName = "YourElement"
#this will find all the parents of the elements with elementName
for elementParent in root.findall(".//{}/..".format(elementName)):
#this will find all the elements under the parent, and remove them
for element in elementParent.findall("{}".format(elementName)):
I like to use an XPath expression for this kind of filtering. Unless I know otherwise, such an expression must be applied at the root level, which means I can't just get a parent and apply the same expression on that parent. However, it seems to me that there is a nice and flexible solution that should work with any supported XPath, as long as none of the sought nodes is the root. It goes something like this:
root = elem.getroot()
# Find all nodes matching the filter string (flt)
nodes = root.findall(flt)
while len(nodes):
# As long as there are nodes, there should be parents
# Get the first of all parents to the found nodes
parent = root.findall(flt+'/..')[0]
# Use this parent to remove the first node
# Find all remaining nodes
nodes = root.findall(flt)
I would like only to add a comment on the accepted answer, but my lack of reputation doesn't allow me to. I wanted to add that it is important to add .findall("*")to the iterator to avoid issues, as stated in the documentation:
Note that concurrent modification while iterating can lead to problems, just like when iterating and modifying Python lists or dicts. Therefore, the example first collects all matching elements with root.findall(), and only then iterates over the list of matches.
Therefore, in the accepted answer the iteration should be for child in root.findal("*"):instead of for child in root:. Not doing so made my code skip some elements from the list.
