Extending pyyaml to find and replace like xml ElementTree - python

I'd like to extend this SO question to treat a non-trivial use-case.
Background: pyyaml is pretty sweet insofar as it eats YAML and poops Python-native data structures. But what if you want to find a specific node in the YAML? The referenced question would suggest that, hey, you just know where in the data structure the node lives and index right into it. In fact pretty much every answer to every pyyaml question on SO seems to give this same advice.
But what if you don't know where the node lives in the YAML in advance?
If I were working with XML I'd solve this problem with an xml.etree.ElementTree. These provide nice facilities for loading an XML document into memory and finding elements based on certain search criteria. See find() and findall().
Questions:
Does pyyaml provide search capabilities analogous to ElementTree? (If yes, feel free to yell at me for being bad at Google.)
If no, does anyone have nice recipe for extending pyyaml to achieve similar things? (Bonus points for not traversing the deserialized YAML all over again.)
Note that one important thing that ElementTree provides in addition to just being able to find things is the ability to modify the XML document given an element reference. I'd like to be able to do this on YAML as well.

The answer to question 1 is: no. PyYAML implements the YAML 1.1 language standard and there is nothing about finding scalars by any path in the standard nor in the library.
However if you safeload a YAML structure, everything is either a mapping, a sequence or a scalar. Even such a simplistic representation (simple, compared to full fledged object instantiation with !typemarkers), can already contain recursive self referencing structures:
&a x: *a
This is not possible in XML without external semantic interpretation. This makes making a generic tree walker much harder in YAML than in XML.
The type loading mechanism of YAML also makes it much more difficult to generic tree walker, even if you exclude the problem of self references.
If you don't know where a node lives in advance, you still need to know how to identify the node, and since you don't know how to you would walk the parent (which might be represented in multiple layers of combined mappings and sequences, it is almost almost useles to have a generic mechanism that depends on context.
Without being able to rely on context (in general) the thing that is left is a uniquely identifiable value (like the HTML id attribute). If all your objects in YAML have such a unique id, then it is possible to search the (safeloaded) tree for such an id value and extract any structure underneath it (mappings, sequences) until you hit a leaf (scalar), or some structure that has an id of its own (another object).
I have been following the YAML development for quite some time now (earliest emails from the YAML mailing list that I have in my YAML folder are from 2004) and I have not seen anything generic evolve since then. I do have some tools to walk the trees and find things that I use for extracting parts of the simplified structure for testing my raumel.yaml library, but no code that is in a releasable shape (it would have already been on PyPI if it was), and nothing near to a generic solution like you can make for XML (which is IMO, on its own, syntactically less complex than YAML).

Do you know how to search through python objects? then you know how to search through the results of a yaml.load()...
YAML is different from XML in two important ways: one is that while every element in XML has a tag and a value, in YAML, there can be some things that are only values. But secondly... again, YAML creates python objects. There is no intermediate in-memory format to use.
E.G. if you load a YAML file like this:
- First
- Second
- Third
you'll get a list like ['First', 'Second', 'Third']. Want to find 'Third' and don't know where it is? You can use [x for x in my_list if 'Third' in x] to find it. Need to lookup an item in a dictionary? Just do it.
If you want to modify an object, you don't modify the YAML, you modify the object. E.G. now I want the second entry to be in German. I just do 'my_list[1] = 'zweite', modifying it in place. Now the python list looks like ['First', 'zweite', 'Third'], and dumping it to YAML looks like
- First
- zweite
- Third
Note that PyYAML is pretty smart... you can even create objects with loops:
>>> a = [1,2,3]
>>> b = {}
>>> b[1] = a
>>> b[2] = a
>>> print yaml.dump(b)
1: &id001 [1, 2, 3]
2: *id001
>>> b[2] = [3,4,5]
>>> print yaml.dump(b)
1: [1, 2, 3]
2: [3, 4, 5]
In the first case, it even figured out that b[1] and b[2] point to the same object, so it created links and automatically put a link from one to the other... in the original object, if you did something like a.pop(), both b[1] and b[2] would show that one entry was gone. If you send that object to YAML, and then load it back in, that will still be true.
(and note in the second one, where they aren't the same, PyYAML doesn't create the extra notations, as it doesn't need to).
In short: Most likely, you're just overthinking it.

Related

Right pattern to deal with different dictionary configuration - Python

My question is not really a problem because the program works the way it is right now, however I'm looking for a way to improve the maintainability of it since the system is growing quite fast.
In essence, I have a function (let's call it 'a') that is going to process a XML (in a form of a python dict) and it is responsible for getting a specific array of elements (let's call it 'obj') from this XML. The problem is that we process a lot of XMLs from different sources, therefore, each XML has its own structure and the obj element is located in different places.
The code is currently in the following structure:
function a(self, code, ...):
xml_doc = ... # this is a dict from a xml document that can have one of many different structures
obj = None # Array of objects that I want to get from the XML. It might be processed later but is eventually returned.
#Because the XML can have different structures, the object I want to get can be placed in different (well-known) places depending on the code value.
if code is 'a':
obj = xml_doc["key1"]["key2"]
elif code is 'b':
obj = xml_doc["key3"]
...
# code that processes the obj object
...
elif code is 'b':
obj = xml_doc["key4"]["key5"]["key6"]
... # elif for different codes goes indefinitely
return obj
As you can see (or not - but believe me), it's not very friendly to add new entries to this function and add code to the cases that have to be processed. So I was looking for a way to do it using dictionaries to map the code to the correct XML structure. Something in the direction of the following example:
...
xml_doc = ...
# That would be extremely neat.
code_to_pattern = {
'a': xml_doc["key1"]["key2"],
'b': xml_doc["key3"],
'c': xml_doc["key4"]["key5"]["key6"],
...
}
obj = code_to_pattern[code]
obj = self.process_xml(code, obj) # It will process the array if it has to in another function.
return obj
...
However, the above code doesn't work for obvious reasons. Each entry of the code_to_pattern dictionary is trying to access an element in xml_doc that might not exist, then an exception is raised. I thought in adding the entries as strings and then using the exec() function, so python only interpret the string in the right moment, however I'm not very found of the exec function and I am sure someone can come up with a better idea.
The conditional processing part of the XML is easy to do, however I can't think in a better way to have an easy method to add new entries to the system.
I'd be very pleased if someone can help me with some ideas.
EDIT1: Thank you for your replies, guys. Your both (#jarondl and #holdenweb) gave me workable and working ideas. For the right answer I am going to choose the one that required me the less change in the format I gave you even though I am going to solve it through xPath.
You should first consider alternatives such as xpath to read the xml, depending on how you parsed it.
If you want to proceed with your dictionary, you can have non evaluated code with lambda - no need for exec:
code_to_pattern = {
'a': lambda doc: doc["key1"]["key2"],
'b': lambda doc: doc["key3"],
'c': lambda doc: doc["key4"]["key5"]["key6"],
...
}
obj = code_to_pattern[code](xml_doc)
Essentially you are looking for a data-driven solution. You have the essence of such a solution, but rather than mapping the codes to elements of the xml_doc table, it might be easier to map the codes to the required keys. In other words, look at doing:
xml_doc = ...
code_to_pattern = {
'a': "key1", "key2",
'b': "key3",
'c': "key4", "key5", "key6",
...
}
The problem there is that you would then need to adapt to the variable number of keys that the different objects mapped to, so a simple
obj = code_to_pattern[code]
wouldn't cut it. Be aware, though, that dicts can take tuples as arguments, so it's possible (though you'd know better than I) that rather than using successive indices like xml_doc["key4"]["key5"]["key6"] you might be able to use tuple indices like xml_doc["key4", "key5", "key6"]. This may or may not help you with your problem.
Finally, you might find it helpful to learn about the collections.defaultdict object, since this automates the creation of new entries rather than forcing you to test for their presence and create them if absent. This could be helpful even in tuple keys won't cut it for you.

use string content as variable - python

I am using a package that has operations inside the class (? not sure what either is really), and normally the data is called this way data[package.operation]. Since I have to do multiple operations thought of shortening it and do the following
list =["o1", "o2", "o3", "o4", "o5", "o6"]
for i in list:
print data[package.i]
but since it's considering i as a string it doesnt do the operation, and if I take away the string then it is an undefined variable. Is there a way to go around this? Or will I just have to write it the long way?.
In particular I am using pymatgen, its package Orbital and with the .operation I want to call specific suborbitals. A real example of how it would be used is data[0][Orbital.s], the first [0] denotes the element in question for which to get the orbitals s (that's why I omitted it in the code above).
You can use getattr in order to dynamically select attributes from objects (the Orbital package in your case; for example getattr(Orbital, 's')).
So your loop would be rewritten to:
for op in ['o1', 'o2', 'o3', 'o4', 'o5', 'o6']:
print(data[getattr(package, op)])

How do I get a formatted list of methods from a given object?

I'm a beginner, and the answers I've found online so far for this have been too complicated to be useful, so I'm looking for an answer in vocabulary and complexity similar to this writing.
I'm using python 2.7 in ipython notebook environment, along with related modules as distributed by anaconda, and I need to learn about the library-specific objects in the course of my daily work. The case I'm using here is a pandas dataframe object but the answer must work for any object of python or of an imported module.
I want to be able to print a list of methods for the given object. Directly from my program, in a concise and readable format. Even if it's just the method names in a list by alphabetical order, that would be great. A bit more detail would be even better, an ordering based on what it does is fine, but I'd like the output to look like a table, one row per method, and not big blocks of text. What i've tried is below, and it fails for me because it's unreadable. It puts copies of my data between each line, and it has no formatting.
(I love stackoverflow. I aspire to have enough points someday to upvote all your wonderful answers.)
import pandas
import inspect
data_json = """{"0":{"comment":"I won\'t go to school"}, "1":{"note":"Then you must stay in bed"}}"""
data_df = pandas.io.json.read_json(data_json, typ='frame',
dtype=True, convert_axes=True,
convert_dates=True, keep_default_dates=True,
numpy=False, precise_float=False,
date_unit=None)
inspect.getmembers(data_df, inspect.ismethod)
Thanks,
- Sharon
Create an object of type str:
name = "Fido"
List all its attributes (there are no “methods” in Python) in alphabetical order:
for attr in sorted(dir(name)):
print attr
Get more information about the lower (function) attribute:
print(name.lower.__doc__)
In an interactive session, you can also use the more convenient
help(name.lower)
function.

How to sort a list of inter-linked tuples?

lst = [(u'course', u'session'), (u'instructor', u'session'), (u'session', u'trainee'), (u'person', u'trainee'), (u'person', u'instructor'), (u'course', u'instructor')]
I've above list of tuple, I need to sort it with following logic....
each tuple's 2nd element is dependent on 1st element, e.g. (course, session) -> session is dependent on course and so on..
I want a sorted list based on priority of their dependency, less or independent object will come first so output should be like below,
lst = [course, person, instructor, session, trainee]
You're looking for what's called a topological sort. The wikipedia page shows the classic Kahn and depth-first-search algorithms for it; Python examples are here (a bit dated, but should still run fine), on pypi (stable and reusable -- you can also read the code online here) and here (Tarjan's algorithm, that kind-of also deals with cycles in the dependencies specified), just to name a few.
Conceptually, what you need to do is create a directed acyclic graph with edges determined by the contents of your list, and then do a topological sort on the graph. The algorithm to do this doesn't exist in Python's standard library (at least, not that I can think of off the top of my head), but you can find plenty of third-party implementations online, such as http://www.bitformation.com/art/python_toposort.html
The function at that website takes a list of all the strings, items, and another list of the pairs between strings, partial_order. Your lst should be passed as the second argument. To generate the first argument, you can use itertools.chain.from_iterable(lst), so the overall function call would be
import itertools
lst = ...
ordering = topological_sort(itertools.chain.from_iterable(lst), lst)
Or you could modify the function from the website to only take one argument, and to create the nodes in the graph directly from the values in your lst.
EDIT: Using the topsort module Alex Martelli linked to, you could just pass lst directly.

Python Extension Returned Object Etiquette

I am writing a python extension to provide access to Solaris kstat data ( in the same spirit as the shipping perl library Sun::Solaris::Kstat ) and I have a question about conditionally returning a list or a single object. The python use case would look something like:
cpu_stats = cKstats.lookup(module='cpu_stat')
cpu_stat0 = cKstats.lookup('cpu_stat',0,'cpu_stat0')
As it's currently implemented, lookup() returns a list of all kstat objects which match. The first case would result in a list of objects ( as many as there are CPUs ) and the second call specifies a single kstat completely and would return a list containing one kstat.
My question is it poor form to return a single object when there is only one match, and a list when there are many?
Thank you for the thoughtful answer! My python-fu is weak but growing stronger due to folks like you.
"My question is it poor form to return a single object when there is only one match, and a list when there are many?"
It's poor form to return inconsistent types.
Return a consistent type: List of kstat.
Most Pythonistas don't like using type(result) to determine if it's a kstat or a list of kstats.
We'd rather check the length of the list in a simple, consistent way.
Also, if the length depends on a piece of system information, perhaps an API method could provide this metadata.
Look at DB-API PEP for advice and ideas on how to handle query-like things.

Categories