Right pattern to deal with different dictionary configuration - Python - python

My question is not really a problem because the program works the way it is right now, however I'm looking for a way to improve the maintainability of it since the system is growing quite fast.
In essence, I have a function (let's call it 'a') that is going to process a XML (in a form of a python dict) and it is responsible for getting a specific array of elements (let's call it 'obj') from this XML. The problem is that we process a lot of XMLs from different sources, therefore, each XML has its own structure and the obj element is located in different places.
The code is currently in the following structure:
function a(self, code, ...):
xml_doc = ... # this is a dict from a xml document that can have one of many different structures
obj = None # Array of objects that I want to get from the XML. It might be processed later but is eventually returned.
#Because the XML can have different structures, the object I want to get can be placed in different (well-known) places depending on the code value.
if code is 'a':
obj = xml_doc["key1"]["key2"]
elif code is 'b':
obj = xml_doc["key3"]
...
# code that processes the obj object
...
elif code is 'b':
obj = xml_doc["key4"]["key5"]["key6"]
... # elif for different codes goes indefinitely
return obj
As you can see (or not - but believe me), it's not very friendly to add new entries to this function and add code to the cases that have to be processed. So I was looking for a way to do it using dictionaries to map the code to the correct XML structure. Something in the direction of the following example:
...
xml_doc = ...
# That would be extremely neat.
code_to_pattern = {
'a': xml_doc["key1"]["key2"],
'b': xml_doc["key3"],
'c': xml_doc["key4"]["key5"]["key6"],
...
}
obj = code_to_pattern[code]
obj = self.process_xml(code, obj) # It will process the array if it has to in another function.
return obj
...
However, the above code doesn't work for obvious reasons. Each entry of the code_to_pattern dictionary is trying to access an element in xml_doc that might not exist, then an exception is raised. I thought in adding the entries as strings and then using the exec() function, so python only interpret the string in the right moment, however I'm not very found of the exec function and I am sure someone can come up with a better idea.
The conditional processing part of the XML is easy to do, however I can't think in a better way to have an easy method to add new entries to the system.
I'd be very pleased if someone can help me with some ideas.
EDIT1: Thank you for your replies, guys. Your both (#jarondl and #holdenweb) gave me workable and working ideas. For the right answer I am going to choose the one that required me the less change in the format I gave you even though I am going to solve it through xPath.

You should first consider alternatives such as xpath to read the xml, depending on how you parsed it.
If you want to proceed with your dictionary, you can have non evaluated code with lambda - no need for exec:
code_to_pattern = {
'a': lambda doc: doc["key1"]["key2"],
'b': lambda doc: doc["key3"],
'c': lambda doc: doc["key4"]["key5"]["key6"],
...
}
obj = code_to_pattern[code](xml_doc)

Essentially you are looking for a data-driven solution. You have the essence of such a solution, but rather than mapping the codes to elements of the xml_doc table, it might be easier to map the codes to the required keys. In other words, look at doing:
xml_doc = ...
code_to_pattern = {
'a': "key1", "key2",
'b': "key3",
'c': "key4", "key5", "key6",
...
}
The problem there is that you would then need to adapt to the variable number of keys that the different objects mapped to, so a simple
obj = code_to_pattern[code]
wouldn't cut it. Be aware, though, that dicts can take tuples as arguments, so it's possible (though you'd know better than I) that rather than using successive indices like xml_doc["key4"]["key5"]["key6"] you might be able to use tuple indices like xml_doc["key4", "key5", "key6"]. This may or may not help you with your problem.
Finally, you might find it helpful to learn about the collections.defaultdict object, since this automates the creation of new entries rather than forcing you to test for their presence and create them if absent. This could be helpful even in tuple keys won't cut it for you.

Related

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

Python: How to create a common element between a list and a dict

I am new to data structures in python and was wondering how do you simulate a thing like pointers in python so that multiple structures can refer and manage the same piece of data.
I have the following two structures
my_list = [1]
my_dictionary = {}
my_dictionary["hello"] = my_list[0]
and when I do the following I get True
id(my_dictionary["hello"]) == my_list[0]
However how can I force removal both from the dict and the list in one go?
If I do the following my_dictionary still has a reference to my_list[0] i.e. 1
del my_list[0]
is there a way to get rid of both of these elements in one go? What is the python way of doing linked structures like this?
It really depends on the point you're trying to solve by cross-referencing.
Suppose your intent is to be able to efficiently both locate an item by key, as well as to sequentially iterate by order. In this case, irrespective of the language, you probably would wish to avoid cross referencing a hash-table and an array data structures, as the updates are inherently linear. Conversely, cross referencing a hash-table and a list might make more sense.
For this, you can use something like llist:
d = {}
l = llist.dllist()
# insert 'foo' and obtain the link
lnk = l.append('foo')
# insert the link to the dictionary
d['foo'] = lnk
Conversely, suppose your intent is to be able to efficiently both locate an item by key, as well as to locate by index. Then you can use a dict and a list, and rebuild the list on each modification of the dict. There is no real reason for fancy cross-referencing.
Simply put, there is no way to easily link your two structures.
You could manipulate the object you point to so that it has some "deleted" state and would act as if it's deleted (while being in both containers).
However, if all you wanted was a list from a dict, use list(the_dict.values()).
You could make a class to achieve this, if all else fails. See https://docs.python.org/2/reference/datamodel.html#emulating-container-types for the details on what your class would have to have. Within the class, you would have your "duplicated effort," but if it's correctly implemented it wouldn't be error prone.
You can always do things like this:
Pointers in Python?
(a quick stackoverflow search shows some results)
But that is messing with more than just data structures.
Remember that Python manages memory for you (in most of the cases, pretty well), so you don't have to worry of cleaning after yourself.
i have tried the following peice of code and it works (changing in one DataStructure changes for the other).
does this help?
list1 = [1,2,3]
list2 = [4,5,6]
my_dictionary = {}
my_dictionary["a"] = list1
my_dictionary["b"] = list2
del list1[0]
print list1
print list2
print my_dictionary

Extending pyyaml to find and replace like xml ElementTree

I'd like to extend this SO question to treat a non-trivial use-case.
Background: pyyaml is pretty sweet insofar as it eats YAML and poops Python-native data structures. But what if you want to find a specific node in the YAML? The referenced question would suggest that, hey, you just know where in the data structure the node lives and index right into it. In fact pretty much every answer to every pyyaml question on SO seems to give this same advice.
But what if you don't know where the node lives in the YAML in advance?
If I were working with XML I'd solve this problem with an xml.etree.ElementTree. These provide nice facilities for loading an XML document into memory and finding elements based on certain search criteria. See find() and findall().
Questions:
Does pyyaml provide search capabilities analogous to ElementTree? (If yes, feel free to yell at me for being bad at Google.)
If no, does anyone have nice recipe for extending pyyaml to achieve similar things? (Bonus points for not traversing the deserialized YAML all over again.)
Note that one important thing that ElementTree provides in addition to just being able to find things is the ability to modify the XML document given an element reference. I'd like to be able to do this on YAML as well.
The answer to question 1 is: no. PyYAML implements the YAML 1.1 language standard and there is nothing about finding scalars by any path in the standard nor in the library.
However if you safeload a YAML structure, everything is either a mapping, a sequence or a scalar. Even such a simplistic representation (simple, compared to full fledged object instantiation with !typemarkers), can already contain recursive self referencing structures:
&a x: *a
This is not possible in XML without external semantic interpretation. This makes making a generic tree walker much harder in YAML than in XML.
The type loading mechanism of YAML also makes it much more difficult to generic tree walker, even if you exclude the problem of self references.
If you don't know where a node lives in advance, you still need to know how to identify the node, and since you don't know how to you would walk the parent (which might be represented in multiple layers of combined mappings and sequences, it is almost almost useles to have a generic mechanism that depends on context.
Without being able to rely on context (in general) the thing that is left is a uniquely identifiable value (like the HTML id attribute). If all your objects in YAML have such a unique id, then it is possible to search the (safeloaded) tree for such an id value and extract any structure underneath it (mappings, sequences) until you hit a leaf (scalar), or some structure that has an id of its own (another object).
I have been following the YAML development for quite some time now (earliest emails from the YAML mailing list that I have in my YAML folder are from 2004) and I have not seen anything generic evolve since then. I do have some tools to walk the trees and find things that I use for extracting parts of the simplified structure for testing my raumel.yaml library, but no code that is in a releasable shape (it would have already been on PyPI if it was), and nothing near to a generic solution like you can make for XML (which is IMO, on its own, syntactically less complex than YAML).
Do you know how to search through python objects? then you know how to search through the results of a yaml.load()...
YAML is different from XML in two important ways: one is that while every element in XML has a tag and a value, in YAML, there can be some things that are only values. But secondly... again, YAML creates python objects. There is no intermediate in-memory format to use.
E.G. if you load a YAML file like this:
- First
- Second
- Third
you'll get a list like ['First', 'Second', 'Third']. Want to find 'Third' and don't know where it is? You can use [x for x in my_list if 'Third' in x] to find it. Need to lookup an item in a dictionary? Just do it.
If you want to modify an object, you don't modify the YAML, you modify the object. E.G. now I want the second entry to be in German. I just do 'my_list[1] = 'zweite', modifying it in place. Now the python list looks like ['First', 'zweite', 'Third'], and dumping it to YAML looks like
- First
- zweite
- Third
Note that PyYAML is pretty smart... you can even create objects with loops:
>>> a = [1,2,3]
>>> b = {}
>>> b[1] = a
>>> b[2] = a
>>> print yaml.dump(b)
1: &id001 [1, 2, 3]
2: *id001
>>> b[2] = [3,4,5]
>>> print yaml.dump(b)
1: [1, 2, 3]
2: [3, 4, 5]
In the first case, it even figured out that b[1] and b[2] point to the same object, so it created links and automatically put a link from one to the other... in the original object, if you did something like a.pop(), both b[1] and b[2] would show that one entry was gone. If you send that object to YAML, and then load it back in, that will still be true.
(and note in the second one, where they aren't the same, PyYAML doesn't create the extra notations, as it doesn't need to).
In short: Most likely, you're just overthinking it.

Python: Linking to a dictionary through a text string

I'm trying to create a program module that contains data structures (dictionaries) and text strings that describe those data structures. I want to import these (dictionaries and descriptions) into a module that is feeding a GUI interface. One of the displayed lines is the contents contained in the first dictionary with one field that contains all possible values contained in another dictionary. I'm trying to avoid 'hard-coding' this relationship and would like to pass a link to the second dictionary (containing all possible values) to the string describing the first dictionary. An abstracted example would be:
dict1 = {
"1":["dog","cat","fish"],
"2":["alpha","beta","gamma","epsilon"]
}
string="parameter1,parameter2,dict1"
# Silly example starts here
#
string=string.split(",")
print string[2]["2"]
(I'd like to get: ["alpha","beta","gamma","epsilon"]
But of course this doesn't work
Does anyone have a clever solution to this problem?
Generally, this kind of dynamic code execution is a bad idea. it leads to very difficult to read and maintain code. However, if you must, you can use globals for this:
globals()[string[2]]["2"]
A better solution would be to put dict1 into a dictionary in the first place:
dict1 = ...
namespace = {'dict1': dict1}
string = ...
namespace[string[2]]["2"]

Look up python dict value by expression

I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary

Categories