Dynamically generate array elements from yaml file in Python

Dynamically generate array elements from yaml file in Python - python

Given the following yaml file stored in my_yaml that contains varying sets of dictionary keys and/or class variables (denoted by self._*):
config1.json:
- [[foo, bar], [hello, world]]
config2.json:
- [[foo], [self._hi]]
From the json file, I want to populate a new list of tuples. The items in each tuple are determined by looking up dict keys in this yaml file.
So if I iterate through a dictionary called config1.json, and I have an empty list called config_list, I want to do something like:
config_list.append(tuple[i['foo']['bar],i['hello']['world']])
But if it were config2.json, I want to do something like:
config_list.append(tuple[i['foo'],self._hi])
I can do this in a less dynamic way:
for i in my_yaml['config1.json'][0]:
config_list.append(tuple([ i[my_yaml[asset][0][0]][my_yaml[asset][0][1]],i[my_yaml[asset][1][0]][my_yaml[asset][1][1]]]))
or:
for i in my_yaml['config2.json'][0]:
config_list.append(tuple([ i[my_yaml[asset][0][0]],i[my_yaml[asset][1][0]]]))
Instead I would like to dynamically generate the contents of config_list
Any ideas or alternatives would be greatly appreciated.

I think you are bit confusing things, first of all because you are referring
to a file in "From the json [sic] file" and there is no JSON file mentioned
anywhere in the question. There are mapping keys that look like
filenames for JSON files, so I hope we can assume you mean "From the value
associated with the mapping key that ends in the string .json".
The other confusing thing is that you obfuscate the fact that you want tuples
but load list nested in list nested in lists from you YAML document.
If you want tuples, it is much more clear to specify them in your YAML document:
config1.json:
- !!python/tuple [[foo, bar], [hello, world]]
config2.json:
- !!python/tuple [[foo], [self._hi]]
So you can do:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='unsafe')
with open('my.yaml') as fp:
my_yaml = yaml.load(fp)
for key in my_yaml:
for idx, elem in enumerate(my_yaml[key]):
print('{}[{}] -> {}'.format(key, idx, my_yaml[key][idx]))
which directly gives you the tuples you seem to want instead of lists you need to process:
config1.json[0] -> (['foo', 'bar'], ['hello', 'world'])
config2.json[0] -> (['foo'], ['self._hi'])
In your question you hard code access to the first and only
element of the sequence that are the values for the root level
mapping. This forces you to use the final [0] in your for loop. I
assume you are going to have multiple elements in those sequences, but
for a good question you should leave that out, as it is irrelevant for the
question on how to get the tuples, and thereby only obfuscating things.
Please note that you need to keep control over your input, as using
typ='unsafe' is, you guessed, unsafe. If you cannot guarantee that
use typ='safe' and register and use the tag !tuple.

Related

h5py: how to use keys() loop over HDF5 Groups and Datasets

print(list(file.keys()))
When I run this code I get:
T00000000,T00000001,T00000002,T00000003, ... ,T00000474
Now, I analized T00000000 but I want to scan them all with for loop. I couldn't do because this is a string. Is there any way to do this?

#python_student, there is more to this than explained in the initial answer. Based on the syntax of you question, it appears you are using h5py to read the HDF5 file. To effectively access the file contents, you need a basic understanding of HDF5 and h5py. I suggest starting here: h5py Quick Start Guide. In addition, there are many good questions and answer here on StackOverflow with details and examples.
An HDF5 file has 2 basic objects:
Datasets: array-like collections of data
Groups: folder-like containers that hold datasets and other groups
h5py, uses dictionary syntax to access Group objects, and reads Datasets using NumPy syntax. (Note group objects are not Python dictionaries - just just "look" like them!)
As you noted, the keys() are the NAMES of the objects (groups or datasets) at the root level of your file. Your code created a list from the group keys: list(file.keys()). In general there is no reason to do this. Typically, you will iterate over the keys() or items() instead of creating a list.
Here is a short code segment to show how you might do this. I can add more details once I know more about your data schema. (HDF5 is a general data container and have most any schema.)
# loop on names:
for name in file.keys():
print(name)
# loop on names and H5 objects:
for name, h5obj in file.items():
if isinstance(h5obj,h5py.Group):
print(name,'is a Group')
elif isinstance(h5obj,h5py.Dataset):
print(name,'is a Dataset')
# return a np.array using dataset object:
arr1 = h5obj[:]
# return a np.array using dataset name:
arr2 = file[name][:]
# compare arr1 to arr2 (should always return True):
print(np.array_equal(arr1, arr2))

Yes, you can use the split() method.
If the string is "T00000000,T00000001,T00000002,T00000003, ... ,T00000474", you can use split to turn it on a list like this:
string = "T00000000,T00000001,T00000002,T00000003, ... ,T00000474"
values = string.split(",")
So, the list values becomes ["T00000000", "T00000001","T00000003", ... ,"T000000474"].
Then you can use this in a for loop.
If you don't wanto to create a list, you can simply:
for value in string.split(","):
#Your code here...
The for loop will be execute with the values T00000000, T00000001, T00000003 ...

Dictionary update replaces original values instead

Whenever I receive a new url, I try to add that in my dictionary, along with the current time.
However, when I use the update() method, it replaces original values with the new values I added, so that the only thing in the dictionary now are the new values (and not the old ones).
Here is a shorter version of my code:
if domain not in lst:
lst.append(domain)
domaindict = {}
listofdomains.append(domaindict)
domaindict.update({domain:datetime.now().strftime('%m/%d/%Y %H:%M:%S')})
if domain in lst:
domindex = lst.index(domain)
listofdomains[domindex].update({domain:datetime.now().strftime('%m/%d/%Y %H:%M:%S')})
lst is the list of domain names so far, while listofdomains is the list that contains all the dictionaries of the separate domains (each dictionary has the domain name plus the time).
When I try to print listofdomains:
print(listofdomains)
It only prints out the newly added domain and urls in the dictionaries. I also tried to use other methods to update a dictionary, as detailed in the answers to this question, but my dictionaries are still not functioning properly.
Why did the original key/value pairs dissapear?

The simplest structure would probably be a dict of lists:
data = {domain1:[time1, time2, ...], domain2:[...] ...}
You can build it simply using a defaultdict that creates empty lists on the fly when needed. Your code would be:
from collections import defaultdict
data = defaultdict(list)
and your whole code becomes simply:
data[domain].append(datetime.now().strftime('%m/%d/%Y %H:%M:%S'))

Python: Linking to a dictionary through a text string

I'm trying to create a program module that contains data structures (dictionaries) and text strings that describe those data structures. I want to import these (dictionaries and descriptions) into a module that is feeding a GUI interface. One of the displayed lines is the contents contained in the first dictionary with one field that contains all possible values contained in another dictionary. I'm trying to avoid 'hard-coding' this relationship and would like to pass a link to the second dictionary (containing all possible values) to the string describing the first dictionary. An abstracted example would be:
dict1 = {
"1":["dog","cat","fish"],
"2":["alpha","beta","gamma","epsilon"]
}
string="parameter1,parameter2,dict1"
# Silly example starts here
#
string=string.split(",")
print string[2]["2"]
(I'd like to get: ["alpha","beta","gamma","epsilon"]
But of course this doesn't work
Does anyone have a clever solution to this problem?

Generally, this kind of dynamic code execution is a bad idea. it leads to very difficult to read and maintain code. However, if you must, you can use globals for this:
globals()[string[2]]["2"]
A better solution would be to put dict1 into a dictionary in the first place:
dict1 = ...
namespace = {'dict1': dict1}
string = ...
namespace[string[2]]["2"]

Is it possible to use multiple keys for a single element in a dict?

I am writing my own function for parsing XML text into objects which is can manipulate and render back into XML text. To handle the nesting, I am allowing XML objects to contain other XML objects as elements.
Since I am automatically generating these XML objects, my plan is to just enter them as elements of a dict as they are created. I was planning on generating an attribute called name which I could use as the key, and having the XML object itself be a value assigned to that key.
All this makes sense to me at this point. But now I realize that I would really like to also save an attribute called line_number, which would be the line from the original XML file where I first encountered the object, and there may be some cases where I would want to locate an XML object by line_number, rather than by name.
So these are my questions:
Is it possible to use a dict in such a way that I could find my XML object either by name or by line number? That is, is it possible to have multiple keys assigned to a single value in a dict?
How do I do that?
If this is a bad idea, what is a better way?

Yes, it is possible. No special magic is required:
In [1]: val = object()
In [2]: d = {}
In [3]: d[123] = val
In [4]: d['name'] = val
In [5]: d
Out[5]: {123: <object at 0x23c6d0>, 'name': <object at 0x23c6d0>}
I would, however, use two separate dictionaries, one for indexing by name, and one for indexing by line number. Even if the sets of names and line numbers are completely disjoint, I think this is a cleaner design.

my_dict['key1'] = my_dict['key2'] = SomeObject
should work fine i would think

Since dictionaries can have keys of multiple types, and you are using names (strings only) as one key and numbers (integers only) as another, you can simply make two separate entries point to the same object - one for the number, and one for the string.
dict[0] = dict['key'] = object1

Parse a CSV file using python (to make a decision tree later) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
First off, full disclosure: This is going towards a uni assignment, so I don't want to receive code. :). I'm more looking for approaches; I'm very new to python, having read a book but not yet written any code.
The entire task is to import the contents of a CSV file, create a decision tree from the contents of the CSV file (using the ID3 algorithm), and then parse a second CSV file to run against the tree. There's a big (understandable) preference to have it capable of dealing with different CSV files (I asked if we were allowed to hard code the column names, mostly to eliminate it as a possibility, and the answer was no).
The CSV files are in a fairly standard format; the header row is marked with a # then the column names are displayed, and every row after that is a simple series of values. Example:
# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14
At the moment, I'm trying to work out the first part: parsing the CSV. To make the decisions for the decision tree, a dictionary structure seems like it's going to be the most logical; so I was thinking of doing something along these lines:
Read in each line, character by character
If the character is not a comma or a space
Append character to temporary string
If the character is a comma
Append the temporary string to a list
Empty string
Once a line has been read
Create a dictionary using the header row as the key (somehow!)
Append that dictionary to a list
However, if I do things that way, I'm not sure how to make a mapping between the keys and the values. I'm also wondering whether there is some way to perform an action on every dictionary in a list, since I'll need to be doing things to the effect of "Everyone return their values for columns Column1 and Column4, so I can count up who has what!" - I assume that there is some mechanism, but I don't think I know how to do it.
Is a dictionary the best way to do it? Would I be better off doing things using some other data structure? If so, what?

Python has some pretty powerful language constructs builtin. You can read lines from a file like:
with open(name_of_file,"r") as file:
for line in file:
# process the line
You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.
To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:
mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary
You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use mylist.append(5) to add 5 to the list, while you can use mydict[key]=value to associate the key key with the value value. To test whether a key is present in the dictionary, you can use the in keyword. For example:
if key in mydict:
print "Present"
else:
print "Absent"
To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:
for val in mylist:
# do something with val
for key in mydict:
# do something with key or with mydict[key]
Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:
for idx, val in enumerate(mylist):
# do something with val or with idx. Note that val=mylist[idx]
The code above is identical in function to:
idx=0
for val in mylist:
# process val, idx
idx += 1
You could also iterate over the indices if you so chose:
for idx in xrange(len(mylist)):
# Do something with idx and possibly mylist[idx]
Also, you can get the number of elements in a list or the number of keys in a dictionary using len.
It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:
>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.

Example using the csv module from docs.python.org:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
Instead of printing the rows, you could just save each row into a list, and then process it in the ID3 later.
database.append(row)

Short answer: don't waste time and mental energy (1) reimplementing the built-in csv module (2) reading the csv module's source (it's written in C) -- just USE it!

Look at csv.DictReader.
Example:
import csv
reader = csvDictReader(open('my_file.csv','rb') # 'rb' = read binary
for d in reader:
print d # this will print out a dictionary with keys equal to the first row of the file.

Take a look at the built-in CSV module. Though you probably can't just use it, you can take a sneak peek at the code...
If that's a no-no, your (pseudo)code looks perfectly fine, though you should make use of the str.split() function and use that, reading the file line-by-line.

Parse the CSV correctly
I'd avoid using str.split() to parse the fields because str.split() will not recognize quoted values. And many real-world CSV files use quotes.
http://en.wikipedia.org/wiki/Comma-separated_values
Example record using quoted values:
1997,Ford,E350,"Super, luxurious truck"
If you use str.split(), you'll get a record like this with 5 fields:
('1997', 'Ford', 'E350', '"Super', ' luxurious truck"')
But what you really want are records like this with 4 fields:
('1997', 'Ford', 'E350', 'Super, luxurious truck')
Also, besides commas being in the data, you may have to deal with newlines "\r\n" or just "\n" in the data. For example:
1997,Ford,E350,"Super
luxurious truck"
1997,Ford,E250,"Ok? Truck"
So be careful using:
file = open('filename.csv', 'r')
for line in file:
# problem here, "line" may contain partial data
Also, like John mentioned, the CSV standard is, that in quotes, if you get a double-double quote, then it turns into one quote.
1997,Ford,E350,"Super ""luxurious"" truck"
('1997', 'Ford', 'E350', 'Super "luxurious" truck')
So I'd suggest to modify your finite state machine like this:
Parse each character at a time.
Check to see if it's a quote, then set the state to "in quote"
If "in quote", store all the characters in the current field until there's another quote.
If "in quote", and there's another quote, store the quote character in the field data. (not the end, because a blank field shouldn't be `data,"",data` but instead `data,,data`)
If not "in quote", store the characters until you find a comma or newline.
If comma, save field and start a new field.
If newline, save field, save record, start a new record and a new field.
On a side note, interestingly, I've never seen a header commented out using # in a CSV. So to me, that would imply that you may have to look for commented lines in the data too. Using # to comment out a line in a CSV file is not standard.
Adding found fields into a record dictionary using header keys
Depending on memory requirements, if the CSV is small enough (maybe 10k to 100k records), using a dictionary is fine. Just store a list of all the column names so you can access the column name by index (or number). Then in the finite state machine, increment the column index when you find a comma, and reset to 0 when you find a newline.
So if your header is header = ['Column1', 'Column2'] Then when you find a data character, add it like this:
record[header[column_index]] += character

I don't know too much about the builtin csv module that #Kaloyan Todorov talks about, but, if you're reading comma separated lines, then you can easily do this:
for line in file:
columns = line.split(',')
for column in columns:
print column.strip()
This will print all the entries of each line without the leading a tailing whitespaces.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.