So I am new to python and I'm trying to do a small project which I want to do in pure pythonic way not using any additional libraries, where my data set looks like this:
LOC,DATE,ATTRIBUTE,COUNT
A,03/01/19,alpha,6483
A,03/01/19,beta,19
B,03/01/19,gamma,346158
B,02/01/19,gamma,156891
A,02/01/19,delta,1319
A,02/01/19,gamma,15272
A,02/01/19,gamma,56810
I have to transform this data set to this output:
B,02/01/19,gamma, 346158
A,02/01/19,alpha,6483
A,02/01/19,beta,19
B,02/01/19,gamma, 172163
A,02/01/19,delta,1319
B,01/01/19,gamma,56810
The data needs to be sorted by Date, Value, Measure, Loc
I thought that nested dictionaries should work, because I only have to update value of attirbute, LOC can be come the outer key
dict = {A:{}, B:{}}
Then date can be used as the key for the nested dictionary:
dict = {A:{03/01/19:{}, 02/01/19:{}}, B:{03/01/19:{}, 02/01/19:{}}
And keep going forward until I reach Count, and every time I keep updating count. But the code is getting more complex every time, my question:
Is there any other alternative data structure that I could use?
If with a dictionary is there a way to check nested keys and keep adding only new values for every key!
Any help would be really grateful!
Related
I have a nested dictionary, where I have tickers to identifiy certain assets in my dictionary and then for each of these assets I would like to store characteristics in a subdictionary for the asset, creating them in a simple loop like the below:
ticker = ["a","bb","ccc"]
ticker_dict = dict.fromkeys(ticker, {"Var":[]})
for key in ticker_dict:
ticker_dict[key]["Var"] = len(key)
From the above output I would expect, that for each ticker/asset it saves the "Var" variable as the length of its name, meaning the following:
{"a":{"Var":1},
"bb":{"Var":2},
"ccc":{"Var":3}}
But, in my view rather weirdly, the result is this
{"a":{"Var":3},
"bb":{"Var":3},
"ccc":{"Var":3}}
To provide further context, the real process is that I have four assets, for which I would like to store dataframes in their subdictionaries as this makes it easy for me to access them later in loops etc. Somehow though, the data from the last asset is simply copied over all assets, eventhogh I explicitly loop through different keys.
What's going on?
PS: I'm not sure how to explain the problem without the sample code, so I might have missed a similar entry on this site. If so, any hints to it would be appreciated as well of course.
In your code, {"Var":[]} is only evaluated once, causing there to be only 1 inner dictionary shared by all keys. Instead, you can use a dictionary comprehension:
ticker_dict = {key:{"Var":[]} for key in ticker}
and it will work as expected.
I have been working on a problem which involves sorting a large data set of shop orders, extracting shop and user information based on some parameters. Mostly this has involved creating dictionaries by iterating through a data set with a for loop and appending a new list, like this:
sshop = defaultdict(list)
for i in range(df_subset.shape[0]):
orderid, sid, userid, time = df.iloc[i]
sshop[sid].append(userid)
sData = dict(sshop)
#CREATES DICTIONARY OF UNIQUE SHOPS WITH USER DATA AS THE VALUE
shops = df_subset['shopid'].unique()
shops_dict = defaultdict(list)
for shop in shops:
shops_dict[shop].append(sData[shop])
shops_dict = dict(shops_dict)
shops_dict looks like this at this point:
{10009: [[196962305]], 10051: [[2854032, 48600461]], 10061: [[168750452, 194819216, 130633421,
62464559]]}
To get to the final stages I have had to repeat lines of code similar to these a couple of times. What seems to happen everytime I do this is that the VALUES in the dictionaries gain a set of square brackets.
This is one of my final dictionaries:
{10159: [[[1577562540.0, 1577736960.0, 1577737080.0]], [[1577651880.0, 1577652000.0, 1577652960.0]]],
10208: [[[1577651040.0, 1577651580.0, 1577797080.0]]]}
I don't entirely understand why this is happening, asides from I believe it is something to do with using defaultdict(list) and then converting that into a dictionary with dict().
These extra brackets, asides from being a little confusing, appear to be causing some problems for accessing the data using certain functions. I understand that there needs to be two sets of square brackets in total, one set that encases all the values in the dictionary key and another inside of that for each of the specific sets of values within that key.
My first question would be, is it possible to remove a specific set of square brackets from a dictionary like that?
My second question would be, if not - is there a better way of creating new dictionaries out the data from an older one without using defaultdict(list) and having all those extra square brackets?
Any help much appreciated!
Thanks :)!
In second loop use extend instead of append.
for shop in shops:
shops_dict[shop].extend(sData[shop])
shops_dict = dict(shops_dict)
This looks like a CS 101 style homework but it actually isn't. I am trying to learn more python so I took up this personal project to write a small app that keeps my grade-book for me.
I have a class semester which holds a dictionary of section objects.
A section is a class that I am teaching in which ever semester object I am manipulating (I didn't want to call them classes for obvious reasons). I originally had sections as a list not a dictionary, and when I wanted to add a roster of students to that semester I could do this.
for sec in working_semester.sections:
sec.addRosterFromFile(filename)
Now I have changed the sections member of semester to a dictionary so I can look up a specific one to work with, but I am having trouble when I want to loop over all of them to do something like when I first set up a new semester I want to add all the sections, then loop over them and add students to each. If I try the same code to loop over the dictionary it gives me the key, but I was hoping to get the value.
I have also tried to iterate over a dictionary like this, which I got out of an older stack over flow question
for sec in iter(sorted(working_semester.sections.iteritems())):
sec.addRosterFromFile(filename)
But iter(sorted ... returns a tuple (key, value) not the item so the line in side the loop gives me an error that tuple does not have a function called addStudent.
Currently I have this fix in place where I loop through the keys and then use the key to access the value like this:
for key in working_semester.sections:
working_semester.sections[key].addRosterFromFile(filename)
There has to be a way to loop over dictionary values, or is this not desirable? My understanding of dictionaries is that they are like lists but rather than grabbing an element by its position it has a specific key, which makes it easier to grab the one you want no matter what order they are in. Am I missing how dictionaries should be used?
Using iteritems is a good approach, you just need to unpack the key and value:
for key, value in iter(sorted(working_semester.sections.iteritems())):
value.addRosterFromFile(filename)
If you really only need the value, you could use the aptly named itervalues:
for sec in sorted(working_semester.sections.itervalues()):
sec.addRosterFromFile(filename)
(It's not clear from your example whether you really need sorted there. If you don't need to iterate over the sections in sorted order just leave sorted out.)
If I want to assign to an element of a list only one value I use always a dictionary. For example:
{'Monday':1, 'Tuesday':2,...'Friday':5,..}
But I want to assign to one element of a list many values, like for example:
Monday: Jogging, Swimming, Skating
Tuesday: School, Work, Dinner, Cinema
...
Friday: Doctor
Is any built-in structure or a simple way to make something like this in python?
My idea: I was thinking about something like: a dictionary which as a key holds a day and as a value holds a list, but maybe there is a better solution.
A dictionary whose values are lists is perfectly fine, and in fact very common.
In fact, you might want to consider an extension to that: a collections.defaultdict(list). This will create a new empty list the first time you access any key, so you can write code like this:
d[day].append(activity)
… instead of this:
if not day in d:
d[day] = []
d[day].append(activity)
The down-side of a defaultdict is that you no longer have a way to detect that a key is missing in your lookup code, because it will automatically create a new one. If that matters, use a regular dict together with the setdefault method:
d.setdefault(day, []).append(activity)
You could wrap either of these solutions up in a "MultiDict" class that encapsulates the fact that it's a dictionary of lists, but the dictionary-of-lists idea is such a common idiom that it really isn't necessary to hide it.
I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary