I am doing a business case on retrieving stock information. The teacher uses the code below to create DataFrames with stock information.
#The tech stocks we'll use for this analysis
tech_list = ['AAPL','GOOG','MSFT','AMZN']
#Set up End and Start times for data grab
end = datetime.now()
start = datetime(end.year - 1,end.month,end.day)
#For loop for grabing yahoo finance data and setting as a dataframe
for stock in tech_list:
# Set DataFrame as the Stock Ticker
globals()[stock] = DataReader(stock,'yahoo',start,end)
He uses globals() to create the 4 dataframes with the techstock. I read in the question below that you can also use dictionary to achieve the same goal.
pandas set names of dataframes in loop
MY QUESTION is that i do not understand this line of code in the answer:
frames = {i:dat for i, dat in data.groupby('Sport')}
Can someone explain?
In this case, frames is a dictionary that is being built using a dictionary comprehension. The call data.groupby() is returning a pair of values, which are being called i and dat in the comprehension, and the notation {i:dat for i, dat in ...} is building a new dictionary out of all such pairs, using i as the key and dat as the value. The result is stored in frames.
The general syntax is (for the case where the iterator returns 2 elements):
{key: value for key, value in iterator}
The answers to this question do a good job explaining what an iterator is in python. Usually (but not always), when used in a dictionary comprehension, the iterator's __next__() method will return two elements. At least one of the elements must be hashable so that it can be used as the dictionary key.
iterator doesn't necessarily need to return two elements (although that is a common use pattern). This works:
print(dict([(i, chr(65+i)) for i in range(4)]))
{0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D'}
and also shows that dictionary comprehensions are really just special syntax using same mechanics as list comprehensions and the dict() method, which is what the comment by #Barmar is doing:
frames = dict(data.groupby('Sport'))
In this case, data.groupby() does need to return two elements, and the order does matter, as it is shorthand for (roughly) this:
dict([(key, value) for key, value in data.groupby('Sport')])
Related
I'm trying to write something that answers "what are the possible values in every column?"
I created a dictionary called all_col_vals and iterate from 1 to however many columns my dataframe has. However, when reading about this online, someone stated this looked too much like Java and the more pythonic way would be to use zip. I can't see how I could use zip here.
all_col_vals = {}
for index in range(RCSRdf.shape[1]):
all_col_vals[RCSRdf.iloc[:,index].name] = set(RCSRdf.iloc[:,index])
The output looks like 'CFN Network': {nan, 'N521', 'N536', 'N401', 'N612', 'N204'}, 'Exam': {'EXRC', 'MXRN', 'HXRT', 'MXRC'} and shows all the possible values for that specific column. The key is the column name.
I think #piRSquared's comment is the best option, so I'm going to steal it as an answer and add some explanation.
Answer
Assuming you don't have duplicate columns, use the following:
{k : {*df[k]} for k in df}
Explanation
k represents a column name in df. You don't have to use the .columns attribute to access them because a pandas.DataFrame works similarly to a python dict
df[k] represents the series k
{*df[k]} unpacks the values from the series and places them in a set ({}) which only keeps distinct elements by definition (see definition of a set).
Lastly, using list comprehension to create the dict is faster than defining an empty dict and adding new keys to it via a for-loop.
I have a very large nested dictionary of the form and example:
keyDict = {f: {t: {c_1: None, c_2: None, c_3: None, ..., c_n: None}}}
And another dictionary with keys and values:
valDict = {c_1: 13.37, c_2: -42.00, c_3: 0.00, ... c_n: -0.69}
I want to use the valDict to assign the values to the lowest level of the keyDict as fast as possible.
My current implementation is very slow I think because I iterate through the 2 upper levels [f][t] of the keyDict. There must be a way to set the values of the low level without concern for the upper levels because the value of [c] does not depend on the values of [f][t].
My current SLOW implementation:
for f in keyDict:
for t in keyDict[f]:
for c in keyDict[f][t]:
keyDict[f][t][c] = valDict[c]
Still looking for a solution. [c] only has a few thousands keys, but [f][t] can have millions, so the way I do it, distinct value assignment is happening millions of times when it should be able to go through the bottom level and assign the value which does NOT depend on f,t but ONLY on c.
To clarify example per Alexis request: c dictionaries don't necessarily all have the same keys, but c dictionaries DO have the same values for a given key. For example, to make things simple, lets say there are only 3 possible keys for c dict (c_1, c_2, c_3). Now one parent dictionary (ex f=1,t=1) may have just {c_2} and another parent diction (f=1,t=2) may have {c_2 and c_3} and yet another (ex f=999,t=999) might have all three {c_1, c_2, and c_3}. Some parent dicts may have the same set of c's. What I am trying to do is assign the value to the c dict, which is defined purely by the c key, not T or F.
If the most nested dicts and valDict share exactly the same keys, it would be faster to use dict.update instead of looping over all the keys of the dict:
for dct in keyDict.values()
for d in dct.values():
d.update(valDict)
Also, it is more elegant and probably more faster to loop on the values of the outer dicts directly instead of iterating on the keys and then accessing the value using the current key.
So you have millions of "c" dictionaries that you need to keep synchronized. The dictionaries have different sets of keys (presumably for good reason, but I trust you realize that your update code puts the new values in all the dictionaries), but the non-None values must change in lockstep.
You haven't explained what this data structure is for, but judging from your description, you should have a single c dictionary, not millions of them.
After all, you only have one set of valid "c" values; maintaining multiple copies is not only a performance problem, it puts an incredible burden of consistency on your code. But obviously, updating a single dictionary will be hugely faster than updating millions of them.
Of course you also want to know which keys were contained in each dictionary: To do this, your tree of dictionaries should terminate with sets of keys, which you can use to look up values as necessary.
In case my description is not clear, here is how your structure would be transformed:
all_c = dict()
for for f in keyDict:
for t in keyDict[f]:
all_c.update(k,v for k, v in keydict[f][t].items() if v is not None)
keydict[f][t] = set(keydict[f][t].keys())
This code builds a combined dictionary all_c with the non-null values from each of your bottom-level "c" dictionaries, then replaces the latter with a list of its keys. If you later need the complete dictionary at keyDict[f][t] (rather than access to particular values), you can reconstruct it like this:
f_t_cdict = dict((k, all_c[k]) for k in keyDict[f][t])
But I'm pretty sure you can do whatever it is you are doing by working with the sets keyDict[f][t], and simply looking up values directly in the combined dictionary all_c.
I am trying to figure out the max and min values for an inner value of a dict of dicts.
The dict looks like this:
{'ALLEN PHILLIP K': {'bonus': 4175000,
'exercised_stock_options': 1729541,
'expenses': 13868},
'BADUM JAMES P': {'bonus': 'NaN',
'exercised_stock_options': 257817,
'expenses': 3486},
...
}
I want to figure out the minimum and maximum exercised_stock_options across all dictionaries.
I tried using pandas to do this, but couldn't find a way to shape the data appropriately. Then, I tried a simple for-loop in Python. My code for the for-loop doesn't work, and I can't figure out why (the dict of dicts is called data_dict):
stock_options=[]
for person in range(len(data_dict)):
stock_options.append(data_dict[person]['exercised_stock_options'])
print stock_options
Then I was going to take the max and min values of the list.
Any idea why this code doesn't work? Any alternative methods for figuring out the max and min of an inner value of a dict of dicts?
Here's a method that uses a list comprehension to get the exercised_stock_options from each dictionary and then prints out the minimum and maximum value from the data. Ignore the sample data, and you can modify it to suit your needs.
d = {'John Smith':{'exercised_stock_options':99},
'Roger Park':{'exercised_stock_options':50},
'Tim Rogers':{'exercised_stock_options':10}}
data = [d[person]['exercised_stock_options'] for person in d]
print min(data), max(data)
You are using range to get an index number for your main dictionary. What you really should do is get the keys for the dictionary and not the index. That is, person is the name of each one. Thus when person == 'ALLEN PHILLIP K' datadict[person] now gets the dictionary for that key.
Note that the Use items() to iterate across dictionary says that it is better to use d, v = data_dict.items() rather than looping over the dictionary itself. Also note the difference between Python 2 and Python 3.
people=[]
stock_options=[]
for person, stock_data in data_dict.items():
people.append(person)
stock_options.append(stock_data['exercised_stock_options'])
# This lets you keep track of the people as well for future use
print stock_options
mymin = min(stock_options)
mymax = max(stock_options)
# process min and max values.
Best-practice
Use items() to iterate across dictionary
The updated code below demonstrates the Pythonic style for iterating
through a dictionary. When you define two variables in a for loop in
conjunction with a call to items() on a dictionary, Python
automatically assigns the first variable as the name of a key in that
dictionary, and the second variable as the corresponding value for
that key.
d = {"first_name": "Alfred", "last_name":"Hitchcock"}
for key,val in d.items():
print("{} = {}".format(key, val))
Difference Python 2 and Python 3
In python 2.x the above examples using items would return a list with
tuples containing the copied key-value pairs of the dictionary. In
order to not copy and with that load the whole dictionary’s keys and
values inside a list to the memory you should prefer the iteritems
method which simply returns an iterator instead of a list. In Python
3.x the iteritems is removed and the items method returns view objects. The benefit of these view objects compared to the tuples
containing copies is that every change made to the dictionary is
reflected in the view objects.
You need to iterate your dictionary .values() and return the value of "exercised_stock_options". You can use a simple list comprehensions to retrieve those values
>>> values = [value['exercised_stock_options'] for value in d.values()]
>>> values
[257817, 1729541]
>>> min(values)
257817
>>> max(values)
1729541
I've released lifter a few weeks ago exactly for these kind of tasks, I think you may find it useful.
The only problem here is that you have a mapping (a dict of dicts) instead of a regular iterable.
Here is an answer using lifter:
from lifter.models import Model
# We create a model representing our data
Person = Model('Person')
# We convert your data to a regular iterable
iterable = []
for name, data in your_data.items():
data['name'] = name
iterable.append(data)
# we load this into lifter
manager = Person.load(iterable)
# We query the data
results = manager.aggregate(
(Person.exercised_stock_options, min),
(Person.exercised_stock_options, max),
)
You can of course achieve the same result using list comprehensions, however, it's sometimes handy to use a dedicated library, especially if you want to filter data using complex queries before fetching your results. For example, you could get your min and max value only for people with less than 10000 expenses:
# We filter the data
queryset = manager.filter(Person.expenses < 10000)
# we apply our aggregate on the filtered queryset
results = queryset.aggregate(
(Person.exercised_stock_options, min),
(Person.exercised_stock_options, max),
)
I'm pretty new to python (couple weeks into it) and I'm having some trouble wrapping my head around data structures. What I've done so far is extract text line-by-line from a .txt file and store them into a dictionary with the key as animal, for example.
database = {
'dog': ['apple', 'dog', '2012-06-12-08-12-59'],
'cat': [
['orange', 'cat', '2012-06-11-18-33-12'],
['blue', 'cat', '2012-06-13-03-23-48']
],
'frog': ['kiwi', 'frog', '2012-06-12-17-12-44'],
'cow': [
['pear', 'ant', '2012-06-12-14-02-30'],
['plum', 'cow', '2012-06-12-23-27-14']
]
}
# year-month-day-hour-min-sec
That way, when I print my dictionary out, it prints out by animal types, and the newest dates first.
Whats the best way to go about sorting this data by time? I'm on python 2.7. What I'm thinking is
for each key:
grab the list (or list of lists) --> get the 3rd entry --> '-'.split it, --> then maybe try the sorted(parameters)
I'm just not really sure how to go about this...
Walk through the elements of your dictionary. For each value, run sorted on your list of lists, and tell the sorting algorithm to use the third field of the list as the "key" element. This key element is what is used to compare values to other elements in the list in order to ascertain sort order. To tell sorted which element of your lists to sort with, use operator.itemgetter to specify the third element.
Since your timestamps are rigidly structured and each character in the timestamp is more temporally significant than the next one, you can sort them naturally, like strings - you don't need to convert them to times.
# Dictionary stored in d
from operator import itemgetter
# Iterate over the elements of the dictionary; below, by
# calling items(), k gets the key value of an entry and
# v gets the value of that entry
for k,v in d.items():
if v and isinstance(v[0], list):
v.sort(key=itemgetter(2)) # Start with 0, so third element is 2
If your dates are all in the format year-month-day-hour-min-sec,2012-06-12-23-27-14,I think your step of split it is not necessary,just compare them as string.
>>> '2012-06-12-23-27-14' > '2012-06-12-14-02-30'
True
Firstly, you'll probably want each key,value item in the dict to be of a similar type. At the moment some of them (eg: database['dog'] ) are a list of strings (a line) and some (eg: database['cat']) are a list of lines. If you get them all into list of lines format (even if there's only one item in the list of lines) it will be much easier.
Then, one (old) way would be to make a comparison function for those lines. This will be easy since your dates are already in a format that's directly (string) comparable. To compare two lines, you want to compare the 3rd (2nd index) item in them:
def compare_line_by_date(x,y):
return cmp(x[2],y[2])
Finally you can get the lines for a particular key sorted by telling the sorted builtin to use your compare_line_by_date function:
sorted(database['cat'],compare_line_by_date)
The above is suitable (but slow, and will disappear in python 3) for arbitrarily complex comparison/sorting functions. There are other ways to do your particular sort, for example by using the key parameter of sorted:
def key_for_line(line):
return line[2]
sorted(database['cat'],key=key_for_line)
Using keys for sorting is much faster than cmp because the key function only needs to be run once per item in the list to be sorted, instead of every time items in the list are compared (which is usually much more often than the number of items in the list). The idea of a key is to basically boil each list item down into something that be compared naturally, like a string or a number. In the example above we boiled the line down into just the date, which is then compared.
Disclaimer: I haven't tested any of the code in this answer... but it should work!
Is anyone having experience working with pycassa I have a doubt with it. How do I get all the keys that are stored in the database?
well in this small snippet we need to give the keys in order to get the associated columns (here the keys are 'foo' and 'bar'),that is fine but my requirement is to get all the keys (only keys) at once as Python list or similar data structure.
cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
Thanks.
try:
list(cf.get_range().get_keys())
more good stuff here: http://github.com/vomjom/pycassa
You can try: cf.get_range(column_count=0,filter_empty=False).
# Since get_range() returns a generator - print only the keys.
for value in cf.get_range(column_count=0,filter_empty=False):
print value[0]
get_range([start][, finish][, columns][, column_start][, column_finish][, column_reversed][, column_count][, row_count][, include_timestamp][, super_column][, read_consistency_level][, buffer_size])
Get an iterator over rows in a
specified key range.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range
Minor improvement on Santhosh's solution
dict(cf.get_range(column_count=0,filter_empty=False)).keys()
If you care about order:
OrderedDict(cf.get_range(column_count=0,filter_empty=False)).keys()
get_range returns a generator. We can create a dict from the generator and get the keys from that.
column_count=0 limits results to the row_key. However, because these results have no columns we also need filter_empty.
filter_empty=False will allow us to get the results. However empty rows and range ghosts may be included in our result now.
If we don't mind more overhead, getting just the first column will resolve the empty rows and range ghosts.
dict(cf.get_range(column_count=1)).keys()
There's a problem with Santhosh's and kzarns' answers, as you're bringing in memory a potentially huge dict that you are immediately discarding. A better approach would be using list comprehensions for this:
keys = [c[0] for c in cf.get_range(column_count=0, filter_empty=False)]
This iterates over the generator returned by get_range, keeps the key in memory and stores the list.
If the list of keys where also potentially too large to keep it in memory all at once and you only need to iterate once, you should use a generator expression instead of a list comprehension:
kgen = (c[0] for c in cf.get_range(column_count=0, filter_empty=False))
# you can iterate over kgen, but do not treat it as a list, it isn't!