Pythonic way to create a dictionary by iterating - python

I'm trying to write something that answers "what are the possible values in every column?"
I created a dictionary called all_col_vals and iterate from 1 to however many columns my dataframe has. However, when reading about this online, someone stated this looked too much like Java and the more pythonic way would be to use zip. I can't see how I could use zip here.
all_col_vals = {}
for index in range(RCSRdf.shape[1]):
all_col_vals[RCSRdf.iloc[:,index].name] = set(RCSRdf.iloc[:,index])
The output looks like 'CFN Network': {nan, 'N521', 'N536', 'N401', 'N612', 'N204'}, 'Exam': {'EXRC', 'MXRN', 'HXRT', 'MXRC'} and shows all the possible values for that specific column. The key is the column name.

I think #piRSquared's comment is the best option, so I'm going to steal it as an answer and add some explanation.
Answer
Assuming you don't have duplicate columns, use the following:
{k : {*df[k]} for k in df}
Explanation
k represents a column name in df. You don't have to use the .columns attribute to access them because a pandas.DataFrame works similarly to a python dict
df[k] represents the series k
{*df[k]} unpacks the values from the series and places them in a set ({}) which only keeps distinct elements by definition (see definition of a set).
Lastly, using list comprehension to create the dict is faster than defining an empty dict and adding new keys to it via a for-loop.

Related

How to split a dataframe and select all possible pairs?

I have a dataframe that I want to separate in order to apply a certain function.
I have the fields df['beam'], df['track'], df['cycle'] and want to separate it by unique values of each of this three. Then, I want to apply this function (it works between two individual dataframes) to each pair that meets that df['track'] is different between the two. Also, the result doesn't change if you switch the order of the pair, so I'd like to not make unnecessary calls to the function if possible.
I currently work it through with four nested for loops into an if conditional, but I'm absolutely sure there's a better, cleaner way.
I'd appreciate all help!
Edit: I ended up solving it like this:
I split the original dataframe into multiple by using df.groupby()
dfsplit=df.groupby(['beam','track','cycle'])
This generates a dictionary where the keys are all the unique ['beam','track','cycle'] combinations as tuples
I combined all possible ['beam','track','cycle'] pairs with the use of itertools.combinations()
keys=list(itertools.combinations(dfsplit.keys(),2))
This generates a list of 2-element tuples where each element is one ['beam','track','cycle'] tuple itself, and it doesn't include the tuple with the order swapped, so I avoid calling the function twice for what would be the same case.
I removed the combinations where 'track' was the same through a for loop
for k in keys.copy():
if k[0][1]==k[1][1]:
keys.remove(k)
Now I can call my function by looping through the list of combinations
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])
Step 3 is taking a long time, probably because I have a very large number of unique ['beam','track','cycle'] combinations so the list is very long, but also probably because I'm doing it sub-optimally. I'll keep the question open in case someone realizes a better way to do this last step.
EDIT 2:
Solved the problem with step 3, once again with itertools, just by doing
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
itertools.filterfalse returns all elements of the list that return false to the function defined, so it's doing the same as the previous for loop but selecting the false instead of removing the true. It's very fast and I believe this solves my problem for good.
I don't know how to mark the question as solved so I'll just repeat the solution here:
dfsplit=df.groupby(['beam','track','cycle'])
keys=list(itertools.combinations(dfsplit.keys(),2))
keys=list(itertools.filterfalse(lambda k : k[0][1]==k[1][1], keys))
for k in keys:
function(dfsplit[k[0]],dfsplit[k[1]])

Using Dictionaries instead of globals()

I am doing a business case on retrieving stock information. The teacher uses the code below to create DataFrames with stock information.
#The tech stocks we'll use for this analysis
tech_list = ['AAPL','GOOG','MSFT','AMZN']
#Set up End and Start times for data grab
end = datetime.now()
start = datetime(end.year - 1,end.month,end.day)
#For loop for grabing yahoo finance data and setting as a dataframe
for stock in tech_list:
# Set DataFrame as the Stock Ticker
globals()[stock] = DataReader(stock,'yahoo',start,end)
He uses globals() to create the 4 dataframes with the techstock. I read in the question below that you can also use dictionary to achieve the same goal.
pandas set names of dataframes in loop
MY QUESTION is that i do not understand this line of code in the answer:
frames = {i:dat for i, dat in data.groupby('Sport')}
Can someone explain?
In this case, frames is a dictionary that is being built using a dictionary comprehension. The call data.groupby() is returning a pair of values, which are being called i and dat in the comprehension, and the notation {i:dat for i, dat in ...} is building a new dictionary out of all such pairs, using i as the key and dat as the value. The result is stored in frames.
The general syntax is (for the case where the iterator returns 2 elements):
{key: value for key, value in iterator}
The answers to this question do a good job explaining what an iterator is in python. Usually (but not always), when used in a dictionary comprehension, the iterator's __next__() method will return two elements. At least one of the elements must be hashable so that it can be used as the dictionary key.
iterator doesn't necessarily need to return two elements (although that is a common use pattern). This works:
print(dict([(i, chr(65+i)) for i in range(4)]))
{0 : 'A', 1 : 'B', 2 : 'C', 3 : 'D'}
and also shows that dictionary comprehensions are really just special syntax using same mechanics as list comprehensions and the dict() method, which is what the comment by #Barmar is doing:
frames = dict(data.groupby('Sport'))
In this case, data.groupby() does need to return two elements, and the order does matter, as it is shorthand for (roughly) this:
dict([(key, value) for key, value in data.groupby('Sport')])

Python: How can I use an enumerate element as a string?

I have a list of dict1.keys() I'm enumerating over and I'd like to use the element as a string.
for i,j in enumerate(dict1.keys()): str(j) = somethingElse
>>> SyntaxError: can't assign to function call
https://dbader.org/blog/python-enumerate describes the enumerate entities as a tuple of: (index, element). The type(j) is <class 'str'>, which I can print, but not use as a variable.
EDIT:
for i,j in enumerate(dict1.keys()): j = somethingElse
EDIT2:
I think the problem may be with pandas. The first line works, not the second.
for i,j in enumerate(dict1.keys()): list1.append(j)
for i,k in enumerate(list1): k = pd.DataFrame(dict1[k]['Values'])
EDIT3:
That second line does work, but only for only ends up with one df, with name 'k' instead of the key. But heres what Im trying to. Each dict converted to a df using its key name:
for i,j in enumerate(dict1.keys()): j = pd.DataFrame(dict1[j]['Values'])
EDIT4:
According to the comments below, I switched to a for loop on the keys (which dont need to be explicitly called), but it still won't use the element 'i' as a variable. However, from the question linked below, elements are able to be used as a key in a dict. After reducing the question to "use list item as name for dataframe" and searching that, it verks. I'll post as an answer also:
dict2={}
for i in dict1: dict2[i] = pd.DataFrame(dict1[i]['Values'])
..thus the names are preserved. Actually, this is similar to Sheri's answer with lists, but the names retain association with the dfs. There may not be a way to set a variable value using something other than a plain string, but I'll start a different question for that.
use elements in a list for dataframe names
Because you are generating your pandas dataframe dynamically inside a for loop so at the end when you print j it will show you the last generated dataframe. You should store your dataframe in list Try using this:
listOfFrame = []
for j in dict.keys():
j = pd.DataFrame(dict[j]['Values'])
listOfFrame.append(j)
Indeed j will be a str (or whatever else type of key you are using in dict).
The actual problem is with the loop body, as the error message states:
str(j) = somethingElse
is not valid Python. The left hand side is a call to the str function, so you cannot assign a value to it.
Based on the comments you want neither enumerate nor to iterate over the dict keys. Instead, you want to iterate over its values:
dfs = []
for val in dict1.values():
dfs.append(pd.DataFrame(val['Values']))
However, this would normally written without an explicit loop in Python, for instance by using list comprehension:
dfs = [pd.DataFrame(val['Values']) for val in dict1.values()]
From the question linked below, elements are able to be used as a key in a dict. After reducing the question to "use list item as name for dataframe" and searching that, it verks. I'll post as an answer also:
dict2={}
for i in dict1: dict2[i] = pd.DataFrame(dict1[i]['Values'])
..thus the names are preserved. Actually, this is similar to Sheri's answer with lists, but the names retain association with the dfs. There may not be a way to set a variable value using something other than a plain string, but I'll start a different question for that.
use elements in a list for dataframe names

Most effient way of List/Dict Lookups in Python

I have a list of dictionaries. Which looks something like,
abc = [{"name":"bob",
"age": 33},
{"name":"fred",
"age": 18},
{"name":"mary",
"age": 64}]
Lets say I want to lookup bobs age. I know I can run a for loop through etc etc. However my questions is are there any quicker ways of doing this.
One thought is to use a loop but break out of the loop once the lookup (in this case the age for bob) has been completed.
The reason for this question is my datasets are thousands of lines long so Im looking for any performance gains I can get.
Edit : I can see you can use the following via the use of a generator, however im not too sure whether this would still iterate over all items of the list or just iterate until the the first dict containing the name bob is found ?
next(item for item in abc if item["name"] == "bob")
Thanks,
Depending on how many times you want to perform this operation, it might be worth defining a dictionary mapping names to the corresponding age (or the list of corresponding ages if more than two people can share the same name).
A dictionary comprehension can help you:
abc_dict = {x["name"]:x["age"] for x in abc}
I'd consider making another dictionary and then using that for multiple age lookups:
for person in abc:
age_by_name[person['name']] = person['age']
age_by_name['bob']
# this is a quick lookup!
Edit: This is equivalent to the dict comprehension listed in Josay's answer
Try indexing it first (once), and then using the index (many times).
You can index it eg. by using dict (keys would be what you are searching by, while the values would be what you are searching for), or by putting the data in the database. That should cover the case if you really have a lot more lookups and rarely need to modify the data.
define dictionary of dictionary like this only
peoples = {"bob":{"name":"bob","age": 33},
"fred":{"name":"fred","age": 18},
"mary": {"name":",mary","age": 64}}
person = peoples["bob"]
persons_age = person["age"]
look up "bob" then look up like "age"
this is correct no ?
You might write a helper function. Here's a take.
import itertools
# First returns the first element encountered in an iterable which
# matches the predicate.
#
# If the element is never found, StopIteration is raised.
# Args:
# pred The predicate which determines a matching element.
#
first = lambda pred, seq: next(itertools.dropwhile(lambda x: not pred(x), seq))

How do I get all the keys that are stored in the Cassandra column family with pycassa?

Is anyone having experience working with pycassa I have a doubt with it. How do I get all the keys that are stored in the database?
well in this small snippet we need to give the keys in order to get the associated columns (here the keys are 'foo' and 'bar'),that is fine but my requirement is to get all the keys (only keys) at once as Python list or similar data structure.
cf.multiget(['foo', 'bar'])
{'foo': {'column1': 'val2'}, 'bar': {'column1': 'val3', 'column2': 'val4'}}
Thanks.
try:
list(cf.get_range().get_keys())
more good stuff here: http://github.com/vomjom/pycassa
You can try: cf.get_range(column_count=0,filter_empty=False).
# Since get_range() returns a generator - print only the keys.
for value in cf.get_range(column_count=0,filter_empty=False):
print value[0]
get_range([start][, finish][, columns][, column_start][, column_finish][, column_reversed][, column_count][, row_count][, include_timestamp][, super_column][, read_consistency_level][, buffer_size])
Get an iterator over rows in a
specified key range.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range
Minor improvement on Santhosh's solution
dict(cf.get_range(column_count=0,filter_empty=False)).keys()
If you care about order:
OrderedDict(cf.get_range(column_count=0,filter_empty=False)).keys()
get_range returns a generator. We can create a dict from the generator and get the keys from that.
column_count=0 limits results to the row_key. However, because these results have no columns we also need filter_empty.
filter_empty=False will allow us to get the results. However empty rows and range ghosts may be included in our result now.
If we don't mind more overhead, getting just the first column will resolve the empty rows and range ghosts.
dict(cf.get_range(column_count=1)).keys()
There's a problem with Santhosh's and kzarns' answers, as you're bringing in memory a potentially huge dict that you are immediately discarding. A better approach would be using list comprehensions for this:
keys = [c[0] for c in cf.get_range(column_count=0, filter_empty=False)]
This iterates over the generator returned by get_range, keeps the key in memory and stores the list.
If the list of keys where also potentially too large to keep it in memory all at once and you only need to iterate once, you should use a generator expression instead of a list comprehension:
kgen = (c[0] for c in cf.get_range(column_count=0, filter_empty=False))
# you can iterate over kgen, but do not treat it as a list, it isn't!

Categories