Creating a Pandas column from a nested dictionary - python

I have a nested dictionary called datastore containing keys m, n, o and finally 'target_a', 'target_b', or 'target c' (these contain the values). Additionally, I have a pandas dataframe df, which contains a number of columns. Three of these columns, 'r', 's', and 't', contain values that can be used as keys to find the values in the dictionary.
With the code below, I have attempted to do this using a lambda function, however, it requires calling the function three times, which seems pretty inefficient! Is there better way of doing this? Any help would be much appreciated.
def find_targets(m, n, o):
if m == 0:
return [1.5, 1.5, 1.5]
else:
a = datastore[m][n][o]['target_a']
b = datastore[m][n][o]['target_b']
c = datastore[m][n][o]['target_c']
return [a, b, c]
df['a'] = df.apply(lambda x: find_targets(x['r'], x['s'], x['t'])[0],axis=1)
df['b'] = df.apply(lambda x: find_targets(x['r'], x['s'], x['t'])[1],axis=1)
df['c'] = df.apply(lambda x: find_targets(x['r'], x['s'], x['t'])[2],axis=1)

You can have your apply return a pd.Series, and then do the assignment in one pass using df.merge
Here's an example, modifying your function to return a pd.Series, but you can find other solutions aswell, keeping your finding function as you defined it and transforming it to series in the lambda expression.
def find_targets(m, n, o):
if m == 0:
return pd.Series({'a':1.5, 'b':1.5, 'c':1.5})
else:
a = d[m][n][o]['target_a']
b = d[m][n][o]['target_b']
c = d[m][n][o]['target_c']
return pd.Series({'a':a, 'b':b, 'c':c})
df.merge(df.apply(lambda x: find_targets(x['r'], x['s'], x['t']), axis=1), left_index=True, right_index=True)

If you make your find targets return a dictionary and in your lambda convert it to a pandas.Series, apply will create the rows for you and return a dataframe with the columns you want.
def find_targets(m, n, o):
if m == 0:
return {'a': 1.5, 'b': 1.5, 'c': 1.5}
else:
targets = {}
targets['a'] = datastore[m][n][o]['target_a']
targets['b'] = datastore[m][n][o]['target_b']
targets['c'] = datastore[m][n][o]['target_c']
return targets
abc_df = df.apply(lambda x: pd.Series(find_targets(x['r'], x['s'], x['t'])), axis=1)
df = pd.concat((df, abc_df), axis=1)
If you can't change the find_targets function you could still zip it with the keys you need:
abc_dict = dict(zip('abc', old_find_targets(...)))

Related

Pandas / Python: Groupby.apply() with function dictionary

I'm trying to implement something like this:
def RR(x):
x['A'] = x['A'] +1
return x
def Locked(x):
x['A'] = x['A'] + 2
return x
func_mapper = {"RR": RR, "Locked": Locked}
df = pd.DataFrame({'A':[1,1], 'LookupVal':['RR','Locked'],'ID':[1,2]})
df= df.groupby("ID").apply(lambda x: func_mapper[x.LookupVal.first()](x))
Output for column A would be 2, 6
where x.LookupVal is a column of strings (it will have the same value within each groupby("ID")) that I want to pass as the key to the dictionary lookup.
Any suggestions how to implement this??
Thanks!
The first is not what you think it is. It is for timeseries data and it requires an offset parameter. I think you are mistaken with groupby first
You can use iloc[0] to get the first value:
slice_b.groupby("ID").apply(lambda x: func_mapper[x.LookupVal.iloc[0]](x))

How to check if any of elements in a dictionary value is in string?

I have a dataframe with strings and a dictionary which values are lists of strings.
I need to check if each string of the dataframe contains any element of every value in the dictionary. And if it does, I need to label it with the appropriate key from the dictionary. All I need to do is to categorize all the strings in the dataframe with keys from the dictionary.
For example.
df = pd.DataFrame({'a':['x1','x2','x3','x4']})
d = {'one':['1','aa'],'two':['2','bb']}
I would like to get something like this:
df = pd.DataFrame({
'a':['x1','x2','x3','x4'],
'Category':['one','two','x3','x4']})
I tried this, but it has not worked:
df['Category'] = np.nan
for k, v in d.items():
for l in v:
df['Category'] = [k if l in str(x).lower() else x for x in df['a']]
Any ideas appreciated!
Firstly create a function that do this for you:-
def func(val):
for x in range(0,len(d.values())):
if val in list(d.values())[x]:
return list(d.keys())[x]
Now make use of split() and apply() method:-
df['Category']=df['a'].str.split('',expand=True)[2].apply(func)
Finally use fillna() method:-
df['Category']=df['Category'].fillna(df['a'])
Now if you print df you will get your expected output:-
a Category
0 x1 one
1 x2 two
2 x3 x3
3 x4 x4
Edit:
You can also do this by:-
def func(val):
for x in range(0,len(d.values())):
if any(l in val for l in list(d.values())[x]):
return list(d.keys())[x]
then:-
df['Category']=df['a'].apply(func)
Finally:-
df['Category']=df['Category'].fillna(df['a'])
I've come up with the following heuristic, which looks really dirty.
It outputs what you desire, albeit with some warnings, since I've used indices to append values to dataframe.
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame({'a': ['x1', 'x2', 'x3', 'x4']})
d = {'one': ['1', 'aa'], 'two': ['2', 'bb']}
found = False
i = 0
df['Category'] = np.nan
for x in df['a']:
for k,v in d.items():
for item in v:
if item in x:
df['Category'][i] = k
found = True
break
else:
df['Category'][i] = x
if found:
found = False
break
i += 1
print(df)
main()

Find Anagram in Pandas DataFrame columns

Given the Dataframe
df = pd.DataFrame({'word1': ['elvis', 'lease', 'admirer'], 'word2': ['lives', 'sale', 'married']})
how can I add a third column that returns True or False depending on whether the two words in the same row are an anagram or not?
I have written this function, which returns an error when I apply it to the df.
def anagram(word1, word2):
word1_lst = [l for l in word1]
word2_lst = [i for i in word2]
return sorted(word1_lst) == sorted(word2_lst)
df['Anagram'] = df.apply(anagram(df['word1'], df['word2']), axis = 1)
TypeError: 'bool' object is not callable
df = pd.DataFrame({'word1': ['elvis', 'lease', 'admirer'], 'word2': ['lives', 'sale', 'married']})
df['Anagram'] = df.word1.apply(sorted) == df.word2.apply(sorted)
The issue here is that you are calling df.apply() with the args
anagram(df['word1'], df['word2') which is a bool, not a function
and
axis = 1
To fix, alter your function like this:
def anagram(row):
word1_lst = [l for l in row['word1']]
word2_lst = [i for i in row['word2']]
return sorted(word1_lst) == sorted(word2_lst)
then call the method with the function name, not the result
df['Anagram'] = df.apply(anagram, axis=1)

missing in function applied to pandas dataframe column

I'm trying to apply a function to my 'age' and 'area' columns in order to get the results that I show in the column 'wanted'.
Unfortunately this funtion gives me errors. I know that there are other methods in Pandas, like iloc, but I would like to understand this particular situation.
raw_data = {'age': [-1, np.nan, 10, 300, 20],'area': ['N','S','W',np.nan,np.nan],
'wanted': ['A',np.nan,'A',np.nan,np.nan]}
df = pd.DataFrame(raw_data, columns = ['age','area','wanted'])
df
def my_funct(df) :
if df["age"].isnull() :
return np.nan
elif df["area"].notnull():
return 'A'
else:
return np.nan
df["target"] = df.apply(lambda df:my_funct(df) ,axis = 1)
In your example, the problem is when you pass a row to your function, by referencing df['age'], it gives you a float, which doesn't have a method called isnull(). To check if a float is null, you can use the pd.isna function. Similar case for notna().
def my_funct(df) :
if pd.isna(df["age"]) :
return np.nan
elif pd.notna(df["area"]):
return 'A'
else:
return np.nan
df["target"] = df.apply(lambda x: my_funct(x) ,axis = 1)

Setting the values of a pandas df column based on ranges of values of another df column

I have a df that looks like this:
df = pd.DataFrame({'a':[-3,-2,-1,0,1,2,3], 'b': [1,2,3,4,5,6,7]})
I would like to create a columns 'c' that looks at values of 'a' to determine what operation to do to 'b' and display it in new column 'c'.
I have a solution that uses iterrow, however, my real df is large and iterrows is inefficient.
What I would like to do is do this operation in a vectorized form.
My 'slow' solution is:
df['c'] = 0
for index, row in df.iterrows():
if row['a'] <=-2:
row['c'] = row['b']*np.sqrt(row[b]*row[a])
if row['a'] > -2 and row['a'] < 2:
row['c'] = np.log(row['b'])
if row['a'] >= 2:
row['c'] = row['b']**3
Use np.select. It's a vectorized operation.
conditions = [
df['a'] <= -2,
(df['a'] > -2) & (df['a'] < 2),
df['a'] >= 2
]
values = [
df['b'] * np.sqrt(df['b'] * df['a'])
np.log(df['b']),
df['b']**3
]
df['c'] = np.select(conditions, values, default=0)
You can use and .apply across multiple columns in a pandas (specifying axis=1) with a lambda function to get the job done. Not sure if the speed is ok. See this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[-3,-2,-1,0,1,2,3], 'b': [1,2,3,4,5,6,7]})
def func(a_, b_):
if a_<=-2:
return b_*(b_*a_)**0.5
elif a_<2:
return np.log(b_)
else:
return b_**3.
df['c'] = df[['a','b']].apply(lambda x: func(x[0], x[1]), axis=1)
One method is to index by conditions and then operate on just those rows. Something like this:
df['c'] = np.nan
indices = [
df['a'] <= -2,
(df['a'] > -2) & (df['a'] < 2),
df['a'] >= 2
]
ops = [
lambda x: x['b'] * np.sqrt(x['b'] * x['a']),
lambda x: np.log(x['b']),
lambda x: x['b']**3
]
for ix, op in zip(indices, ops):
df.loc[ix, 'c'] = op(df)
df['c'] = df.apply(lambda x: my_func(x), 1)
def my_func(x):
if x['a'] <= -2:
return x['b']*np.sqrt(x[b]*x[a])
# write other conditions as needed
The df.apply function iterates over each row of the dataframe and applies the function passed(i.e lambda function). The second argument is axis which is set to 1 which means it will iterate over rows and row values will be passed into the lambda function. By default it is 0, in this case it will iterate over columns.
Lastly, you need to return a value which will be set as column 'c' value.

Categories