How to map values to a DataFrame with multiple columns as keys? - python

I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!

I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!

Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])

You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.

Related

Match column to another column containing array

I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?
Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C

Groupby and agg produce NaNs when used with diff

I have an indexed dataset like this
np.random.seed(1)
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': np.random.randn(4)},
index = [5,242,12,634])
Now I'm trying to get the difference of C by group like so
df.groupby('A').agg('diff')
which gives me the output
B C
5 NaN NaN
242 1.0 -2.492028
12 NaN NaN
634 1.0 -0.455332
I'm trying to get a resulting dataframe with only 2 rows, which contain the differences like so
B C
1.0 -2.492028
1.0 -0.455332
How can I achieve this?
First diff is not a agg function which will return the same length of out put same as original dataframe , if you would like the diff without NaN we should do dropna
out = df.groupby('A').diff().dropna()

How to use groupby max in own groupby function?

I have the following df
d = {'CAT':['C1','C2','C1','C2'],'A': [10, 20,30,40], 'B': [3, 4,10,3]}
df1 = pd.DataFrame(data=d)
I am trying to include a new column obtained by dividing 'A' by the highest 'B' it is category ('CAT'). That is, I want to divide 10 by 10, 20 by 4, 10 by 10 and 40 by 4 to obtain the following df
d = {'CAT':['C1','C2','C1','C2'],'A': [10, 20,30,40], 'B': [3, 4,10,3], 'C':[1,5,3,10]}
Any suggestions?
I find it easy to do without having to condition/groupby on CAT
d = {'A': [10, 20,30,40], 'B': [3, 4,10,3]}
df1 = pd.DataFrame(data=d)
df1 = df1.apply(lambda x:x.A/max(df1['B']),axis=1)
but with 'CAT' I am having a hard time.
You could do this in one line; I only broke it into separate lines for more clarity. transform allows replication of the groupby accross the entire dataframe; with that we can get the results for column C :
grouping = df1.groupby("CAT").B.transform("max")
df1['C'] = df1.A.div(grouping)
df1
CAT A B C
0 C1 10 3 1.0
1 C2 20 4 5.0
2 C1 30 10 3.0
3 C2 40 3 10.0
you're pretty much most of the way there with using apply. Depending on how big your actual dataset it, using apply could work out as inefficient, but ignoring that, you can solve your problem by the 'max' function on a filter of the dataframe rather than the df itself.
Or, just to get to the code:
df1['calculation'] = df1.apply(lambda row: row['A'] / max(df1[df1['CAT'] == row['CAT']]['B']), axis=1)

Multiply all rows in a Pandas DataFrame by dictionary

Say I have a DataFrame call it one like this:
non_multiply_col col_1 col_2
A Name 1 3
and a dict like this call it two:
{'col_1': 4, 'col_2': 5}
is there a way that I can multiply all rows of one by the values in two for the columns as defined by two's keys so the result would be:
non_multiply_col col_1 col_2
A Name 4 15
I tried using multiply, but I'm not really looking to join on anything specific. Maybe I'm not understanding how to use multiply correctly.
Thanks
mul/multiply works fine if the dictionary is converted to a Series:
d = {'col_1': 4, 'col_2': 5}
df.mul(pd.Series(d), axis=1)
# col_1 col_2
#0 4 15
In case you have more columns in the data frame than the dictionary:
df = pd.DataFrame([{'col_1': 1, 'col_2': 3, 'col_3': 4}])
d = {'col_1': 4, 'col_2': 5}
cols_to_update = d.keys() # you might need cols_to_update = list(d.keys()) in python 3
# multiply the selected columns and update
df[cols_to_update] = df[cols_to_update].mul(pd.Series(d), axis=1)[cols_to_update]
df
col_1 col_2 col_3
#0 4 15 4
I happen to find this work as well, not sure if there is any caveat about this usage:
df[d.keys()] *= pd.Series(d)

DataFrame from dictionary

Sorry, if it is a duplicate, but I didn't find the solution in internet...
I have some dictionary
{'a':1, 'b':2, 'c':3}
Now I want to construct pandas DF with the columns names corresponding to key and values corresponding to values. Actually it should be Df with only one row.
a b c
1 2 3
At the other topic I found only solutions, where both - keys and values are columns in the new DF.
You have some caveats here, if you just pass the dict to the DataFrame constructor then it will raise an error:
ValueError: If using all scalar values, you must must pass an index
To get around that you can pass an index which will work:
In [139]:
temp = {'a':1,'b':2,'c':3}
pd.DataFrame(temp, index=[0])
Out[139]:
a b c
0 1 2 3
Ideally your values should be iterable, so a list or array like:
In [141]:
temp = {'a':[1],'b':[2],'c':[3]}
pd.DataFrame(temp)
Out[141]:
a b c
0 1 2 3
Thanks to #joris for pointing out that if you wrap the dict in a list then you don't have to pass an index to the constructor:
In [142]:
temp = {'a':1,'b':2,'c':3}
pd.DataFrame([temp])
Out[142]:
a b c
0 1 2 3
For flexibility, you can also use pd.DataFrame.from_dict with orient='index'. This works whether your dictionary values are scalars or lists.
Note the final transpose step, which can be performed via df.T or df.transpose().
temp1 = {'a': 1, 'b': 2, 'c': 3}
temp2 = {'a': [1, 2], 'b':[2, 3], 'c':[3, 4]}
print(pd.DataFrame.from_dict(temp1, orient='index').T)
a b c
0 1 2 3
print(pd.DataFrame.from_dict(temp2, orient='index').T)
a b c
0 1 2 3
1 2 3 4

Categories