How to transform the result of a Pandas `GROUPBY` function to the original dataframe - python

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.

You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True

My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

Related

Idiomatic way to create pandas dataframe as concatenation of function of another's rows

Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]

Use transform to calculate a value from two different columns

I'd like to apply a small function that uses two parameters on one data frame using the transform function.
Consider this rather useless example function:
import pandas as pd
def example_function(x, y):
if y=="hi":
res = x*3
else:
res = x
return res
Depending on the value in y ("hi" or something else) the value x will bu multiplied by 3 or returned unaltered.
Given this example Dataframe
df = pd.DataFrame(dict([("A",[1,2,3,4]), ("B",["hi", "ho", "ho", "hi"])]))
I'd like to get this result:
A B C
0 1 hi 3
1 2 ho 2
2 3 ho 3
3 4 hi 12
I assumed that passing two columns should work:
df["combined"] = df[["A", "B"]].transform(example_function)
but I'm getting an error (Missing 1 required positional argument). Any suggestion how to solve this?
It is not possible, because transform processing each column separately, so cannot filtering between columns (Series).
Solution with DataFrame.apply working like you need:
df["combined"] = df.apply(lambda x: example_function(x.A, x.B), axis=1)
print (df)
A B combined
0 1 hi 3
1 2 ho 2
2 3 ho 3
3 4 hi 12
You can check it with this function:
def function(x):
print (x)
return x
df[["A", "B"]].transform(function)

Add column to DataFrame in a loop

Let's say I have a very simple pandas dataframe, containing a single indexed column with "initial values". I want to read in a loop N other dataframes to fill a single "comparison" column, with matching indices.
For instance, with my inital dataframe as
Initial
0 a
1 b
2 c
3 d
and the following two dataframes to read in a loop
Comparison
0 e
1 f
Comparison
2 g
3 h
4 i <= note that this index doesn't exist in Initial so won't be matched
I would like to produce the following result
Initial Comparison
0 a e
1 b f
2 c g
3 d h
Using merge, concat or join, I only ever seem to be able to create a new column for each iteration of the loop, filling the blanks with NaN.
What's the most pandas-pythonic way of achieving this?
Below an example from the proposed duplicate solution:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
print df1
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
print df2
df3 = pd.DataFrame(np.array([[2,'g'],[3,'h'],[4,'i']]), columns=['','Compare'])
df3 = df3.set_index('')
print df3
print df1.merge(df2,left_index=True,right_index=True).merge(df3,left_index=True,right_index=True)
>>
Initial
0 a
1 b
2 c
3 d
Compare
0 e
1 f
Compare
2 g
3 h
4 i
Empty DataFrame
Columns: [Initial, Compare_x, Compare_y]
Index: []
Second edit: #W-B, the following seems to work, but it can't be the case that there isn't a simpler option using proper pandas methods. It also requires turning off warnings, which might be dangerous...
pd.options.mode.chained_assignment = None
df1["Compare"]=pd.Series()
for ind in df1.index.values:
if ind in df2.index.values:
df1["Compare"][ind]=df2.T[ind]["Compare"]
if ind in df3.index.values:
df1["Compare"][ind]=df3.T[ind]["Compare"]
print df1
>>
Initial Compare
0 a e
1 b f
2 c g
3 d h
Ok , since Op need more info
Data input
import functools
df1 = pd.DataFrame(np.array([['a'],['b'],['c'],['d']]), columns=['Initial'])
df1['Compare']=np.nan
df2 = pd.DataFrame(np.array([['e'],['f']]), columns=['Compare'])
df3 = pd.DataFrame(np.array(['g','h','i']), columns=['Compare'],index=[2,3,4])
Solution
newdf=functools.reduce(lambda x,y: x.fillna(y),[df1,df2,df3])
newdf
Out[639]:
Initial Compare
0 a e
1 b f
2 c g
3 d h

python split pd dataframe by column

Is there a function that splits a pandas.dataframe object into multiple sub-dataframes, by a specific column value? For example, if I have
A 1
B 2
A 3
B 4
I want the result as follow:
A 1
A 3
and
B 2
B 4
In R, it is the split function. How is it being done in python? I know I can use subset within a forloop. But is there a function does that? Thanks.
You can use groupby() with list-comprehension to extract a list of sub data frames where each of them contains only a single ind value:
import pandas as pd
from StringIO import StringIO
df = pd.read_csv(StringIO("""A 1
B 2
A 3
B 4"""), sep = "\s+", names=['ind', 'value'])
lst = [g for _, g in df.groupby('ind')]
lst[0]
# ind value
#0 A 1
#2 A 3
lst[1]
# ind value
#1 B 2
#3 B 4

Importing Data and Columns from Another Python Pandas Data Frame

I have been trying to select a subset of a correlation matrix using the Pandas Python library.
For instance, if I had a matrix like
0 A B C
A 1 2 3
B 2 1 4
C 3 4 1
I might want to select a matrix where some of the variables in the original matrix are correlated with some of the other variables, like :
0 A C
A 1 3
C 3 1
To do this, I tried using the following code to slice the original correlation matrix using the names of the desired variables in a list, transpose the correlation matrix, reassign the original column names, and then slice again.
data = pd.read_csv("correlationmatrix.csv")
initial_vertical_axis = pd.DataFrame()
for x in var_list:
a = data[x]
initial_vertical_axis = initial_vertical_axis.append(a)
print initial_vertical_axis
initial_vertical_axis = pd.DataFrame(data=initial_vertical_axis, columns= var_list)
initial_matrix = pd.DataFrame()
for x in var_list:
a = initial_vertical_axis[x]
initial_matrix = initial_matrix.append(a)
print initial_matrix
However, this returns an empty correlation matrix with the right row and column labels but no data like
0 A C
A
C
I cannot find the error in my code that would lead to this. If there is a simpler way to go about this, I am open to suggestions.
Suppose data contains your matrix,
In [122]: data
Out[122]:
A B C
0
A 1 2 3
B 2 1 4
C 3 4 1
In [123]: var_list = ['A','C']
In [124]: data.loc[var_list,var_list]
Out[124]:
A C
0
A 1 3
C 3 1

Categories