Pandas GroupBy using another DataFrame of one-hot encodings/overlapping masks - python

I have two dataframes with observations on rows and features (or group membership) on columns, e.g.:
> data_df
a b c
A 1 2 1
B 0 1 3
C 0 0 1
D 2 1 1
E 1 1 1
> mask_df
g1 g2
A 0 1
B 1 0
C 1 0
D 1 0
E 0 1
I want to group and aggregate (by sum) the values in the first dataframe (data_df) conditional on the binary values (masks) in the second dataframe (mask_df). The result should be the following (groups x features):
> aggr_df
a b c
g1 2 2 5
g2 2 3 2
Is there a way in pandas to group the first dataframe (data_df) using the masks contained in a second dataframe (mask_df) in a single command?

You can do this cheaply with dot and groupby:
data_df.groupby(mask_df.dot(mask_df.columns)).sum()
a b c
g1 2 2 5
g2 2 3 2
Where,
mask_df.dot(mask_df.columns)
A g2
B g1
C g1
D g1
E g2
dtype: object
Which works well assuming each row always has exactly one column set to 1.

Notice that this will work even in the case that observations in the first dataframe (data_df) belong to multiple masks in the second dataframe (mask_df).
> pd.concat({x:data_df.mul(mask_df[x],0).sum() for x in mask_df}).unstack()
a b c
g1 2 2 5
g2 2 3 2

The best way to do this is to combine the dataframes. You can combine on the index by using a join statement first. df_merge = data_df.merge(aggr_df, left_on=True, right_on=True). Then you can just use df_merge for your grouping operations.

I decided to write another answer since:
coldspeed's answer works only with one-hot encodings
W-B's answer cannot be easily parallelized since it runs on dict comprehension
In my case I noticed that I could achieve the same result just by using a dot product of mask_df with data_df:
> mask_df.T.dot(data_df)
In the special case of getting the average instead of the sum, this is achievable scaling the mask_df by the number of ones for each group:
> mask_df.T.dot(data_df).div(mask_df.sum(), axis=0)

Here's a way using a list comprehension:
pd.DataFrame([(data_df.T * mask_df[i]).sum(axis=1) for i in mask_df.columns],
index = mask.columns)
a b c
g1 2 2 5
g2 2 3 2

Related

split pandas data frame into multiple of 4 rows

I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.
i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Use groupby in dataframe to perform data filtering and element-wise subtraction

I have a dataframe composed by the following table:
A B C D
A1 5 3 4
A1 8 1 0
A2 1 1 0
A2 1 9 1
A2 1 3 1
A3 0 4 7
...
I need to group the data according to the 'A' label, then check whether the sum of the 'B' column for each label is larger than 10. If it is larger than 10 then perform an operation that involves subtracting 'C' and 'D'. Finally, I need to drop all rows that identify those 'A' labels for which the condition on the sum is not larger than 10. I am trying to use the groupby method, but I am not sure this is the right way to go. So far I have grouped everything with df.groupby('A')['B'].sum() and get a list of sums per grouped label in order to check the aforementioned condition on the 10 elements. But then how to apply the subtraction between columns C and D and also drop the irrelevant rows?
Use GroupBy.transform with sum for new Series filled by aggregate values and filter rows greater like 10 in boolean indexing with Series.gt and then subtract columns:
df = df[df.groupby('A')['B'].transform('sum').gt(10)].copy()
df['E'] = df['C'].sub(df['D'])
print (df)
A B C D E
0 A1 5 3 4 -1
1 A1 8 1 0 1
Similar idea if need sum column:
df['sum'] = df.groupby('A')['B'].transform('sum')
df['E'] = df['C'].sub(df['D'])
df = df[df['sum'].gt(10)].copy()
print (df)
A B C D sum E
0 A1 5 3 4 13 -1
1 A1 8 1 0 13 1

Shuffle rows of a DataFrame until all consecutive values in a column are different?

I have a dataframe with rows that I'd like to shuffle continuously until the value in column B is not identical across any two consecutive rows:
initial dataframe:
A | B
_______
a 1
b 1
c 2
d 3
e 3
Possible outcome:
A | B
_______
b 1
c 2
e 3
a 1
d 3
I made a function scramble meant to do this but I am having trouble passing the newly scrambled dataframe back into the function to test for matching B values:
def scamble(x):
curr_B='nothing'
for index, row in x.iterrows():
next_B=row['B']
if str(next_B) == str(curr_B):
x=x.sample(frac=1)
curr_B=next_B
curr_B=next_B
return x
df=scramble(df)
I suspect the function is finding the matching values in the next row, but I can't shuffle it continuously until there are no two sequential rows with the same B value.
Printing the output yields a dataframe shows consecutive rows with the same value in B.
If your goal is to eliminate consecutive duplicates, you can just use groupby and cumcount, then reindex your DataFrame:
df.loc[df.groupby('B').cumcount().sort_values().index]
A B
0 a 1
2 c 2
3 d 3
1 b 1
4 e 3
If you actually want randomness, then you can group on cumcount and call shuffle. This should eliminate consecutive dupes to some degree (NOT GUARANTEED) while preserving randomness and still avoiding slow iteration. Here's an example:
np.random.seed(0)
(df.groupby(df.groupby('B').cumcount(), group_keys=False)
.apply(lambda x: x.sample(frac=1))
.reset_index(drop=True))
A B
0 d 3
1 a 1
2 c 2
3 b 1
4 e 3

Count the intersections between duplicates using Pandas

I have a Dataframe that looks like this:
Symbols Count
A 3
A 1
A 2
A 4
B 1
B 3
B 9
C 2
C 1
C 3
What I want to do using Pandas is to identify duplicate rows on the "Count" column but I want to count the number of times the Symbols intersect with each other on the duplicate.
By this I mean, if a Count value appears twice with two different Symbols. The Symbols are listed as having one intersection between them as they share the same Count value.
Something like this:
Symbol Symbol Number of Intersections
A B 2
B A 2
C A 3
.....
I'm sure there is a Pythonic Pandas way of doing this. But its not coming to mind.
Let's use merge to do a self merge then query, and groupby:
df_selfmerge = df.merge(df, on='Count', how="inner").query('Symbols_x != Symbols_y')
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.count()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
'Count':'Number of Intersections'}))
EDIT: Use size() is safer just incase of NaN valueas
(df_selfmerge.groupby(['Symbols_x','Symbols_y'])['Count']
.size()
.reset_index()
.rename(columns={'Symbols_x':'Symbol',
'Symbols_y':'Symbol',
0:'Number of Intersections'}))
Output:
Symbol Symbol Number of Intersections
0 A B 2
1 A C 3
2 B A 2
3 B C 2
4 C A 3
5 C B 2

Categories