Combination aggregations in Pandas - python

I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?

this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1

What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''

Related

How to get event wise frequency and the frequency of each event in a dataframe?

I have a dataset like:
Data
a
a
a
a
a
b
b
b
a
a
b
I want to add a column that looks like the one below. The data will be in the form of a1,1 in the column, where the first element represent the event frequency (a1) and the second element (,1) is the frequency for each event. Is there a way we can do this using python?
Data Frequency
a a1,1
a a1,2
a a1,3
a a1,4
a a1,5
b b1,1
b b1,2
b b1,3
a a2,1
a a2,2
b b2,1
You can use:
# identify changes in Data
m = df['Data'].ne(df['Data'].shift()).cumsum()
# cumulated increments within groups
g1 = df.groupby(m).cumcount().add(1).astype(str)
# increments of different subgroups per Data
g2 = (df.loc[~m.duplicated(), 'Data']
.groupby(df['Data']).cumcount().add(1)
.reindex(df.index, method='ffill')
.astype(str)
)
df['Frequency'] = df['Data'].add(g2+','+g1)
output:
Data Frequency
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 a a1,5
5 b b1,1
6 b b1,2
7 b b1,3
8 a a2,1
9 a a2,2
10 b b2,1
Code:
from itertools import groupby
k = [key for key, _group in groupby(df['Data'].tolist())] #OUTPUT ['a', 'b', 'a', 'b']
Key = [v+f'{k[:i].count(v)+1}' for i,v in enumerate(k)] #OUTPUT ['a1', 'b1', 'a2', 'b2']
Sum = [sum(1 for _ in _group) for key, _group in groupby(df['Data'].tolist())] #OUTPUT [4, 3, 2, 1]
df['Frequency'] = [f'{K},{S}' for I, K in enumerate(Key) for S in range(1, Sum[I]+1)]
Output:
Data Frequency
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 b b1,1
5 b b1,2
6 b b1,3
7 a a2,1
8 a a2,2
9 b b2,1
def function1(dd:pd.DataFrame):
dd2=dd.assign(col2=dd.col1.ne(dd.col1.shift()).cumsum())\
.assign(col2=lambda dd:dd.Data+dd.col2.astype(str))\
.assign(rk=dd.groupby('col1').col1.transform('cumcount').astype(int)+1)\
.assign(col3=lambda dd:dd.col2+','+dd.rk.astype(str))
return dd2.loc[:,['Data','col3']]
df1.assign(col1=df1.ne(df1.shift()).cumsum()).groupby(['Data']).apply(function1)
Data col3
0 a a1,1
1 a a1,2
2 a a1,3
3 a a1,4
4 a a1,5
5 b b1,1
6 b b1,2
7 b b1,3
8 a a2,1
9 a a2,2
10 b b2,1

Compute the sum of values in one column, if row in another column contains item in list

In python I have a dataframe that looks like:
Column1 Column2
[a,b,c,d] 4
[a,f,g] 3
[b,c] 6
[a,c,d] 5
I would like to compute a third column, that adds the value in Column2 for each time one of the items is present in Column1 (for example in the first row it would be a=4+3+5, b=4+6, c=5+6+5, d=4+5, so in total 4+3+5+4+6+5+6+5+4+5=47):
Column1 Column2 Column3
[a,b,c,d] 4 47
[a,f,g] 3 21
[b,c] 6 26
[a,c,d] 5 37
I've tried my best with query and indexing but with no success, thank you in advance!
Try with explode, then create the mapping dict and groupby back
s = df.explode('Column1')
d = s.groupby('Column1')['Column2'].sum()
s['new'] = s['Column1'].map(d)
out = s.groupby(level=0).agg({'Column1':list,'Column2':'first','new':'sum'})
out
Column1 Column2 new
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
Notice :
c = 4+6+5
df = pd.DataFrame({'Column1': [['a', 'b', 'c', 'd'], ['a', 'f', 'g'], ['b', 'c'], ['a', 'c', 'd']],
'Column2': [4, 3, 6, 5]})
df1 = df.explode('Column1')
df['Column3'] = df1.groupby(level=0).apply(
lambda d: d.Column1.apply(lambda x: df1.loc[df1.Column1 == x, 'Column2'].sum()).sum())
print(df)
Column1 Column2 Column3
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
Let's start from the easier to comprehend version, step by step.
Explode Column1:
wrk = df.explode(column='Column1')
The result is:
Column1 Column2
0 a 4
0 b 4
0 c 4
0 d 4
1 a 3
1 f 3
1 g 3
2 b 6
2 c 6
3 a 5
3 c 5
3 d 5
Compute weights for each element from lists in Column1:
weight = wrk.groupby('Column1').sum().rename(columns={'Column2': 'Weight'})
The result is:
Weight
Column1
a 12
b 10
c 15
d 9
f 3
g 3
Note some differences to your counting, e.g. weight for c
is 4 + 6 + 5 = 15.
Join Column1 from wrk with weight:
wrk2 = wrk[['Column1']].join(weight, on='Column1')
The result is:
Column1 Weight
0 a 12
0 b 10
0 c 15
0 d 9
1 a 12
1 f 3
1 g 3
2 b 10
2 c 15
3 a 12
3 c 15
3 d 9
And the final step is to compute the new column:
df['Column3'] = wrk2.groupby(level=0).Weight.sum()
The result is:
Column1 Column2 Column3
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
But if you want more concise code, you can "compress" the above
solution to:
wrk = df.explode(column='Column1')
df['Column3'] = wrk[['Column1']].join(wrk.groupby('Column1').sum().rename(
columns={'Column2': 'Weight'}), on='Column1').groupby(level=0).Weight.sum()

Generate pairs from columns in Pandas

What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?
Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)
This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E

python pandas - how to create for each row a list of column names with a condition?

I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!
You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]

Pandas - Remove duplicates across multiple columns

I am trying to efficiently remove duplicates in Pandas in which duplicates are inverted across two columns. For example, in this data frame:
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
print frame
p1 p2 value
0 a b 1
1 b a 1
2 a c 2
3 a d 3
4 b c 5
5 d a 3
6 c b 5
I would want to remove rows 1, 5 and 6, leaving me with just:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5
Thanks in advance for ideas on how to do this.
Reorder the p1 and p2 values so they appear in a canonical order:
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
yields
In [149]: df
Out[149]:
p1 p2 value first second
0 a b 1 a b
1 b a 1 a b
2 a c 2 a c
3 a d 3 a d
4 b c 5 b c
5 d a 3 a d
6 c b 5 b c
Then you can drop_duplicates:
df = df.drop_duplicates(subset=['value', 'first', 'second'])
import pandas as pd
key = pd.DataFrame({'p1':['a','b','a','a','b','d','c'],'p2':['b','a','c','d','c','a','b'],'value':[1,1,2,3,5,3,5]})
df = pd.DataFrame(key,columns=['p1','p2','value'])
mask = df['p1'] < df['p2']
df['first'] = df['p1'].where(mask, df['p2'])
df['second'] = df['p2'].where(mask, df['p1'])
df = df.drop_duplicates(subset=['value', 'first', 'second'])
df = df[['p1', 'p2', 'value']]
yields
In [151]: df
Out[151]:
p1 p2 value
0 a b 1
2 a c 2
3 a d 3
4 b c 5

Categories