Generate pairs from columns in Pandas

Generate pairs from columns in Pandas - python

What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?

Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)

This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E

Related

Table with most frequent combinations with pandas python

The dataset
ID Product
1 A
1 B
2 A
3 A
3 C
3 D
4 A
4 B
5 A
5 C
5 D
.....
The goal is to have the most frequent combinaisons of product by ID regardless the number of string value.
The expected result here is :
[A, C, D] 2
[A, B] 2
[A, C] 2
......
Like that but with a working value
import itertools
(df.groupby('ID').Product.agg(lambda x: list(set(itertools.combinations(x,**?**))))
.explode().str.join('-').value_counts())

IIUC, groupby ID, aggregate to frozenset and count the occurrences with value_counts:
df.groupby('ID')['Product'].agg(frozenset).value_counts()
output:
(B, A) 2
(D, C, A) 2
(A) 1
Name: Product, dtype: int64
Alternative using sorted tuples:
df.groupby('ID')['Product'].agg(lambda x: tuple(sorted(x))).value_counts()
output:
(A, B) 2
(A, C, D) 2
(A,) 1
Name: Product, dtype: int64
Or strings:
df.groupby('ID')['Product'].agg(lambda x: ','.join(sorted(x))).value_counts()
output:
A,B 2
A,C,D 2
A 1
Name: Product, dtype: int64

Compute the sum of values in one column, if row in another column contains item in list

In python I have a dataframe that looks like:
Column1 Column2
[a,b,c,d] 4
[a,f,g] 3
[b,c] 6
[a,c,d] 5
I would like to compute a third column, that adds the value in Column2 for each time one of the items is present in Column1 (for example in the first row it would be a=4+3+5, b=4+6, c=5+6+5, d=4+5, so in total 4+3+5+4+6+5+6+5+4+5=47):
Column1 Column2 Column3
[a,b,c,d] 4 47
[a,f,g] 3 21
[b,c] 6 26
[a,c,d] 5 37
I've tried my best with query and indexing but with no success, thank you in advance!

Try with explode, then create the mapping dict and groupby back
s = df.explode('Column1')
d = s.groupby('Column1')['Column2'].sum()
s['new'] = s['Column1'].map(d)
out = s.groupby(level=0).agg({'Column1':list,'Column2':'first','new':'sum'})
out
Column1 Column2 new
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
Notice :
c = 4+6+5

df = pd.DataFrame({'Column1': [['a', 'b', 'c', 'd'], ['a', 'f', 'g'], ['b', 'c'], ['a', 'c', 'd']],
'Column2': [4, 3, 6, 5]})
df1 = df.explode('Column1')
df['Column3'] = df1.groupby(level=0).apply(
lambda d: d.Column1.apply(lambda x: df1.loc[df1.Column1 == x, 'Column2'].sum()).sum())
print(df)
Column1 Column2 Column3
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36

Let's start from the easier to comprehend version, step by step.
Explode Column1:
wrk = df.explode(column='Column1')
The result is:
Column1 Column2
0 a 4
0 b 4
0 c 4
0 d 4
1 a 3
1 f 3
1 g 3
2 b 6
2 c 6
3 a 5
3 c 5
3 d 5
Compute weights for each element from lists in Column1:
weight = wrk.groupby('Column1').sum().rename(columns={'Column2': 'Weight'})
The result is:
Weight
Column1
a 12
b 10
c 15
d 9
f 3
g 3
Note some differences to your counting, e.g. weight for c
is 4 + 6 + 5 = 15.
Join Column1 from wrk with weight:
wrk2 = wrk[['Column1']].join(weight, on='Column1')
The result is:
Column1 Weight
0 a 12
0 b 10
0 c 15
0 d 9
1 a 12
1 f 3
1 g 3
2 b 10
2 c 15
3 a 12
3 c 15
3 d 9
And the final step is to compute the new column:
df['Column3'] = wrk2.groupby(level=0).Weight.sum()
The result is:
Column1 Column2 Column3
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
But if you want more concise code, you can "compress" the above
solution to:
wrk = df.explode(column='Column1')
df['Column3'] = wrk[['Column1']].join(wrk.groupby('Column1').sum().rename(
columns={'Column2': 'Weight'}), on='Column1').groupby(level=0).Weight.sum()

python pandas - how to create for each row a list of column names with a condition?

I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!

You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]

Creating a Dictionary of Dataframes from a Large Dataframe based on Multi-Index via a Loop

Sorry if this seems simple but have been struggling to find an answer to this.
I have a large dataframe of the format in the picture:
Each row can be uniquely identified by the multi-index built from the columns "trip_id", "direction_id", "stop_sequence".
I would like to request methods using loops to create a python-dictionary of dataframes where each dataframe is a subset of the large dataframe which contains all the rows for each "trip_id" + "direction_id" multi-index.
At the end of the loops I would like to be able to have a python-dictionary of dataframes where I can access each dictionary with a simple index key such as from 0 - 10,000 or the key being the combination of trip_id and direction_id
E.g. for the image above, I would like all the rows where the trip_id is "17067064.T0.2-EPP-F-mjp-1.8.R" and the direction ID is "1" to be in one dataframe of this dictionary collection.
Thank you for your help.
Kind regards,
Ben

Use groupby with dictionary comprehension:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,5,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B','C'])
print (df)
A D E
F B C
a 4 7 a 1 5
5 8 b 3 3
9 c 5 6
b 5 4 d 7 9
2 e 1 2
4 3 f 0 4
A D E
#python 3.6+
dfs = {f'{a}_{b}':v for (a, b), v in df.groupby(level=['F','B'])}
#python bellow
#dfs = {'{}_{}'.format(a,b):v for (a, b), v in df.groupby(level=['F','B'])}
print (dfs)
{'a_4': A D E
F B C
a 4 7 a 1 5, 'a_5': A D E
F B C
a 5 8 b 3 3
9 c 5 6, 'b_4': A D E
F B C
b 4 3 f 0 4, 'b_5': A D E
F B C
b 5 4 d 7 9
2 e 1 2}
print (dfs['a_4'])
A D E
F B C
a 4 7 a 1 5

Combination aggregations in Pandas

I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?

this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1

What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate pairs from columns in Pandas - python

What's the easiest way in Pandas to turn this df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']}) df Class Students 0 1 A,B,C,D 1 2 E,A,C Into this?

Related

Table with most frequent combinations with pandas python

Compute the sum of values in one column, if row in another column contains item in list

python pandas - how to create for each row a list of column names with a condition?

Creating a Dictionary of Dataframes from a Large Dataframe based on Multi-Index via a Loop

Combination aggregations in Pandas

Categories

Resources