What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?
Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)
This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E
Related
The dataset
ID Product
1 A
1 B
2 A
3 A
3 C
3 D
4 A
4 B
5 A
5 C
5 D
.....
The goal is to have the most frequent combinaisons of product by ID regardless the number of string value.
The expected result here is :
[A, C, D] 2
[A, B] 2
[A, C] 2
......
Like that but with a working value
import itertools
(df.groupby('ID').Product.agg(lambda x: list(set(itertools.combinations(x,**?**))))
.explode().str.join('-').value_counts())
IIUC, groupby ID, aggregate to frozenset and count the occurrences with value_counts:
df.groupby('ID')['Product'].agg(frozenset).value_counts()
output:
(B, A) 2
(D, C, A) 2
(A) 1
Name: Product, dtype: int64
Alternative using sorted tuples:
df.groupby('ID')['Product'].agg(lambda x: tuple(sorted(x))).value_counts()
output:
(A, B) 2
(A, C, D) 2
(A,) 1
Name: Product, dtype: int64
Or strings:
df.groupby('ID')['Product'].agg(lambda x: ','.join(sorted(x))).value_counts()
output:
A,B 2
A,C,D 2
A 1
Name: Product, dtype: int64
In python I have a dataframe that looks like:
Column1 Column2
[a,b,c,d] 4
[a,f,g] 3
[b,c] 6
[a,c,d] 5
I would like to compute a third column, that adds the value in Column2 for each time one of the items is present in Column1 (for example in the first row it would be a=4+3+5, b=4+6, c=5+6+5, d=4+5, so in total 4+3+5+4+6+5+6+5+4+5=47):
Column1 Column2 Column3
[a,b,c,d] 4 47
[a,f,g] 3 21
[b,c] 6 26
[a,c,d] 5 37
I've tried my best with query and indexing but with no success, thank you in advance!
Try with explode, then create the mapping dict and groupby back
s = df.explode('Column1')
d = s.groupby('Column1')['Column2'].sum()
s['new'] = s['Column1'].map(d)
out = s.groupby(level=0).agg({'Column1':list,'Column2':'first','new':'sum'})
out
Column1 Column2 new
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
Notice :
c = 4+6+5
df = pd.DataFrame({'Column1': [['a', 'b', 'c', 'd'], ['a', 'f', 'g'], ['b', 'c'], ['a', 'c', 'd']],
'Column2': [4, 3, 6, 5]})
df1 = df.explode('Column1')
df['Column3'] = df1.groupby(level=0).apply(
lambda d: d.Column1.apply(lambda x: df1.loc[df1.Column1 == x, 'Column2'].sum()).sum())
print(df)
Column1 Column2 Column3
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
Let's start from the easier to comprehend version, step by step.
Explode Column1:
wrk = df.explode(column='Column1')
The result is:
Column1 Column2
0 a 4
0 b 4
0 c 4
0 d 4
1 a 3
1 f 3
1 g 3
2 b 6
2 c 6
3 a 5
3 c 5
3 d 5
Compute weights for each element from lists in Column1:
weight = wrk.groupby('Column1').sum().rename(columns={'Column2': 'Weight'})
The result is:
Weight
Column1
a 12
b 10
c 15
d 9
f 3
g 3
Note some differences to your counting, e.g. weight for c
is 4 + 6 + 5 = 15.
Join Column1 from wrk with weight:
wrk2 = wrk[['Column1']].join(weight, on='Column1')
The result is:
Column1 Weight
0 a 12
0 b 10
0 c 15
0 d 9
1 a 12
1 f 3
1 g 3
2 b 10
2 c 15
3 a 12
3 c 15
3 d 9
And the final step is to compute the new column:
df['Column3'] = wrk2.groupby(level=0).Weight.sum()
The result is:
Column1 Column2 Column3
0 [a, b, c, d] 4 46
1 [a, f, g] 3 18
2 [b, c] 6 25
3 [a, c, d] 5 36
But if you want more concise code, you can "compress" the above
solution to:
wrk = df.explode(column='Column1')
df['Column3'] = wrk[['Column1']].join(wrk.groupby('Column1').sum().rename(
columns={'Column2': 'Weight'}), on='Column1').groupby(level=0).Weight.sum()
I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!
You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Sorry if this seems simple but have been struggling to find an answer to this.
I have a large dataframe of the format in the picture:
Each row can be uniquely identified by the multi-index built from the columns "trip_id", "direction_id", "stop_sequence".
I would like to request methods using loops to create a python-dictionary of dataframes where each dataframe is a subset of the large dataframe which contains all the rows for each "trip_id" + "direction_id" multi-index.
At the end of the loops I would like to be able to have a python-dictionary of dataframes where I can access each dictionary with a simple index key such as from 0 - 10,000 or the key being the combination of trip_id and direction_id
E.g. for the image above, I would like all the rows where the trip_id is "17067064.T0.2-EPP-F-mjp-1.8.R" and the direction ID is "1" to be in one dataframe of this dictionary collection.
Thank you for your help.
Kind regards,
Ben
Use groupby with dictionary comprehension:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,5,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B','C'])
print (df)
A D E
F B C
a 4 7 a 1 5
5 8 b 3 3
9 c 5 6
b 5 4 d 7 9
2 e 1 2
4 3 f 0 4
A D E
#python 3.6+
dfs = {f'{a}_{b}':v for (a, b), v in df.groupby(level=['F','B'])}
#python bellow
#dfs = {'{}_{}'.format(a,b):v for (a, b), v in df.groupby(level=['F','B'])}
print (dfs)
{'a_4': A D E
F B C
a 4 7 a 1 5, 'a_5': A D E
F B C
a 5 8 b 3 3
9 c 5 6, 'b_4': A D E
F B C
b 4 3 f 0 4, 'b_5': A D E
F B C
b 5 4 d 7 9
2 e 1 2}
print (dfs['a_4'])
A D E
F B C
a 4 7 a 1 5
I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''