Table with most frequent combinations with pandas python - python

The dataset
ID Product
1 A
1 B
2 A
3 A
3 C
3 D
4 A
4 B
5 A
5 C
5 D
.....
The goal is to have the most frequent combinaisons of product by ID regardless the number of string value.
The expected result here is :
[A, C, D] 2
[A, B] 2
[A, C] 2
......
Like that but with a working value
import itertools
(df.groupby('ID').Product.agg(lambda x: list(set(itertools.combinations(x,**?**))))
.explode().str.join('-').value_counts())

IIUC, groupby ID, aggregate to frozenset and count the occurrences with value_counts:
df.groupby('ID')['Product'].agg(frozenset).value_counts()
output:
(B, A) 2
(D, C, A) 2
(A) 1
Name: Product, dtype: int64
Alternative using sorted tuples:
df.groupby('ID')['Product'].agg(lambda x: tuple(sorted(x))).value_counts()
output:
(A, B) 2
(A, C, D) 2
(A,) 1
Name: Product, dtype: int64
Or strings:
df.groupby('ID')['Product'].agg(lambda x: ','.join(sorted(x))).value_counts()
output:
A,B 2
A,C,D 2
A 1
Name: Product, dtype: int64

Related

Generate pairs from columns in Pandas

What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?
Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)
This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E

Sort index of different length tuples in pandas

>>> df = pd.DataFrame(index=[('B',),('A',),('B','C',),('A','B',),('A','B','C')],data=[1,2,3,4,5],columns=['count'])
>>> df
count
(B,) 1
(A,) 2
(B, C) 3
(A, B) 4
(A, B, C) 5
I would like to sort by the tuple index such that the 1 length tuples are sorted, then the 2 length, etc. The expected output is this:
count
(A,) 2
(B,) 1
(A, B) 4
(B, C) 3
(A, B, C) 5
I have tried sort_index, but that sorts according to the first member only and disregards the length:
>>> df.sort_index()
count
(A,) 2
(A, B) 4
(A, B, C) 5
(B,) 1
(B, C) 3
You can also reindex:
print (df.reindex(sorted(df.index,key=lambda d: (len(d), d))))
count
(A,) 2
(B,) 1
(A, B) 4
(B, C) 3
(A, B, C) 5
IIUC, you could do this by creating a sortkey:
(df.assign(sortkey=df.index.str.len())
.rename_axis('index')
.sort_values(['sortkey', 'index']))
Output:
count sortkey
index
(A,) 2 1
(B,) 1 1
(A, B) 4 2
(B, C) 3 2
(A, B, C) 5 3
First, let's use .str accessor to get the len of the tuple in the index and assign to a temporary column, sortkey. Let's rename_axis the index so, we can then use sort_values using a combination of column headers and index name.

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.
Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64
Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

python 3 get the column name depending of a condition [duplicate]

This question already has answers here:
Create a column in a dataframe that is a string of characters summarizing data in other columns
(3 answers)
Closed 4 years ago.
So i have a pandas df (python 3.6) like this
index A B C ...
A 1 5 0
B 0 0 1
C 1 2 4
...
As you can see, the index values are the same as the columns names.
What i'm trying to do is to get a new column in the dataframe that has the name of the columns where the value is > than 0
index A B C ... NewColumn
A 1 5 0 [A,B]
B 0 0 1 [C]
C 1 2 4 [A,B,C]
...
i've been trying with iterrows with no success
also i know i can melt and pivot but i think there should be a way with apply lamnda maybe?
Thanks in advance
If new column should be string compare by DataFrame.gt with dot product with columns, last remove trailing separator:
df['NewColumn'] = df.gt(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
A B C NewColumn
A 1 5 0 A, B
B 0 0 1 C
C 1 2 4 A, B, C
And for lists use apply with lambda function:
df['NewColumn'] = df.gt(0).apply(lambda x: x.index[x].tolist(), axis=1)
print (df)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Use:
df['NewColumn'] = df.apply(lambda x: list(x[x.gt(0)].index),axis=1)
A B C NewColumn
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
You could use .gt to check which values are greater than 0 and .dot to obtain the corresponding columns. Finally .apply(list) to turn the results to lists:
df.loc[:, 'NewColumn'] = df.gt(0).dot(df.columns).apply(list)
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]
Note: works with single letter columns, otherwise you could do:
df.loc[:, 'NewColumn'] = ((df.gt(0) # df.columns.map('{},'.format))
.str.rstrip(',').str.split(','))
A B C NewColumn
index
A 1 5 0 [A, B]
B 0 0 1 [C]
C 1 2 4 [A, B, C]

Combination aggregations in Pandas

I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''

Categories