Sort index of different length tuples in pandas

Sort index of different length tuples in pandas - python

>>> df = pd.DataFrame(index=[('B',),('A',),('B','C',),('A','B',),('A','B','C')],data=[1,2,3,4,5],columns=['count'])
>>> df
count
(B,) 1
(A,) 2
(B, C) 3
(A, B) 4
(A, B, C) 5
I would like to sort by the tuple index such that the 1 length tuples are sorted, then the 2 length, etc. The expected output is this:
count
(A,) 2
(B,) 1
(A, B) 4
(B, C) 3
(A, B, C) 5
I have tried sort_index, but that sorts according to the first member only and disregards the length:
>>> df.sort_index()
count
(A,) 2
(A, B) 4
(A, B, C) 5
(B,) 1
(B, C) 3

You can also reindex:
print (df.reindex(sorted(df.index,key=lambda d: (len(d), d))))
count
(A,) 2
(B,) 1
(A, B) 4
(B, C) 3
(A, B, C) 5

IIUC, you could do this by creating a sortkey:
(df.assign(sortkey=df.index.str.len())
.rename_axis('index')
.sort_values(['sortkey', 'index']))
Output:
count sortkey
index
(A,) 2 1
(B,) 1 1
(A, B) 4 2
(B, C) 3 2
(A, B, C) 5 3
First, let's use .str accessor to get the len of the tuple in the index and assign to a temporary column, sortkey. Let's rename_axis the index so, we can then use sort_values using a combination of column headers and index name.

Related

Creating a new column in a data frame based on row values

I want to be able to get the following result without using a for loop or df.apply()
The result for each row should be the row values up until the group index.
group 0 1 2 3 4 5 6 7
0 2 a b c d e f g h
1 5 s t u v w x y z
2 7 a b c d e f g h
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]

Use DataFrame.melt, filter group column and variable column in DataFrame.query and last aggregate list:
s = (df.melt('group', ignore_index=False)
.astype({'variable':int})
.query("group >= variable")
.groupby(level=0)['value']
.agg(list))
df = df[['group']].join(s.rename('result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]
Or use apply:
df = (df.set_index('group')
.rename(columns=int)
.apply(lambda x: list(x[x.index <= x.name]), axis=1)
.reset_index(name='result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]

Table with most frequent combinations with pandas python

The dataset
ID Product
1 A
1 B
2 A
3 A
3 C
3 D
4 A
4 B
5 A
5 C
5 D
.....
The goal is to have the most frequent combinaisons of product by ID regardless the number of string value.
The expected result here is :
[A, C, D] 2
[A, B] 2
[A, C] 2
......
Like that but with a working value
import itertools
(df.groupby('ID').Product.agg(lambda x: list(set(itertools.combinations(x,**?**))))
.explode().str.join('-').value_counts())

IIUC, groupby ID, aggregate to frozenset and count the occurrences with value_counts:
df.groupby('ID')['Product'].agg(frozenset).value_counts()
output:
(B, A) 2
(D, C, A) 2
(A) 1
Name: Product, dtype: int64
Alternative using sorted tuples:
df.groupby('ID')['Product'].agg(lambda x: tuple(sorted(x))).value_counts()
output:
(A, B) 2
(A, C, D) 2
(A,) 1
Name: Product, dtype: int64
Or strings:
df.groupby('ID')['Product'].agg(lambda x: ','.join(sorted(x))).value_counts()
output:
A,B 2
A,C,D 2
A 1
Name: Product, dtype: int64

Generate pairs from columns in Pandas

What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?

Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)

This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.

Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64

Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

Pandas: Conditional Split on Column

I have the following question: I have the following table:
A B C
1 A A
2 A A.B
3 B B.C
4 A,B A.A,A.B,B.C
Column A is an index (1 through 4). Column B lists the letters, which appear in column C before the point (if there is any, if there is none, this is implicit, so the entry in (C,1) = A is the letter after the (.) (so this entry = A.A).
And column C either lists both letters before and after or only after the point.
The idea is to split these points and lists up. So column C should first be split up by the comma to separate rows (that works). Problematic here is, whenever there are different letter possible in B - because after splitting up, B should also only contain 1 letter (the correct on for column C).
So the result should look like this:
A B C
1 A A
2 A B
3 B C
4 A A
4 B B
4 B C
Can someone help me with ensuring, that column B contains the correct (i.e., fitting) information, which is denoted in column C?
Thanks and kind regards.

First, stack your dataframe to get your combinations:
out = (
df.set_index(['A', 'B']).C
.str.split(',').apply(pd.Series)
.stack().reset_index([0,1]).drop('B', 1)
)
A 0
0 1 A
1 2 A.B
2 3 B.C
3 4 A.A
4 4 A.B
5 4 B.C
Then replace single entries with their counterpart and apply pd.Series:
(out.set_index('A')[0].str
.replace(r'^([A-Z])$', r'\1.\1')
.str.split('.').apply(pd.Series)
.reset_index()
).rename(columns={0: 'B', 1: 'C'})
Output:
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
4 4 A B
5 4 B C

With a comprehension
def m0(x):
"""Take a string, return a dictionary split on '.' or a self mapping"""
if '.' in x:
return dict([x.split('.')])
else:
return {x: x}
def m1(s):
"""split string on ',' then do the dictionary thing in m0"""
return [*map(m0, s.split(','))]
pd.DataFrame([
(a, b, m[b])
for a, B, C in df.itertuples(index=False)
for b in B.split(',')
for m in m1(C) if b in m
], df.index.repeat(df.C.str.count(',') + 1), df.columns)
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
3 4 A B
3 4 B C

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort index of different length tuples in pandas - python

You can also reindex: print (df.reindex(sorted(df.index,key=lambda d: (len(d), d)))) count (A,) 2 (B,) 1 (A, B) 4 (B, C) 3 (A, B, C) 5

Related

Creating a new column in a data frame based on row values

Table with most frequent combinations with pandas python

Generate pairs from columns in Pandas

Counting each unique array of an array in each row of a column in a data frame

Pandas: Conditional Split on Column

Categories

Resources