Sort index of different length tuples in pandas - python

>>> df = pd.DataFrame(index=[('B',),('A',),('B','C',),('A','B',),('A','B','C')],data=[1,2,3,4,5],columns=['count'])
>>> df
count
(B,) 1
(A,) 2
(B, C) 3
(A, B) 4
(A, B, C) 5
I would like to sort by the tuple index such that the 1 length tuples are sorted, then the 2 length, etc. The expected output is this:
count
(A,) 2
(B,) 1
(A, B) 4
(B, C) 3
(A, B, C) 5
I have tried sort_index, but that sorts according to the first member only and disregards the length:
>>> df.sort_index()
count
(A,) 2
(A, B) 4
(A, B, C) 5
(B,) 1
(B, C) 3

You can also reindex:
print (df.reindex(sorted(df.index,key=lambda d: (len(d), d))))
count
(A,) 2
(B,) 1
(A, B) 4
(B, C) 3
(A, B, C) 5

IIUC, you could do this by creating a sortkey:
(df.assign(sortkey=df.index.str.len())
.rename_axis('index')
.sort_values(['sortkey', 'index']))
Output:
count sortkey
index
(A,) 2 1
(B,) 1 1
(A, B) 4 2
(B, C) 3 2
(A, B, C) 5 3
First, let's use .str accessor to get the len of the tuple in the index and assign to a temporary column, sortkey. Let's rename_axis the index so, we can then use sort_values using a combination of column headers and index name.

Related

Creating a new column in a data frame based on row values

I want to be able to get the following result without using a for loop or df.apply()
The result for each row should be the row values up until the group index.
group 0 1 2 3 4 5 6 7
0 2 a b c d e f g h
1 5 s t u v w x y z
2 7 a b c d e f g h
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]
Use DataFrame.melt, filter group column and variable column in DataFrame.query and last aggregate list:
s = (df.melt('group', ignore_index=False)
.astype({'variable':int})
.query("group >= variable")
.groupby(level=0)['value']
.agg(list))
df = df[['group']].join(s.rename('result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]
Or use apply:
df = (df.set_index('group')
.rename(columns=int)
.apply(lambda x: list(x[x.index <= x.name]), axis=1)
.reset_index(name='result'))
print (df)
group result
0 2 [a, b, c]
1 5 [s, t, u, v, w, x]
2 7 [a, b, c, d, e, f, g, h]

Table with most frequent combinations with pandas python

The dataset
ID Product
1 A
1 B
2 A
3 A
3 C
3 D
4 A
4 B
5 A
5 C
5 D
.....
The goal is to have the most frequent combinaisons of product by ID regardless the number of string value.
The expected result here is :
[A, C, D] 2
[A, B] 2
[A, C] 2
......
Like that but with a working value
import itertools
(df.groupby('ID').Product.agg(lambda x: list(set(itertools.combinations(x,**?**))))
.explode().str.join('-').value_counts())
IIUC, groupby ID, aggregate to frozenset and count the occurrences with value_counts:
df.groupby('ID')['Product'].agg(frozenset).value_counts()
output:
(B, A) 2
(D, C, A) 2
(A) 1
Name: Product, dtype: int64
Alternative using sorted tuples:
df.groupby('ID')['Product'].agg(lambda x: tuple(sorted(x))).value_counts()
output:
(A, B) 2
(A, C, D) 2
(A,) 1
Name: Product, dtype: int64
Or strings:
df.groupby('ID')['Product'].agg(lambda x: ','.join(sorted(x))).value_counts()
output:
A,B 2
A,C,D 2
A 1
Name: Product, dtype: int64

Generate pairs from columns in Pandas

What's the easiest way in Pandas to turn this
df = pd.DataFrame({'Class': [1, 2], 'Students': ['A,B,C,D', 'E,A,C']})
df
Class Students
0 1 A,B,C,D
1 2 E,A,C
Into this?
Let's try combinations:
from functools import partial
from itertools import combinations
(df.set_index('Class')['Students']
.str.split(',')
.map(partial(combinations, r=2))
.map(list)
.explode()
.reset_index())
Class Students
0 1 (A, B)
1 1 (A, C)
2 1 (A, D)
3 1 (B, C)
4 1 (B, D)
5 1 (C, D)
6 2 (E, A)
7 2 (E, C)
8 2 (A, C)
This need multiple steps with pandas only , split + explode , then drop_duplicates
df.Student=df.Student.str.split(',')
df=df.explode('Student')
df=df.merge(df,on='Class')
df[['Student_x','Student_y']]=np.sort(df[['Student_x','Student_y']].values, axis=1)
df=df.query('Student_x!=Student_y').drop_duplicates(['Student_x','Student_y'])
df['Student']=df[['Student_x','Student_y']].agg(','.join,axis=1)
df
Out[100]:
Class Student_x Student_y Student
1 1 A B A,B
2 1 A C A,C
3 1 A D A,D
6 1 B C B,C
7 1 B D B,D
11 1 C D C,D
17 2 A E A,E
18 2 C E C,E

Counting each unique array of an array in each row of a column in a data frame

I am practicing pandas and python and I am not so good at for loops. I have a data frame as below: let's say this is df:
Name Value
A [[A,B],[C,D]]
B [[A,B],[D,E]]
C [[D,E],[K,L],[M,L]]
D [[K,L]]
I want to go through each row and find unique arrays and count them.
I have tried np.unique(a, return_index=True) then returns two different list and my problem I don't know how to go through each array.
Expected result would be:
Value Counts
[A,B] 2
[D,E] 2
[K,L] 2
[C,D] 1
[M,L] 1
Thank you very much.
Use DataFrame.explode in pandas +0.25:
df.explode('Value')['Value'].value_counts()
Output:
[K, L] 2
[A, B] 2
[D, E] 2
[C, D] 1
[M, L] 1
Name: Value, dtype: int64
Use Series.explode with Series.value_counts:
df = df['Value'].explode().value_counts().rename_axis('Value').reset_index(name='Counts')
print (df)
Value Counts
0 [D, E] 2
1 [A, B] 2
2 [K, L] 2
3 [C, D] 1
4 [M, L] 1
Numpy solution:
a, v = np.unique(np.concatenate(df['Value']),axis=0, return_counts=True)
df = pd.DataFrame({'Value':a.tolist(), 'Counts':v})
print (df)
Value Counts
0 [A, B] 2
1 [C, D] 1
2 [D, E] 2
3 [K, L] 2
4 [M, L] 1

Pandas: Conditional Split on Column

I have the following question: I have the following table:
A B C
1 A A
2 A A.B
3 B B.C
4 A,B A.A,A.B,B.C
Column A is an index (1 through 4). Column B lists the letters, which appear in column C before the point (if there is any, if there is none, this is implicit, so the entry in (C,1) = A is the letter after the (.) (so this entry = A.A).
And column C either lists both letters before and after or only after the point.
The idea is to split these points and lists up. So column C should first be split up by the comma to separate rows (that works). Problematic here is, whenever there are different letter possible in B - because after splitting up, B should also only contain 1 letter (the correct on for column C).
So the result should look like this:
A B C
1 A A
2 A B
3 B C
4 A A
4 B B
4 B C
Can someone help me with ensuring, that column B contains the correct (i.e., fitting) information, which is denoted in column C?
Thanks and kind regards.
First, stack your dataframe to get your combinations:
out = (
df.set_index(['A', 'B']).C
.str.split(',').apply(pd.Series)
.stack().reset_index([0,1]).drop('B', 1)
)
A 0
0 1 A
1 2 A.B
2 3 B.C
3 4 A.A
4 4 A.B
5 4 B.C
Then replace single entries with their counterpart and apply pd.Series:
(out.set_index('A')[0].str
.replace(r'^([A-Z])$', r'\1.\1')
.str.split('.').apply(pd.Series)
.reset_index()
).rename(columns={0: 'B', 1: 'C'})
Output:
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
4 4 A B
5 4 B C
With a comprehension
def m0(x):
"""Take a string, return a dictionary split on '.' or a self mapping"""
if '.' in x:
return dict([x.split('.')])
else:
return {x: x}
def m1(s):
"""split string on ',' then do the dictionary thing in m0"""
return [*map(m0, s.split(','))]
pd.DataFrame([
(a, b, m[b])
for a, B, C in df.itertuples(index=False)
for b in B.split(',')
for m in m1(C) if b in m
], df.index.repeat(df.C.str.count(',') + 1), df.columns)
A B C
0 1 A A
1 2 A B
2 3 B C
3 4 A A
3 4 A B
3 4 B C

Categories