pandas: Aggregate on one column and count based on two columns - python

Suppose I have the following dataframe:
fid prefix target_text
0 f1 p1 t1
1 f1 p1 t2
2 f1 p2 t1
3 f1 p2 t2
4 f1 p3 t1
5 f1 p3 t3
6 f1 p3 t4
7 f2 p1 t1
8 f2 p1 t2
9 f2 p2 t2
10 f2 p2 t1
If I group them by fid and prefix and count unique target_text I have:
>>> num_targets = df.groupby(['fid','prefix'])['target_text'].transform('nunique')
0 2
1 2
2 2
3 2
4 3
5 3
6 3
7 2
8 2
9 2
10 2
Now I want to group them by only 'fid' but in front of it print the number of distinct [prefix, target_text]
I expect:
num_targets
f1 7
f2 4
However if I gourp the dataframe by fid, then how can I count distinct [prefix, target_text]?

If need unique per both columns output is different:
s = (df['target_text'] + '_' + df['prefix']).groupby(df['fid']).nunique()
print (s)
fid
f1 7
f2 4
dtype: int64
s = df.drop_duplicates(['fid','prefix','target_text'])['fid'].value_counts()
print (s)
f1 7
f2 4
Name: fid, dtype: int64

Related

table pivot with pandas

I would like to convert the table
Time
Params
Values
Lot ID
t1
A
3
a1
t1
B
4
a1
t1
C
7
a1
t1
D
2
a1
t2
A
2
a1
t2
B
5
a1
t2
C
9
a1
t2
D
3
a1
t3
A
2
a2
t3
B
5
a2
t3
C
9
a2
t3
D
3
a2
to
Time
A
B
C
D
Lot ID
t1
3
4
7
2
a1
t2
2
5
9
3
a1
t3
3
4
7
2
a2
Tried
df.pivot(index = 'Time', columns = 'Params', values = 'Values')
but it didn't come out with Lot ID
How can I add that to the table as an additional column?
The index argument of pivot can take a list.
df.pivot(index = ['Time', 'Lot ID'], columns = 'Params', values = 'Values')
Will work for you.

How to create layered pandas data frame using Python?

I want to create a data frame as below.
C1 C2 C3 C4
1 1 1 1
1 2 2 2
1 2 2 3
1 2 3 4
1 2 3 5
2 3 4 6
3 4 5 7
The C4 column should be unique values. C4 column values should belongs to any one of the C3 column. C3 column values should belongs to any one of the C2 column. C2 column values should belongs to any one of the C1 column. The column values should be C1 < C2 < C3 < C4. The values may be random.
I used below sample Python code.
import pandas as pd
import numpy as np
C1 = [1, 2]
C2 = [1, 2, 3,4]
C3 = [1, 2, 3, 4,5]
C4 = [1,2,3,4,5,6,7]
Z = [C1,C2,C3,C4]
n = max(len(x) for x in Z)
a = [np.hstack((np.random.choice(x, n - len(x)), x)) for x in Z]
df = pd.DataFrame(a, index=['C1', 'C2', 'C3','C4']).T.sample(frac=1)
print (df)
Below is my output.
C1 C2 C3 C4
1 4 2 2
1 3 4 6
2 1 2 4
1 2 5 1
2 4 5 7
1 2 3 5
2 2 1 3
But I couldn’t get the output as per my logic. The value 2 in C3 column belongs to 1 and 4 of C2 column. Also the value 2 in C2 column belongs to 1 and 2 of C1 column. Guide me to get output as per my logic. Thanks.

Pandas sort by subtotal of each group

Still new into pandas but is there a way to sort df by subtotal of each group.
Area Unit Count
A A1 5
A A2 2
B B1 10
B B2 1
B B3 3
C C1 10
So I want to sort them by subtotal of each Area which results to A subtotal = 7, B subtotal=14, C subtotal = 10
The sort should be like
Area Unit Count
B B1 10
B B2 1
B B3 3
C C1 10
A A1 5
A A2 2
*Note that despite the value of B3 > B1 it should not be affected by the sort.
create a helper column 'sorter', which is the sum of the count variable, and sort ur dataframe with it
df['sorter'] = df.groupby("Area").Count.transform('sum')
df.sort_values('sorter',ascending=False).reset_index(drop=True).drop('sorter',axis=1)
Area Unit Count
0 B B1 10
1 B B2 1
2 B B3 3
3 C C1 10
4 A A1 5
5 A A2 2

Selecting rows based on multiple column values in pandas dataframe MultiIndex

I have a pandas DataFrame MultiIndex:
f1 f2 value
2 2 4
2 3 5
3 3 4
4 1 3
4 4 3
I would like to have an output where f1 == f2:
f1 f2 value
2 2 4
3 3 4
4 4 3
Can you suggest an elegant way to select those rows?
Use boolean indexing if f1, f2 are columns:
df = df[df.f1 == df.f2]
print (df)
f1 f2 value
0 2 2 4
2 3 3 4
4 4 4 3
If level of MultiIndex are f1, f2 use Index.get_level_values:
df = df[df.index.get_level_values('f1') == df.index.get_level_values('f2')]
Or if f1, f2 are columns names or levels of MultiIndex:
df = df.query('f1 == f2')
print (df)
f1 f2 value
0 2 2 4
2 3 3 4
4 4 4 3

How to count the following number of rows in pandas (new)

Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1

Categories