How do get maximum count grouped by element in pandas data frame - python

I have data grouped by tow columns [CustomerID,cluster] like this:
CustomerIDClustered.groupby(['CustomerID','cluster']).count()
Count
CustomerID cluster
1893 0 1
1 2
2 5
3 1
2304 2 3
3 1
2655 0 1
2 1
2850 1 1
2 1
3 1
3648 0 1
I need assign most frequent cluster to the customer id
For Example:
1893->2 (2 appear in cluster more than other clusters)
2304->2
2655->1

Use sort_values, reset_index and last drop_duplicates:
df = df.sort_values('Count', ascending=False).reset_index().drop_duplicates('CustomerID')
Similar solution, only filter by first level of MultiIndex:
df = df.sort_values('Count', ascending=False)
df = df[~df.index.get_level_values(0).duplicated()].reset_index()
print (df)
CustomerID cluster Count
0 1893 2 5
1 2304 2 3
2 2655 0 1
3 2850 1 1
4 3648 0 1

Related

pandas number of items in one column per value in another column

I have two dataframes. say for example, frame 1 is the student info:
student_id course
1 a
2 b
3 c
4 a
5 f
6 f
frame 2 is each interaction the student has with a program
student_id day number_of_clicks
1 4 60
1 5 34
1 7 87
2 3 33
2 4 29
2 8 213
2 9 46
3 2 103
I am trying to add the information from frame 2 to frame 1, ie. for each student I would like to know the number of different days they accessed the database on, and the sum of all the clicks on those days. eg:
student_id course no_days total_clicks
1 a 3 181
2 b 4 321
3 c 1 103
4 a 0 0
5 f 0 0
6 f 0 0
I've tried to do this with groupby, but I couldn't add the information back into frame 1, or figure out how to sum the number of clicks. any ideas?
First we aggregate your df2 to the desired information using GroupBy.agg. Then we merge that information into df1:
agg = df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
)
df1 = df1.merge(agg, on='student_id', how='left').fillna(0)
student_id course no_days total_clicks
0 1 a 3.0 181.0
1 2 b 4.0 321.0
2 3 c 1.0 103.0
3 4 a 0.0 0.0
4 5 f 0.0 0.0
5 6 f 0.0 0.0
Or if you like one-liners, here's the same method as above, but in one line of code and more in SQL kind of style:
df1.merge(
df2.groupby('student_id').agg(
no_days=('day', 'size'),
total_clicks=('number_of_clicks', 'sum')
),
on='student_id',
how='left'
).fillna(0)
Use merge and fillna the null values then aggregate using groupby.agg as:
df = df1.merge(df2, how='left').fillna(0, downcast='infer')\
.groupby(['student_id', 'course'], as_index=False)\
.agg({'day':np.count_nonzero, 'number_of_clicks':np.sum}).reset_index()
print(df)
student_id course day number_of_clicks
0 1 a 3 181
1 2 b 4 321
2 3 c 1 103
3 4 a 0 0
4 5 f 0 0
5 6 f 0 0
​

how to utilize Pandas aggregate functions on this DataFrame?

This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id

How do I efficiently filter a data frame obtained by a two-column groupby operation to include just the max and min values of the second index?

I have a data frame df that was obtained by performing a two-column groupby operation:
df = data.groupby(['letters', 'syllables']).size()
Here is the output of the first 11 rows of df:
0
letters syllables
1 1 25
3 1
2 1 188
2 44
3 1
4 1
3 1 1304
2 189
3 89
4 2
5 3
I would like to filter df so that for each index in letters, only the max and min indices of syllables are shown, giving the following output:
0
letters syllables
1 1 25
3 1
2 1 188
4 1
3 1 1304
5 3
Even better would be to create a data frame like this:
0
letters statistic syllables
1 min 1 25
max 3 1
2 min 1 188
max 4 1
3 min 1 1304
max 5 3
The full data frame has 120 rows. I know I could do this with a loop, but I am trying to understand pandas operations better and would like to know how to do this more efficiently.
The sample data above can be imported from a csv file into a multi-level index data frame using the following:
df = pd.read_csv('data.csv', index_col=[0,1])
Edit: Here is the output of the code suggested by Erfan:
df = data.groupby(['letters', 'syllables']).agg({'letters' : 'size', 'syllables' : ['min', 'max']})
Output:
letters syllables
size min max
letters syllables
1 1 25 1 1
3 1 3 3
2 1 188 1 1
2 44 2 2
3 1 3 3
4 1 4 4
3 1 1304 1 1
2 189 2 2
3 89 3 3
4 2 4 4
5 3 5 5
You can do it separately then concat it back
s=data.groupby(['letters', 'syllables']).size().sort_values(0)
yourdf=pd.concat([s.groupby(level=0).head(1),s.groupby(level=0).tail(1)],keys=['min','max']).swaplevel(i=0,j=1).sort_index()

Pandas join DataFrame and Series over a column

I have a Pandas DataFrame df that store a matching between a label and an integer, and a Pandas Series s that contains a sequence of labels :
print(df)
label id
0 AAAAAAAAA 0
1 BBBBBBBBB 1
2 CCCCCCCCC 2
3 DDDDDDDDD 3
4 EEEEEEEEE 4
print(s)
0 AAAAAAAAA
1 BBBBBBBBB
2 CCCCCCCCC
3 CCCCCCCCC
4 EEEEEEEEE
5 EEEEEEEEE
6 DDDDDDDDD
I want to join this DataFrame and this Series, to get the sequence of integer corresponding to my sequence s.
Here is the expected result of my example :
print(df.join(s)["id"])
0 0
1 1
2 2
3 2
4 4
5 4
6 3
Use Series.map with Series:
print (s.map(df.set_index('label')['id']))
0 0
1 1
2 2
3 2
4 4
5 4
6 3
Name: a, dtype: int64
Alternative - be careful, if dupes no error but return last dupe row:
print (s.map(dict(zip(df['label'], df['id']))))

creating dataframe efficiently without for loop

I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary

Categories