Pandas Group By and Count - python

A pandas dataframe df has 3 columns:
user_id,
session,
revenue
What I want to do now is group df by unique user_id and derive 2 new columns - one called number_sessions (counts the number of sessions associated with a particular user_id) and another called number_transactions (counts the number of rows under the revenue column that has a value > 0 for each user_id). How do I go about doing this?
I tried doing something like this:
df.groupby('user_id')['session', 'revenue'].agg({'number sessions': lambda x: len(x.session),
'number_transactions': lambda x: len(x[x.revenue>0])})

I think you can use:
df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})
print (df)
revenue session user_id
0 -1 4 a
1 0 5 a
2 1 4 s
3 2 5 s
4 1 5 s
a = df.groupby('user_id') \
.agg({'session': len, 'revenue': lambda x: len(x[x>0])}) \
.rename(columns={'session':'number sessions','revenue':'number_transactions'})
print (a)
number sessions number_transactions
user_id
a 2 0
s 3 3
a = df.groupby('user_id') \
.agg({'session':{'number sessions': len},
'revenue':{'number_transactions': lambda x: len(x[x>0])}})
a.columns = a.columns.droplevel()
print (a)
number sessions number_transactions
user_id
a 2 0
s 3 3

I'd use nunique for session to not double count the same session for a particular user
funcs = dict(session={'number sesssions': 'nunique'},
revenue={'number transactions': lambda x: x.gt(0).sum()})
df.groupby('user_id').agg(funcs)
setup
df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})

Related

How to create a column by applying the command `for`?

I did the for command, to know how many times the ID repeats in the dateframe, now I need to create a column indexing the total in the respective ID.
In short, I need a column with the repeat total of df['ID'], how to index a total of the groupby command?
test = df['ID'].sort_values(ascending=True)
rep = 0
for k in range(0,len(test)-1):
if(test[k]==test[k+1]):
rep += 1
if(k==len(test)-2):
print(test[k],',', rep+1)
else:
print(test[k],',', rep+1)
rep = 0
out:
> 7614381 , 1
> 349444 , 5
> 4577800 ,7
"For example, a column with the total number of times that given df['ID'] appeared in the dataframe"
Is this what you mean?
import pandas as pd
df = pd.DataFrame(
{'id': [7614381, 349444 ,349444, 4577800, 4577800 ,349444, 4577800]}
)
df["id_count"] = df.groupby('id')['id'].transform('count')
df
Output
id id_count
0 7614381 1
1 349444 3
2 349444 3
3 4577800 2
4 349444 3
5 4577800 2
based on: https://stackoverflow.com/a/22391554/11323137

pandas groupby with length of lists

I need display in dataframe columns both the user_id and length of content_id which is a list object. But struggling to do using groupby.
Please help in both groupby as well as my question asked at the bottom of this post (how do I get the results along with user_id in dataframe?)
Dataframe types:
df.dtypes
output:
user_id object
content_id object
dtype: object
Sample Data:
user_id content_id
0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_...
2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
Pandas query:
df.groupby('user_id')['content_id'].count().reset_index()
df.groupby(['user_id'])['content_id'].apply(lambda x: get_count(x))
output:
user_id content_id
0 user_10013 1
1 user_10034 1
2 user_10042 1
When I tried without grouping, I am getting fine as below -
df['content_id'].apply(lambda x: len(x))
0 11
1 9
2 11
3 1
But, how do I get the results along with user_id in dataframe? Like I want in below format -
user_id content_id
some xxx 11
some yyy 6
pandas.Groupby returns a grouper element not the contents of each cell. As such it is not possible (without alot of workarounding) to do what you want. Instead you need to simply rewrite the columns (as suggested by #ifly6)
Using
df_agg = df.copy()
df_agg.content_id = df_agg.content_id.apply(len)
df_agg = df_agg.groupby('user_id').sum()
will result in the same dataframe as the Groupby you described.
For completeness sake the instruction for a single groupby would be
df.groupby('user_id').agg(lambda x: x.apply(len).sum())
try converting content_id to a string, split it by comma, then reassemble as a list of lists then count the list items.
data="""index user_id content_id
0 user_18085 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
1 user_16044 [cont_2738_2_49,cont_4482_2_19,cont_4994_18_]
2 user_13110 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
"""
df = pd.read_csv(StringIO(data), sep='\s+')
def convert_to_list(x):
x=re.sub(r'[\[\]]', '', x)
lst=list(x.split(','))
return lst
df['content_id2']= [list() for x in range(len(df.index))]
for key,item in df.iterrows():
lst=convert_to_list(str(item['content_id']))
for item in lst:
df.loc[key,'content_id2'].append(item)
def count_items(x):
return len(x)
df['count'] = df['content_id2'].apply(count_items)
df.drop(['content_id'],axis=1,inplace=True)
df.rename(columns={'content_id2':'content_id'},inplace=True)
print(df)
output:
index user_id content_id count
0 0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
1 1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_] 3
2 2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
3 3 user_18909 [cont_3170_2_28] 1
4 4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
​

extract rows with conditions and with new created column in python

I have a data like this
id name sub marks
1 a m 52
1 a s 69
1 a p 63
2 b m 36
2 b s 52
2 b p 56
3 c m 85
3 c s 62
3 c p 56
And I want output table which contain columns such as id, name and new column result(using criteria if marks in all subject is greater than 40 then this student is pass)
id name result
1 a pass
2 b fail
3 c pass
I would like to do this in python.
Create a boolean mask from marks, and then use groupby (on id and name) + all:
import pandas as pd
df = pd.read_csv('file.csv')
v = df.assign(result=df.marks.gt(40))\
.groupby(['id', 'name'])\
.result\
.all()\
.reset_index()
v['result'] = np.where(v['result'], 'pass', 'fail')
v
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Here's one way
In [127]: df.groupby(['id', 'name']).marks.agg(
lambda x: 'pass' if x.ge(40).all() else 'fail'
).reset_index(name='result')
Out[127]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Another way, inspired from jpp's solution, use replace or map
In [132]: df.groupby(['id', 'name']).marks.min().ge(40).replace(
{True: 'pass', False: 'fail'}
).reset_index(name='result')
Out[132]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Here is one way via pandas. Note your criteria is equivalent to the minimum mark being above 40. This algorithm is computationally more efficient.
import pandas as pd
df = pd.read_csv('file.csv')
df = df.groupby(['id', 'name'])['marks'].apply(min).reset_index()
df['result'] = np.where(df['marks'] > 40, 'pass', 'fail')
df = df[['id', 'name', 'result']]
Result
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Explanation
First perform a groupby.min() by id and name.
Then assign the column a string depending on value.

How do I specify a column header for pandas groupby result?

I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

Getting a new series conditional on some rows being present in Python and Pandas

I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.
I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582
Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.
Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)

Categories