pandas groupby with length of lists - python

I need display in dataframe columns both the user_id and length of content_id which is a list object. But struggling to do using groupby.
Please help in both groupby as well as my question asked at the bottom of this post (how do I get the results along with user_id in dataframe?)
Dataframe types:
df.dtypes
output:
user_id object
content_id object
dtype: object
Sample Data:
user_id content_id
0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_...
2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
Pandas query:
df.groupby('user_id')['content_id'].count().reset_index()
df.groupby(['user_id'])['content_id'].apply(lambda x: get_count(x))
output:
user_id content_id
0 user_10013 1
1 user_10034 1
2 user_10042 1
When I tried without grouping, I am getting fine as below -
df['content_id'].apply(lambda x: len(x))
0 11
1 9
2 11
3 1
But, how do I get the results along with user_id in dataframe? Like I want in below format -
user_id content_id
some xxx 11
some yyy 6

pandas.Groupby returns a grouper element not the contents of each cell. As such it is not possible (without alot of workarounding) to do what you want. Instead you need to simply rewrite the columns (as suggested by #ifly6)
Using
df_agg = df.copy()
df_agg.content_id = df_agg.content_id.apply(len)
df_agg = df_agg.groupby('user_id').sum()
will result in the same dataframe as the Groupby you described.
For completeness sake the instruction for a single groupby would be
df.groupby('user_id').agg(lambda x: x.apply(len).sum())

try converting content_id to a string, split it by comma, then reassemble as a list of lists then count the list items.
data="""index user_id content_id
0 user_18085 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
1 user_16044 [cont_2738_2_49,cont_4482_2_19,cont_4994_18_]
2 user_13110 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
"""
df = pd.read_csv(StringIO(data), sep='\s+')
def convert_to_list(x):
x=re.sub(r'[\[\]]', '', x)
lst=list(x.split(','))
return lst
df['content_id2']= [list() for x in range(len(df.index))]
for key,item in df.iterrows():
lst=convert_to_list(str(item['content_id']))
for item in lst:
df.loc[key,'content_id2'].append(item)
def count_items(x):
return len(x)
df['count'] = df['content_id2'].apply(count_items)
df.drop(['content_id'],axis=1,inplace=True)
df.rename(columns={'content_id2':'content_id'},inplace=True)
print(df)
output:
index user_id content_id count
0 0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
1 1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_] 3
2 2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
3 3 user_18909 [cont_3170_2_28] 1
4 4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
​

Related

How to parse from one column to create another with Pandas and Regex?

I have a pd dataframe with one column user_id and each row ends with "/tgroup..."
I want to create a new column group_id where each row will have the corresponding "tgroup..." matching user_id.
This is my implementation so far:
user_id
0 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-0
1 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-1
2 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-2
3 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-3
4 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-4
df['group_id'] = df['user_id'].apply(lambda x: re.findall('(^\t)',x))
print(df.head())
user_id group_id
0 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-0 []
1 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-1 []
2 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-2 []
3 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-3 []
4 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-4 []
Clearly lambda/regex method is not grabbing the string selection that I want.
Any ideas?
is \t the tab character or backslash and t? If the latter you can try:
df['group_id'] = df.user_id.str.extract(r'\\t(.*)')
Output:
user_id group_id
0 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-0 group-0
1 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-1 group-1
2 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-2 group-2
3 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-3 group-3
4 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-4 group-4

Recover a standard, single-index data frame after using pandas groupby+apply

I want to apply a custom reduction function to each group in a Python dataframe. The function reduces the group to a single row by performing operations that combine several of the columns of the group.
I've implemented this like so:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"afac": np.random.random(size=1000),
"bfac": np.random.random(size=1000),
"class":np.random.randint(low=0,high=5,size=1000)
})
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop)]})
aggdf = df.groupby('class').apply(f)
My input data frame df looks like:
>>> df
afac bfac class
0 0.689969 0.992403 0
1 0.688756 0.728763 1
2 0.086045 0.499061 1
3 0.078453 0.198435 2
4 0.621589 0.812233 4
But my code gives this multi-indexed data frame:
>>> aggdf
per_apop
class
0 0 0.553292
1 0 0.503112
2 0 0.444281
3 0 0.517646
4 0 0.503290
I've tried various ways of getting back to a "normal" data frame, but none seem to work.
>>> aggdf.reset_index()
class level_1 per_apop
0 0 0 0.553292
1 1 0 0.503112
2 2 0 0.444281
3 3 0 0.517646
4 4 0 0.503290
>>> aggdf.unstack().reset_index()
class per_apop
0
0 0 0.553292
1 1 0.503112
2 2 0.444281
3 3 0.517646
4 4 0.503290
How can I perform this operation and get a normal data frame afterwards?
Update: The output data frame should have columns for class and per_apop. Ideally, the function f can return multiple columns and possibly multiple rows. Perhaps using
return pd.DataFrame(data={'per_apop': [np.sum(per_area*per_pop),2], 'sue':[1,3]})
You can select which level to reset as well as if you want to retain the index using reset_index. In your case, you ended up with a multi-index that has 2 levels: class and one that is unnamed. reset_index allows you to reset the entire index (default) or just the levels you want. In the following example, the last level (-1) is being pulled out of the index. By also using drop=True it is dropped rather than appended as a column in the data frame.
aggdf.reset_index(level=-1, drop=True)
per_apop
class
0 0.476184
1 0.476254
2 0.509735
3 0.502444
4 0.525287
EDIT:
To push the class level of the index back to the data frame, you can simply call .reset_index() again. Ugly, but it work.
aggdf.reset_index(level=-1, drop=True).reset_index()
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Alternatively, you could also, reset the index, then just drop the extra column.
aggdf.reset_index().drop('level_1', axis=1)
class per_apop
0 0 0.515733
1 1 0.497349
2 2 0.527063
3 3 0.515476
4 4 0.494530
Make your self-def function return Series
def f(group):
total_area = group['afac'].sum()
per_area = (group['afac']/total_area).values
per_pop = group['bfac'].values
return pd.Series(data={'per_apop': np.sum(per_area*per_pop)})
df.groupby('class').apply(f).reset_index()
class per_apop
0 0 0.508332
1 1 0.505593
2 2 0.488117
3 3 0.481572
4 4 0.500401
Although you have a good answer, a suggestion:
test func for df.groupby(...).apply( func ) on the first group, like this:
agroupby = df.groupby(...)
for key, groupdf in agroupby: # an iterator -> (key, groupdf) ... pairs
break # get the first pair
print( "\n-- first groupdf: len %d type %s \n%s" % (
len(groupdf), type(groupdf), groupdf )) # DataFrame
test = myfunc( groupdf )
# groupdf .col [col] [[col ...]] .set_index .resample ... as usual

Extracting specific rows from a data frame

I have a data frame df1 with two columns 'ids' and 'names' -
ids names
fhj56 abc
ty67s pqr
yu34o xyz
I have another data frame df2 which has some of the columns being -
user values
1 ['fhj56','fg7uy8']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
My result should give me those users from df2 whose values contains at least one of the ids from df1 and also tell which ids are responsible to put them into resultant table. Result should look like -
user values_responsible names
1 ['fhj56'] ['abc']
3 ['fhj56','ty67s'] ['abc','pqr']
User 2 doesn't come in resultant table because none of its values exist in df1.
I was trying to do it as follows -
df2.query('values in #df1.ids')
But this doesn't seem to work well.
You can iterate through the rows and then use .loc together with isin to find the matching rows from df2. I converted this filtered dataframe into a dictionary
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.append(result['ids'].tolist())
names.append(result['names'].tolist())
users.append(row['user'])
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']]
user values_responsible names
0 1 [fhj56] [abc]
1 3 [fhj56, ty67s] [abc, pqr]
Or, for tidy data:
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.extend(result['ids'].tolist())
names.extend(result['names'].tolist())
users.extend([row['user']] * len(result['ids']))
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']])
user values_responsible names
0 1 fhj56 abc
1 3 fhj56 abc
2 3 ty67s pqr
Try this , using the idea of unnest a list cell.
Temp_unnest = pd.DataFrame([[i, x]
for i, y in df['values'].apply(list).iteritems()
for x in y], columns=list('IV'))
Temp_unnest['user']=Temp_unnest.I.map(df.user)
df1.index=df1.ids
Temp_unnest.assign(names=Temp_unnest.V.map(df1.names)).dropna().groupby('user')['V','names'].agg({(lambda x: list(x))})
Out[942]:
V names
<lambda> <lambda>
user
1 [fhj56] [abc]
3 [fhj56, ty67s] [abc, pqr]
I would refactor your second dataframe (essentially, normalizing your database). Something like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
Then, all you have to do is merge the first and second dataframe on the id column.
r = df2.merge(df1, left_on='id', right_on='ids', how='left')
You can exclude any gids for which some of the ids don't have a matching name.
r[~r[gid].isin( r[r['names'] == None][gid].unique() )]
where r[r['names'] == None][gid].unique() finds all the gids that have no name and then r[~r[gid].isin( ... )] grabs only entries that aren't in the list argument for isin.
If you had more id groups, the second table might look like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
1 2 '1asdf3'
1 2 '7ada2a'
1 2 'asd341'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
which would be equivalent to
user values
1 ['fhj56','fg7uy8']
1 ['1asdf3', '7ada2a', 'asd341']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']

Pandas Group By and Count

A pandas dataframe df has 3 columns:
user_id,
session,
revenue
What I want to do now is group df by unique user_id and derive 2 new columns - one called number_sessions (counts the number of sessions associated with a particular user_id) and another called number_transactions (counts the number of rows under the revenue column that has a value > 0 for each user_id). How do I go about doing this?
I tried doing something like this:
df.groupby('user_id')['session', 'revenue'].agg({'number sessions': lambda x: len(x.session),
'number_transactions': lambda x: len(x[x.revenue>0])})
I think you can use:
df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})
print (df)
revenue session user_id
0 -1 4 a
1 0 5 a
2 1 4 s
3 2 5 s
4 1 5 s
a = df.groupby('user_id') \
.agg({'session': len, 'revenue': lambda x: len(x[x>0])}) \
.rename(columns={'session':'number sessions','revenue':'number_transactions'})
print (a)
number sessions number_transactions
user_id
a 2 0
s 3 3
a = df.groupby('user_id') \
.agg({'session':{'number sessions': len},
'revenue':{'number_transactions': lambda x: len(x[x>0])}})
a.columns = a.columns.droplevel()
print (a)
number sessions number_transactions
user_id
a 2 0
s 3 3
I'd use nunique for session to not double count the same session for a particular user
funcs = dict(session={'number sesssions': 'nunique'},
revenue={'number transactions': lambda x: x.gt(0).sum()})
df.groupby('user_id').agg(funcs)
setup
df = pd.DataFrame({'user_id':['a','a','s','s','s'],
'session':[4,5,4,5,5],
'revenue':[-1,0,1,2,1]})

Getting a new series conditional on some rows being present in Python and Pandas

I did not know of an easier thing to call what I am trying to do. Edits welcome. Here is what I want to do.
I have store, date, and product indices and a column called price.
I have two unique products 1 and 2.
But for each store, I don't have an observation for every date, and for every date, I don't have both products necessarily.
I want to create a series for each store that is indexed by dates only when when both products are present. The reason is because I want the value of the series to be product 1 price / product 2 price.
This is highly unbalanced panel, and I did a horrible workaround about 75 lines of code, so I appreciate any tips. This will be very useful in the future.
Data looks like below.
weeknum Location_Id Item_Id averageprice
70 201138 8501 1 0.129642
71 201138 8501 2 0.188274
72 201138 8502 1 0.129642
73 201139 8504 1 0.129642
Expected output in this simple case would be:
weeknum Location_Id averageprice
? 201138 8501 0.129642/0.188274
Since that is the only one with every requirement met.
I think this could be join on the two subFrames (but perhaps there is a cleaner pivoty way):
In [11]: res = pd.merge(df[df['Item_Id'] == 1], df[df['Item_Id'] == 2],
on=['weeknum', 'Location_Id'])
In [12]: res
Out[12]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y
0 201138 8501 1 0.129642 2 0.188274
Now you can divide those two columns in the result:
In [13]: res['price'] = res['averageprice_x'] / res['averageprice_y']
In [14]: res
Out[14]:
weeknum Location_Id Item_Id_x averageprice_x Item_Id_y averageprice_y price
0 201138 8501 1 0.129642 2 0.188274 0.688582
Example data similar to yours:
weeknum loc_id item_id avg_price
0 1 8 1 8
1 1 8 2 9
2 1 9 1 10
3 2 10 1 11
First create a date mask that gets you the correct dates:
df_group = df.groupby(['loc_id', 'weeknum'])
df = df.join(df_group.item_id.apply(lambda x: len(x.unique()) == 2), on = ['loc_id', 'weeknum'], r_suffix = '_r')
weeknum loc_id item_id avg_price item_id_r
0 1 8 1 8 True
1 1 8 2 9 True
2 1 9 1 10 False
3 2 10 1 11 False
This give yous a boolean mask for groupby of each store for each date where there are exactly two unique Item_Id present. From this you can now apply the function that concatenates your prices:
df[df.item_id_r].groupby(['loc_id','weeknum']).avg_price.apply(lambda x: '/'.join([str(y) for y in x]))
loc_id weeknum
8 1 8,9
It's a bit verbose and lots of lambdas but it will get you started and you can refactor to make faster and/or more concise if you want.
Let's say your full dataset is called TILPS. Then you might try this:
import pandas as pd
from __future__ import division
# Get list of unique dates present in TILPS
datelist = list(TILPS.ix[:, 'datetime'].unique())
# Get list of unique stores present in TILPS
storelist = list(TILPS.ix[:, 'store'].unique())
# For a given date, extract relative price
def dateLevel(daterow):
price1 = int(daterow.loc[(daterow['Item_id']==1), 'averageprice'].unique())
price2 = int(daterow.loc[(daterow['Item_id']==2), 'averageprice'].unique())
return pd.DataFrame(pd.Series({'relprice' : price1/price2}))
# For each store, extract relative price for each date
def storeLevel(group, datelist):
info = {d: for d in datelist}
exist = group.loc[group['datetime'].isin(datelist), ['weeknum', 'locid']]
exist_gr = exist.groupy('datetime')
relprices = exist_gr.apply(dateLevel)
# Merge relprices with exist on INDEX.
exist.merge(relprices, left_index=True, right_index=True)
return exist
# Group TILPS by store
gr_store = TILPS.groupby('store')
fn = lambda x: storeLevel(x, datelist)
output = gr_store.apply(fn)
# Peek at output
print output.head(30)

Categories