Pandas DataFrame joining 2 tables on <,> conditions - python

I would like to remove all sessions after user conversion (and also removing the sessions that happened on the day of conversion)
full_sessions = pd.DataFrame(data={'user_id':[1,1,2,3,3], 'visit_no':[1,2,1,1,2], 'date':['20180307','20180308','20180307','20180308','20180308'], 'result':[0,1,1,0,0]})
print full_sessions
date result user_id visit_no
0 20180307 0 1 1
1 20180308 1 1 2
2 20180307 1 2 1
3 20180308 0 3 1
4 20180308 0 3 2
When did people convert?
conversion = full_sessions[full_sessions['result'] == 1][['user_id','date']]
print conversion
user_id date
0 1 20180308
2 2 20180307
Ideal output:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
What do I want in SQL?
SQL would be:
SELECT * FROM (
SELECT * FROM full_sessions
LEFT JOIN conversion
ON
full_sessions.user_id = conversion.user_id AND full_sessions.date < conversion.date
UNION ALL
SELECT * FROM full_sessions
WHERE user_id NOT IN (SELECT user_id FROM conversion)
)

IIUC using merge in pandas
full_sessions.merge(conversion,on='user_id',how='left').loc[lambda x : (x.date_y>x.date_x)|(x.date_y.isnull())].dropna(1)
Out[397]:
date_x result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2

You can join the dataframes and then filter the rows matching your criteria this way:
df_join = full_sessions.join(conversion,lsuffix='',
rsuffix='_right',how='left',on='user_id')
print(df_join)
date result user_id visit_no user_id_right date_right
0 20180307 0 1 1 1.0 20180308
1 20180308 1 1 2 1.0 20180308
2 20180307 1 2 1 2.0 20180307
3 20180308 0 3 1 NaN NaN
4 20180308 0 3 2 NaN NaN
And then just keep those with NaN in the right date or with date_right smaller than date:
>>> df_join[df_join.apply(lambda x: x.date < x.date_right
if pd.isna(x.date_right) is False
else True,axis=1)][['date','visit_no','user_id']]
date visit_no user_id
0 20180307 1 1
3 20180308 1 3
4 20180308 2 3

Here is a method which maps a series instead of join / merge alternatives.
fs['date'] = pd.to_numeric(fs['date'])
s = fs[fs['result'] == 1].set_index('user_id')['date']
result = fs.loc[fs['date'] < fs['user_id'].map(s).fillna(fs['date'].max()+1)]
Result
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
Explanation
Create a mapping from user_id to conversion date, store it in a series s.
Then just filter on dates prior to conversion dates mapped via user_id.
If no conversion date, then data will be included since we fillna with a maximal date.
Consider using datetime objects. I have converted to numeric above for simplicity.

using groupby & apply & some final cleanup with reset index, you can express it in 1 very long statement:
full_sessions.groupby('user_id', as_index=False).apply(
lambda x: x[:(x.result==1).values.argmax()] if any(x.result==1) else x
).reset_index(level=0, drop=True)
outputs:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2

Related

Realise accumulated DataFrame from a column of Boolean values

Be the following python pandas DataFrame:
ID
Holidays
visit_1
visit_2
visit_3
other
0
True
1
2
0
red
0
False
3
2
0
red
0
True
4
4
1
blue
1
False
2
0
0
red
1
True
1
2
1
green
2
False
1
0
0
red
Currently I calculate a new DataFrame with the accumulated visit values as follows.
# Calculate the columns of the total visit count
visit_df = df.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
I would like to create a new one taking into account only the rows whose Holiday value is True. How could I do this?
Simple subset the rows first:
df[df['Holidays']].groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
output:
visit_1 visit_2 visit_3
ID
0 5 6 1
1 1 2 1
Alternative if you want to also get the groups without any match:
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
)
output:
visit_1 visit_2 visit_3
ID
0 5.0 6.0 1.0
1 1.0 2.0 1.0
2 0.0 0.0 0.0
variant
df2 = df.set_index('ID')
(df2.where(df2['Holidays'])
.groupby('ID')[['visit_1', 'visit_2', 'visit_3']].sum()
.convert_dtypes()
.add_suffix('_Holidays')
)
output:
visit_1_Holidays visit_2_Holidays visit_3_Holidays
ID
0 5 6 1
1 1 2 1
2 0 0 0

how to utilize Pandas aggregate functions on this DataFrame?

This is the table:
order_id product_id reordered department_id
2 33120 1 16
2 28985 1 4
2 9327 0 13
2 45918 1 13
3 17668 1 16
3 46667 1 4
3 17461 1 12
3 32665 1 3
4 46842 0 3
I want to group by department_id, summing the number of orders that come from that department, as well as the number of orders from that department where reordered == 0. The resulting table would look like this:
department_id number_of_orders number_of_reordered_0
3 2 1
4 2 0
12 1 0
13 2 1
16 2 0
I know this can be done in SQL (I forget what the query for that would look like as well, if anyone can refresh my memory on that, that'd be great too). But what are the Pandas functions to make that work?
I know that it starts with df.groupby('department_id').sum(). Not sure how to flesh out the rest of the line.
Use GroupBy.agg with DataFrameGroupBy.size and lambda function for compare values by Series.eq and count by sum of True values (Trues are processes like 1):
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0',lambda x: x.eq(0).sum())])
.reset_index())
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
If values are only 1 and 0 is possible use sum and last subtract:
df1 = (df.groupby('department_id')['reordered']
.agg([('number_of_orders','size'), ('number_of_reordered_0','sum')])
.reset_index())
df1['number_of_reordered_0'] = df1['number_of_orders'] - df1['number_of_reordered_0']
print (df1)
department_id number_of_orders number_of_reordered_0
0 3 2 1
1 4 2 0
2 12 1 0
3 13 2 1
4 16 2 0
in sql it would be simple aggregation
select department_id,count(*) as number_of_orders,
sum(case when reordered=0 then 1 else 0 end) as number_of_reordered_0
from tabl_name
group by department_id

How do get maximum count grouped by element in pandas data frame

I have data grouped by tow columns [CustomerID,cluster] like this:
CustomerIDClustered.groupby(['CustomerID','cluster']).count()
Count
CustomerID cluster
1893 0 1
1 2
2 5
3 1
2304 2 3
3 1
2655 0 1
2 1
2850 1 1
2 1
3 1
3648 0 1
I need assign most frequent cluster to the customer id
For Example:
1893->2 (2 appear in cluster more than other clusters)
2304->2
2655->1
Use sort_values, reset_index and last drop_duplicates:
df = df.sort_values('Count', ascending=False).reset_index().drop_duplicates('CustomerID')
Similar solution, only filter by first level of MultiIndex:
df = df.sort_values('Count', ascending=False)
df = df[~df.index.get_level_values(0).duplicated()].reset_index()
print (df)
CustomerID cluster Count
0 1893 2 5
1 2304 2 3
2 2655 0 1
3 2850 1 1
4 3648 0 1

creating dataframe efficiently without for loop

I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary

Filtering and Then Summing Groupby Data in Pandas

I want to find the sum of all the survived values in the Pandas data series when the 'Fare' is >100. The Panda data series was created using the .groupby('Fare').
Name: Survived, dtype: int64
Fare
0.0000 1
4.0125 0
5.0000 0
6.2375 0
6.4375 0
6.4500 0
6.4958 0
6.7500 0
6.8583 0
6.9500 0
6.9750 1
7.0458 0
7.0500 0
7.0542 0
7.1250 0
7.1417 1
7.2250 3
7.2292 4
7.2500 1
7.3125 0
7.4958 1
7.5208 0
7.5500 1
7.6292 0
7.6500 1
7.7250 0
7.7292 0
7.7333 2
7.7375 1
7.7417 0
..
80.0000 2
81.8583 1
82.1708 1
83.1583 3
83.4750 1
86.5000 3
89.1042 2
90.0000 3
91.0792 2
93.5000 2
106.4250 1
108.9000 1
110.8833 3
113.2750 2
120.0000 4
133.6500 2
134.5000 2
135.6333 2
146.5208 2
151.5500 2
153.4625 2
164.8667 2
211.3375 3
211.5000 0
221.7792 0
227.5250 3
247.5208 1
262.3750 2
263.0000 2
512.3292 3
I tried using this fare_df.loc[fare_df.index >100, fare_df[:]].sum() but I get the error:
pandas.core.indexing.IndexingError: Too many indexers
Please help!
This will get you the sum() you're looking for:
fare_df[fare_df.Fare > 100]['Survived'].sum()

Categories