Efficiently creating frequency and recency columns - python

This is a very specific problem - my code is very slow, wonder if I'm doing something obviously wrong or there's a better way.
The situation: I have two dataframes, frame and contacts. frame is a database of people, and contacts is points of contact with these people. They look something like:
frame:
name
id
166 Bob
253 Serge
1623 Anna
766 Benna
981 Paul
contacts:
id type date
0 253 email 2016-01-05
1 1623 sale 2012-05-12
2 1623 email 2017-12-22
3 253 sale 2018-02-15
I want to add two columns to frame, 'most_recent' and '3 year contact count', which give the most recent contact (if there is one) and the number of contacts in the past 3 years.
(frame is ~100,000 rows, and contacts is ~95,000)
So far, I'm reducing the amount of ids to iterate over, then creating a dict for each id with the right values:
id_list = [i for i in frame.index if i in contacts['id']]
freq_rec_dict = {i: [contacts.loc[contacts['id']==i,'value'].max(),
len(contacts.loc[(contacts['id']==i)&(contacts['value']>dt(2016,1,1))])]
for i in id_list}
Then, I turn the dict into a dataframe and perform a join:
freq_rec_df = pd.DataFrame.from_dict(freq_rec_dict, orient='index',columns=['most_recent','3 year contact count'])
result = frame.join(freq_rec_df)
This does give me what I need, but the dictionary comprehension took 30 minutes - I feel like there must be a more efficient way to do this (I will need this in the future). Any ideas would be much appreciated - thanks!

You don't specify your output, but here goes. You should leverage the built-in groupby method instead of taking your data out of a frame and back into a frame and then merging
contacts.groupby('id')[['date','type']].max()
date type
id
253 2018-02-15 sale
1623 2017-12-22 sale
Which you can do in one line if you need to save memory space. Again, you don't give a preferred output, so I used a left join. You could also use 'inner' to keep only rows where records exist.
df=pd.merge(frame,contacts.groupby('id')[['date','type']].max(), left_index=True, right_index=True, how='left')
name date type
id
166 Bob NaN NaN
253 Serge 2018-02-15 sale
1623 Anna 2017-12-22 sale
766 Benna NaN NaN
981 Paul NaN NaN

Related

Sum one column based on matching values in second column and put total into correct record in another column

I would like to ask for some guidance with a project. The below dataframe scripts are the before and after for your review.
What is listed below is a hotel list of guests, Rates, & Status.
The ID column is for tracking each person staying with the hotel. The Rate column list the rate per person.
The Status column is "P" for primary and "S" for shared.
So the overview is the guests with matching ID numbers are staying in a room together, both rates should be summed together, but listed under the "P"(primary) guest record in the Total column. In the Total column you should have for the "P"'s a total of the two guests staying together and the "S"(shared) should be zero.
I have tried the pandas groupby & sum but this snippet is removing some of the matching ID's records. The Sum is working for my totals, but I still need to figure out how to put the total under the primary guest record. I am still reviewing stackoverflow for likeness solutions that could help.
df = df.groupby(["ID"]).Rate.sum().reset_index()
import pandas as pd
print('Before')
df=pd.DataFrame({'ID':[1182,2554,1182,2658,5489,2658],
'Fname':['Tom','Harry','Trisha','Ben','Susan','Brenda'],
'Rate':[125.00,89.00,135.00,25.00,145.00,19.00],
'Status':['P','S','P','P','P','S'],
'Total':[0,0,0,0,0,0,]})
df1=pd.DataFrame({'ID':[1182,1182,2658,2658,2554,5489,],
'Fname':['Tom','Trisha','Ben','Brenda','Harry','Susan'],
'Rate':[125.00,135.00,245.00,19.00,89.00,25.00],
'Status':['P','S','P','S','P','P'],
'Total':[260.00,0,264,0,89.00,25.00]})
print(df)
print()
print('After')
print(df1)
Assuming you have a unique P per group, you can use a GroupBy.transform with a mask:
df['Total'] = df.groupby('ID')['Rate'].transform('sum').where(df['Status'].eq('P'), 0)
output (using the data from df1):
ID Fname Rate Status Total
0 1182 Tom 125.0 P 260.0
1 1182 Trisha 135.0 S 0.0
2 2658 Ben 245.0 P 264.0
3 2658 Brenda 19.0 S 0.0
4 2554 Harry 89.0 P 89.0
5 5489 Susan 25.0 P 25.0

Filter for most recent event by group with pandas

I'm trying to filter a pandas dataframe so that I'm able to get the most recent data point for each account number in the dataframe.
Here is an example of what the data looks like.
I'm looking for an output of one instance of an account with the product and most recent date.
account_number product sale_date
0 123 rental 2021-12-01
1 423 rental 2021-10-01
2 513 sale 2021-11-02
3 123 sale 2022-01-01
4 513 sale 2021-11-30
I was trying to use groupby and idxmax() but it doesn't work with dates.
And I did want to change the dtype away from date time.
data_grouped = data.groupby('account_number')['sale_date'].max().idxmax()
Any ideas would be awesome.
To retain a subsetted data frame, consider sorting by account number and descending sale date, then call DataFrame.groupby().head (which will return NaNs if in first row per group unlike DataFrame.groupby().first):
data_grouped = (
data.sort_values(
["account_number", "sale_date"], ascending=[True, False]
).reset_index(drop=True)
.groupby("account_number")
.head(1)
)
It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby + idxmax:
df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]
Output:
account_number product sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30
Would the keyword 'first' work ? So that would be:
data.groupby('account_number')['sale_date'].first()
You want the last keyword in order to get the most recent date after grouping, like this:
df.groupby(by=["account_number"])["sale_date"].last()
which will provide this output:
account_number
123 2022-01-01
423 2021-10-01
513 2021-11-30
Name: sale_date, dtype: datetime64[ns]
It is unclear why you want to transition away from using the datetime dtype, but you need it in order to correctly sort for the value you are looking for. Consider doing this as an intermediate step, then reformatting the column after processing.
I'll change my answer to use #Daniel Weigelbut's answer... and also here, where you can apply .nth(n) to find the nth value for a general case ((-1) for the most recent date).
new_data = data.groupby('account_number')['sale_date'].nth(-1)
My previous suggestion of creating a sorted multi index with
data.set_index(['account_number', 'sale_date'], inplace = True)
data_sorted = data.sort_index(level = [0, 1])
still works and might be more useful for any more complex sorting. As others have said, make sure your date strings are date time objects if you sort like this.

Appending two dataframes with multiindex

I have two dataframes, each with a multiindex. The multiindex levels share names, but are in a different order. When I append or concat, I would expect pandas to line up the indices just like it aligns index-less columns before appending. Is there a function or an argument I can pass to append or concat to get this to work in the way I desire (and that I think ought to be standard)?
import pandas as pd
df1 = pd.DataFrame(data = {'Name':['Bob','Ann','Sally'], 'Acct':['Savings','Savings','Checking'], 'Value':[101,102,103]})
df1 = df1.set_index(['Name','Acct'])
print(df1)
df2 = pd.DataFrame(data = {'Acct':['Savings','Savings','Checking'], 'Name':['Bob','Ann','Sally'], 'Value':[201,202,203]})
df2 = df2.set_index(['Acct','Name'])
print(df2)
print(df1.append(df2))
print(pd.concat([df1,df2]))
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Value
Acct Name
Savings Bob 201
Ann 202
Checking Sally 203
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Savings Bob 201
Ann 202
Checking Sally 203
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Savings Bob 201
Ann 202
Checking Sally 203
As you can see, after appending or concatenating, my combined index appears to show that, for example, "Sally" is an account, not a name. I'm aware that if I put the index levels in the same order when setting index, I'll get what I want, and that I could reset the index on the frames to align them, but I'm hoping there's a more intuitive way to get the indices to align on name, not on position.
Somewhat of a work around, you can reset_index on both data sets, concat them, then set_index:
print(pd.concat([
df1.reset_index(),
df2.reset_index()
], sort=False).set_index([
'Name',
'Acct'
]))
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Bob Savings 201
Ann Savings 202
Sally Checking 203
Though I'm not sure why you would want to have multiple rows with the same index...

Panda groupby on many columns with agg()

I've been asked to analize the DB from a medical record app. So a bunch of record would look like:
So i have to resume more than 3 million records from 2011 to 2014 by PX, i know they repeat since thats the ID for each patient, so a patient should had many visitis to the doctor. How could i group them or resume them by patient.
I don't know what you mean by "resume", but it looks like all you want to do is only to sort and display data in a nicer way. You can visually group (=order) the records "px- and fecha-wise" like this:
df.set_index(['px', 'fecha'], inplace=True)
EDIT:
When you perform a grouping of the data based on some common property, you have to decide, what kind of aggregation are you going to use on the data in other columns. Simply speaking, once you perform a groupby, you only have one empty field for in each remaining column for each "pacient_id" left, so you must use some aggregation function (e.g. sum, mean, min, avg, count,...) that will return desired representable value of the grouped data.
It is hard to work on your data since they are locked in an image, and it is impossible to tell what you mean by "Age", since this column is not visible, but I hope you can achieve what you want by looking at the following example with dummy data:
import pandas as pd
import numpy as np
from datetime import datetime
import random
from datetime import timedelta
def random_datetime_list_generator(start_date, end_date,n):
return ((start_date + timedelta(seconds=random.randint(0, int((end_date - start_date).total_seconds())))) for i in xrange(n))
#create random dataframe with 4 sample columns and 50000 rows
rows = 50000
pacient_id = np.random.randint(100,200,rows)
dates = random_datetime_list_generator(pd.to_datetime("2011-01-01"),pd.to_datetime("2014-12-31"),rows)
age = np.random.randint(10,80,rows)
bill = np.random.randint(1,1000,rows)
df = pd.DataFrame(columns=["pacient_id","visited","age","bill"],data=zip(pacient_id,dates,age,bill))
print df.head()
# 1.Only perform statictis of the last visit of each pacient only
stats = df.groupby("pacient_id",as_index=False)["visited"].max()
stats.columns = ["pacient_id","last_visited"]
print stats
# 2. Perform a bit more complex statistics on pacient by specifying desired aggregate function for each column
custom_aggregation = {'visited':{"first visit": 'min',"last visit": "max"}, 'bill':{"average bill" : "mean"}, 'age': 'mean'}
#perform a group by with custom aggregation and renaming of functions
stats = df.groupby("pacient_id").agg(custom_aggregation)
#round floats
stats = stats.round(1)
print stats
Original dummy dataframe looks like so:
pacient_id visited age bill
0 150 2012-12-24 21:34:17 20 188
1 155 2012-10-26 00:34:45 17 672
2 116 2011-11-28 13:15:18 33 360
3 126 2011-06-03 17:36:10 58 167
4 165 2013-07-15 15:39:31 68 815
First aggregate would look like this:
pacient_id last_visited
0 100 2014-12-29 00:01:11
1 101 2014-12-22 06:00:48
2 102 2014-12-26 11:51:41
3 103 2014-12-29 15:01:32
4 104 2014-12-18 15:29:28
5 105 2014-12-30 11:08:29
Second, complex aggregation would look like this:
visited age bill
first visit last visit mean average bill
pacient_id
100 2011-01-06 06:11:33 2014-12-29 00:01:11 45.2 507.9
101 2011-01-01 20:44:55 2014-12-22 06:00:48 44.0 503.8
102 2011-01-02 17:42:59 2014-12-26 11:51:41 43.2 498.0
103 2011-01-01 03:07:41 2014-12-29 15:01:32 43.5 495.1
104 2011-01-07 18:58:11 2014-12-18 15:29:28 45.9 501.7
105 2011-01-01 03:43:12 2014-12-30 11:08:29 44.3 513.0
This example should get you going. Additionaly, there is a nice SO question about pandas groupby aggregation which may teach you a lot on this topics.

How to recuperate user id (without duplicate) from DataFrame and store them in another Dataframe for later use

I have the table below in a pandas dataframe:
date user_id val1 val2
01/01/2014 00:00:00 1 1790 12
01/02/2014 00:00:00 3 364 15
01/03/2014 00:00:00 2 280 10
02/04/2000 00:00:00 5 259 24
05/05/2003 00:00:00 4 201 39
02/05/2001 00:00:00 5 559 54
05/03/2003 00:00:00 4 231 69
..
The table was extracted from a .csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'val1', 'val2']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of each users and/or for the whole.
For this purpose, I would like to know how I can store at this stage all user_id (without duplicate) into another dataframe df_user_id (that I could use at the end in a loop in order to display the results for each user id).
I'm confused about your big-picture goal, but if you want to store all the unique user IDs, that probably should not be a DataFrame. (What would the index mean? And why would there need to be multiple columns?) A simple numpy array would suffice -- or a Series if you have some reason to need pandas' methods.
To get a numpy array of the unique user ids:
user_ids = df['user_id'].unique()

Categories