Panda groupby on many columns with agg() - python

I've been asked to analize the DB from a medical record app. So a bunch of record would look like:
So i have to resume more than 3 million records from 2011 to 2014 by PX, i know they repeat since thats the ID for each patient, so a patient should had many visitis to the doctor. How could i group them or resume them by patient.

I don't know what you mean by "resume", but it looks like all you want to do is only to sort and display data in a nicer way. You can visually group (=order) the records "px- and fecha-wise" like this:
df.set_index(['px', 'fecha'], inplace=True)
EDIT:
When you perform a grouping of the data based on some common property, you have to decide, what kind of aggregation are you going to use on the data in other columns. Simply speaking, once you perform a groupby, you only have one empty field for in each remaining column for each "pacient_id" left, so you must use some aggregation function (e.g. sum, mean, min, avg, count,...) that will return desired representable value of the grouped data.
It is hard to work on your data since they are locked in an image, and it is impossible to tell what you mean by "Age", since this column is not visible, but I hope you can achieve what you want by looking at the following example with dummy data:
import pandas as pd
import numpy as np
from datetime import datetime
import random
from datetime import timedelta
def random_datetime_list_generator(start_date, end_date,n):
return ((start_date + timedelta(seconds=random.randint(0, int((end_date - start_date).total_seconds())))) for i in xrange(n))
#create random dataframe with 4 sample columns and 50000 rows
rows = 50000
pacient_id = np.random.randint(100,200,rows)
dates = random_datetime_list_generator(pd.to_datetime("2011-01-01"),pd.to_datetime("2014-12-31"),rows)
age = np.random.randint(10,80,rows)
bill = np.random.randint(1,1000,rows)
df = pd.DataFrame(columns=["pacient_id","visited","age","bill"],data=zip(pacient_id,dates,age,bill))
print df.head()
# 1.Only perform statictis of the last visit of each pacient only
stats = df.groupby("pacient_id",as_index=False)["visited"].max()
stats.columns = ["pacient_id","last_visited"]
print stats
# 2. Perform a bit more complex statistics on pacient by specifying desired aggregate function for each column
custom_aggregation = {'visited':{"first visit": 'min',"last visit": "max"}, 'bill':{"average bill" : "mean"}, 'age': 'mean'}
#perform a group by with custom aggregation and renaming of functions
stats = df.groupby("pacient_id").agg(custom_aggregation)
#round floats
stats = stats.round(1)
print stats
Original dummy dataframe looks like so:
pacient_id visited age bill
0 150 2012-12-24 21:34:17 20 188
1 155 2012-10-26 00:34:45 17 672
2 116 2011-11-28 13:15:18 33 360
3 126 2011-06-03 17:36:10 58 167
4 165 2013-07-15 15:39:31 68 815
First aggregate would look like this:
pacient_id last_visited
0 100 2014-12-29 00:01:11
1 101 2014-12-22 06:00:48
2 102 2014-12-26 11:51:41
3 103 2014-12-29 15:01:32
4 104 2014-12-18 15:29:28
5 105 2014-12-30 11:08:29
Second, complex aggregation would look like this:
visited age bill
first visit last visit mean average bill
pacient_id
100 2011-01-06 06:11:33 2014-12-29 00:01:11 45.2 507.9
101 2011-01-01 20:44:55 2014-12-22 06:00:48 44.0 503.8
102 2011-01-02 17:42:59 2014-12-26 11:51:41 43.2 498.0
103 2011-01-01 03:07:41 2014-12-29 15:01:32 43.5 495.1
104 2011-01-07 18:58:11 2014-12-18 15:29:28 45.9 501.7
105 2011-01-01 03:43:12 2014-12-30 11:08:29 44.3 513.0
This example should get you going. Additionaly, there is a nice SO question about pandas groupby aggregation which may teach you a lot on this topics.

Related

Calculate difference between dates for sequential pandas rows based on conditional column value

I need to find the number of days between a request date and its most recent offer date for each apartment number. My example dataframe looks like the first 3 columns below and I'm trying to figure out how to calculate the 'days_since_offer' column. The apartment and or_date columns are already sorted.
apartment offer_req_type or_date days_since_offer
A request 12/4/2019 n/a
A request 12/30/2019 n/a
A offer 3/4/2020 0
A request 4/2/2020 29
A request 6/4/2020 92
A request 8/4/2020 153
A offer 12/4/2020 0
A request 1/1/2021 28
B offer 1/1/2019 0
B request 8/1/2019 212
B offer 10/1/2019 0
B request 1/1/2020 92
B request 9/1/2020 244
B offer 1/1/2021 0
B request 1/25/2021 24
I tried to create a new function which sort of gives me what I want if I pass it dates for a single apartment. When I use the apply function is gives me an error though: "SpecificationError: Function names must be unique if there is no new column names assigned".
def func(attr, date_ser):
offer_dt = date(1900,1,1)
lapse_days = []
for row in range(len(attr)):
if attr[row] == 'offer':
offer_dt = date_ser[row]
lapse_days.append(-1)
else:
lapse_days.append(date_ser[row]-offer_dt)
print(lapse_days)
return lapse_days
df['days_since_offer'] = df.apply(func(df['offer_req_type'], df['or_date']))
I also tried to use groupby + diff functions like this and this but it's not the answer that I need:
df.groupby('offer_req_type').or_date.diff().dt.days
I also looked into using the shift method, but I'm not necessarily looking at sequential rows every time.
Any pointers on why my function is failing or if there is a better way to get the date differences that I need using a groupby method would be helpful!
I have played around and I am certainly not claiming this to be the best way. I used df.apply() (edit: see below for alternative without df.apply()).
import numpy as np
import pandas as pd
# SNIP: removed the df creation part for brevity.
df["or_date"] = pd.to_datetime(df["or_date"])
df.drop("days_since_offer", inplace=True, axis="columns")
def get_last_offer(row:pd.DataFrame, df: pd.DataFrame):
if row["offer_req_type"] == "offer":
return
temp_df = df[(df.apartment == row['apartment']) & (df.offer_req_type == "offer") & (df.or_date < row["or_date"])]
if temp_df.empty:
return
else:
x = row["or_date"]
y = temp_df.iloc[-1:, -1:]["or_date"].values[0]
return x-y
return 1
df["days_since_offer"] = df.apply(lambda row: get_last_offer(row, df), axis=1)
print(df)
This returns the following df:
0 A request 2019-12-04 NaT
1 A request 2019-12-30 NaT
2 A offer 2020-03-04 NaT
3 A request 2020-04-02 29 days
4 A request 2020-06-04 92 days
5 A request 2020-08-04 153 days
6 A offer 2020-12-04 NaT
7 A request 2021-01-01 28 days
8 B offer 2019-01-01 NaT
9 B request 2019-08-01 212 days
10 B offer 2019-10-01 NaT
11 B request 2020-01-01 92 days
12 B request 2020-09-01 336 days
13 B offer 2021-01-01 NaT
14 B request 2021-01-25 24 days
EDIT
I was wondering if I couldn't find a way not using df.apply(). I ended up with the following lines: (replace from line def get_last_offer() in previous code bit)
df["offer_dates"] = np.where(df['offer_req_type'] == 'offer', df['or_date'], pd.NaT)
# OLD: df["offer_dates"].ffill(inplace=True)
df["offer_dates"] = df.groupby("apartment")["offer_dates"].ffill()
df["diff"] = pd.to_datetime(df["or_date"]) - pd.to_datetime(df["offer_dates"])
df.drop("offer_dates", inplace=True, axis="columns")
This creates a helper column (df['offer_dates']) which is filled for every row that has offer_req_type as 'offer'. Then it is forward-filled, meaning that every NaT value will be replaced with the previous valid value. Then We calculate the df['diff'] column, with the exact same result. I like this bit better because it is cleaner and it has 4 lines rather than 12 lines of code :)

Filter a dataframe by first call date

I am trying to filter a dataframe by the first call date in a way to remove the other rows that comes after that first call.
client_id date duration_in_sec incoming_number avg
4 13/01/2016 94 0632108564 55.5
4 15/01/2016 17 0632108564 55.5
5 13/01/2016 339 0699309366 339.0
I am trying to keep only the rows for clients to know when they've made their first call, so the dataframe above, will keep only the row for client id where date is 13/01/2016 since it's his first call date.
I've tried to do a groupby minimum date but couldn't reach a good results.
Use -
df.loc[df.groupby('client_id')['date'].idxmin(), :]
Output
client_id date duration_in_sec incoming_number avg
0 4 2016-01-13 94 632108564 55.5
2 5 2016-01-13 339 699309366 339.0
This is given that your date column as casted as datetime-
df['date'] = pd.to_datetime(df['date'])

Return multiple rows for every row in a DataFrame in Pandas

Here is the task i would like to perform, i have list of about 7000 sites and 50 categories each which has a sales plan per combination every month. I want to convert this monthly plan into daily and compare it with actuals and create a power BI visual, for this i need to convert the plan data to daily.
here is the sample:
df = pd.DataFrame({'ID':[1,2],
'Month':[1,1],
'Plan':[310,620],
'Month_start_date': ['2020-01-01','2020-01-01']})
print(df)
df['Month_start_date'] = (pd.to_datetime(df['Month_start_date'], format='%Y/%m/%d')
.dt.to_period('m').dt.to_timestamp())
df = df.set_index('Month_start_date')
Now the function i want to apply on each row returns more number of rows, here is a sample:
start = '2020-01-01'
end = '2020-01-05'
dates = pd.date_range(start, end, freq='D')
dates
df= df.reindex(dates,method = 'ffill')
This returns a error as Index has duplicate values
ValueError: cannot reindex a non-unique index with a method or limit
Here is my desired output
ID Month Plan
2020-01-01 1 1 310
2020-01-02 1 1 310
2020-01-03 1 1 310
2020-01-04 1 1 310
2020-01-05 1 1 310
2020-01-01 2 1 620
2020-01-02 2 1 620
2020-01-03 2 1 620
2020-01-04 2 1 620
2020-01-05 2 1 620
Since the number of combinations i have to run this for is about 800K in reality running it on loops(using .iterrows() ) takes forever to complete and seems very inefficient.
Also tried using .groupby.apply() function. but it doesnt allow me to return a dataframe for every row(table df).
Suggestions needed to improve this process.
The sort_values() function should hopefully achieve what you're looking for:
df.sort_values(by=df.index, inplace =True)
Or if your dates had a column name you'd just change it to the df's column name, and you could even pair it with groupby to create dataframes just for certain sorted groups since your dataset is so large. I hope this helps a little!

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

How to recuperate user id (without duplicate) from DataFrame and store them in another Dataframe for later use

I have the table below in a pandas dataframe:
date user_id val1 val2
01/01/2014 00:00:00 1 1790 12
01/02/2014 00:00:00 3 364 15
01/03/2014 00:00:00 2 280 10
02/04/2000 00:00:00 5 259 24
05/05/2003 00:00:00 4 201 39
02/05/2001 00:00:00 5 559 54
05/03/2003 00:00:00 4 231 69
..
The table was extracted from a .csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'val1', 'val2']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of each users and/or for the whole.
For this purpose, I would like to know how I can store at this stage all user_id (without duplicate) into another dataframe df_user_id (that I could use at the end in a loop in order to display the results for each user id).
I'm confused about your big-picture goal, but if you want to store all the unique user IDs, that probably should not be a DataFrame. (What would the index mean? And why would there need to be multiple columns?) A simple numpy array would suffice -- or a Series if you have some reason to need pandas' methods.
To get a numpy array of the unique user ids:
user_ids = df['user_id'].unique()

Categories