I have a Pandas script that counts the number of readmissions to hospital within 30 days based on a few conditions. I wonder if it could be vectorized to improve performance. I've experimented with df.rolling().apply, but so far without luck.
Here's a table with contrived data to illustrate:
ID VISIT_NO ARRIVED LEFT HAD_A_MASSAGE BROUGHT_A_FRIEND
1 1 29/02/1996 01/03/1996 0 1
1 2 01/12/1996 04/12/1996 1 0
2 1 20/09/1996 21/09/1996 1 0
3 1 27/06/1996 28/06/1996 1 0
3 2 04/07/1996 06/07/1996 0 1
3 3 16/07/1996 18/07/1996 0 1
4 1 21/02/1996 23/02/1996 0 1
4 2 29/04/1996 30/04/1996 1 0
4 3 02/05/1996 02/05/1996 0 1
4 4 02/05/1996 03/05/1996 0 1
5 1 03/10/1996 05/10/1996 1 0
5 2 07/10/1996 08/10/1996 0 1
5 3 10/10/1996 11/10/1996 0 1
First, I create a dictionary with IDs:
ids = massage_df[massage_df['HAD_A_MASSAGE'] == 1]['ID']
id_dict = {id:0 for id in ids}
Everybody in this table has had a massage, but in my real dataset, not all people are so lucky.
Next, I run this bit of code:
for grp, df in massage_df.groupby(['ID']):
date_from = df.loc[df[df['HAD_A_MASSAGE']==1].index, 'LEFT']
date_to = date_from + DateOffset(days=30)
mask = ((date_from.values[0] < df['ARRIVED']) &
(df['ARRIVED'] <= date_to.values[0]) &
(df['BROGHT_A_FRIEND'] == 1))
if len(df[mask]) > 0:
id_dict[df['ID'].iloc[0]] = len(df[mask])
Basically, I want to count the number of times when someone originally came in for a massage (single or with a friend) and then came back within 30 days with a friend. The expected results for this table would be a total of 6 readmissions for IDs 3, 4 and 5.
Related
I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]
I have data about product sales (1 column per product) at the customer level (1 row per customer).
I'm assessing which customers are more likely to be interested in a specific product. I have a list of the 10 most correlated products. (and I have this for multiple products, so I'm trying to build a scalable approach).
I'm trying to score all customers based on how many of those 10 products they buy.
Let's say my list is:
prod_x_corr_prod
How can I create a scoring column (say prox_x_propensity) which goes through the 10 relevant columns, for every row, and for each column with a value > 0 adds 1?
For instance, if customer Y bought 3 of the products correlated with product X, he would have a score of 3 in the "prox_x_score" column.
EDIT: thanks to all of you for the feedback.
For customer 5 I would ge a 2, while for 1,2,3 I would get 1. For 4, 0.
You can do:
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
Example with dummy data:
import numpy as np
import pandas as pd
prod_x_corr_prod = ["prod{}".format(i) for i in range(1, 11)]
df = pd.DataFrame({col:np.random.choice([0,1], size=5) for col in prod_x_corr_prod})
df['prox_x_score'] = (df[prod_x_corr_prod] > 0).sum(axis=1)
print(df)
Output:
prod1 prod10 prod2 prod3 prod4 prod5 prod6 prod7 prod8 prod9 \
0 1 1 1 0 0 1 1 1 1 0
1 1 1 1 0 1 0 0 1 1 0
2 1 1 1 1 0 1 0 0 1 0
3 0 0 0 0 0 0 1 0 1 0
4 0 0 0 0 0 0 0 1 1 0
prox_x_score
0 7
1 6
2 6
3 2
4 2
This is how my data looks like:
Day Price A Price B Price C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 64503 43692 79982
6 86664 69990 53468
7 77924 62998 68911
8 66600 68830 94396
9 82664 89972 49614
10 59741 48904 49528
11 34030 98074 72993
12 74400 85547 37715
13 51031 50031 85345
14 74700 59932 73935
15 62290 98130 88818
I have a small python script that outputs a sum for each column. I need to input an n value (for number of days) and the summing will run and output the values.
However, for example, given n=5 (for days), I want to output only Price A/B/C rows starting from the next day (which is day 6). Hence, the row for Day 5 should be '0'.
How can I produce this logic on Pandas ?
The idea I have is to use the n input value to then, truncate values on the rows corresponding to that particular (n day value). But how can I do this on code ?
if dataframe['Day'] == n:
dataframe['Price A'] == 0 & dataframe['Price B'] == 0 & dataframe['Price C'] == 0
You can filter rows by condition and set all columns without first by iloc[mask, 1:], for next row add Series.shift:
n = 5
df.iloc[(df['Day'].shift() <= n).values, 1:] = 0
print (df)
Day Price A Price B Price C
0 1 0 0 0
1 2 0 0 0
2 3 0 0 0
3 4 0 0 0
4 5 0 0 0
5 6 0 0 0
6 7 77924 62998 68911
7 8 66600 68830 94396
8 9 82664 89972 49614
9 10 59741 48904 49528
10 11 34030 98074 72993
11 12 74400 85547 37715
12 13 51031 50031 85345
13 14 74700 59932 73935
14 15 62290 98130 88818
Pseudo Code
Make sure to sort by day
shift columns 'A', 'B' and 'C' by n and fill in with 0
Sum accordingly
All of that can be done on one line as well
It is simply
dataframe.iloc[:n+1] = 0
This sets the values of all columns for the first n days to 0
# Sample output
dataframe
a b
0 1 2
1 2 3
2 3 4
3 4 2
4 5 3
n = 1
dataframe.iloc[:n+1] = 0
dataframe
a b
0 0 0
1 0 0
2 3 4
3 4 2
4 5 3
This truncates all for all the previous days. If you want to truncate only for the nth day.
dataframe.iloc[n] = 0
I have a df with badminton scores. Each sets of a games for a team are on rows and the score at each point on the columns like so:
0 0 1 1 2 3 4
0 1 2 3 3 4 4
I want to obtain only O and 1 when a point is scored, like so: (to analyse if there any pattern in the points):
0 0 1 0 1 1 1
0 1 1 1 0 1 0
I was thinking of using df.itertuples() and iloc and conditions to attribute 1 to new dataframe if next score = score+1 or 0 if next score = score + 1
But I dont know how to iterate through the generated tuples and how to generate my new df with the 0 and 1 at the good locations.
Hope that is clear thanks for your help.
Oh also, any suggestions to analyse the patterns after that ?
You just need diff(If you need convert it back try cumsum)
df.diff(axis=1).fillna(0).astype(int)
Out[1382]:
1 2 3 4 5 6 7
0 0 0 1 0 1 1 1
1 0 1 1 1 0 1 0
Say part of my dataframe df[(df['person_num'] == 1) | (df['person_num'] == 2) ] looks like this:
person_num Days IS_TRUE
1 1 1
1 4 1
1 5 0
1 9 1
2 1 1
2 4 1
2 5 0
2 9 1
And for each person_num, I want to count something like "how many IS_TRUE=1 happens within seven days before a certain day". So for Day 9, I count the number of IS_TRUE=1s from Day 2 to Day 8, and add the count to a new column IS_TRUE_7day_WINDOW. The result would be:
person_num Days IS_TRUE IS_TRUE_7day_WINDOW
1 1 1 0
1 4 1 1
1 5 0 2
1 9 1 1
2 1 1 0
2 4 1 1
2 5 0 2
2 9 1 1
I'm thinking about using something like this:
df.groupby('person_num').transform(pd.rolling_sum, window=7,min_periods=1)
But I think rolling_sum only works for datetime, and the code doesn't work for my dataframe. Is there an easy way to convert rolling_sum to work for integers (Days in my case)? Or are there other ways to quickly compute the column I want?
I used for loops to calculate IS_TRUE_7day_WINDOW, but it took me an hour to get the results since my dataframe is pretty large. I guess something like rolling_sum would speed up my old code.
You could implicitly do the for loop through vectorization, which will in general be faster than explicitly writing a for loop. Here's a working example on the data you provided:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Days': [1, 4, 5, 9, 1, 4, 5, 9],
'IS_TRUE': [1, 1, 0, 1, 1, 1, 0, 1],
'person_num': [1, 1, 1, 1, 2, 2, 2, 2]})
def window(group):
diff = np.subtract.outer(group.Days, group.Days)
group['IS_TRUE_7day_WINDOW'] = np.dot((diff > 0) & (diff <= 7),
group['IS_TRUE'])
return group
f.groupby('person_num').apply(window)
Output is this:
Days IS_TRUE person_num IS_TRUE_7day_WINDOW
0 1 1 1 0
1 4 1 1 1
2 5 0 1 2
3 9 1 1 1
4 1 1 2 0
5 4 1 2 1
6 5 0 2 2
7 9 1 2 1
Since you mentioned data frame derives from a database, consider an SQL solution using a subquery which runs the calculation in its engine and not directly in Python.
Below assumes a MySQL database but adjust library and connection string according to your actual backend (SQLite, PostgreSQL, SQL Server, etc.). Below should be ANSI-syntax SQL, compliant in most RDMS.
SQL Solution
import pandas pd
import pymysql
conn = pymysql.connect(host="localhost" port=3306,
user="username", passwd="***", db="databasename")
sql = "SELECT t1.Days, t1.person_num, t1.IS_TRUE, \
(SELECT IFNULL(SUM(t2.IS_TRUE),0) \
FROM TableName t2 \
WHERE t2.person_num= t1.person_num \
AND t2.Days >= t1.Days - 7 \
AND t2.Days < t1.Days) AS IS_TRUE_7DAY_WINDOW \
FROM TableName t1"
df = pd.read_sql(sql, conn)
OUTPUT
Days person_num IS_TRUE IS_TRUE_7DAY_WINDOW
1 1 1 0
4 1 1 1
5 1 0 2
9 1 1 1
1 2 1 0
4 2 1 1
5 2 0 2
9 2 1 1
The rolling_functions like rolling_sum use the index of the DataFrame or Series when seeing how far to go back. It doesn't have to be a datetime index. Below is some code to find the calculation for each user...
First use crosstab to make a DataFrame with a column for each person_num and a row for each day.
>>> days_person = pd.crosstab(data['days'],
data['person_num'],
values=data['is_true'],
aggfunc=pd.np.sum)
>>> days_person
person_num 1 2
days
1 1 1
4 1 1
5 0 0
9 1 1
Next I'm going to fill in missing days with 0's, because you only have a few days of data.
>>> empty_data = {n: [0]*10 for n in days_person.columns}
>>> days_person = (days_person + pd.DataFrame(empty_data)).fillna(0)
>>> days_person
person_num 1 2
days
1 1 1
2 0 0
3 0 0
4 1 1
5 0 0
6 0 0
7 0 0
8 0 0
9 1 1
Now use rolling_sum to get the table you're looking for. Note that days 1-6 will have NaN values, because there weren't enough previous days to do the calculation.
>>> pd.rolling_sum(days_person, 7)