Say part of my dataframe df[(df['person_num'] == 1) | (df['person_num'] == 2) ] looks like this:
person_num Days IS_TRUE
1 1 1
1 4 1
1 5 0
1 9 1
2 1 1
2 4 1
2 5 0
2 9 1
And for each person_num, I want to count something like "how many IS_TRUE=1 happens within seven days before a certain day". So for Day 9, I count the number of IS_TRUE=1s from Day 2 to Day 8, and add the count to a new column IS_TRUE_7day_WINDOW. The result would be:
person_num Days IS_TRUE IS_TRUE_7day_WINDOW
1 1 1 0
1 4 1 1
1 5 0 2
1 9 1 1
2 1 1 0
2 4 1 1
2 5 0 2
2 9 1 1
I'm thinking about using something like this:
df.groupby('person_num').transform(pd.rolling_sum, window=7,min_periods=1)
But I think rolling_sum only works for datetime, and the code doesn't work for my dataframe. Is there an easy way to convert rolling_sum to work for integers (Days in my case)? Or are there other ways to quickly compute the column I want?
I used for loops to calculate IS_TRUE_7day_WINDOW, but it took me an hour to get the results since my dataframe is pretty large. I guess something like rolling_sum would speed up my old code.
You could implicitly do the for loop through vectorization, which will in general be faster than explicitly writing a for loop. Here's a working example on the data you provided:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Days': [1, 4, 5, 9, 1, 4, 5, 9],
'IS_TRUE': [1, 1, 0, 1, 1, 1, 0, 1],
'person_num': [1, 1, 1, 1, 2, 2, 2, 2]})
def window(group):
diff = np.subtract.outer(group.Days, group.Days)
group['IS_TRUE_7day_WINDOW'] = np.dot((diff > 0) & (diff <= 7),
group['IS_TRUE'])
return group
f.groupby('person_num').apply(window)
Output is this:
Days IS_TRUE person_num IS_TRUE_7day_WINDOW
0 1 1 1 0
1 4 1 1 1
2 5 0 1 2
3 9 1 1 1
4 1 1 2 0
5 4 1 2 1
6 5 0 2 2
7 9 1 2 1
Since you mentioned data frame derives from a database, consider an SQL solution using a subquery which runs the calculation in its engine and not directly in Python.
Below assumes a MySQL database but adjust library and connection string according to your actual backend (SQLite, PostgreSQL, SQL Server, etc.). Below should be ANSI-syntax SQL, compliant in most RDMS.
SQL Solution
import pandas pd
import pymysql
conn = pymysql.connect(host="localhost" port=3306,
user="username", passwd="***", db="databasename")
sql = "SELECT t1.Days, t1.person_num, t1.IS_TRUE, \
(SELECT IFNULL(SUM(t2.IS_TRUE),0) \
FROM TableName t2 \
WHERE t2.person_num= t1.person_num \
AND t2.Days >= t1.Days - 7 \
AND t2.Days < t1.Days) AS IS_TRUE_7DAY_WINDOW \
FROM TableName t1"
df = pd.read_sql(sql, conn)
OUTPUT
Days person_num IS_TRUE IS_TRUE_7DAY_WINDOW
1 1 1 0
4 1 1 1
5 1 0 2
9 1 1 1
1 2 1 0
4 2 1 1
5 2 0 2
9 2 1 1
The rolling_functions like rolling_sum use the index of the DataFrame or Series when seeing how far to go back. It doesn't have to be a datetime index. Below is some code to find the calculation for each user...
First use crosstab to make a DataFrame with a column for each person_num and a row for each day.
>>> days_person = pd.crosstab(data['days'],
data['person_num'],
values=data['is_true'],
aggfunc=pd.np.sum)
>>> days_person
person_num 1 2
days
1 1 1
4 1 1
5 0 0
9 1 1
Next I'm going to fill in missing days with 0's, because you only have a few days of data.
>>> empty_data = {n: [0]*10 for n in days_person.columns}
>>> days_person = (days_person + pd.DataFrame(empty_data)).fillna(0)
>>> days_person
person_num 1 2
days
1 1 1
2 0 0
3 0 0
4 1 1
5 0 0
6 0 0
7 0 0
8 0 0
9 1 1
Now use rolling_sum to get the table you're looking for. Note that days 1-6 will have NaN values, because there weren't enough previous days to do the calculation.
>>> pd.rolling_sum(days_person, 7)
Related
I have a dataframe which is called "df". It looks like this:
a
0 2
1 3
2 0
3 5
4 1
5 3
6 1
7 2
8 2
9 1
I would like to produce a cummulative sum column which:
Sums the contents of column "a" cumulatively;
Until it gets a sum of "5";
Resets the cumsum total, to 0, when it reaches a sum of "5", and continues with the summing process;
I would like the dataframe to look like this:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
In the dataframe, the column "a_cumm_summ" contains the results of the cumulative sum.
Does anyone know how I can achieve this? I have hunted through the forums. And saw similar questions, for example, this one, but they did not meet my exact requirements.
You can get the cumsum, and floor divide by 5. Then subtract the result of the floor division, multiplied by 5, from the below row's cumulative sum:
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
df
Out[1]:
a a_cumm_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
Solution #2 (more robust):
Per Trenton's comment, A good, diverse sample dataset goes a long way to figure out unbreakable logic for these types of problems. I probably would have come up with a better solution first time around with a good sample dataset. Here is a solution that overcomes the sample dataset that Trenton mentioned in the comments. As shown, there are more conditions to handle as you have to deal with carry-over. On a large dataset, this would still be much more performant than a for-loop, but it is much more difficult logic to vectorize:
df = pd.DataFrame({'a': {0: 2, 1: 4, 2: 1, 3: 5, 4: 1, 5: 3, 6: 1, 7: 2, 8: 2, 9: 1}})
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
over = (df['a_cumm_sum'].shift(1) - 5)
df['a_cumm_sum'] = df['a_cumm_sum'] - np.where(over > 0, df['a_cumm_sum'] - over, 0).cumsum()
s = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum']*-1, 0).cumsum()
df['a_cumm_sum'] = np.where((df['a_cumm_sum'] > 0) & (s > 0), s + df['a_cumm_sum'],
df['a_cumm_sum'])
df['a_cumm_sum'] = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum'].shift() + df['a'], df['a_cumm_sum'])
df
Out[2]:
a a_cumm_sum
0 2 2.0
1 4 6.0
2 1 1.0
3 5 6.0
4 1 1.0
5 3 4.0
6 1 5.0
7 2 2.0
8 2 4.0
9 1 5.0
The assignment can be combined with a condition. The code is as follows:
import numpy as np
import pandas as pd
a = [2, 3, 0, 5, 1, 3, 1, 2, 2, 1]
df = pd.DataFrame(a, columns=["a"])
df["cumsum"] = df["a"].cumsum()
df["new"] = df["cumsum"]%5
df["new"][((df["cumsum"]/5)==(df["cumsum"]/5).astype(int)) & (df["a"]!=0)] = 5
df
The output is as follows:
a cumsum new
0 2 2 2
1 3 5 5
2 0 5 0
3 5 10 5
4 1 11 1
5 3 14 4
6 1 15 5
7 2 17 2
8 2 19 4
9 1 20 5
Working:
Basically, take remainder for the cumulative sum for 5. In cases where the actual sum is 5 also becomes zero. So, for these cases, check if the value/5 == int(value/5). Then, remove cases where the actual value is zero.
EDIT:
As Trenton McKinney pointed out in the comments, OP likely wanted to reset it to 0 whenever the cumsum exceeded 5. This makes the definition to be a recurrence which is usually difficult to do with pandas/numpy (see David's solution). I'd recommend using numba to speed up the for loop in this case
Another alternative: using groupby
In [78]: df.groupby((df['a'].cumsum()% 5 == 0).shift().fillna(False).cumsum()).cumsum()
Out[78]:
a
0 2
1 5
2 0
3 5
4 1
5 4
6 5
7 2
8 4
9 5
You could try using this for loop:
lastvalue = 0
newcum = []
for i in df['a']:
if lastvalue >= 5:
lastvalue = i
else:
lastvalue += i
newcum.append(lastvalue)
df['a_cum_sum'] = newcum
print(df)
Output:
a a_cum_sum
0 2 2
1 3 5
2 0 0
3 5 5
4 1 1
5 3 4
6 1 5
7 2 2
8 2 4
9 1 5
The above for loop iterates through the a column, and when the cumulative sum is 5 or more, it resets it to 0 then adds the a column's value i, but if the cumulative sum is lower than 5, it just adds the a column's value i (the iterator).
I am beginner, and I really need help on the following:
I need to do similar to the following but on a two dimensional dataframe Identifying consecutive occurrences of a value
I need to use this answer but for two dimensional dataframe. I need to count at least 2 consecuetive ones along the columns dimension. Here is a sample dataframe:
my_df=
0 1 2
0 1 0 1
1 0 1 0
2 1 1 1
3 0 0 1
4 0 1 0
5 1 1 0
6 1 1 1
7 1 0 1
The output I am looking for is:
0 1 2
0 3 5 4
Instead of the column 'consecutive', I need a new output called "out_1_df" for line
df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
So that later I can do
threshold = 2;
out_2_df= (out_1_df > threshold).astype(int)
I tried the following:
out_1_df= my_df.groupby(( my_df != my_df.shift(axis=0)).cumsum(axis=0))
out_2_df =`(out_1_df > threshold).astype(int)`
How can I modify this?
Try:
import pandas as pd
df=pd.DataFrame({0:[1,0,1,0,0,1,1,1], 1:[0,1,1,0,1,1,1,0], 2: [1,0,1,1,0,0,1,1]})
out_2_df=((df.diff(axis=0).eq(0)|df.diff(periods=-1,axis=0).eq(0))&df.eq(1)).sum(axis=0)
>>> out_2_df
[3 5 4]
In continuation to my previous Question I need some more help.
The dataframe is like
time eve_id sub_id flag
0 5 2 0
1 5 2 0
2 5 2 1
3 5 2 1
4 5 2 0
5 4 25 0
6 4 30 0
7 5 2 1
I need to count the eve_id in the time flag goes 0 to 1,
and count the eve_id for the time flag is 1 to 1
the output will look like this
time flag count
0 0 2
2 1 2
4 0 3
Can someone help me here ?
First we make a grouper indicator which checks if the difference between two rows is not equal to 0, which indicates a difference.
Then we groupby on this indicator and use agg. Since pandas 0.25.0 we have named aggregations:
s = df['flag'].diff().ne(0).cumsum()
grpd = df.groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
Output
time flag count
0 0 0 2
1 2 1 2
2 4 0 3
3 7 1 1
If time is your index, use:
grpd = df.assign(time=df.index).groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
notice: the row extra is because there's a difference between the last row and the row before as well
Change aggregate function sum to GroupBy.size:
df1 = (df.groupby([df['flag'].ne(df['flag'].shift()).cumsum(), 'flag'])
.size()
.reset_index(level=0, drop=True)
.reset_index(name='count'))
print (df1)
flag count
0 0 2
1 1 2
2 0 3
3 1 1
i want to "forward fill" the values of a new column in a DataFrame according to the first instance of a condition being satisfied. here is a basic example:
import pandas as pd
import numpy as np
x1 = [1,2,4,-3,4,1]
df1 = pd.DataFrame({'x':x1})
i'd like to add a new column to df1 - 'condition' - where the value will be 1 upon the occurrence of a negative number,else 0, but i'd like the remaining values of the column to be set to 1 once the negative number is found
so, i would look for desired output as follows:
condition x
0 0 1
1 0 2
2 0 4
3 1 -3
4 1 4
5 1 1
No one's used cummax so far:
In [165]: df1["condition"] = (df1["x"] < 0).cummax().astype(int)
In [166]: df1
Out[166]:
x condition
0 1 0
1 2 0
2 4 0
3 -3 1
4 4 1
5 1 1
Using np.cumsum:
df1['condition'] = np.where(np.cumsum(np.where(df1['x'] < 0, 1, 0)) == 0, 0, 1)
Output:
x condition
0 1 0
1 2 0
2 4 0
3 -3 1
4 4 1
5 1 1
You can use Boolean series here:
df1['condition'] = (df1.index >= (df1['x'] < 0).idxmax()).astype(int)
print(df1)
x condition
0 1 0
1 2 0
2 4 0
3 -3 1
4 4 1
5 1 1
I have a Pandas script that counts the number of readmissions to hospital within 30 days based on a few conditions. I wonder if it could be vectorized to improve performance. I've experimented with df.rolling().apply, but so far without luck.
Here's a table with contrived data to illustrate:
ID VISIT_NO ARRIVED LEFT HAD_A_MASSAGE BROUGHT_A_FRIEND
1 1 29/02/1996 01/03/1996 0 1
1 2 01/12/1996 04/12/1996 1 0
2 1 20/09/1996 21/09/1996 1 0
3 1 27/06/1996 28/06/1996 1 0
3 2 04/07/1996 06/07/1996 0 1
3 3 16/07/1996 18/07/1996 0 1
4 1 21/02/1996 23/02/1996 0 1
4 2 29/04/1996 30/04/1996 1 0
4 3 02/05/1996 02/05/1996 0 1
4 4 02/05/1996 03/05/1996 0 1
5 1 03/10/1996 05/10/1996 1 0
5 2 07/10/1996 08/10/1996 0 1
5 3 10/10/1996 11/10/1996 0 1
First, I create a dictionary with IDs:
ids = massage_df[massage_df['HAD_A_MASSAGE'] == 1]['ID']
id_dict = {id:0 for id in ids}
Everybody in this table has had a massage, but in my real dataset, not all people are so lucky.
Next, I run this bit of code:
for grp, df in massage_df.groupby(['ID']):
date_from = df.loc[df[df['HAD_A_MASSAGE']==1].index, 'LEFT']
date_to = date_from + DateOffset(days=30)
mask = ((date_from.values[0] < df['ARRIVED']) &
(df['ARRIVED'] <= date_to.values[0]) &
(df['BROGHT_A_FRIEND'] == 1))
if len(df[mask]) > 0:
id_dict[df['ID'].iloc[0]] = len(df[mask])
Basically, I want to count the number of times when someone originally came in for a massage (single or with a friend) and then came back within 30 days with a friend. The expected results for this table would be a total of 6 readmissions for IDs 3, 4 and 5.