Check if value follow a criteria for subsequent timeseries values - python

I have a dataframe that looks like
ID DATE PROFIT
2342 2017-03-01 457
2342 2017-06-01 658
2342 2017-09-01 3456
2342 2017-12-01 345
2342 2018-03-01 235
2342 2018-06-01 23
808 2017-03-01 9346
808 2017-06-01 54
808 2017-09-01 314
808 2017-12-01 57
....
....
For each ID:
Lets say I want to find out if the Profit has stayed between 200 and 1000.
I want to do it in such a way that the counter( a new column) indicates how many quarters (latest and previous) in succession have satisfied this condition.
If for some reason, one of the intermediate quarters does not match the condition, the counter should reset.
I am thinking of using the shift functionality to access/condition on the previous rows, however if there is a better way to check if condition in datetime values, it will be good to know.

Solution if all datetimes are consecutive:
Use GroupBy.tail with 5 for last and previous 4 quarters, compare by Series.lt, add missing values with Series.reindex and if encessary cast to integer for True/False to 1/0 mapping:
df['flag'] = (df.groupby('ID')['PROFIT']
.tail(5)
.lt(1000)
.reindex(df.index, fill_value=False)
.astype(int))
print (df)
ID DATE PROFIT flag
0 2342 2017-03-01 457 0 #<-6.th value no match
1 2342 2017-06-01 658 1
2 2342 2017-09-01 3456 0
3 2342 2017-12-01 345 1
4 2342 2018-03-01 235 1
5 2342 2018-06-01 23 1
6 808 2017-03-01 9346 0
7 808 2017-06-01 54 1
8 808 2017-09-01 314 1
9 808 2017-12-01 57 1
EDIT: for counter column by Series.between function is possible create consecutive groups by compare by DataFrame.ne (!=) with DataFrame.shift and DataFrame.cumsum and last use GroupBy.cumcount with multiple by Series.mul for set to 0 groups with consecutive 0:
df['flag'] = df['PROFIT'].between(200, 1000).astype(int)
df1 = df[['ID','flag']].ne(df[['ID','flag']].shift()).cumsum()
g = df.groupby([df1['ID'], df1['flag']])
df['counter1'] = g.cumcount().add(1).mul(df['flag'])
df['counter2'] = g.cumcount(ascending=False).add(1).mul(df['flag'])
print (df)
ID DATE PROFIT flag counter1 counter2
0 2342 2017-03-01 457 1 1 2
1 2342 2017-06-01 658 1 2 1
2 2342 2017-09-01 3456 0 0 0
3 2342 2017-12-01 345 1 1 3
4 2342 2018-03-01 235 1 2 2
5 2342 2018-06-01 230 1 3 1
6 808 2017-03-01 934 1 1 2
7 808 2017-06-01 540 1 2 1
8 808 2017-09-01 34 0 0 0
9 808 2017-12-01 57 0 0 0

Related

How to aggregate data within a time window to a specific date in a dataframe

I have a dataset like:
New_ID application_start_date is_approved
1234 2022-03-29 1
2345 2022-01-29 1
1234 2021-02-28 0
567 2019-07-03 1
567 2018-09-01 0
And I want to create new attributes N_App_3M which would be sum of is_approved to that application_start_date within 3 month time frame.
Expected output would be:
New_ID application_start_date is_approved N_App_3M
1234 2022-03-29 1 2
2345 2022-01-29 0 0
1234 2022-02-28 1 1
567 2019-07-03 1 1
567 2018-09-01 0 0
Compute the rolling 3-month and 7-day sums and then use pd.merge_asof to generate your columns:
df["application_start_date"] = pd.to_datetime(df["application_start_date"])
df = df.set_index("application_start_date").sort_index()
app_3M = df.resample("M")["is_approved"].sum().rolling(3).sum().rename("N_App_3M").fillna(0)
app_7D = df.rolling("7D")["is_approved"].sum().rename("N_App_7D").fillna(0)
output = pd.merge_asof(df,app_3M,direction="nearest",left_index=True,right_index=True)
output = pd.merge_asof(output,app_7D,direction="nearest",left_index=True,right_index=True)
>>> output
New_ID is_approved N_App_3M N_App_7D
application_start_date
2018-09-01 567 0 0.0 0.0
2019-07-03 567 1 0.0 1.0
2021-02-28 1234 0 0.0 0.0
2022-01-29 2345 1 1.0 1.0
2022-03-29 1234 1 2.0 1.0

Incremental Counter flag for a matching condition on subsequent time series data

I have a dataframe that looks like below
ID DATE PROFIT
2342 2017-03-01 457
2342 2017-06-01 658
2342 2017-09-01 3456
2342 2017-12-01 345
2342 2018-03-01 235
2342 2018-06-01 23
808 2016-12-01 200
808 2017-03-01 9346
808 2017-06-01 54
808 2017-09-01 314
808 2017-12-01 57
....
....
For each ID:
I want to find out if the Profit has stayed between 200 and 1000.
I want to do it in such a way that a counter( a new column) indicates how many quarters (latest and previous) in succession have satisfied this condition. If for some reason, one of the intermediate quarters does not match the condition, the counter should reset.
So the output should look something like :
ID DATE PROFIT COUNTER
2342 2017-03-01 457 1
2342 2017-06-01 658 2
2342 2017-09-01 3456 0
2342 2017-12-01 345 1
2342 2018-03-01 235 2
2342 2018-06-01 23 0
808 2016-12-01 200 1
808 2017-03-01 9346 0
808 2017-06-01 54 0
808 2017-09-01 314 1
808 2017-12-01 57 0
....
....
I am thinking of using the shift functionality to access/condition on the previous rows, however if there is a better way to check if condition in datetime values, it will be good to know.
IIUC Create the help key by using cumsum , then we just need to filter before assign back and fillna which is not between 200 to 1000 as 0
s=(~df.PROFIT.between(200,1000)).groupby(df['ID']).cumsum()
df['COUNTER']=df[df.PROFIT.between(200,1000)].groupby([df.ID,s]).cumcount()+1
df.COUNTER.fillna(0,inplace=True)
df
Out[226]:
ID DATE PROFIT COUNTER
0 2342 2017-03-01 457 1.0
1 2342 2017-06-01 658 2.0
2 2342 2017-09-01 3456 0.0
3 2342 2017-12-01 345 1.0
4 2342 2018-03-01 235 2.0
5 2342 2018-06-01 23 0.0
6 808 2016-12-01 200 1.0
7 808 2017-03-01 9346 0.0
8 808 2017-06-01 54 0.0
9 808 2017-09-01 314 1.0
10 808 2017-12-01 57 0.0
Set up a criteria column with value 1 meets criteria, then group and sum.
df['criteria'] = 0
df.loc[(df['PROFIT'] >= 200) & (df['PROFIT'] <= 1000), 'criteria'] = 1
df['result'] = df.groupby(['ID', df.criteria.eq(0).cumsum()])['criteria'].cumsum()
ID DATE PROFIT criteria result
0 2342 2017-03-01 457 1 1
1 2342 2017-06-01 658 1 2
2 2342 2017-09-01 3456 0 0
3 2342 2017-12-01 345 1 1
4 2342 2018-03-01 235 1 2
5 2342 2018-06-01 23 0 0
6 808 2016-12-01 200 1 1
7 808 2017-03-01 9346 0 0
8 808 2017-06-01 54 0 0
9 808 2017-09-01 314 1 1
10 808 2017-12-01 57 0 0
def magic(y):
return y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)
data["condition"] = data['PROFIT'].between(200, 1000)
data["COUNTER"] = data.groupby("ID").condition.apply(magic)
ID DATE PROFIT condition COUNTER
0 2342 2017-03-01 457 True 1
1 2342 2017-06-01 658 True 2
2 2342 2017-09-01 3456 False 0
3 2342 2017-12-01 345 True 1
4 2342 2018-03-01 235 True 2
5 2342 2018-06-01 23 False 0
6 808 2016-12-01 200 True 1
7 808 2017-03-01 9346 False 0
8 808 2017-06-01 54 False 0
9 808 2017-09-01 314 True 1
10 808 2017-12-01 57 False 0
Use groupby with a cumsum and a cumcount, then simply use loc, to get first rows and make them as you want them:
df['BOOL'] = (~df['PROFIT'].between(200, 1000)).cumsum()
df['COUNTER'] = df.groupby('BOOL', 'ID']).cumcount()
df.loc[df.groupby('ID', as_index=False)['BOOL'].apply(lambda x: x.loc[:x.idxmin()-1]).index.levels[1], 'COUNTER'] += 1
And now:
print(df)
Is:
ID DATE PROFIT COUNTER
0 2342 2017-03-01 457 1
1 2342 2017-06-01 658 2
2 2342 2017-09-01 3456 0
3 2342 2017-12-01 345 1
4 2342 2018-03-01 235 2
5 2342 2018-06-01 23 0
6 808 2016-12-01 200 1
7 808 2017-03-01 9346 0
8 808 2017-06-01 54 0
9 808 2017-09-01 314 1
10 808 2017-12-01 57 0
As you shown in the desired output.
Wouldn't something as simple as the following work?
if profit_value>200 and profit_value<1000:
cntr+=1
else:
cntr=0

Parsing week of year to datetime objects with pandas

A B C D yearweek
0 245 95 60 30 2014-48
1 245 15 70 25 2014-49
2 150 275 385 175 2014-50
3 100 260 170 335 2014-51
4 580 925 535 2590 2015-02
5 630 126 485 2115 2015-03
6 425 90 905 1085 2015-04
7 210 670 655 945 2015-05
The last column contains the the year along with the weeknumber. Is it possible to convert this to a datetime column with pd.to_datetime?
I've tried:
pd.to_datetime(df.yearweek, format='%Y-%U')
0 2014-01-01
1 2014-01-01
2 2014-01-01
3 2014-01-01
4 2015-01-01
5 2015-01-01
6 2015-01-01
7 2015-01-01
Name: yearweek, dtype: datetime64[ns]
But that output is incorrect, while I believe %U should be the format string for week number. What am I missing here?
You need another parameter for specify day - check this:
df = pd.to_datetime(df.yearweek.add('-0'), format='%Y-%W-%w')
print (df)
0 2014-12-07
1 2014-12-14
2 2014-12-21
3 2014-12-28
4 2015-01-18
5 2015-01-25
6 2015-02-01
7 2015-02-08
Name: yearweek, dtype: datetime64[ns]

Pandas DataFrame Pivot Using Dates and Counts

I've taken a large data file and managed to use groupby and value_counts to get the dataframe below. However, I want to format it so the company is on the left, with the months on top, and each number would be the number of calls that month, the third column.
Here is my code to sort:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count)
df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
Here is my output df=
recvd_dttm CompanyName
1/1/2015 11:42 Company 1 1
1/1/2015 14:29 Company 2 1
1/1/2015 8:12 Company 4 1
1/1/2015 9:53 Company 1 1
1/10/2015 11:38 Company 3 1
1/10/2015 11:31 Company 5 1
1/10/2015 12:04 Company 2 1
I want
Company Jan Feb Mar Apr May
Company 1 10 4 45 40 34
Company 2 2 5 56 5 57
Company 3 3 7 71 6 53
Company 4 4 4 38 32 2
Company 5 20 3 3 3 29
I know that there is a nifty pivot function for dataframes from this documentation http://pandas.pydata.org/pandas-docs/stable/reshaping.html for pandas, so I've been trying to use df.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
One problem is that the third column doesn't have a name, so I can't use it for values = 'NumberCalls'. The second problem is figuring out how to take the datetime format in my dataframe and make it display by month only.
Edit:
CompanyName is the first column, recvd_dttm is the 15th column. This is my code after some more attempts:
data = pd.DataFrame.from_csv('MYDATA.csv')
data[['recvd_dttm','CompanyName']]
data['recvd_dttm'].value_counts()
RatedCustomerCallers = data['CompanyName'].value_counts()
count = data.groupby(["recvd_dttm","CompanyName"]).size()
df = pd.DataFrame(count).set_index('recvd_dttm').sort_index()
df.index = pd.to_datetime(df.index, format='%m/%d/%Y %H:%M')
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
result.columns = ['Month', 'CompanyName', 'NumberCalls']
result.pivot(index='recvd_dttm', columns='CompanyName', values='NumberCalls')
It is throwing this error: KeyError: 'recvd_dttm' and won't get to the result line.
You need to aggregate the data before creating the pivot table. If there is no column name, you can either refer it to df.iloc[:, 1] (the 2nd column) or simply rename the df.
import pandas as pd
import numpy as np
# just simulate your data
np.random.seed(0)
dates = np.random.choice(pd.date_range('2015-01-01 00:00:00', '2015-06-30 00:00:00', freq='1h'), 10000)
company = np.random.choice(['company' + x for x in '1 2 3 4 5'.split()], 10000)
df = pd.DataFrame(dict(recvd_dttm=dates, CompanyName=company)).set_index('recvd_dttm').sort_index()
df['C'] = 1
df.columns = ['CompanyName', '']
Out[34]:
CompnayName
recvd_dttm
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company2 1
2015-01-01 00:00:00 company1 1
2015-01-01 00:00:00 company2 1
2015-01-01 01:00:00 company4 1
2015-01-01 01:00:00 company2 1
2015-01-01 01:00:00 company5 1
2015-01-01 03:00:00 company3 1
2015-01-01 03:00:00 company2 1
2015-01-01 03:00:00 company3 1
2015-01-01 04:00:00 company4 1
2015-01-01 04:00:00 company1 1
2015-01-01 04:00:00 company3 1
2015-01-01 05:00:00 company2 1
2015-01-01 06:00:00 company5 1
... ... ..
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company2 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company3 1
2015-06-29 19:00:00 company5 1
2015-06-29 19:00:00 company5 1
2015-06-29 20:00:00 company1 1
2015-06-29 20:00:00 company4 1
2015-06-29 22:00:00 company1 1
2015-06-29 22:00:00 company2 1
2015-06-29 22:00:00 company4 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company2 1
2015-06-30 00:00:00 company1 1
2015-06-30 00:00:00 company4 1
[10000 rows x 2 columns]
# first groupby month and company name, and calculate the sum of calls, and reset all index
# since we don't have a name for that columns, simply tell pandas it is the 2nd column we try to count on
result = df.groupby([lambda idx: idx.month, 'CompanyName']).agg({df.columns[1]: sum}).reset_index()
# rename the columns
result.columns = ['Month', 'CompanyName', 'counts']
Out[41]:
Month CompanyName counts
0 1 company1 328
1 1 company2 337
2 1 company3 342
3 1 company4 345
4 1 company5 331
5 2 company1 295
6 2 company2 300
7 2 company3 328
8 2 company4 304
9 2 company5 329
10 3 company1 366
11 3 company2 398
12 3 company3 339
13 3 company4 336
14 3 company5 345
15 4 company1 322
16 4 company2 348
17 4 company3 351
18 4 company4 340
19 4 company5 312
20 5 company1 347
21 5 company2 354
22 5 company3 347
23 5 company4 363
24 5 company5 312
25 6 company1 316
26 6 company2 311
27 6 company3 331
28 6 company4 307
29 6 company5 316
# create pivot table
result.pivot(index='CompanyName', columns='Month', values='counts')
Out[44]:
Month 1 2 3 4 5 6
CompanyName
company1 326 297 339 337 344 308
company2 310 318 342 328 355 296
company3 347 315 350 343 347 329
company4 339 314 367 353 343 311
company5 370 331 370 320 357 294

Grouping daily data by month in python/pandas while firstly grouping by user id

I have the table below in a Pandas dataframe:
date user_id whole_cost cost1
02/10/2012 00:00:00 1 1790 12
07/10/2012 00:00:00 1 364 15
30/01/2013 00:00:00 1 280 10
02/02/2013 00:00:00 1 259 24
05/03/2013 00:00:00 1 201 39
02/10/2012 00:00:00 3 623 1
07/12/2012 00:00:00 3 90 0
30/01/2013 00:00:00 3 312 90
02/02/2013 00:00:00 5 359 45
05/03/2013 00:00:00 5 301 34
02/02/2013 00:00:00 5 359 1
05/03/2013 00:00:00 5 801 12
..
The table was extracted from a csv file using the following query :
import pandas as pd
newnames = ['date','user_id', 'whole_cost', 'cost1']
df = pd.read_csv('expenses.csv', names = newnames, index_col = 'date')
I have to analyse the profile of my users and for this purpose:
I would like to group (for each user - they are thousands) queries by month summing the query whole_cost for the entire month e.g. if user_id=1 was has a whole cost of 1790 on 02/10/2012 with cost1 12 and on the 07/10/2012 with whole cost 364, then it should have an entry in the new table of 2154 (as the whole cost) on 31/10/2012 (end of the month end-point representing the month - all dates in the transformed table will be month ends representing the whole month to which they relate).
In 0.14 you'll be able to groupby monthly and another column at the same time:
In [11]: df
Out[11]:
user_id whole_cost cost1
2012-10-02 1 1790 12
2012-10-07 1 364 15
2013-01-30 1 280 10
2013-02-02 1 259 24
2013-03-05 1 201 39
2012-10-02 3 623 1
2012-12-07 3 90 0
2013-01-30 3 312 90
2013-02-02 5 359 45
2013-03-05 5 301 34
2013-02-02 5 359 1
2013-03-05 5 801 12
In [12]: df1 = df.sort_index() # requires sorted DatetimeIndex
In [13]: df1.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[13]:
user_id
2012-10-31 1 2154
3 623
2012-12-31 3 90
2013-01-31 1 280
3 312
2013-02-28 1 259
5 718
2013-03-31 1 201
5 1102
Name: whole_cost, dtype: int64
until 0.14 I think you're stuck with doing two groupbys:
In [14]: g = df.groupby('user_id')['whole_cost']
In [15]: g.resample('M', how='sum').dropna()
Out[15]:
user_id
1 2012-10-31 2154
2013-01-31 280
2013-02-28 259
2013-03-31 201
3 2012-10-31 623
2012-12-31 90
2013-01-31 312
5 2013-02-28 718
2013-03-31 1102
dtype: float64
With timegrouper getting deprecated, you can replace it with Grouper to get the same results
df.groupby(['user_id', pd.Grouper(key='date', freq='M')]).agg({'whole_cost':sum})
df.groupby(['user_id', df['date'].dt.dayofweek]).agg({'whole_cost':sum})

Categories