Get data into monthly datetime index - python

I have a pd.dataframe that looks like the one below
Start Date End Date
1/1/1990 7/1/2014
7/1/2005 5/1/2013
8/1/1997 8/1/2004
9/1/2001
I'd like to capture this data where it shows how many items had started but ended by certain months, in a datetimeindex. What I want it to look like is illustrated below.
Date Count
4/1/2013 3
5/1/2013 2
6/1/2013 2
7/1/2013 2
So far I have created a series that creates a string combining the start and finish dates and sums up all items with the same start and end dates.
1/1/19007/1/2014 1
7/1/20055/1/2013 1
8/1/19978/1/2004 1
9/1/2001 1
And I have a dataframe with the datetimeindex looking as follows:
4/1/2013
5/1/2013
6/1/2013
7/1/2013
Now I'm struggling to combine the two to get what I'm looking for. I'm probably thinking about this all wrong and was looking for better ideas.

You can try:
print df1
Start Date End Date
0 1/1/1990 7/1/2014
1 7/1/2005 5/1/2013
2 8/1/1997 8/1/2004
3 9/1/2001 NaN
print df2
Index: [4/1/2013, 5/1/2013, 6/1/2013, 7/1/2013]
#drop NaT in columns Start Date, End Date
df1 = df1.dropna(subset=['Start Date','End Date'])
#convert columns to datetime and then to month period
df1['Start Date'] = pd.to_datetime(df1['Start Date']).dt.to_period('M')
df1['End Date'] = pd.to_datetime(df1['End Date']).dt.to_period('M')
#create new column from datetimeindex and convert it to month period
df2['Date'] = pd.DatetimeIndex(df2.index).to_period('M')
print df1
Start Date End Date
0 1990-01 2014-07
1 2005-07 2013-05
2 1997-08 2004-08
print df2
Date
Date
4/1/2013 2013-04
5/1/2013 2013-05
6/1/2013 2013-06
7/1/2013 2013-07
#stack data for resampling
df1 = df1.stack().reset_index(drop=True, level=1).reset_index(name='Date')
print df1
index Date
0 0 1990-01
1 0 2014-07
2 1 2005-07
3 1 2013-05
4 2 1997-08
5 2 2004-08
#resample by column index
df = df1.groupby(df1['index']).apply(lambda x: x.set_index('Date').resample('1M', how='first')).reset_index(level=1)
#remove unecessary column index
df = df.drop('index', axis=1)
print df.head()
Date
index
0 1990-01
0 1990-02
0 1990-03
0 1990-04
0 1990-05
#merge df and df2 by column Date, groupby by Date and count
print pd.merge(df, df2, on='Date').groupby('Date')['Date'].count()
Date
2013-04 2
2013-05 2
2013-06 1
2013-07 1
Freq: M, Name: Date, dtype: int64

Related

Groupby over periods of time

I have a table which contains ids, dates, a target (potentially multi class but for now binary where 1 is a fail) and a yearmonth column based on the date column. Below are the first 8 rows of this table:
row
id
date
target
yearmonth
0
A
2015-03-16
0
2015-03
1
A
2015-05-29
1
2015-05
2
A
2015-08-02
1
2015-08
3
A
2015-09-05
1
2015-09
4
A
2015-09-22
0
2015-09
5
A
2015-10-15
1
2015-10
6
A
2015-11-09
1
2015-11
7
B
2015-04-17
0
2015-04
I want to create lookback features for the last let's say 3 months so that for each single row, we take a look in the past and see the how that id performed over the last 3 months. So for ex for row 6, where date is 9th Nov 2015, the percentage of fails for id A in the last 3 calendaristic months (so in the whole of months of Aug, Sept & Oct) would be 75% (using rows 2-5).
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B'],'date' :['2015-03-16','2015-05-29','2015-08-02','2015-09-05','2015-09-22','2015-10-15','2015-11-09','2015-04-17'],'target':[0,1,1,1,0,1,1,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['yearmonth'] = df['date'].dt.to_period('M')
agg_dict = {
"Total_Transactions": pd.NamedAgg(column='target', aggfunc='count'),
"Fail_Count": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1]))),
"Perc_Monthly_Fails": pd.NamedAgg(column='target', aggfunc=(lambda x: len(x[x == 1])/len(x)*100))
}
df.groupby(['id','yearmonth']).agg(**agg_dict).reset_index(level = 1)
I've done an aggregation using id and month (see below) and I've tried things like rolling windows, but I could't find a way to actually aggregate looking back over a specific period for each single row. Any help is appreciated.
id
yearmonth
Total_Transactions
Fail_Count
Perc_Monthly_Fails
A
2015-03
1
0
0
A
2015-05
1
1
100
A
2015-08
1
1
100
A
2015-09
2
1
50
A
2015-10
1
1
100
A
2015-11
1
1
100
B
2015-04
1
0
0
You can do this by merging the DataFrame with itself on 'id'.
First we'll create a first of month 'fom' column since your date logic wants to look back based on prior months, not the date specifically. Then we merge the DataFrame with itself, bringing along the index so we can assign the result back in the end.
With month offsets we can then filter that to only keeping the observations within 3 months of the observation for that row, and then we groupby the original index and take the mean of 'target' to get the percent fail, which we can just assign back (alignment on index).
If there are NaN in the output it's because that row had no observations in the prior 3 months so you can't calculate.
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]') # Credit #anky
df1 = df.reset_index()
df1 = (df1.drop(columns='target').merge(df1, on='id', suffixes=['', '_past']))
df1 = df1[df1.fom_past.between(df1.fom-pd.offsets.DateOffset(months=3),
df1.fom-pd.offsets.DateOffset(months=1))]
df['Pct_fail'] = df1.groupby('index').target.mean()*100
id date target fom Pct_fail
0 A 2015-03-16 0 2015-03-01 NaN # No Rows to Avg
1 A 2015-05-29 1 2015-05-01 0.000000 # Avg Rows 0
2 A 2015-08-02 1 2015-08-01 100.000000 # Avg Rows 1
3 A 2015-09-05 1 2015-09-01 100.000000 # Avg Rows 2
4 A 2015-09-22 0 2015-09-01 100.000000 # Avg Rows 2
5 A 2015-10-15 1 2015-10-01 66.666667 # Avg Rows 2,3,4
6 A 2015-11-09 1 2015-11-01 75.000000 # Avg Rows 2,3,4,5
7 B 2015-04-17 0 2015-04-01 NaN # No Rows to Avg
If you're having an issue with memory we can take a very slow loop approach, which subsets for each row and then calculates the average from that subset.
def get_prev_avg(row, df):
df = df[df['id'].eq(row['id'])
& df['fom'].between(row['fom']-pd.offsets.DateOffset(months=3),
row['fom']-pd.offsets.DateOffset(months=1))]
if not df.empty:
return df['target'].mean()*100
else:
return np.NaN
#df['date'] = pd.to_datetime(df['date'], dayfirst = True)
df['fom'] = df['date'].astype('datetime64[M]')
df['Pct_fail'] = df.apply(lambda row: get_prev_avg(row, df), axis=1)
I have modified #ALollz code so that it applies better to my original dataset, where I have a multiclass target, and I would like to obtain PctFails for class 1 and 2, plus the nr of transactions, and I would need to group by different columns over different periods of times. Also, decided it's simpler and better to use the last x months prior to the date rather than the calendar months. So my solution to that was this:
df = pd.DataFrame({'Id':['A','A','A','A','A','A','A','B'],'Type':['T1','T3','T1','T2','T2','T1','T1','T3'],'date' :['2015-03-16','2015-05-29','2015-08-10','2015-09-05','2015-09-22','2015-11-08','2015-11-09','2015-04-17'],'target':[2,1,2,1,0,1,2,0]} )
df['date'] = pd.to_datetime(df['date'], dayfirst = True)
def get_prev_avg(row, df, columnname, lastxmonths):
df = df[df[columnname].eq(row[columnname])
& df['date'].between(row['date']-pd.offsets.DateOffset(months=lastxmonths),
row['date']-pd.offsets.DateOffset(days=1))]
if not df.empty:
NrTransactions= len(df['target'])
PctMinorFails= (df['target'].where(df['target'] == 1).count())/len(df['target'])*100
PctMajorFails= (df['target'].where(df['target'] == 2).count())/len(df['target'])*100
return pd.Series([NrTransactions, PctMinorFails, PctMajorFails])
else:
return pd.Series([np.NaN, np.NaN, np.NaN])
for lastxmonths in [3, 4]:
for columnname in ['Id','Type']:
df[['NrTransactionsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMinorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months',
'PctMajorFailsBy' + str(columnname) + 'Last' + str(lastxmonths) +'Months'
]]= df.apply(lambda row: get_prev_avg(row, df, columnname, lastxmonths), axis=1)
Each iteration takes a couple hours for my original dataset which is not great, but unsure how to optimise it further.

How can I loop through column of df1 to see how often condition is met in df2?

I have two dataframes, df1 and df2.
df1 = pd.DataFrame({'Date': ['1/1/2017', '4/1/2017', '7/1/2017', '10/1/2017', '1/1/2018']})
df2 = pd.DataFrame({'Open Date': ['2/1/2017', '6/12/2017', '8/23/2017', '11/14/2017', '11/15/2017'],
'Close Date': ['12/2/2017', '9/6/2017', '10/23/2017', '12/14/2017', '1/15/2018']})
My goal is to create a new column in df1 which designates how many accounts were open on the exact date listed in df1. So in theory the output would look something like this:
Date | Count
1/1/2017 | 0 Accounts open
4/1/2017 | 1
7/1/2017 | 2
10/1/2017 | 2
1/1/2018 | 1
This means individual accounts can be counted more than once because they could be active/open on more than one exact date.
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Open Date'] = pd.to_datetime(df2['Open Date'])
df2['Close Date'] = pd.to_datetime(df2['Close Date'])
>>> df1.assign(
Accounts_open=df1['Date'].apply(
lambda ts: (df2['Open Date'].le(ts) & df2['Close Date'].ge(ts)).sum()))
Date Accounts_open
0 2017-01-01 0
1 2017-04-01 1
2 2017-07-01 2
3 2017-10-01 2
4 2018-01-01 1

How do I compare a date column with a date and add a text column?

I have the following dataframe:
id date
1 2019-09-28
2 NaT
3 2017-09-28
I want to compare the date column to a date: 2018-09-28. Then add a column status that adds a string 'Greater' or 'Less' based on the comparison. So the output would be something like this:
id date status
1 2019-09-28 Greater
2 NaT
3 2017-09-28 Less
With Pandas datetime series, you can compare values with a string:
df['date'] = pd.to_datetime(df['date'])
df['status'] = np.where(df['date'] >= '2018-09-28', 'Greater', 'Less')
df.loc[df['date'].isnull(), 'status'] = ''
print(df)
id date status
0 1 2019-09-28 Greater
1 2 NaT
2 3 2017-09-28 Less

Combining similar dataframe rows

I currently have a dataframe which looks like this
User Date FeatureA FeatureB
John DateA 1 2
John DateB 3 5
Is there anyway that I can combine the 2 rows such that it becomes
User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2
John DateA DateB 1 2 3 5
I think need:
g = df.groupby(['User']).cumcount()
df = df.set_index(['User', g]).unstack()
df.columns = ['{}{}'.format(i, j+1) for i, j in df.columns]
df = df.reset_index()
print (df)
User Date1 Date2 FeatureA1 FeatureA2 FeatureB1 FeatureB2
0 John DateA DateB 1 3 2 5
Explanation:
Get count per groups by Users with cumcount
Create MultiIndex by set_index
Reshape by unstack
Flatenning MultiIndex in columns
Convert index to columns by reset_index

pandas: retrieve values post group by sum

Pandas data frame("df") looks like :
name id time
1095 One 1 12:03:37.230812
1096 Two 2 10:56:29.314745
1097 Three 3 10:58:18.897624
1098 Three 3 09:45:38.755116
1099 Two 2 09:02:59.472508
1100 One 1 12:28:38.341024
On this, i did an operation which is
df = df.groupby(by=['id'])[['time']].transform(sum).sort('time', ascending=False)
On the resulting df I want to iterate and get response as name and total time. How can I achive that from last df(from groupby/transform response) ? So result should look something like this:
name time
One 24:03:37.230812
Two 19:56:29.314745
Three 19:58:18.897624
I think you need convert column time to_timedelta first.
Then groupby by column name or id and aggregate sum:
df.time = pd.to_timedelta(df.time)
df = df.groupby('name', as_index=False)['time'].sum().sort_values('time', ascending=False)
print (df)
name time
0 One 1 days 00:32:15.571836
1 Three 0 days 20:43:57.652740
2 Two 0 days 19:59:28.787253
df = df.groupby('id', as_index=False)['time'].sum().sort_values('time', ascending=False)
print (df)
id time
0 1 1 days 00:32:15.571836
2 3 0 days 20:43:57.652740
1 2 0 days 19:59:28.787253
Last is possible convert timedeltas to seconds by total_seconds, another conversation are here:
df.time = df.time.dt.total_seconds()
print (df)
id time
0 1 88335.571836
2 3 74637.652740
1 2 71968.787253

Categories