The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)
Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))
IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64
Related
This post follows on from another one I posted which can be found here:
use groupby() and for loop to count column values with conditions
I am working with the same data again:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
Like in the previous post, I first created a pd.Series with the 1st day of every month in the entire history of the data
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I now want to do is count the number of rows in the data-frame where the df["start_date"] values are less than the 1st day of each month in the series and where the df["end_date"] values are greater than the 1st day of each month in the series
I would think that I would apply a lambda function or use np.logical_and on the dates series to obtain the output I am after - the logic of which would look something like this:
#only obtain those rows with end dates
inactives = df[df["end_date"].isnull() == False]
dates.apply(
lambda x: (inactives[inactives["start_date"] < x] & inactives[inactives["cancel_date"] > x]).count()
)
or like this:
dates.apply(
lambda x: np.logical_and(
inactives[inactives["start_date"] < x,
inactives[inactives["cancel_date"] > x]]
).sum())
The resulting output would look like this:
month_first
count
2015-01-01
10
2015-02-01
25
2015-03-01
45
Correct, we can use apply lambda for this. So, first, we create our list of first days in each month. Here we use freq "MS" to create start of month inside our defined interval.
new_df = pd.DataFrame({"month_first": pd.date_range(start="2015-01-01", end="2022-10-01", freq = "MS")})
This will result in this table:
month_first
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-04-01
4 2015-05-01
.. ...
89 2022-06-01
90 2022-07-01
91 2022-08-01
92 2022-09-01
93 2022-10-01
[94 rows x 1 columns]
Then we apply the lambda function below. So for each of the rows in our date range, we take from inactives which the start_date is less and end_date is greater. We use & operator to perform and operation to each row of our resulting comparisons. Then, we use sum to sum all the boolean values.
new_df["count"] = new_df["month_first"].apply(
lambda x: ((inactives["start_date"] < x) & (inactives["end_date"] > x)).sum())
This will result in this table:
month_first count
0 2015-01-01 0
1 2015-02-01 4
2 2015-03-01 9
3 2015-04-01 14
4 2015-05-01 19
.. ... ...
89 2022-06-01 25
90 2022-07-01 22
91 2022-08-01 19
92 2022-09-01 13
93 2022-10-01 13
[94 rows x 2 columns]
I would like to filter for customer_id'sthat first appear after a certain date in this case 2019-01-10 and then create a new df with a list of new customers
df
date customer_id
2019-01-01 429492
2019-01-01 344343
2019-01-01 949222
2019-01-10 429492
2019-01-10 344343
2019-01-10 129292
Output df
customer_id
129292
This is what I have tried so far but this gives me also customer_id's that were active before 10th January 2019
s = df.loc[df["date"]>="2019-01-10", "customer_id"]
df_new = df[df["customer_id"].isin(s)]
df_new
You can use boolean indexing with filtering with Series.isin:
df["date"] = pd.to_datetime(df["date"])
mask1 = df["date"]>="2019-01-10"
mask2 = df["customer_id"].isin(df.loc[~mask1,"customer_id"])
df = df.loc[mask1 & ~mask2, ['customer_id']]
print (df)
customer_id
5 129292
df['date'] = pd.to_datetime(df['date'])
cutoff = pd.to_datetime('2019-01-10')
mask = df['date'] >= cutoff
customers_before = df.loc[~mask, 'customer_id'].unique().tolist()
customers_after = df.loc[mask, 'customer_id'].unique().tolist()
result = set(customers_after) - set(customers_before)
"then create a new df with a list of new customers" so in this case your output is null, because 2019-01-10 is last date, there is no new customers after this date
but if you want to get list of customers after certain date or equal than :
df=pd.DataFrame({
'date':['2019-01-01','2019-01-01','2019-01-01',
'2019-01-10','2019-01-10','2019-01-10'],
'customer_id':[429492,344343,949222,429492,344343,129292]
})
certain_date=pd.to_datetime('2019-01-10')
df.date=pd.to_datetime(df.date)
df=df[
df.date>=certain_date
]
print(df)
date customer_id
3 2019-01-10 429492
4 2019-01-10 344343
5 2019-01-10 129292
If your 'date' column has datetime objects you just have to do:
df_new = df[df['date'] >= datetime(2019, 1, 10)]['customer_id']
If your 'date' column doesn't contain datetime objects, you should convert it first it by using to_datetime method:
df['date'] = pd.to_datetime(df['date'])
And then apply the methodology described above.
I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks
You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0
I have a dataFrame with two columns, ["StartDate" ,"duration"]
the elements in the StartDate column are datetime type, and the duration are ints.
Something like:
StartDate Duration
08:16:05 20
07:16:01 20
I expect to get:
EndDate
08:16:25
07:16:21
Simply add the seconds to the hour.
I'd being checking some ideas about it like the delta time types and that all those datetimes have the possibilities to add delta times, but so far I can find how to do it with the DataFrames (in a vector fashion, cause It might be possible to iterate over all the rows performing the operation ).
consider this df
StartDate duration
0 01/01/2017 135
1 01/02/2017 235
You can get the datetime column like this
df['EndDate'] = pd.to_datetime(df['StartDate']) + pd.to_timedelta(df['duration'], unit='s')
df.drop('StartDate,'duration', axis = 1, inplace = True)
You get
EndDate
0 2017-01-01 00:02:15
1 2017-01-02 00:03:55
EDIT: with the sample dataframe that you posted
df['EndDate'] = pd.to_timedelta(df['StartDate']) + pd.to_timedelta(df['Duration'], unit='s')
df.StartDate = df.apply(lambda x: pd.to_datetime(x.StartDate)+pd.Timedelta(Second(df.duration)) ,axis = 1)
I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1