Count number of days in each continuous period pandas - python

Suppose I have next df N03_zero (date_code is already datetime):
item_code date_code
8028558104973 2022-01-01
8028558104973 2022-01-02
8028558104973 2022-01-03
8028558104973 2022-01-06
8028558104973 2022-01-07
7622300443269 2022-01-01
7622300443269 2022-01-10
7622300443269 2022-01-11
513082 2022-01-01
513082 2022-01-02
513082 2022-01-03
Millions of rows with date_code assigned to some item_code.
I am trying to get the number of days of each continuous period for each item_code, all other similar questions doesn't helped me.
The expected df should be:
item_code continuous_days
8028558104973 3
8028558104973 2
7622300443269 1
7622300443269 2
513082 3
Once days sequence breaks, it should count days in this sequence and then start to count again.
The aim is, to able to get then the dataframe with count, min, max, and mean for each item_code.
Like this:
item_code no. periods min max mean
8028558104973 2 2 3 2.5
7622300443269 2 1 2 1.5
513082 1 3 3 3
Any suggestions?

For consecutive days compare difference by Series.diff in days by Series.dt.days for not equal 1 by Series.ne with cumulative sum by Series.cumsum and then use GroupBy.size, remove second level by DataFrame.droplevel and create DataFrame:
df['date_code'] = pd.to_datetime(df['date_code'])
df1= (df.groupby(['item_code',df['date_code'].diff().dt.days.ne(1).cumsum()], sort=False)
.size()
.droplevel(1)
.reset_index(name='continuous_days'))
print (df1)
item_code continuous_days
0 8028558104973 3
1 8028558104973 2
2 7622300443269 1
3 7622300443269 2
4 513082 3
And then aggregate values by named aggregations by GroupBy.agg:
df2 = (df1.groupby('item_code', sort=False, as_index=False)
.agg(**{'no. periods': ('continuous_days','size'),
'min':('continuous_days','min'),
'max':('continuous_days','max'),
'mean':('continuous_days','mean')}))
print (df2)
item_code no. periods min max mean
0 8028558104973 2 2 3 2.5
1 7622300443269 2 1 2 1.5
2 513082 1 3 3 3.0

Related

pandas shifting missing months

let's assume the following dataframe and shift operation:
d = {'col1': ['2022-01-01','2022-02-01','2022-03-01','2022-05-01'], 'col2': [1,2,3,4]}
df = pd.DataFrame(d)
df['shifted'] = df['col2'].shift(1, fill_value=0)
I want to create a column containing the values of the month before and filling it up for months which do not exist with 0, so the desired result would look like:
col1
col2
shifted
2022-01-01
1
0
2022-02-01
2
1
2022-03-01
3
2
2022-05-01
4
0
So in the last line the value is 0 because there is no data for April.
But at the moment it looks like this:
col1
col2
shifted
2022-01-01
1
0
2022-02-01
2
1
2022-03-01
3
2
2022-05-01
4
3
Does anyone know how to achieve this?
One idea is create month PeriodIndex, so possible shift by months, last replace missing values:
df = df.set_index(pd.to_datetime(df['col1']).dt.to_period('m'))
df['shifted'] = df['col2'].shift(1, freq='m').reindex(df.index, fill_value=0)
print (df)
col1 col2 shifted
col1
2022-01 2022-01-01 1 0
2022-02 2022-02-01 2 1
2022-03 2022-03-01 3 2
2022-05 2022-05-01 4 0
Last is possible remove PeriodIndex:
df = df.reset_index(drop=True)
print (df)
col1 col2 shifted
0 2022-01-01 1 0
1 2022-02-01 2 1
2 2022-03-01 3 2
3 2022-05-01 4 0

Conduct the calculation only when the date value is valid

I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

How to drop rows for each value in a column using a condition?

I have the following dataframe:
df = pd.DataFrame({'No': [0,0,0,1,1,2,2],
'date':['2020-01-15','2019-12-16','2021-03-01', '2018-05-19', '2016-04-08', '2020-01-02', '2020-03-07']})
df.date =pd.to_datetime(df.date)
No date
0 0 2018-01-15
1 0 2019-12-16
2 0 2021-03-01
3 1 2018-05-19
4 1 2016-04-08
5 2 2020-01-02
6 2 2020-03-07
I want to drop the rows if all the date values are earlier than 2020-01-01 for each unique number in No column, i.e. I want to drop rows with the indices 3 and 4.
Is it possible to do it without a for loop?
Use groupby and transform:
>>> df[df.groupby('No')['date'].transform('max')>='2020-01-01']
No date
0 0 2020-01-15
1 0 2019-12-16
2 0 2021-03-01
5 2 2020-01-02
6 2 2020-03-07

Rolling Look Forward Sum with Datetime Index in Pandas

I have multivariate time-series/panel data in the following simplified format:
id,date,event_ind
1,2014-01-01,0
1,2014-01-02,1
1,2014-01-03,1
2,2014-01-01,1
2,2014-01-02,1
2,2014-01-03,1
3,2014-01-01,0
3,2014-01-02,0
3,2014-01-03,1
For this simplified example, I would like the future 2 day sum of event_ind grouped by id
For some reason adapting this example still gives me the "index is not monotonic error": how to do forward rolling sum in pandas?
Here is my approach which otherwise worked for past rolling by group before I adapted it:
df.sort_values(['id','date'], ascending=[True,True], inplace=True)
df.reset_index(drop=True, inplace=True)
df['date'] = pd.DatetimeIndex(df['date'])
df.set_index(['date'], drop=True, inplace=True)
rolling_forward_2_day = lambda x: x.iloc[::-1].rolling('2D').sum().shift(1).iloc[::-1]
df['future_2_day_total'] = df.groupby(['id'], sort=False)['event_ind'].transform(rolling_forward_2_day)
df.reset_index(drop=False, inplace=True)
Here is the expected result:
id date event_ind future_2_day_total
0 1 2014-01-01 0 2
1 1 2014-01-02 1 1
2 1 2014-01-03 1 0
3 2 2014-01-01 1 2
4 2 2014-01-02 1 1
5 2 2014-01-03 1 0
6 3 2014-01-01 0 1
7 3 2014-01-02 0 1
8 3 2014-01-03 1 0
Any tips on what I might be doing wrong or high-performance alternatives would be great!
EDIT:
One quick clarification. This example is simplified and valid solutions need to be able to handle unevenly spaced/irregular time series which is why rolling with a time-based index is utilized.
You can still use rolling here, but use it with the flag win_type='boxcar' and shift your data around before and after you sum:
df['future_day_2_total'] = (
df.groupby('id').event_ind.shift(-1)
.fillna(0).groupby(df.id).rolling(2, win_type='boxcar')
.sum().shift(-1).fillna(0)
)
id date event_ind future_day_2_total
0 1 2014-01-01 0 2.0
1 1 2014-01-02 1 1.0
2 1 2014-01-03 1 0.0
3 2 2014-01-01 1 2.0
4 2 2014-01-02 1 1.0
5 2 2014-01-03 1 0.0
6 3 2014-01-01 0 1.0
7 3 2014-01-02 0 1.0
8 3 2014-01-03 1 0.0

Categories