How to extract the date component from multiple datetime columns - python

I have a Data Set that looks like this:
import pandas as pd
import numpy as np
data = {'ProcessStartDate': ['08/11/2019 22:59', '07/11/2019 16:18', '04/12/2019 15:00', '24/06/2019 14:20', '24/04/2019 19:16'],
'ValidationEndTime': ['27/11/2019 11:47', np.nan, np.nan, '28/06/2019 16:23', np.nan],
'ValidationStartTime': ['08/11/2019 22:59', '06/01/2020 13:52', '27/11/2019 11:47', '24/06/2019 16:44', '10/07/2019 17:41'],
'AiSStartTime': ['25/03/2020 11:18', '25/03/2020 11:18', '25/03/2020 08:14', '14/08/2019 15:43', '28/06/2019 16:23'],
'AiSEndTime': [np.nan, np.nan, np.nan, '26/08/2019 14:17', '14/08/2019 15:43']}
df = pd.DataFrame(data)
ProcessStartDate ValidationEndTime ValidationStartTime AiSStartTime AiSEndTime
0 08/11/2019 22:59 27/11/2019 11:47 08/11/2019 22:59 25/03/2020 11:18 NaN
1 07/11/2019 16:18 NaN 06/01/2020 13:52 25/03/2020 11:18 NaN
2 04/12/2019 15:00 NaN 27/11/2019 11:47 25/03/2020 08:14 NaN
3 24/06/2019 14:20 28/06/2019 16:23 24/06/2019 16:44 14/08/2019 15:43 26/08/2019 14:17
4 24/04/2019 19:16 NaN 10/07/2019 17:41 28/06/2019 16:23 14/08/2019 15:43
what i need is to extract the date part of every column and put it into a new column named as the columns where the date is concatenated with 'new'. the columns are objects so i can transform them all to datetime format with this code:
cols = ['ProcessStartDate','ValidationEndTime','ValidationStartTime','AiSStartTime','AiSEndTime']
df[cols] = df[cols].apply(pd.to_datetime)
I would have thought that I could have extracted the dates from all the columns using the same code as above but adding the dt.date but this raises an exception.
I also have searched SO for an answer but I have only been able to find answers that deal with doing this for one column and not multiple.

As stated in the OP, all of the columns can be converted to a datetime format:
df = df.apply(pd.to_datetime)
# extract the date component from the columns
df_new = df.apply(lambda col: col.dt.date)
# add _new to the column names
df_new.columns = [f'{v}_new' for v in df_new.columns]
# display(df_new)
ProcessStartDate_new ValidationEndTime_new ValidationStartTime_new AiSStartTime_new AiSEndTime_new
0 2019-08-11 2019-11-27 2019-08-11 2020-03-25 NaT
1 2019-07-11 NaT 2020-06-01 2020-03-25 NaT
2 2019-04-12 NaT 2019-11-27 2020-03-25 NaT
3 2019-06-24 2019-06-28 2019-06-24 2019-08-14 2019-08-26
4 2019-04-24 NaT 2019-10-07 2019-06-28 2019-08-14
Alternatively, the transformation can be done in a single .apply
df_new = df.apply(lambda col: pd.to_datetime(col).dt.date)

Related

Pandas compute time duration among 3 columns and skip the none value at the same time

I have a Dateframe ,you can have it ,by runnnig:
import pandas as pd
from io import StringIO
df = """
case_id first_created last_paid submitted_time
3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023+00:00
7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585+00:00
1245 2021-09-13 None 2022-10-31 02:03:59.620348+00:00
9073 None None 2021-09-12 10:25:30.845687+00:00
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s+', engine='python')
df
The logic is create 2 new columns for each row:
df['create_duration']=df['submitted_time']-df['first_created']
df['paid_duration']=df['submitted_time']-df['last_paid']
The unit need to be days.
My changeling is sometime the last_paid or first_created will be none,how to skip the none value in the same row ,but still keep computing the another column ,if its value is not none ?
For example ,the last_paid in the third row is none ,but first_created is not,so for this row:
df['create_duration']=df['submitted_time']-df['first_created']
df['paid_duration']='N/A'
You can use:
submitted = pd.to_datetime(df['submitted_time'], errors='coerce', utc=True).dt.tz_localize(None)
df['create_duration'] = submitted.sub(pd.to_datetime(df['first_created'], errors='coerce')).dt.days
df['paid_duration'] = submitted.sub(pd.to_datetime(df['last_paid'], errors='coerce')).dt.days
Output:
case_id first_created last_paid submitted_time create_duration paid_duration
0 3456 2021-01-27 2021-01-29 2021-01-26 21:34:36.566023+00:00 -1.0 -3.0
1 7891 2021-08-02 2021-09-16 2022-10-26 19:49:14.135585+00:00 450.0 405.0
2 1245 2021-09-13 None 2022-10-31 02:03:59.620348+00:00 413.0 NaN
3 9073 None None 2021-09-12 10:25:30.845687+00:00 NaN NaN

How can I count the rows between a date index and a date one month in the future in pandas vectorized to add them as a column?

I have a dataframe (df) with a date index. And I want to achieve the following:
1. Take Dates column and add one month -> e.g. nxt_dt = df.index + np.timedelta64(month=1) and lets call df.index curr_dt
2. Find the nearest entry in Dates that is >= nxt_dt.
3 Count the rows between curr_dt and nxt_dt and put them into a column in df.
The result is supposed to look like this:
px_volume listed_sh ... iv_mid_6m '30d'
Dates ...
2005-01-03 228805 NaN ... 0.202625 21
2005-01-04 189983 NaN ... 0.203465 22
2005-01-05 224310 NaN ... 0.202455 23
2005-01-06 221988 NaN ... 0.202385 20
2005-01-07 322691 NaN ... 0.201065 21
Needless to mention that there are only dates/rows in the df for which there are observations.
I can think of some different ways to get this done in loops, but since the data I work with is quite big, I would really like to avoid to loop through rows to fill them.
Is there a way in pandas to get this done vectorized?
If you are OK to reindex this should do the job:
import numpy as np
import pandas as pd
df = pd.DataFrame({'date': ['2020-01-01', '2020-01-08', '2020-01-24', '2020-01-29', '2020-02-09', '2020-03-04']})
df['date'] = pd.to_datetime(df['date'])
df['value'] = 1
df = df.set_index('date')
df = df.reindex(pd.date_range('2020-01-01','2020-03-04')).fillna(0)
df = df.sort_index(ascending=False)
df['30d'] = df['value'].rolling(30).sum() - 1
df.sort_index().query("value == 1")
gives:
value 30d
2020-01-01 1.0 3.0
2020-01-08 1.0 2.0
2020-01-24 1.0 2.0
2020-01-29 1.0 1.0
2020-02-09 1.0 NaN
2020-03-04 1.0 NaN

How to count total days in pandas dataframe

I have a df column with dates and hours / minutes:
0 2019-09-13 06:00:00
1 2019-09-13 06:05:00
2 2019-09-13 06:10:00
3 2019-09-13 06:15:00
4 2019-09-13 06:20:00
Name: Date, dtype: datetime64[ns]
I need to count how many days the dataframe contains.
I tried it like this:
sample_length = len(df.groupby(df['Date'].dt.date).first())
and
sample_length = len(df.groupby(df['Date'].dt.date))
But the number I get seems wrong. Do you know another method of counting the days?
Consider the sample dates:
sample = pd.date_range('2019-09-12 06:00:00', periods=50, freq='4h')
df = pd.DataFrame({'date': sample})
date
0 2019-09-12 06:00:00
1 2019-09-12 10:00:00
2 2019-09-12 14:00:00
3 2019-09-12 18:00:00
4 2019-09-12 22:00:00
5 2019-09-13 02:00:00
6 2019-09-13 06:00:00
...
47 2019-09-20 02:00:00
48 2019-09-20 06:00:00
49 2019-09-20 10:00:00
Use, DataFrame.groupby to group the dataframe on df['date'].dt.date and use the aggregate function GroupBy.size:
count = df.groupby(df['date'].dt.date).size()
# print(count)
date
2019-09-12 5
2019-09-13 6
2019-09-14 6
2019-09-15 6
2019-09-16 6
2019-09-17 6
2019-09-18 6
2019-09-19 6
2019-09-20 3
dtype: int64
I'm not completely sure what you want to do here. Do you want to count the number of unique days (Monday/Tuesday/...), monthly dates (1-31 ish), yearly dates (1-365), or unique dates (unique days since the dawn of time)?
From a pandas series, you can use {series}.value_counts() to get the number of entries for each unique value, or simply get all unique values with {series}.unique()
import pandas as pd
df = pd.DataFrame(pd.DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
'2015-01-24 10:11:04', '2015-03-26 16:23:53',
'2017-04-01 00:38:21', '2015-03-19 03:47:54',
'2015-12-30 07:32:32', '2015-11-10 20:39:36',
'2015-06-24 05:48:09', '2015-03-19 16:05:19'],
dtype='datetime64[ns]', freq=None), columns = ["date"])
days (Monday/Tuesday/...):
df.date.dt.dayofweek.value_counts()
monthly dates (1-31 ish)
df.date.dt.day.value_counts()
yearly dates (1-365)
df.date.dt.dayofyear.value_counts()
unique dates (unique days since the dawn of time)
df.date.dt.date.value_counts()
To get the number of unique entries from any of the above, simply add .shape[0]
In order to calculate the total number of unique dates in the given time series data example we can use:
print(len(pd.to_datetime(df['Date']).dt.date.unique()))
import pandas as pd
df = pd.DataFrame({'Date': ['2019-09-13 06:00:00',
'2019-09-13 06:05:00',
'2019-09-13 06:10:00',
'2019-09-13 06:15:00',
'2019-09-13 06:20:00']
},
dtype = 'datetime64[ns]'
)
df = df.set_index('Date')
_count_of_days = df.resample('D').first().shape[0]
print(_count_of_days)

Replacing a series with another series of a different length in Pandas for a multiindex

I have a series that is within a multiindex, that I would like to change.
Say I have the following Series named ser:
gbd_wijk_naam gbd_buurt_naam cluster_id weging_datum_weging
Centrale Markt Ecowijk 119617.877|488566.830 2017-05-07 20.248457
2017-05-21 23.558438
2017-05-28 40.910273
2017-06-18 14.142136
2017-07-09 15.652476
...
Westindische Buurt Postjeskade e.o. 118620.633|486116.648 2019-11-17 17.029386
2019-12-01 21.530015
2019-12-08 15.491933
2019-12-15 22.896061
2019-12-22 13.228757
In the end, I want to do this for all indexes, but for now lets focus on just one..
I'm taking the first index, so (Centrale Markt, Ecowijk, 119617.877|488566.830). This returns to me the following series:
weging_datum_weging
2017-05-07 20.248457
2017-05-21 23.558438
2017-05-28 40.910273
2017-06-18 14.142136
2017-07-09 15.652476
2017-07-23 44.067607
2017-07-30 17.464249
2017-08-20 20.000000
2017-08-27 30.184594
2017-09-03 19.104973
2017-09-10 17.175564
2017-09-17 15.968719
2017-09-24 38.415531
2017-10-29 18.708287
2017-11-05 18.574176
2017-11-12 21.095023
2017-12-10 21.794495
2019-01-06 42.966652
2019-01-20 13.038405
2019-01-27 29.483345
2019-02-17 16.278821
2019-02-24 15.968719
2019-03-03 31.583124
2019-03-10 19.748418
2019-04-28 18.574176
2019-05-12 17.029386
2019-05-19 20.976177
2019-06-23 20.493902
2019-07-14 15.329710
2019-09-22 34.537485
2019-09-29 17.320508
2019-10-06 16.431677
2019-10-27 10.246951
2019-11-17 16.733201
2019-11-24 29.567957
Name: weging_netto_gewicht, dtype: float64
With shape (35,)
I want to replace all values in this index, with those of an interpolated series that i make through:
_ = ser.loc[('Centrale Markt', 'Ecowijk', '119617.877|488566.830')]
upsampled = _.resample('D')
interpolated = upsampled.interpolate(method='linear')
This series has shape (932,).
I'm able to change the series through:
x = ser.loc[('Centrale Markt', 'Ecowijk', '119617.877|488566.830')]
x = x.reindex(interpolated.index)
x.update(interpolated)
Giving me
weging_datum_weging
2017-05-07 20.248457
2017-05-08 20.484884
2017-05-09 20.721311
2017-05-10 20.957738
2017-05-11 21.194166
...
2019-11-20 22.233810
2019-11-21 24.067347
2019-11-22 25.900884
2019-11-23 27.734420
2019-11-24 29.567957
Freq: D, Name: weging_netto_gewicht, Length: 932, dtype: float64
What I can't seem to figure out is how to put x back into ser at index ('Centrale Markt', 'Ecowijk', '119617.877|488566.830')
When I try to do it for all the indeces for example:
for idx, df_select in ser2.groupby(level=[0,1,2]):
_ = ser.loc[idx]
upsampled = _.resample('D')
interpolated = upsampled.interpolate(method='linear')
ser.loc[idx] = ser.loc[idx].reindex(interpolated.index)
ser.loc[idx].update(interpolated)
Interpolated is generated as it should, but the second part is not updating ser.
I have it working now in this way:
for index, value in interpolated.items():
new_df = new_df.append(
{'gbd_wijk_naam': idx[0], \
'gbd_buurt_naam': idx[1],\
'cluster_id': idx[2],\
'weging_datum_weging': index,\
'weging_netto_gewicht': value}, ignore_index=True)
Where it appends the row to a new df and that df gets grouped in the same way later again. This is super slow though. How can we speed this up?
Resample works when the index is either DatetimeIndex, TimedeltaIndex or PeriodIndex, but not with a multi-index as you have.
It is possible to set the timestamp column as the index, group by the other columns and resample/interpolate.
Using the following data for illustration:
gbd_wijk_naam gbd_buurt_naam cluster_id weging_datum_weging
Centrale Markt Ecowijk 119617.877|488566.830 2017-05-07 20.248457
2017-05-21 23.558438
2017-05-28 40.910273
give the series a name & reset_index
df = series.rename('val').reset_index()
ensure datetime column has the right type
df.weging_datum_weging = pd.to_datetime(df.wegin_datum_wegin)
set index, groupby other cols, resample & interpolate
(df.set_index('weging_datum_weging')
.groupby(['gbd_wijk_naam', 'gbd_buurt_naam', 'cluster_id'])
.val.apply(lambda s: s.resample('D').interpolate('linear')))
produces the output:
gbd_wijk_naam gbd_buurt_naam cluster_id weging_datum_weging
Centrale Markt Ecowijk 119617.877|488566.830 2017-05-07 20.248457
2017-05-08 20.484884
2017-05-09 20.721311
2017-05-10 20.957739
2017-05-11 21.194166
...
2017-07-05 15.364792
2017-07-06 15.436713
2017-07-07 15.508634
2017-07-08 15.580555
2017-07-09 15.652476
Name: val, Length: 64, dtype: float64

Filling NaN by 'ffill' and 'interpolate' depending on time of the day of NaN occurrence in Python

I want to fill NaN in a df using 'mean' and 'interpolate' depending on at what time of the day the NaN occur. As you can see below, the first NaN occur at 6 am and the second NaN is at 8 am.
02/03/2016 05:00 8
02/03/2016 06:00 NaN
02/03/2016 07:00 1
02/03/2016 08:00 NaN
02/03/2016 09:00 3
My df consists of thousand of days. I want to apply 'ffill' for any NaN occur before 7 am and apply 'interpolate' for those occur after 7 am. My data is from 6 am to 6 pm.
My attempt is:
df_imputed = (df.between_time("00:00:00", "07:00:00", include_start=True, include_end=False)).ffill()
df_imputed = (df.between_time("07:00:00", "18:00:00", include_start=True, include_end=True)).interpolate()
But it cut my df down to the assigned time periods rather than filling the NaN as I want.
Edit: my df contains around 400 columns so the procedure will apply to all columns.
Original question: single series of values
You can define a Boolean series according to your condition, then interpolate or ffill as appropriate via numpy.where:
# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
'02/03/2016 08:00', '02/03/2016 09:00'],
'value': [8, np.nan, 1, np.nan, 3]})
df['date'] = pd.to_datetime(df['date'])
# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')
# use numpy.where to differentiate between two scenarios
df['value'] = np.where(switch, df['value'].interpolate(), df['value'].ffill())
print(df)
date value
0 2016-02-03 05:00:00 8.0
1 2016-02-03 06:00:00 8.0
2 2016-02-03 07:00:00 1.0
3 2016-02-03 08:00:00 2.0
4 2016-02-03 09:00:00 3.0
Updated question: multiple series of values
With multiple value columns, you can adjust the above solution using pd.DataFrame.where and iloc. Or, instead of iloc, you can use loc or other means (e.g. filter) of selecting columns:
# setup
df = pd.DataFrame({'date': ['02/03/2016 05:00', '02/03/2016 06:00', '02/03/2016 07:00',
'02/03/2016 08:00', '02/03/2016 09:00'],
'value': [8, np.nan, 1, np.nan, 3],
'value2': [3, np.nan, 2, np.nan, 6]})
df['date'] = pd.to_datetime(df['date'])
# construct Boolean switch series
switch = (df['date'] - df['date'].dt.normalize()) > pd.to_timedelta('07:00:00')
# use numpy.where to differentiate between two scenarios
df.iloc[:, 1:] = df.iloc[:, 1:].interpolate().where(switch, df.iloc[:, 1:].ffill())
print(df)
date value value2
0 2016-02-03 05:00:00 8.0 3.0
1 2016-02-03 06:00:00 8.0 3.0
2 2016-02-03 07:00:00 1.0 2.0
3 2016-02-03 08:00:00 2.0 4.0
4 2016-02-03 09:00:00 3.0 6.0

Categories