datetimelike error with perfectly fine timestamps - python

I am getting the error AttributeError: Can only use .dt accessor with datetimelike values when trying to round the time to 30 min steps, but I don't see why. I have another dataframe with the exact same timestamps and there's no problem with that one.
for hobo in loggers_hobo:
df = pd.read_csv(path_input+hobo+ext, skiprows=[0])
df.rename(columns={'Date Time, GMT+01:00': 'TIMESTAMP', 'Abs Pres, psi (LGR S/N: '+hobo+', SEN S/N: '+hobo+')': 'Pressure_TOT_'+hobo, 'Temp, °F (LGR S/N: '+hobo+', SEN S/N: '+hobo+')': 'Temperature'}, inplace=True)
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']).dt.strftime('%Y-%m-%d %H:%M:%S')
df['TIMESTAMP'] = df['TIMESTAMP'].dt.round('30min')
df.drop(['#', 'Coupler Detached (LGR S/N: '+hobo+')', 'Coupler Attached (LGR S/N: '+hobo+')', 'Host Connected (LGR S/N: '+hobo+')', 'End Of File (LGR S/N: '+hobo+')'], inplace=True, axis=1)
df.to_csv(path_temp+hobo+ext)
print(df)
Dataframe after editing:
TIMESTAMP Pressure_TOT_20796630 Temperature
0 2022-03-11 14:10:00 14.8332 75.659
1 2022-03-11 14:10:22 NaN NaN
2 2022-03-11 14:18:58 NaN NaN
3 2022-03-11 14:19:54 NaN NaN
4 2022-03-11 14:20:58 NaN NaN
... ... ...
16094 2023-02-09 12:40:00 16.5452 42.363
16095 2023-02-09 13:10:00 16.5363 42.179
16096 2023-02-09 13:20:36 NaN NaN
16097 2023-02-09 13:20:43 NaN NaN
16098 2023-02-09 13:20:58 NaN NaN
Any ideas?

Problem is need rounding datetimes, not strings after strftime, so instead:
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']).dt.strftime('%Y-%m-%d %H:%M:%S')
df['TIMESTAMP'] = df['TIMESTAMP'].dt.round('30min')
use:
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP']).dt.round('30min').dt.strftime('%Y-%m-%d %H:%M:%S')

Related

Keep the conditional calculation result in a dataframe

My data frame dft2022 is:
Start_Time End_Time
9:55:00 10:55:00
5:41:00 14:55:00
9:01:00 12:55:00
9:02:00 7:55:00
8:55:00 N/A
11:55:00 N/A
I want to add a duration column in this dataframe, the duration = End_Time - Start_Time. I used the following:
dft2022["duration"] = pd.to_datetime(dft2022["End_Time"]) - pd.to_datetime(dft2022["Start_Time"])
However, I only want to keep the "duration" value where End_Time - Start_Time is positive value or doesn't have a N/A in End_Time.
The output that you expect is unclear, bit you could use clip to set the negative deltas to 0:
df['duration'] = (pd.to_datetime(dft2022["End_Time"])
.sub(pd.to_datetime(dft2022["Start_Time"]))
.clip(lower='0')
)
Output:
Start_Time End_Time duration
0 9:55:00 10:55:00 0 days 01:00:00
1 5:41:00 14:55:00 0 days 09:14:00
2 9:01:00 12:55:00 0 days 03:54:00
3 9:02:00 7:55:00 0 days 00:00:00
4 8:55:00 NaN NaT
5 11:55:00 NaN NaT
To filter the date, you can use:
df[pd.to_datetime(dft2022["End_Time"])
.sub(pd.to_datetime(dft2022["Start_Time"]))
.gt('0')]
Output:
Start_Time End_Time
0 9:55:00 10:55:00
1 5:41:00 14:55:00
2 9:01:00 12:55:00
Can try something like this, needs Numpy.
import numpy as np
import pandas as pd
data = {'Start_Date': ['9:55:00', '5:41:00', '9:01:00', '9:02:00', '8:55:00', '11:55:00'], 'End_Date': ['10:55:00', '14:55:00', '12:55:00', '7:55:00', 'N/A', 'N/A']}
df = pd.DataFrame(data)
df['_end_time'] = pd.to_datetime(df.End_Date, errors='coerce')
df['_start_time'] = pd.to_datetime(df.Start_Date, errors='coerce')
# Coerce sets NaT if unable to parse as datetime.
df["duration"] = np.where(
df._end_time >= df._start_time,
df._end_time - df._start_time,
pd.NaT
)
df.drop(columns=['_start_time', '_end_time'], inplace=True)
print(df)
Output:
Start_Date End_Date duration
0 9:55:00 10:55:00 3600000000000
1 5:41:00 14:55:00 33240000000000
2 9:01:00 12:55:00 14040000000000
3 9:02:00 7:55:00 NaT
4 8:55:00 N/A NaT
5 11:55:00 N/A NaT

How to extract the date component from multiple datetime columns

I have a Data Set that looks like this:
import pandas as pd
import numpy as np
data = {'ProcessStartDate': ['08/11/2019 22:59', '07/11/2019 16:18', '04/12/2019 15:00', '24/06/2019 14:20', '24/04/2019 19:16'],
'ValidationEndTime': ['27/11/2019 11:47', np.nan, np.nan, '28/06/2019 16:23', np.nan],
'ValidationStartTime': ['08/11/2019 22:59', '06/01/2020 13:52', '27/11/2019 11:47', '24/06/2019 16:44', '10/07/2019 17:41'],
'AiSStartTime': ['25/03/2020 11:18', '25/03/2020 11:18', '25/03/2020 08:14', '14/08/2019 15:43', '28/06/2019 16:23'],
'AiSEndTime': [np.nan, np.nan, np.nan, '26/08/2019 14:17', '14/08/2019 15:43']}
df = pd.DataFrame(data)
ProcessStartDate ValidationEndTime ValidationStartTime AiSStartTime AiSEndTime
0 08/11/2019 22:59 27/11/2019 11:47 08/11/2019 22:59 25/03/2020 11:18 NaN
1 07/11/2019 16:18 NaN 06/01/2020 13:52 25/03/2020 11:18 NaN
2 04/12/2019 15:00 NaN 27/11/2019 11:47 25/03/2020 08:14 NaN
3 24/06/2019 14:20 28/06/2019 16:23 24/06/2019 16:44 14/08/2019 15:43 26/08/2019 14:17
4 24/04/2019 19:16 NaN 10/07/2019 17:41 28/06/2019 16:23 14/08/2019 15:43
what i need is to extract the date part of every column and put it into a new column named as the columns where the date is concatenated with 'new'. the columns are objects so i can transform them all to datetime format with this code:
cols = ['ProcessStartDate','ValidationEndTime','ValidationStartTime','AiSStartTime','AiSEndTime']
df[cols] = df[cols].apply(pd.to_datetime)
I would have thought that I could have extracted the dates from all the columns using the same code as above but adding the dt.date but this raises an exception.
I also have searched SO for an answer but I have only been able to find answers that deal with doing this for one column and not multiple.
As stated in the OP, all of the columns can be converted to a datetime format:
df = df.apply(pd.to_datetime)
# extract the date component from the columns
df_new = df.apply(lambda col: col.dt.date)
# add _new to the column names
df_new.columns = [f'{v}_new' for v in df_new.columns]
# display(df_new)
ProcessStartDate_new ValidationEndTime_new ValidationStartTime_new AiSStartTime_new AiSEndTime_new
0 2019-08-11 2019-11-27 2019-08-11 2020-03-25 NaT
1 2019-07-11 NaT 2020-06-01 2020-03-25 NaT
2 2019-04-12 NaT 2019-11-27 2020-03-25 NaT
3 2019-06-24 2019-06-28 2019-06-24 2019-08-14 2019-08-26
4 2019-04-24 NaT 2019-10-07 2019-06-28 2019-08-14
Alternatively, the transformation can be done in a single .apply
df_new = df.apply(lambda col: pd.to_datetime(col).dt.date)

How to fill periods in columns?

There is a dataframe. The period column contains lists. These lists contain time spans.
#load data
df = pd.DataFrame(data, columns=['task_id', 'target_start_date', 'target_end_date'])
df['target_start_date'] = pd.to_datetime(df.target_start_date)
df['target_end_date'] = pd.to_datetime(df.target_end_date)
df['period'] = np.nan
#create period column
z = dict()
freq = 'M'
for i in range(0, len(df)):
l = pd.period_range(df.target_start_date[i], df.target_end_date[i], freq=freq)
l = l.to_native_types()
z[i] = l
df.period = z.values()
Output
task_id target_start_date target_end_date period
0 35851 2019-04-01 07:00:00 2019-04-01 07:00:00 [2019-04]
1 35852 2020-02-26 11:30:00 2020-02-26 11:30:00 [2020-02]
2 35854 2019-05-17 07:00:00 2019-06-01 17:30:00 [2019-05, 2019-06]
3 35855 2019-03-20 11:30:00 2019-04-07 15:00:00 [2019-03, 2019-04]
4 35856 2019-04-06 08:00:00 2019-04-26 19:00:00 [2019-04]
enter image description here
Then I add columns which are called time slices.
#create slices
date_min = df.target_start_date.min()
date_max = df.target_end_date.max()
period = pd.period_range(date_min, date_max, freq=freq)
#add columns
for i in period:
df[str(i)] = np.nan
result
enter image description here
How can I fill Nan values ​​for True, if this value is in the list in the period column?
enter image description here
Apply a function across the dataframe rows
def fillit(row):
for i in row.period:
row[i] = True
df.apply(fillit), axis=1)
My approach was to iterate over rows and column names and compare values:
import numpy as np
import pandas as pd
# handle assignment error
pd.options.mode.chained_assignment = None
# setup test data
data = {'time': [['2019-04'], ['2019-01'], ['2019-03'], ['2019-06', '2019-05']]}
data = pd.DataFrame(data=data)
# create periods
date_min = data.time.min()[0]
date_max = data.time.max()[0]
period = pd.period_range(date_min, date_max, freq='M')
for i in period:
data[str(i)] = np.nan
# compare and fill data
for index, row in data.iterrows():
for column in data:
if data[column].name in row['time']:
data[column][index] = 'True'
Output:
time 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
0 [2019-04] NaN NaN NaN True NaN NaN
1 [2019-01] True NaN NaN NaN NaN NaN
2 [2019-03] NaN NaN True NaN NaN NaN
3 [2019-06, 2019-05] NaN NaN NaN NaN True True

Shifting Date time index

I am trying to shift my date time index such that 2018-04-09 will show as 2018-04-08 one day ahead and only shifting the last row, I tried a few ways with different error such as below:
df.index[-1] = df.index[-1] + pd.offsets.Day(1)
TypeError: Index does not support mutable operations
Can you kindly advise a suitable way please?
My df looks like this:
FinalPosition
dt
2018-04-03 1.32
2018-04-04 NaN
2018-04-05 NaN
2018-04-06 NaN
2018-04-09 NaN
Use rename if values of DatetimeIndex are unique:
df = df.rename({df.index[-1]: df.index[-1] + pd.offsets.Day(1)})
print (df)
FinalPosition
dt
2018-04-03 1.32
2018-04-04 NaN
2018-04-05 NaN
2018-04-06 NaN
2018-04-10 NaN
If possible not unique for me working DatetimeIndex.insert:
df.index = df.index[:-1].insert(len(df), df.index[-1] + pd.offsets.Day(1))
Use .iloc
Ex:
import pandas as pd
df = pd.DataFrame({"datetime": ["2018-04-09"]})
df["datetime"] = pd.to_datetime(df["datetime"])
print df["datetime"].iloc[-1:] - pd.offsets.Day(1)
Output:
0 2018-04-08
Name: datetime, dtype: datetime64[ns]

how to resample without skipping nan values in pandas

I am trying get the 10 days aggregate of my data which has NaN values. The sum of 10 days should return a nan values if there is a NaN value in the 10 day duration.
When I apply the below code, pandas is considering NaN as Zero and returning the sum of remaining days.
dateRange = pd.date_range(start_date, periods=len(data), freq='D')
# Creating a data frame so that the timeseries can handle numpy array.
df = pd.DataFrame(data)
base_Series = pd.DataFrame(list(df.values), index=dateRange)
# Converting to aggregated series
agg_series = base_Series.resample('10D', how='sum')
agg_data = agg_series.values
Sample Data:
2011-06-01 46.520536
2011-06-02 8.988311
2011-06-03 0.133823
2011-06-04 0.274521
2011-06-05 1.283360
2011-06-06 2.556313
2011-06-07 0.027461
2011-06-08 0.001584
2011-06-09 0.079193
2011-06-10 2.389549
2011-06-11 NaN
2011-06-12 0.195844
2011-06-13 0.058720
2011-06-14 6.570925
2011-06-15 0.015107
2011-06-16 0.031066
2011-06-17 0.073008
2011-06-18 0.072198
2011-06-19 0.044534
2011-06-20 0.240080
Output:
2011-06-01 62.254651
2011-06-11 7.301481
This uses numpy sum which will return nan if nan is present in the sum
In [35]: s = Series(randn(100),index=date_range('20130101',periods=100))
In [36]: s.iloc[11] = np.nan
In [37]: s.resample('10D',how=lambda x: x.values.sum())
Out[37]:
2013-01-01 6.910729
2013-01-11 NaN
2013-01-21 -1.592541
2013-01-31 -2.013012
2013-02-10 1.129273
2013-02-20 -2.054807
2013-03-02 4.669622
2013-03-12 3.489225
2013-03-22 0.390786
2013-04-01 -0.005655
dtype: float64
to filter out those days which have any NaNs, I propose that you do
noNaN_days_only = s.groupby(lambda x: x.date).filter(lambda x: ~x.isnull().any()
where s is a DataFrame
Just apply an agg function:
agg_series = base_Series.resample('10D').agg(lambda x: np.nan if np.isnan(x).all() else np.sum(x) )

Categories