I have very unusual time series data which is both irregular and has several missing values.
The data points are measured 3 times a day only on weekdays, at 10:00AM, 2:00PM, and 6:00PM, most days are missing one or two measurements, and some days are missing altogether.
My df looks something like this:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-31 10:00:00 6
3 2020-07-31 14:00:00 4.5
4 2020-07-31 18:00:00 7
5 2020-08-03 14:00:00 5.5
6 2020-08-04 14:00:00 5
I'm trying to figure out how to fill it out with the time stamps for the missing measurements, add a row with the missing time stamp and an NA value, but without adding extra times of day or any Saturdays or Sundays, so that my df looks like this at the end:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-30 18:00:00 NA
3 2020-07-31 10:00:00 6
4 2020-07-31 14:00:00 4.5
5 2020-07-31 18:00:00 7
6 2020-08-03 10:00:00 NA
7 2020-08-03 14:00:00 5.5
8 2020-08-03 18:00:00 NA
9 2020-08-04 10:00:00 NA
10 2020-08-04 14:00:00 5
11 2020-08-04 18:00:00 NA
The only thing I could come up with was pretty convoluted: write a loop to generate a row for all the dates in the desired date range * 3 (1 for each time of measurement) formatted as date times, along with a an additional week of day counter. Convert it into a df, and then drop all columns where Week of Day = 6,7, then do a join of this new df with my original df on the date time column (Outer or left - whichever one keeps all columns).
Is there any more elegant way of doing this?
you could create a filtered date range and index by it:
all_ts = pd.date_range(start=df['datetime'].min(), end=df['datetime'].max(), freq='H')
weekday_ts = all_ts[~all_ts.weekday.isin([5,6])]
filtered_ts = weekday_ts[weekday_ts.hour.isin([10, 14, 18])]
df.set_index(df['datetime']).reindex(filtered_ts).drop('datetime', axis=1).reset_index()
df = pd.DataFrame([
{"date time": datetime.datetime.strptime("2020-07-30 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
{"date time": datetime.datetime.strptime("2020-07-30 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 3},
{"date time": datetime.datetime.strptime("2020-07-31 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 6},
{"date time": datetime.datetime.strptime("2020-07-31 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 4.5},
{"date time": datetime.datetime.strptime("2020-07-31 18:00:00", '%Y-%m-%d %H:%M:%S'), "value": 7},
{"date time": datetime.datetime.strptime("2020-08-02 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5.5},
{"date time": datetime.datetime.strptime("2020-08-03 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
]
)
# define your range of dates you're working with
range_dates = pd.date_range('2020-07-30', '2020-08-04', freq='D')
# remove weekend days
range_dates = range_dates[~range_dates.weekday.isin([5,6])]
range_dates = pd.Series(range_dates)
# here we will create a range of your 3 hours of measurements
range_times = pd.date_range('10:00:00', '18:00:00', freq='4H')
range_times = pd.Series(range_times.time)
# we combine our two ranges
index = range_dates.apply(
lambda date: range_times.apply(
lambda time: datetime.datetime.combine(date, time)
)
).unstack()
# we reindex the dataframe and sort it
df = df.reindex(index=index).sort_index()
Output:
value
2020-07-30 10:00:00 5.0
2020-07-30 14:00:00 3.0
2020-07-30 18:00:00 NaN
2020-07-31 10:00:00 6.0
2020-07-31 14:00:00 4.5
2020-07-31 18:00:00 7.0
2020-08-01 10:00:00 NaN
2020-08-01 14:00:00 NaN
2020-08-01 18:00:00 NaN
2020-08-02 10:00:00 NaN
2020-08-02 14:00:00 5.5
2020-08-02 18:00:00 NaN
2020-08-03 10:00:00 NaN
2020-08-03 14:00:00 5.0
2020-08-03 18:00:00 NaN
2020-08-04 10:00:00 NaN
2020-08-04 14:00:00 NaN
2020-08-04 18:00:00 NaN
Related
I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
Suppose we have two dataframes, one with a timestamp and the other with start and end timestamps. df1 and df2 as:
DateTime
Value1
StartDateTime
EnddDateTime
Value2
2020-01-11 12:30:00
1
2020-01-11 12:23:12
2020-01-11 13:10:00
a
2020-01-11 13:00:00
2
2020-01-11 14:12:20
2020-01-11 14:20:34
b
2020-02-11 13:30:00
3
2020-01-11 15:20:00
2020-01-11 15:28:10
c
2020-02-11 14:00:00
4
2020-01-11 15:45:20
2020-01-11 16:26:23
d
2020-02-11 14:30:00
5
2020-02-11 15:00:00
6
2020-02-11 15:30:00
7
2020-02-11 16:00:00
8
The timestamp of df1 represents half an hour starting from the time in the DateTime column. I want to match df2 start and end time with these 20 minutes periods. A value of df2 may fall in two rows of df1 if its period (the time between start and end) matches with two DateTime in df1, even for only one second. The outcome should be a dataframe as below.
DateTime
Value1
Value2
2020-01-11 12:30:00
1
a
2020-01-11 13:00:00
2
a
2020-02-11 13:30:00
3
Nan
2020-02-11 14:00:00
4
b
2020-02-11 14:30:00
5
Nan
2020-02-11 15:00:00
6
c
2020-02-11 15:30:00
7
d
2020-02-11 16:00:00
8
d
Any suggestions to efficiently merge large data?
There maybe shorter better answers out there because I am going longhand.
melt the second data frame
df3=pd.melt(df2, id_vars=['Value2'], value_vars=['StartDateTime', 'EnddDateTime'],value_name='DateTime').sort_values(by='DateTime')
Create temp columns on both dfs. The reason is, you want to get the time from datetime, append that time to a uniform date to be used in the merge
df1['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df1['DateTime']).dt.time.astype(str)
df3['DateTime1']=pd.Timestamp('today').strftime('%Y-%m-%d') + ' ' +pd.to_datetime(df3['DateTime']).dt.time.astype(str)
Convert the new column date times computed above to datetime
df3["DateTime1"]=pd.to_datetime(df3["DateTime1"])
df1["DateTime1"]=pd.to_datetime(df1["DateTime1"])
Finally, mergeasof with a time tolerance
final = pd.merge_asof(df1, df3, on="DateTime1",tolerance=pd.Timedelta("39M"),suffixes=('_', '_df2')).drop(columns=['DateTime1','variable','DateTime_df2'])
DateTime_ Value1 Value2
0 2020-01-11 13:00:00 2 a
1 2020-02-11 13:30:00 3 a
2 2020-02-11 14:00:00 4 NaN
3 2020-02-11 14:30:00 5 b
4 2020-02-11 15:00:00 6 NaN
5 2020-02-11 15:30:00 7 c
6 2020-02-11 16:00:00 8 d
I would like to identify the maximum value in a column that occurs within the following X days from the current date.
This is a subselect of the data frame showing the daily values for 2020.
Date Data
6780 2020-01-02 323.540009
6781 2020-01-03 321.160004
6782 2020-01-06 320.489990
6783 2020-01-07 323.019989
6784 2020-01-08 322.940002
... ... ...
7028 2020-12-24 368.079987
7029 2020-12-28 371.739990
7030 2020-12-29 373.809998
7031 2020-12-30 372.339996
I would like to find a way to identify the max value within the following 30 days. e.g.
Date Data Max
6780 2020-01-02 323.540009 323.019989
6781 2020-01-03 321.160004 323.019989
6782 2020-01-06 320.489990 323.730011
6783 2020-01-07 323.019989 323.540009
6784 2020-01-08 322.940002 325.779999
... ... ... ...
7028 2020-12-24 368.079987 373.809998
7029 2020-12-28 371.739990 373.809998
7030 2020-12-29 373.809998 372.339996
7031 2020-12-30 372.339996 373.100006
I tried calculating the start and end dates and storing them in the columns. e.g.
df['startDate'] = df['Date'] + pd.to_timedelta(1, unit='d')
df['endDate'] = df['Date'] + pd.to_timedelta(30, unit='d')
before trying to calculate the max. e.g,
df['Max'] = df.loc[(df['Date'] > df['startDate']) & (df['Date'] < df['endDate'])]['Data'].max()
But this results in;
Date Data startDate endDate Max
6780 2020-01-02 323.540009 2020-01-03 2020-01-29 NaN
6781 2020-01-03 321.160004 2020-01-04 2020-01-30 NaN
6782 2020-01-06 320.489990 2020-01-07 2020-02-02 NaN
6783 2020-01-07 323.019989 2020-01-08 2020-02-03 NaN
6784 2020-01-08 322.940002 2020-01-09 2020-02-04 NaN
... ... ... ... ... ...
7027 2020-12-23 368.279999 2020-12-24 2021-01-19 NaN
7028 2020-12-24 368.079987 2020-12-25 2021-01-20 NaN
7029 2020-12-28 371.739990 2020-12-29 2021-01-24 NaN
7030 2020-12-29 373.809998 2020-12-31 2021-01-26 NaN
If I statically add dates to the loc[] statement, it partially works, filling the max for this static range however this just gives me the same value for every field.
Any help on the correct panda way to achieve this would be appreciated.
Kind Regards
df.rolling can do this if you make the date a datetime object as the axis:
df["Date"] = pd.to_datetime(df.Date)
df.set_index("Date").rolling("2d").max()
output:
Data
Date
2020-01-02 323.540009
2020-01-03 323.540009
2020-01-06 320.489990
2020-01-07 323.019989
2020-01-08 323.019989
My dataset has a datetime column that has one entry for every hour of the day for many days.
For example:
123412,2020-03-26 12:00,
123412,2020-03-27 12:00,
123412,2020-03-27 09:00,
123412,2020-03-27 09:00,
123412,2020-03-27 15:00,
123412,2020-03-26 15:00,
123412,2020-03-27 11:00,
123412,2020-03-27 12:00,
The example is not ordered, but as I said, there is one entry for every hour of the day.
The way is I want to filter this data is, for example, take datetime 2020-03-26 12:00.
Then, the filter will return the following rows:
2020-03-26 12:00
2020-03-25 12:00
2020-03-24 12:00
and etc.
I've tried the Grouper like this df2 = df2.groupby(pd.Grouper(key=DATETIME, freq='D')) but that didn't work.
How can I accomplish this? Thank you
You can filter datetimes by times by boolean indexing and Series.dt.time:
print (df)
a date b
0 123412 2020-03-26 12:00:00 NaN
1 123412 2020-03-27 12:00:00 NaN
2 123412 2020-03-27 09:00:00 NaN
3 123412 2020-03-27 09:00:00 NaN
4 123412 2020-03-27 15:00:00 NaN
5 123412 2020-03-26 15:00:00 NaN
6 123412 2020-03-27 11:00:00 NaN
7 123412 2020-03-27 12:00:00 NaN
d = '2020-03-26 12:00'
df = df[df['date'].dt.time.eq(pd.Timestamp(d).time())]
print (df)
a date b
0 123412 2020-03-26 12:00:00 NaN
1 123412 2020-03-27 12:00:00 NaN
7 123412 2020-03-27 12:00:00 NaN
If want only unique datetimes:
d = '2020-03-26 12:00'
df = df.drop_duplicates('date')
df = df[df['date'].dt.time.eq(pd.Timestamp(d).time())]
print (df)
a date b
0 123412 2020-03-26 12:00:00 NaN
1 123412 2020-03-27 12:00:00 NaN
I have the following dataframe
data = pd.DataFrame({
'date': ['1988/01/12', '1988/01/13', '1988/01/14', '1989/01/20','1990/01/01'],
'value': [11558522, 12323552, 13770958, 18412280, 13770958]
})
Is there a way in python that I can average a value for a whole month and make that the new value for that month
i.e. I want to average the 1988-01 value and make that the final value for 1988-01. I tried the groupby method but that didnt work
new_df=data.groupby(['date']).mean()
Use month periods created by Series.dt.to_period:
data['date'] = pd.to_datetime(data['date'])
new_df=data.groupby(data['date'].dt.to_period('m')).mean()
print (new_df)
value
date
1988-01 1.255101e+07
1989-01 1.841228e+07
1990-01 1.377096e+07
Or use DataFrame.resample and if necessary remove missing values:
new_df=data.resample('MS', on='date').mean().dropna()
print (new_df)
value
date
1988-01-01 1.255101e+07
1989-01-01 1.841228e+07
1990-01-01 1.377096e+07
Or you can use months and years separately for MultiIndex:
new_df=data.groupby([data['date'].dt.year.rename('y'),
data['date'].dt.month.rename('m')]).mean()
print (new_df)
value
y m
1988 1 1.255101e+07
1989 1 1.841228e+07
1990 1 1.377096e+07
df=pd.read_csv("data .csv",encoding='ISO-8859-1', parse_dates=["datetime"])
print(df)
print(df.dtypes)
datetime Temperature
0 1987-11-01 07:00:00 21.4
1 1987-11-01 13:00:00 27.4
2 1987-11-01 19:00:00 25.0
3 1987-11-02 07:00:00 22.0
4 1987-11-02 13:00:00 27.6
... ...
27554 2020-03-30 13:00:00 24.8
27555 2020-03-30 18:00:00 23.8
27556 2020-03-31 07:00:00 23.4
27557 2020-03-31 13:00:00 24.6
27558 2020-03-31 18:00:00 26.4
df1=df.groupby(pd.Grouper(key='datetime',freq='D')).mean()
datetime Temperature
1987-11-01 24.600000
1987-11-02 25.066667
1987-11-03 24.466667
1987-11-04 22.533333
1987-11-05 25.066667
...
2020-03-27 26.533333
2020-03-28 27.666667
2020-03-29 27.733333
2020-03-30 24.266667
2020-03-31 24.800000