We're working with python on an ubuntu 18.04 server, and is storing real time data from temperature sensors on a MySQL database. The database is installed on our server.
What we want to do is to increment a timestamp, where we retrieve the latest value in a 20 min interval, which means that in every 20 min we retrieve the latest temperature value from the sensor, from the MySQL database. We only want the interval to be from .0, 0.20, 0.40.
Example of the incrementing
2019-07-26 00:00:00
2019-07-26 00:20:00
2019-07-26 00:40:00
2019-07-26 01:00:00
...
2019-07-26 23:40:00
2019-07-27 00:00:00
...
2019-07-30 23:40:00
2019-08-01 00:00:00
This is the basic idea of what we want to achieve, but we know this a very bad way of coding thisWe want a more dynamically code. We're imagining that there's a function perhaps, or some other way we haven't thought about. This is what the basic idea looks like:
for x in range (0, 24, 1)
for y in range (0, 60, 20)
a = pd.read_sql('SELECT temperature1m FROM Weather_Station WHERE timestamp > "2019-07-26 %d:%d:00" AND timestamp < "2019-07-26 %d:%d:00" ORDER BY timestamp DESC LIMIT 1' % (x, y, x, y+20), conn).astype(float).values
On our database we can retrieve first and last timestamp on our sensor.
lastLpnTime = pd.read_sql('SELECT MAX(timestamp) FROM Raw_Data WHERE topic = "lpn1"', conn).astype(str).values
firstLpnTime = pd.read_sql('SELECT MIN(timestamp) FROM Raw_Data WHERE topic = "lpn1"', conn).astype(str).values
Therefore we imagine that we can say:
From firstLpnTime to lastLpnTime in a 20 min interval from .00, 0.20 or 0.40 do this retrieve data from the MySQL database
but how we do this?
If you load the data in pandas dataframe you can sample them in the desired time periods using pd.resample .
if you want to increment your timestamp you can do something like:
from datetime import datetime, timedelta
your_start_date = '2019-07-26 00:00:00'
date = datetime.strptime(your_start_date, '%Y-%m-%d %H:%M:%S')
for i in range(10):
print(date.strftime('%Y-%m-%d %H:%M:%S'))
date += increment
output:
# 2019-07-26 00:00:00
# 2019-07-26 00:20:00
# 2019-07-26 00:40:00
# 2019-07-26 01:00:00
# 2019-07-26 01:20:00
# 2019-07-26 01:40:00
# 2019-07-26 02:00:00
# 2019-07-26 02:20:00
# 2019-07-26 02:40:00
# 2019-07-26 03:00:00
Related
I have a time series dataset of a Pandas series df that I am trying to add a new value to the bottom of the df and then increment the timestamp which is the df index.
For example the new value I can add to the bottom of the df like this:
testday.loc[len(testday.index)] = testday_predict[0]
print(testday)
Which seems to work but the time stamp is just incremented:
kW
Date
2022-07-29 00:00:00 39.052800
2022-07-29 00:15:00 38.361600
2022-07-29 00:30:00 38.361600
2022-07-29 00:45:00 38.534400
2022-07-29 01:00:00 38.880000
... ...
2022-07-29 23:00:00 36.806400
2022-07-29 23:15:00 36.806400
2022-07-29 23:30:00 36.633600
2022-07-29 23:45:00 36.806400
96 44.482361 <---- my predicted value added at the bottom good except for the time stamp value of 96
Like the value of 96 is just the next value in the length of the df.index hopefully this makes sense.
If I try:
from datetime import timedelta
last_index_stamp = testday.last_valid_index()
print(last_index_stamp)
This returns:
Timestamp('2022-07-29 23:45:00')
And then I can add 15 minutes to this Timestamp (my data is 15 minute data) like this:
new_timestamp = last_index_stamp + timedelta(minutes=15)
print(new_timestamp)
Which returns what I am looking instead of the value of 96:
Timestamp('2022-07-30 00:00:00')
But how do I replace the value of 96 with new_timestampt? If I try:
testday.index[-1:] = new_timestamp
This will error out:
TypeError: Index does not support mutable operations
Any tips greatly appreciated...
This should do the trick:
testday.loc[new_timestamp,:] = testday_predict[0]
I am currently having issues with date-time format, particularly converting string input to the correct python datetime format
Date/Time Dry_Temp[C] Wet_Temp[C] Solar_Diffuse_Rate[[W/m2]] \
0 01/01 00:10:00 8.45 8.237306 0.0
1 01/01 00:20:00 7.30 6.968360 0.0
2 01/01 00:30:00 6.15 5.710239 0.0
3 01/01 00:40:00 5.00 4.462898 0.0
4 01/01 00:50:00 3.85 3.226244 0.0
These are current examples of timestamps I have in my time, I have tried splitting date and time such that I now have the following columns:
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time
0 55.553640 18 26 1900-01-01 00:10:00
1 54.204342 18 26 1900-01-01 00:20:00
2 51.896272 18 26 1900-01-01 00:30:00
3 49.007770 18 26 1900-01-01 00:40:00
4 45.825810 18 26 1900-01-01 00:50:00
I have managed to get the year into datetime format, but there are still 2 problems to resolve:
the data was not recorded in 1900, so I would like to change the year in the Date,
I get the following error whent rying to convert time into time datetime python format
pandas/_libs/tslibs/strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()
ValueError: time data '00:00:00' does not match format ' %m/%d %H:%M:%S' (match)
I tried having 24:00:00, however, python didn't like that either...
preferences:
I would prefer if they were both in the same cell without having to split this information into two columns.
I would also like to get rid of the seconds data as the data was recorded in 10 min intervals so there is no need for seconds in my case.
Any help would be greatly appreciated.
the data was not recorded in 1900, so I would like to change the year in the Date,
datetime.datetime.replace method of datetime.datetime instance is used for this task consider following example:
import pandas as pd
df = pd.DataFrame({"when":pd.to_datetime(["1900-01-01","1900-02-02","1900-03-03"])})
df["when"] = df["when"].apply(lambda x:x.replace(year=2000))
print(df)
output
when
0 2000-01-01
1 2000-02-02
2 2000-03-03
Note that it can be used also without pandas for example
import datetime
d = datetime.datetime.strptime("","") # use all default values which result in midnight of Jan 1 of year 1900
print(d) # 1900-01-01 00:00:00
d = d.replace(year=2000)
print(d) # 2000-01-01 00:00:00
I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)
I have a time series dataframe with dates|weather information that looks like this:
2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6
I am trying to upsample it to hourly data using the following:
weather.resample('H').pad()
I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?
Thanks in advance.
Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with
df.loc['2018-01-01'] = 0
Edit:
You can get the result you want with numpy.repeat
Take this df
np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
data={'WEATHER_MAX': np.random.random(365)*15})
WEATHER_MAX
2017-01-01 6.255330
2017-01-02 10.804867
2017-01-03 0.001716
2017-01-04 4.534989
2017-01-05 2.201338
... ...
2017-12-27 4.503725
2017-12-28 2.145087
2017-12-29 13.519627
2017-12-30 8.123391
2017-12-31 14.621106
[365 rows x 1 columns]
By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs
# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
index=weather.index).stack()
# combine date and hour
hourly.index = (
hourly.index.get_level_values(0) +
pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()
Output
WEATHER_MAX
2017-01-01 00:00:00 6.255330
2017-01-01 01:00:00 6.255330
2017-01-01 02:00:00 6.255330
2017-01-01 03:00:00 6.255330
2017-01-01 04:00:00 6.255330
... ...
2017-12-31 19:00:00 14.621106
2017-12-31 20:00:00 14.621106
2017-12-31 21:00:00 14.621106
2017-12-31 22:00:00 14.621106
2017-12-31 23:00:00 14.621106
[8760 rows x 1 columns]
What to do and the reason are the same as #RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.
This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.
Prepare a valid value for 2018-01-01 at the end of the data.
Call resample.
Delete the data of 2018-01-01 after resample.
You will get 8670 data for 2017.
Look at #RichieV's modified answer:
I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい
If the same value as 00:00 on the day is all right, it would be a different way of thinking.
This is a small subset of my data:
heartrate
2018-01-01 00:00:00 67.0
2018-01-01 00:01:00 55.0
2018-01-01 00:02:00 60.0
2018-01-01 00:03:00 67.0
2018-01-01 00:04:00 72.0
2018-01-01 00:05:00 53.0
2018-01-01 00:06:00 62.0
2018-01-01 00:07:00 59.0
2018-01-01 00:08:00 117.0
2018-01-01 00:09:00 62.0
2018-01-01 00:10:00 65.0
2018-01-01 00:11:00 70.0
2018-01-01 00:12:00 49.0
2018-01-01 00:13:00 59.0
This data is a collection of daily heart rates from patients. I am trying to see if, based off their heart rate, I can find the time window that they are asleep.
I am not sure how to write a code that is able to identify the time window that the patient is asleep because every few minutes, there will be a spike in the data. For example, in the data provided from 2018-01-01 00:07:00 to 2018-01-01 00:08:00, the heartrate jumped from 59 to 117. Can anyone suggest a way around this and a way to find the time window when the Heartrate is below the mean for a few hours?
As mentioned in your comments, you can find the rolling mean to 'smoothen' your signal using:
patient_data_df['rollingmeanVal'] = patient_data_df.rolling('3T').heartrate.mean()
Assuming you are using a dataframe and want to identify rows that have a HR bellow or equal to the mean you can use:
HR_mean = patient_data_df['rollingmeanVal'].mean()
selected_data_df = patient_data_df[patient_data_df['rollingmeanVal'] <= HR_mean]
Then, instead of dealing with the dataframe as a time-series dataframe, you can reset the index and generate a column called index with the datetime as values. Now that you have a dataframe with all values bellow the mean, you can group them into groups when there is more than 30 mins difference between each group. This is assuming that having fluctuating data for 30 mins is ok.
Assuming that the group with the most data is when the patient is asleep, you can identify that group. Using the first and last date of this group, you can then identify the time window that the patient is asleep.
Reset the index, adding a new col called index with the time-series data:
selected_data_df.reset_index(inplace=True)
Group by:
selected_data_df['grp'] = selected_data_df['index'].diff().dt.seconds.ge(30 * 60).cumsum()
sleep_grp = selected_data_df.groupby('grp').count().sort_values(['grp']).head(1)
sleep_grp_index = sleep_grp.index.values[0]
sleep_df = selected_data_df[selected_data_df['grp'] == sleep_grp_index].drop('grp', axis=1)
Start of sleep time:
temp2_df['index'].iloc[0]
End of sleep time:
temp2_df['index'].iloc[-1]
You may use Run Length Encoding function from base R for solving your problem. In step 1 you may calculate the rolling mean of your patients heart rate. You may use your solution or any other. Afterwards you add a logic flag to your data.frame, e.g. patient['lowerVal'] = patient['heartrate'] < patient['rollingmeanVal']. Afterwards apply rle function on that variable lowerVal. As return you get the length of runs below and above mean. By applying cumsum on the lengths value, you get locations of your sleeping time frames.
Sorry. It is Python. Therefore, you may use the Python version of Run Length Encoding.