3-day and 7-day moving average - python

I have a data indexed by time (2023-01-01 00:15:00+00:00) and I want to calculate 3-day and 7 day average i.e say I want to predict what my value will be on 2023-01-10 00:15:00+00:00, I will have to find the average of the values I have on 2023-01-9 00:15:00+00:00, 2023-01-08 00:15:00+00:00 and 2023-01-07 00:15:00+00:00 (for three day average). I want to these for all the values. Pls how can I write the python code that does this? Part of the data is seen in the pic the pic show how the dataframe look
I have tried to do some indexing using the date and using the notion that I need 24 hours lookbacks but to no avail
here is the code I wrote so far (in the code I tried to see if I could print the value I have 24 hrs before 2023-01-22 23:30:00+00:00):
import datetime
import pandas as pd
data = pd.read_csv("output_saldo.csv")
print(data)
target_timestamp = datetime.datetime.strptime('2023-01-22 23:30:00+00:00', '%Y-%m-%d %H:%M:%S+00:00')
target_timestamp_24h = target_timestamp - datetime.timedelta(hours=24)
print(data.loc[pd.DatetimeIndex([target_timestamp_24h])])
print(target_timestamp_24h)
Error: No axis named 2023-01-21 23:30:00 for object type DataFrame

Related

How to get difference in days between two dates from a csv in python?

So I'm trying to get the number of days between these two dates ("today" and a prior date) from a csv file. I'm using the datetime library, i can get the delta (difference between the two dates), but I can't get the number of days. If i try to use .days it returns me an error, stating it cannot convert series to datetime.
I just wanna get the number of days in int or float to use in a code I'm working on. Can anyone give me some direction? I've tried everything and done a lot of prior research here on Stackoverflow and other websites.
Here's the code below.
import pandas as pd
from datetime import datetime
df = pd.read_csv("serv.csv",parse_dates=["PasswordLastSet"])
for row in df:
d1 = pd.to_datetime(datetime.today())
d2 = (df["PasswordLastSet"])
delta = d1-d2
days = delta.datetime.days
tried some different librarys, tried converting the delta to datetime, tried turning it to a string and filter that string but no sucess. I just want to get the number of days, like "10" or "60" so i can make some math with it later on.
convert passwordlastset to datetime and subtract from today
df['delta']=(datetime.today() - pd.to_datetime(df['PasswordLastSet'])).dt.days
df
PasswordLastSet delta
0 2019-09-10 11:11 1142
1 2019-10-07 11:13 1115
2 2019-11-04 11:16 1087
3 2019-11-28 11:20 1063

Combine date present in two different columns to generate mean for a column

I have a dataset of the following format which has the Starting column values ranging from 2021-01-01 to 2022-03-13 and same goes for the Ending column where my values begin from 2021-01-01 to 2022-03-13.
The data for rainfall gets collected on a daily basis such that the entries are as follows:
I am trying to combine and form monthly average values for the dataset. I cannot find a way where I am able to take monthly average values and store them in a different pandas dataframe such that it appears as follows:
The Monthly Rainfall is found using Total rainfall/ Total days in the month
Any help would be appreciated!
I have tried to use groupy and mean together from pandas library to find the output but it doesn't appear in the format I want.
df=df.groupby(['Starting','Ending','Location_id'])['rainfall'].mean().reset_index()
To solve the problem, you can write a function like this:
import math
from datetime import datetime
def to_date(x, y):
lists = zip([datetime.strptime(dt, '%Y-%m-%d').date() for dt in x], [datetime.strptime(dt, '%Y-%m-%d').date() for dt in y])
return [0 if math.isinf((x-y).days) else (x-y).days for x,y in lists]
Basically this function takes two lists (x,y) and turn every item in those into date() objects. And returns a new lists with items as days object. For your information, if you deduct same dates, Python returns an inf integer, which is infinite. To go over this, you can check if the item is an infitine integer, if so return 0 else return days.
Here's the code snippet I wrote, since you didn't provide a dataset, I wrote using the images you provided:
import pandas as pd
d = {
'New_Starting': ['2021-01-01','2021-01-01','2021-01-01'],
'New_Ending': ['2021-01-31','2021-01-31','2021-01-31'],
'Location_id': [45, 52, 30],
'Rainfall': [4.07, 6.53, 3.71]
}
d = pd.DataFrame(d)
d['Monthly_Rainfall'] = d['Rainfall'] / to_date(d['New_Ending'], d['New_Starting'])
Output:
New_Starting New_Ending Location_id Rainfall Monthly_Rainfall
0 2021-01-01 2021-01-31 45 4.07 0.135667
1 2021-01-01 2021-01-31 52 6.53 0.217667
2 2021-01-01 2021-01-31 30 3.71 0.123667

How to convert to dates after removing the seasonality from the time series in python?

The question can be reframed as "How to remove daily seasonality from the dataset in python?"
Please read the following:
I have a time series and have used seasonal_decompose() from statsmodel to remove seasonality from the series. As I have used seasonal_decompose() on "Months" data, I get the seasonality only in months. How do I convert these months in to days/dates? Can I use seasonal_decompose() to remove daily seasonality? I tried one option of keeping frequency=365, but it raises following error:
x must have 2 complete cycles requires 730 observations. x only has 24 observation(s)
Snippet of the code:
grp_month = train.append(test).groupby(data['Month']).sum()['Var1']
season_result = seasonal_decompose(grp_month, model='addition', period=12)
This gives me the output:
Month
Out
2018-01-01
-17707.340278
2018-02-01
-49501.548611
2018-03-01
-28172.590278
..
..
..
..
2019-12-01
-13296.173611
As you can see in the table, implementing seasonal_decompose() gives me the monthly seasonality. Is there any way I can get the daily data from this? Or can I convert this into a date wise series?
Edit:
I tried to remove daily seasonality as follows but I'm not really sure if this is the way to go.
period = 365
season_mean = data.groupby(data.index % period).transform('mean')
data -= season_mean
print(data.head())
If you want to substract these values to a daily DataFrame, you should upsample the DataFrame season_result using pandas.DataFrame.resample this way you will be able to substract the monthly seasonnality from your original one.

Date difference: different results in Excel vs. Python

I have a pandas dataframe with two dates columns with timestamp ( i want to keep time stamp)
I want to get the difference in days between those two dates , I used the below . It works just fine.
mergethetwo['diff_days']=(mergethetwo['todaydate']-mergethetwo['LastLogon']).dt.days
The doubt is , when I got the difference between those two dates in Excel , it gave me different number.
In python for example the difference between
5/15/2020 1:48:00 PM (LastLogon) and 6/21/2020 12:00:00 AM(todaydate) is 36 .
However , in Excel using
DATEDIF =(LastLogon,todaydate,"d")
5/15/2020 1:48:00 PM and 6/21/2020 12:00:00 AM is 37 days !
Why is the difference ? Which one should I trust ? As I have 30,000 + rows I can't go through all od them to confirm.
Appreciate your support
Thank you
Excel DATEDIF with "D" seems to count "started" days (dates, as the name of the function says...); whilst the Python timedelta gives the actual delta in time - 36.425 days:
import pandas as pd
td = pd.to_datetime("6/21/2020 12:00:00 AM")-pd.to_datetime("5/15/2020 1:48:00 PM")
# Timedelta('36 days 10:12:00')
td.days
# 36
td.total_seconds() / 86400
# 36.425
You will get the same result if you do todaydate-LastLogon in Excel, without using any function.

Pandas read and parse Excel data that shows as a datetime, but shouldn't be a datetime

I have a system I am reading from that implemented a time tracking function in a pretty poor way - It shows the tracked working time as [hh]:mm in the cell. Now this is problematic when attempting to read this data because when you click that cell the data bar shows 11:00:00 PM, but what that 23:00 actually represents is 23 hours of time spent and not 11PM. So whenever the time is 24:00 or more you end up with 1/1/1900 12:00:00 AM and on up ( 25:00 = 1/1/1900 01:00:00 AM).
So pandas picks up the 11:00:00 AM or 1/1/1900 01:00:00 AM when it comes into the dataframe. I am at a loss as to how I would put this back into an INT for and get the number of hours in a whole number format 24, 25, 32, etc.
Can anyone help me figure out how to turn this horribly formatted data into the number of hours in int format?
If you want 1/1/1900 01:00:00 AM to represent 25 hours of elapsed time then this tells me your reference timestamp is 12/31/1899 00:00:00. Try the following:
time_delta = pd.Timestamp('1/1/1900 01:00:00 AM') - pd.Timestamp('12/31/1899 00:00:00')
# returns Timedelta('1 days 01:00:00')
You can get the total number of seconds by using the Timedelta.total_seconds() method:
time_delta.total_seconds()
# returns 90000.0
and then you could get the number of hours with
time_delta.total_seconds() / 3600.0
# returns 25.0
So try subtracting pd.Timestamp('12/31/1899 00:00:00') from your DatetimeIndex based on the year 1900 to get a TimedeltaIndex. You can then leave your TimedeltaIndex as is or convert it to a Float64Index with TimedeltaIndex.total_seconds().
pandas is not at fault its the excel that is interpreting the data wrongly,
Set the data to text in that column and it wont interpret as date.
then save the file and open through pandas and it should work fine.
other wise export as CSV and try to open in pandas.
Here is where I ended and it does work:
for i in range(len(df['Planned working time'])) :
pwt = df['Planned working time'][i]
if len(str(df['Planned working time'][i]).split(' ')) > 1 :
if str(str(pwt).split(' ')[0]).split('-')[0] == '1900' :
workint = int(24)*int(str(str(pwt).split(' ')[0]).split('-')[2]) + int(str(pwt).split(' ')[1].split(':')[0])
elif len(str(pwt).split(' ')) == 1 :
if str(str(pwt).split(' ')[0]).split('-')[0] != '1900' :
workint = int(str(pwt).split(' ')[0].split(':')[0])
df.set_value(i, 'Planned working time', workint)
any suggested improvements are welcome, but this results in the correct int values in all cases. Tested on over 14K rows of data. This would likely have to be refined if there were minutes, but there are no cases where minutes show up in the data and the UI on the front end doesn't appear to actually allow minutes.

Categories