Weighted average datetime, off but only for certain months - python

I am calculating the weighted average of a series of datetimes (and must be doing it wrong, since I can't explain the following):
import pandas as pd
import numpy as np
foo = pd.DataFrame({'date': ['2022-06-01', '2022-06-16'],
'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
bar = np.average(foo['date'].view(dtype='float64'), weights=foo['value'])
print(np.array(bar).view(dtype='datetime64[ns]'))
returns
2022-06-14T15:16:21.818181818, which is expected.
Changing the month to July:
foo = pd.DataFrame({'date': ['2022-07-01', '2022-07-16'],
'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
bar = np.average(foo['date'].view(dtype='float64'), weights=foo['value'])
print(np.array(bar).view(dtype='datetime64[ns]'))
returns 2022-07-14T23:59:53.766924660,
when the expected result is
2022-07-14T15:16:21.818181818.
The expected result, as calculated in Excel:
What am I overlooking?
EDIT: Additional Detail
My real dataset is much larger and I'd like to use numpy if possible.
foo['date'] can be assumed to be dates with no time component, but the weighted average will have a time component.

I strongly suspect this is a resolution/rounding issue.
I'm assuming that for averaging dates those are converted to timestamps - and then the result is converted back to a datetime object. But pandas works in nanoseconds, so the timestamp values, multiplied by 1000 and 10000 respectively, exceed 2**52 - i.e. exceed the mantissa capability of 64-bit floats.
On the contrary Excel works in milliseconds, so no problems here; Python's datetime.datetime works in microseconds, so still no problems:
dt01 = datetime(2022,7,1)
dt16 = datetime(2022,7,16)
datetime.fromtimestamp((dt01.timestamp()*1000 + dt16.timestamp()*10000)/11000)
datetime.datetime(2022, 7, 14, 15, 16, 21, 818182)
So if you need to use numpy/pandas I suppose your best option is to convert dates to timedeltas from a "starting" date (i.e. define a "custom epoch") and compute the weighted average of those values.

first of all I don't think the problem is in your code, I think pandas has a problem.
the problem is that when you use the view command it translates it to a very small number (e-198) so I believe it accidently lose resolution.
I found a solution (I think it works but it gave an 3 hour difference from your answer):
from datetime import datetime
import pandas as pd
import numpy as np
foo = pd.DataFrame({'date': ['2022-07-01', '2022-07-16'],
'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
# bar = np.average([x.timestamp() for x in foo['date']], weights=foo['value'])
bar = np.average(foo['date'].apply(datetime.timestamp), weights=foo['value']) # vectorized
print(datetime.fromtimestamp(bar))

Related

How to calculate relative volume using pandas with faster way?

I am trying to implement the RVOL by the time of day technical indicator, which can be used as the indication of market strength.
The logic behind this is as follows:
If the current time is 2022/3/19 13:00, we look through the same moment (13:00) at the previous N days and average all the previous volumes at that moment to calculate Average_volume_previous.
Then, RVOL(t) is volume(t)/Average_volume_previous(t).
It is hard to use methods like rolling and apply to deal with this complex logic in the code I wrote.
However, the operation time of for loop is catastrophically long.
from datetime import datetime
import pandas as pd
import numpy as np
datetime_array = pd.date_range(datetime.strptime('2015-03-19 13:00:00', '%Y-%m-%d %H:%M:%S'), datetime.strptime("2022-03-19 13:00:00", '%Y-%m-%d %H:%M:%S'), freq='30min')
volume_array = pd.Series(np.random.uniform(1000, 10000, len(datetime_array)))
df = pd.DataFrame({'Date':datetime_array, 'Volume':volume_array})
df.set_index(['Date'], inplace=True)
output = []
for idx in range(len(df)):
date = str(df.index[idx].hour)+':'+str(df.index[idx].minute)
temp_date = df.iloc[:idx].between_time(date, date)
output.append(temp_date.tail(day_len).mean().iloc[0])
output = np.array(output)
Practically, there might be missing data in the datetime array. So, it would be hard to use fixed length lookback period to solve this. Is there any way to make this code work faster?
I'm not sure I understand, however this is the solution as far as I understand.
I didn't use date as index
df.set_index(['Date'], inplace=True)
# Filter data to find instant
rolling_day = 10
hour = df['Date'].dt.hour == 13
minute = df['Date'].dt.minute == 0
df_moment = df[ore&minuti].copy()
Calculation of moving averages
df_moment['rolling'] = df_moment.rolling(rolling_day).mean()
Calculation of Average_volume_previous(t)/volume(t)
for idx_s, idx_e in zip(df_moment['Volume'][::rolling_day], df_moment['rolling'][rolling_day::rolling_day]):
print(f'{idx_s/idx_e}')
Output:
0.566379345408499
0.7229214799940626
0.6753586759429548
2.0588617812341354
0.7494803741982076
1.2132554086225438

how to attribute repeated annual datetime values to a series of numbers in a dataframe

I have a data frame consisting of hourly wind speed measurements for the year 2012 for different locations as seen below:
I was able to change the index to datetime format for just one location using the code:
dfn=df1_s.reset_index(drop=True)
import datetime
dfn['datetime']=pd.to_datetime(dfn.index,
origin=pd.Timestamp('2012-01-01 00:00:00'), unit= 'H')
When using the same code over the entire dataframe, i obtain the error: cannot convert input with unit 'h'. This is probably because there are way to much data than the number of years to be represented by 'hours', i am not sure. Nevertheless, it works when i use units in minutes i.e. units='m'.
What i want to do is to set the datetime in such a way that it repeats itself after every 8784 hours i.e. having the same replicate datetime format for each location on the same dataframe as seen in the image below(expected results produced on excel).
When trying the following, all i obtained was a column with a series on NaNs:
import pdb, random
dates = pd.date_range('2012-01-01', '2013-01-01', freq='H')
data = [int(1000*random.random()) for i in range(len(dates))]
dfn['cum_data'] = pd.Series(data, index=dates)
Can you please direct me on how to go about this?

Python, improving for loop performance

I have made a class called localSun. I've taken a simplified model of the Earth-Sun system and have tried to compute the altitude angle of the sun for any location on earth for any time. When I run the code for current time and check timeandddate it matches well. So it works.
But then I wanted to basically go through one year and store all the altitude angles into an array (numpy array) for a specific location and I went in 1 minutes intervals.
Here's my very first naive attempt which I'm fairly certain is not good for performance. I just wanted to test for performance anyways.
import numpy as np
from datetime import datetime
from datetime import date
from datetime import timedelta
...
...
altitudes = np.zeros(int(year/60))
m = datetime(2018, 5, 29, 15, 21, 0)
for i in range(0, len(altitudes)):
n = m + timedelta(minutes = i+1)
nn = localSun(30, 0, n)
altitudes[i] = nn.altitude() # .altitude() is a method in localSun
altitudes is the array to which I want to store all the altitudes and its size is 525969 which is basically the amount of minutes in a year.
The localSun() object takes 3 parameters: colatitude (30 deg), longitude (0 deg) and a datetime object which has the time from a bit over an hour ago (when this is posted)
So the question is: What would be a good efficient way of going through a year in 1 minute intervals and computing the altitude angle at that time because this seems rather slow. Should I use map to update the values of the altitude angle instead of a for loop. I presume I'll have to each time create a new localSun object too. Also it's probably bad to just create these variables n and nn all the time.
We can assume the localSun objects all methods work fine. I'm just asking what is an efficient way (if there is) of going through a year in 1 minute intervals and updating the array with the altitude. The code I have should reveal enough information.
I would want to perhaps even do this in just 1 second interval later so it would be great to know if there's an efficient way. I tried that but it takes very long with that if I use this code.
This piece of code took about a minute to do on a university computer which are quite fast as far as I know.
Greatly appreaciate if someone can answer. Thanks in advance!
Numpy has naitive datetime and timedelta support so you could take an approach like this:
start = datetime.datetime(2018,5,29,15,21,0)
end = datetime.datetime(2019,5,29,15,21,0)
n = np.arange(start, end, dtype='datetime64[m]') # [m] specifies the interval as minutes
altitudes = np.vectorize(lambda x, y, z: localSun(x, y, z).altitude())(30,0,n)
np.vectorize is not fast at all, but gets this working until you can modify 'localSun' to work with arrays of datetimes.
Since you are already using numpy you can go one step further with pandas. It has powerful date and time manipulation routines such as pd.date_range:
import pandas as pd
start = pd.Timestamp(year=2018, month=1, day=1)
stop = pd.Timestamp(year=2018, month=12, day=31)
dates = pd.date_range(start, stop, freq='min')
altitudes = localSun(30, 0, dates)
You would then need to adapt your localSun to work with an array of pd.Timestamp rather than a single datetime.datetime.
Changing from minutes to seconds would then be as simple as changing freq='min' to freq='S'.

Compare two Pandas Series/DataFrames that are virtually equal

For a unittest I have to compare two pandas DataFrames (with one column, so they can also be cast to Series without losing information). The problem is that the index of one is of datetime type, the other date. For our purposes the information in the two is equal, since the time component of the datetime is not used.
To check if the two objects are equal for a unittest I could:
Extract the index of one of them and cast to date/datetime
Extract just the values of the one column, compare those and start and end dates
Am I missing any elegant way to compare the two?
Code example:
from datetime import date, datetime, timedelta
import pandas as pd
days_in_training = 40
start_date = date(2016, 12, 1)
dates = [start_date + timedelta(days=i) for i in range(days_in_training)]
actual = pd.DataFrame({'col1': range(days_in_training)}, index=dates)
start_datetime = datetime(2016, 12, 1)
datetimes = [start_datetime + timedelta(days=i) for i in range(days_in_training)]
expected = pd.DataFrame({'col1': range(days_in_training)}, index=datetimes)
assert(all(actual == expected))
Gives:
ValueError: Can only compare identically-labeled DataFrame objects
For future reference, through this blogpost (https://penandpants.com/2014/10/07/testing-with-numpy-and-pandas/) I found the function pandas.util.testing.assert_frame_equal() (https://github.com/pandas-dev/pandas/blob/29de89c1d961bea7aa030422b56b061c09255b96/pandas/util/testing.py#L621)
This function has some flexibility in what it tests for. In addition it prints a summary why the DataFrames might not be considered equal, the line assert(all(actual == expected)) only returns True or False, which makes debugging harder.

Day delta for dates >292 years apart

I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.

Categories