Python, improving for loop performance - python

I have made a class called localSun. I've taken a simplified model of the Earth-Sun system and have tried to compute the altitude angle of the sun for any location on earth for any time. When I run the code for current time and check timeandddate it matches well. So it works.
But then I wanted to basically go through one year and store all the altitude angles into an array (numpy array) for a specific location and I went in 1 minutes intervals.
Here's my very first naive attempt which I'm fairly certain is not good for performance. I just wanted to test for performance anyways.
import numpy as np
from datetime import datetime
from datetime import date
from datetime import timedelta
...
...
altitudes = np.zeros(int(year/60))
m = datetime(2018, 5, 29, 15, 21, 0)
for i in range(0, len(altitudes)):
n = m + timedelta(minutes = i+1)
nn = localSun(30, 0, n)
altitudes[i] = nn.altitude() # .altitude() is a method in localSun
altitudes is the array to which I want to store all the altitudes and its size is 525969 which is basically the amount of minutes in a year.
The localSun() object takes 3 parameters: colatitude (30 deg), longitude (0 deg) and a datetime object which has the time from a bit over an hour ago (when this is posted)
So the question is: What would be a good efficient way of going through a year in 1 minute intervals and computing the altitude angle at that time because this seems rather slow. Should I use map to update the values of the altitude angle instead of a for loop. I presume I'll have to each time create a new localSun object too. Also it's probably bad to just create these variables n and nn all the time.
We can assume the localSun objects all methods work fine. I'm just asking what is an efficient way (if there is) of going through a year in 1 minute intervals and updating the array with the altitude. The code I have should reveal enough information.
I would want to perhaps even do this in just 1 second interval later so it would be great to know if there's an efficient way. I tried that but it takes very long with that if I use this code.
This piece of code took about a minute to do on a university computer which are quite fast as far as I know.
Greatly appreaciate if someone can answer. Thanks in advance!

Numpy has naitive datetime and timedelta support so you could take an approach like this:
start = datetime.datetime(2018,5,29,15,21,0)
end = datetime.datetime(2019,5,29,15,21,0)
n = np.arange(start, end, dtype='datetime64[m]') # [m] specifies the interval as minutes
altitudes = np.vectorize(lambda x, y, z: localSun(x, y, z).altitude())(30,0,n)
np.vectorize is not fast at all, but gets this working until you can modify 'localSun' to work with arrays of datetimes.

Since you are already using numpy you can go one step further with pandas. It has powerful date and time manipulation routines such as pd.date_range:
import pandas as pd
start = pd.Timestamp(year=2018, month=1, day=1)
stop = pd.Timestamp(year=2018, month=12, day=31)
dates = pd.date_range(start, stop, freq='min')
altitudes = localSun(30, 0, dates)
You would then need to adapt your localSun to work with an array of pd.Timestamp rather than a single datetime.datetime.
Changing from minutes to seconds would then be as simple as changing freq='min' to freq='S'.

Related

How to calculate relative volume using pandas with faster way?

I am trying to implement the RVOL by the time of day technical indicator, which can be used as the indication of market strength.
The logic behind this is as follows:
If the current time is 2022/3/19 13:00, we look through the same moment (13:00) at the previous N days and average all the previous volumes at that moment to calculate Average_volume_previous.
Then, RVOL(t) is volume(t)/Average_volume_previous(t).
It is hard to use methods like rolling and apply to deal with this complex logic in the code I wrote.
However, the operation time of for loop is catastrophically long.
from datetime import datetime
import pandas as pd
import numpy as np
datetime_array = pd.date_range(datetime.strptime('2015-03-19 13:00:00', '%Y-%m-%d %H:%M:%S'), datetime.strptime("2022-03-19 13:00:00", '%Y-%m-%d %H:%M:%S'), freq='30min')
volume_array = pd.Series(np.random.uniform(1000, 10000, len(datetime_array)))
df = pd.DataFrame({'Date':datetime_array, 'Volume':volume_array})
df.set_index(['Date'], inplace=True)
output = []
for idx in range(len(df)):
date = str(df.index[idx].hour)+':'+str(df.index[idx].minute)
temp_date = df.iloc[:idx].between_time(date, date)
output.append(temp_date.tail(day_len).mean().iloc[0])
output = np.array(output)
Practically, there might be missing data in the datetime array. So, it would be hard to use fixed length lookback period to solve this. Is there any way to make this code work faster?
I'm not sure I understand, however this is the solution as far as I understand.
I didn't use date as index
df.set_index(['Date'], inplace=True)
# Filter data to find instant
rolling_day = 10
hour = df['Date'].dt.hour == 13
minute = df['Date'].dt.minute == 0
df_moment = df[ore&minuti].copy()
Calculation of moving averages
df_moment['rolling'] = df_moment.rolling(rolling_day).mean()
Calculation of Average_volume_previous(t)/volume(t)
for idx_s, idx_e in zip(df_moment['Volume'][::rolling_day], df_moment['rolling'][rolling_day::rolling_day]):
print(f'{idx_s/idx_e}')
Output:
0.566379345408499
0.7229214799940626
0.6753586759429548
2.0588617812341354
0.7494803741982076
1.2132554086225438

Python function that mimics the distribution of my dataset

I have a food order dataset that looks like this, with a few thousand orders over the span of a few months:
Date
Item Name
Price
2021-10-09 07:10:00
Water Bottle
1.5
2021-10-09 12:30:60
Pizza
12
2021-10-09 17:07:56
Chocolate bar
3
Those orders are time-dependent. Nobody will eat a pizza at midnight, usually. There will be more 3PM Sunday orders than there will be 3PM Monday orders (because people are at work). I want to extract the daily order distribution for each weekday (Monday till Sunday) from those few thousand orders so I can generate new orders later that fits this distribution. I do not want to fill in the gaps in my dataset.
How can I do so?
I want to create a generate_order_date() function that would generate a random hours:minutes:seconds depending on the day. I can already identify which weekday a date corresponds to. I just need to extract the 7 daily distributions so I can call my function like this:
generate_order_date(day=Monday, nb_orders=1)
[12:30:00]
generate_order_date(day=Friday, nb_orders=5)
[12:30:00, 07:23:32, 13:12:09, 19:15:23, 11:44:59]
generated timestamps do not have to be in chronological order. Just like if I was calling
np.random.normal(mu, sigma, 1000)
Try np.histogram(data)
https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
The first argument will give you the density which would be your distribution. You can visualise it
plt.plot(np.histogram(data)[0])
data here would be the time of the day a particular item was ordered. For this approach, I would suggest round up your time to 5 min intervals or more depending on the frequency. For example - round 12:34pm to 12:30 and 12:36pm to 12:35pm. Choose a suitable frequency.
Another method would be scipy.stats.gaussian_kde. This would use a guassian kernel. Below is an implementation I have used previously
def get_kde(df:pd.DataFrame)->list:
xs = np.round(np.linspace(-1,1,3000),3)
kde = gaussian_kde(df.values)
kde_vals = np.round(kde(xs),3)
data = [[xs[i],kde_vals[i]] for i in range(len(xs)) if kde_vals[i]>0.1]
return data
where df.values is your data. There are plenty more kernels which you can use to get a density estimate. The most suitable one depends on the nature of your data.
Also see https://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation
Here's a rough sketch of what you could do:
Assumption: Column Date of the DataFrame df contains datetimes. If not do df.Date = pd.to_datetime(df.Date) first.
First 2 steps:
import pickle
from datetime import datetime, timedelta
from sklearn.neighbors import KernelDensity
# STEP 1: Data preparation
df["Seconds"] = (
df.Date.dt.hour * 3600 + df.Date.dt.minute * 60 + df.Date.dt.second
) / 86400
data = {
day: sdf.Seconds.to_numpy()[:, np.newaxis]
for day, sdf in df.groupby(df.Date.dt.weekday)
}
# STEP 2: Kernel density estimation
kdes = {
day: KernelDensity(bandwidth=0.1).fit(X) for day, X in data.items()
}
with open("kdes.pkl", "wb") as file:
pickle.dump(kdes, file)
STEP 1: Build a normalised column Seconds (normalised between 0 and 1). Then group over the weekdays (numbered 0,..., 6) and prepare for for every day of the week data for estimation of a kernel density.
STEP 2: Estimate the kernel densities for every day of the week with KernelDensity from Scikit-learn and pickle the results.
Based on these estimates build the desired sample function:
# STEP 3: Sampling
with open("kdes.pkl", "rb") as file:
kdes = pickle.load(file)
def generate_order_date(day, orders):
fmt = "%H:%M:%S"
base = datetime(year=2022, month=1, day=1)
kde = kdes[day]
return [
(base + timedelta(seconds=int(s * 86399))).time().strftime(fmt)
for s in kde.sample(orders)
]
I won't pretend that this is anywhere near perfect. But maybe you could use it as a starting point.

Python: Converting a seconds to a datetime format in a dataframe column

Currently I am working with a big dataframe (12x47800). One of the twelve columns is a column consisting of an integer number of seconds. I want to change this column to a column consisting of a datetime.time format. Schedule is my dataframe where I try changing the column named 'depTime'. Since I want it to be a datetime.time and it could cross midnight i added the if-statement. This 'works' but really slow as one could imagine. Is there a faster way to do this?
My current code, the only one I could get working is:
for i in range(len(schedule)):
t_sec = schedule.iloc[i].depTime
[t_min, t_sec] = divmod(t_sec,60)
[t_hour,t_min] = divmod(t_min,60)
if t_hour>23:
t_hour -= 23
schedule['depTime'].iloc[i] = dt.time(int(t_hour),int(t_min),int(t_sec))
Thanks in advance guys.
Ps: I'm pretty new to Python, so if anybody could help me I would be very gratefull :)
I'm adding a new solution which is much faster than the original since it relies on pandas vectorized functions instead of looping (pandas apply functions are essentially optimized loops on the data).
I tested it with a sample similar in size to yours and the difference is from 778ms to 21.3ms. So I definitely recommend the new version.
Both solutions are based on transforming your seconds integers into timedelta format and adding it to a reference datetime. Then, I simply capture the time component of the resulting datetimes.
New (Faster) Option:
import datetime as dt
seconds = pd.Series(np.random.rand(50)*100).astype(int) # Generating test data
start = dt.datetime(2019,1,1,0,0) # You need a reference point
datetime_series = seconds.astype('timedelta64[s]') + start
time_series = datetime_series.dt.time
time_series
Original (slower) Answer:
Not the most elegant solution, but it does the trick.
import datetime as dt
seconds = pd.Series(np.random.rand(50)*100).astype(int) # Generating test data
start = dt.datetime(2019,1,1,0,0) # You need a reference point
time_series = seconds.apply(lambda x: start + pd.Timedelta(seconds=x)).dt.time
You should try not to do a full scan on a dataframe, but instead use vectorized access because it is normally much more efficient.
Fortunately, pandas has a function that does exactly what you are asking for, to_timedelta:
schedule['depTime'] = pd.to_timedelta(schedule['depTime'], unit='s')
It is not really a datetime format, but it is the pandas equivalent of a datetime.timedelta and is a convenient type for processing times. You could use to_datetime but will end with a full datetime close to 1970-01-01...
If you really need datetime.time objects, you can get them that way:
schedule['depTime'] = pd.to_datetime(schedule['depTime'], unit='s').dt.time
but they are less convenient to use in a pandas dataframe.

Convert datatime format to sampletime

Working on Python, I need to convert an array of datetime values into sample times, because I want to treat the corresponding time of the time series as sampletime [0..T].
[2013/11/09 14:29:54.660, 2013/11/09 14:29:54.680, ... T] where T> 1000. So I have an array of >1000 date time values, pretty big
I come up with the following code:
tiempos= [datetime.strptime(x,"%Y/%m/%d %H:%M:%S.%f") for x in csvTimeColum]
sampletime= [(t- tiempos[0]).microseconds/1000 for t in tiempos]
This piece of code seem to work well, but I have batches of 1000 samples within the signal:
[0,20,...,980,0,20,...,980,0,20,...,980,...]
So, my resulting signal is not a continuos one. How do I properly do this conversion in order to keep a continuous signal? Anybody has a good idea on how to solve this?
Use total_seconds() which also works for timedeltas:
Convert TimeDiff to total seconds
sampletime= [(t- tiempos[0]).total_seconds()*1000 for t in tiempos]
Working example:
import datetime
csvTimeColum = ["2013/11/09 14:29:54.660", "2013/11/09 14:29:54.680"]
tiempos= [datetime.datetime.strptime(x,"%Y/%m/%d %H:%M:%S.%f") for x in csvTimeColum]
sampletime= [(t- tiempos[0]).total_seconds()*1000 for t in tiempos]
sampletime # [0.0, 20.0]

Day delta for dates >292 years apart

I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.

Categories