Given a datetime array of the shape (n, 2):
x = np.array([['2017-10-02T00:00:00.000000000', '2017-10-12T00:00:00.000000000']], dtype='datetime64[ns]')
x has shape (1, 2), but in reality it could be (n, 2), n >= 1. In each pair, the first date is always smaller than (or equal to) the second. I want to get a list of all date ranges between each pair of dates in x. This is what I'm doing basically:
np.concatenate([pd.date_range(*y, closed='right') for y in x])
And it works, giving
array(['2017-10-03T00:00:00.000000000', '2017-10-04T00:00:00.000000000',
'2017-10-05T00:00:00.000000000', '2017-10-06T00:00:00.000000000',
'2017-10-07T00:00:00.000000000', '2017-10-08T00:00:00.000000000',
'2017-10-09T00:00:00.000000000', '2017-10-10T00:00:00.000000000',
'2017-10-11T00:00:00.000000000', '2017-10-12T00:00:00.000000000'], dtype='datetime64[ns]')
But this is pretty slow because of the list comp - it isn't exactly vectorised as I'd like. I'm wondering if there's a better way to obtain date ranges for multiple pairs of dates?
I'll provide as much clarification as needed. Thanks.
It's a tad convoluted...
But
d = np.array(1, dtype='timedelta64[D]')
x = x.astype('datetime64[D]')
deltas = np.diff(x, axis=1) / d
np.concatenate([
i + np.arange(j + 1) for i, j in zip(x[:, 0], deltas[:, 0].astype(int))
]).astype('datetime64[ns]')
array(['2017-10-02T00:00:00.000000000', '2017-10-03T00:00:00.000000000',
'2017-10-04T00:00:00.000000000', '2017-10-05T00:00:00.000000000',
'2017-10-06T00:00:00.000000000', '2017-10-07T00:00:00.000000000',
'2017-10-08T00:00:00.000000000', '2017-10-09T00:00:00.000000000',
'2017-10-10T00:00:00.000000000', '2017-10-11T00:00:00.000000000',
'2017-10-12T00:00:00.000000000'], dtype='datetime64[ns]')
How it works
d represents one day
x is turned into dates with no timestamps
diff gets me the number of days difference... but in timedelta space
I divide by my d which is also in timedelta space and the dimensions disappear... leaving me with float which I cast to int
When I add the first column of the pairs x[:, 0] to an array of integers, I get a broadcasting of adding 1 unit of whatever the dimension is of x, which is datetime64[D]. So I'm adding one day.
Derived from / Inspired by #hpaulj
Will remove if they post an answer
d = np.array(1, dtype='timedelta64[D]')
np.concatenate([np.arange(row[0], row[1] + 1, d) for row in x])
array(['2017-10-02T00:00:00.000000000', '2017-10-03T00:00:00.000000000',
'2017-10-04T00:00:00.000000000', '2017-10-05T00:00:00.000000000',
'2017-10-06T00:00:00.000000000', '2017-10-07T00:00:00.000000000',
'2017-10-08T00:00:00.000000000', '2017-10-09T00:00:00.000000000',
'2017-10-10T00:00:00.000000000', '2017-10-11T00:00:00.000000000',
'2017-10-12T00:00:00.000000000'], dtype='datetime64[ns]')
Related
I have a series of lists (np.arrays, actually), of which the elements are dates.
id
0a0fe3ed-d788-4427-8820-8b7b696a6033 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8 [2019-01-29, 2019-01-30, 2019-01-31, 2019-02-0...
Name: startDate, dtype: object
For each element in the series (i.e. for each list of dates), I want to retain the longest sublist in which all dates are consecutive. I'm struggling to approach this in a pythonic (simple/efficient) way. The only approach that I can think of is to use multiple loops: loop over the series values (the lists), and loop over each element in the list. I would then store the first date and the number of consecutive days, and use temporary values to overwrite the results if a longer sequence of consecutive days is encountered. This seems highly inefficient though. Is there a better way of doing this?
Since you mention you are using numpy arrays of dates it makes sense to stick to numpy types instead of converting to the built-in type. I'm assuming here that your arrays have dtype 'datetime64[D]'. In that case you could do something like
import numpy as np
date_list = np.array(['2005-02-01', '2005-02-02', '2005-02-03',
'2005-02-05', '2005-02-06', '2005-02-07', '2005-02-08', '2005-02-09',
'2005-02-11', '2005-02-12',
'2005-02-14', '2005-02-15', '2005-02-16', '2005-02-17',
'2005-02-19', '2005-02-20',
'2005-02-22', '2005-02-23', '2005-02-24',
'2005-02-25', '2005-02-26', '2005-02-27', '2005-02-28'],
dtype='datetime64[D]')
i0max, i1max = 0, 0
i0 = 0
for i1, date in enumerate(date_list):
if date - date_list[i0] != np.timedelta64(i1-i0, 'D'):
if i1 - i0 > i1max - i0max:
i0max, i1max = i0, i1
i0 = i1
print(date_list[i0max:i1max])
# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']
Here, i0 and i1 indicate the start and stop indeces of the current sub-array of consecutive dates, and i0max and i1max the start and stop indices of the longest sub-array found so far. The solution uses the fact that the difference between the i-th and zeroth entry in a list of consecutive dates is exactly i days.
You can convert list to ordinals which are increasing for all consecutive dates. Which means next_date = previous_date + 1 read more.
Then find the longest consecutive sub-array.
This process will take O(n)->single loop time which is the most efficient way to get this.
CODE
from datetime import datetime
def get_consecutive(date_list):
# convert to ordinals
v = [datetime.strptime(d, "%Y-%m-%d").toordinal() for d in date_list]
consecutive = []
run = []
dates = []
# get consecutive ordinal sequence
for i in range(1, len(v) + 1):
run.append(v[i-1])
dates.append(date_list[i-1])
if i == len(v) or v[i-1] + 1 != v[i]:
if len(consecutive) < len(run):
consecutive = dates
dates = []
run = []
return consecutive
OUTPUT:
date_list = ['2019-01-29', '2019-01-30', '2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088, 737089, 737090, 737095]
OUTPUT:
['2019-01-29', '2019-01-30', '2019-01-31']
Now use get_consecutive in df.column.apply(get_consecutive)it will give you all increasing date list. Or you can all function for each list if you are using some other data structure.
I'm going to reduce this problem to finding consecutive days in a single list. There are a few tricks that make it more Pythonic as you ask. The following script should run as-is. I've documented how it works inline:
from datetime import timedelta, date
# example input
days = [
date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 4),
date(2020, 1, 5), date(2020, 1, 6), date(2020, 1, 8),
]
# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index = 0
longest_interval_length = current_interval_length = 1
# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1, 2020-01-02), (2020-01-02, 2020-01-03), ...]
# use enumerate to get the index of the current day
for i, (previous_day, current_day) in enumerate(zip(days, days[1:]), start=1):
if current_day - previous_day == timedelta(days=+1):
# we've found a consecutive day! increase the interval length
current_interval_length += 1
else:
# nope, not a consecutive day! start from this day and start
# counting from 1
current_interval_index = i
current_interval_length = 1
if current_interval_length > longest_interval_length:
# we broke the record! record it as the longest interval
longest_interval_index = current_interval_index
longest_interval_length = current_interval_length
print("Longest interval index:", longest_interval_index)
print("Longest interval: ", days[longest_interval_index:longest_interval_index + longest_interval_length])
It should be easy enough to turn this into a reusable function.
given a list of indices that match a condition, where there will be many spans in the list that are sequentially adjacent, how can I easily select only the first of each span.
such that
magicallySelect([1,2,3,10,11,12,100,101,102]) == [1,10,100]
but -- importantly, this should also work for other indicies, like dates (which is the case in my data). The actual code I'm hoping to get working is:
original.reset_index(inplace=True)
predict = {}
for app in apps:
reg = linear_model.LinearRegression()
reg.fit(original.index.values.reshape(-1, 1), original[app].values)
slope = reg.coef_.tolist()[0]
delta = original[app].apply(lambda x: abs(slope - x))
forecast['test_delta'] = forecast[app].apply(lambda x: abs(slope - x))
tdm = forecast['test_delta'].mean()
tds = forecast['test_delta'].std(ddof=0)
# identify moments that are σ>2 abnormal
forecast['z'] = forecast['test_delta'].apply(lambda x: abs(x - tdm / tds))
sig = forecast.index[forecast[forecast['z'] > 2]].tolist()
predict[app] = FIRST_INDEX_IN_EACH_SPAN_OF(sig)
l = [1,2,3,10,11,12,100,101,102]
indices = [l[i] for i in range(len(l)) if l[i-1]!=l[i]-1]
Reordering this slightly to work for datetimes, this would give you all items in the list where the gap from the previous item is greater than 1 day (plus the first item by default):
indices = [l[0]] + [l[i] for i in range(len(l)) if (l[i]-l[i-1]).days>1]
For a difference in time measured in minutes, you can convert to seconds and substitute this in. E.g. for 15 minutes (900 seconds) you can do:
indices = [l[0]] + [l[i] for i in range(len(l)) if (l[i]-l[i-1]).seconds>900]
So I have a deltatime array dt=[(20,6)(20,7)(20,9)(20,10)(20,11)(20,13)] and the issue i have is that i cant allow any data to be more than one second apart from the next value in the list. I wrote out a little if statement that goes
for k in range(len(dt)-15):
if dt[k+1].seconds-dt[k].seconds>1:
gj.append(dt[k])
gj.append(dt[k+1])
and I end up with (20,7)(20,9)(20,11)(20,13) so I know which times are greater than one second apart, but I can't figure out how to delete the values from a deltatime array. I tried numpy.delete but that didnt work because it's in a non useable format. The end goal is having a new array [(20,6)(20,10)] with only data that is one second apart.
Why not check for a difference smaller than 1 second and append those to a list?
Code
from datetime import time
dt = [(20,6), (20,7), (20,9), (20,10), (20,11), (20,13)]
dt = [time(0, m, s) for m, s in dt]
left = []
for i in range(len(dt) - 1):
if dt[i + 1].second - dt[i].second <= 1:
left.append(dt[i])
print(left)
Result
>>> [datetime.time(0, 20, 6), datetime.time(0, 20, 9), datetime.time(0, 20, 10)]
Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting
I've got a 2-row array called C like this:
from numpy import *
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = vstack((A,B))
I want to take all the columns in C where the value in the first row falls between i and i+2, and average them. I can do this with just A no problem:
i = 0
A_avg = []
while(i<6):
selection = A[logical_and(A >= i, A < i+2)]
A_avg.append(mean(selection))
i += 2
then A_avg is:
[1.0,2.5,4.5]
I want to carry out the same process with my two-row array C, but I want to take the average of each row separately, while doing it in a way that's dictated by the first row. For example, for C, I want to end up with a 2 x 3 array that looks like:
[[1.0,2.5,4.5],
[50,35,15]]
Where the first row is A averaged in blocks between i and i+2 as before, and the second row is B averaged in the same blocks as A, regardless of the values it has. So the first entry is unchanged, the next two get averaged together, and the next two get averaged together, for each row separately. Anyone know of a clever way to do this? Many thanks!
I hope this is not too clever. TIL boolean indexing does not broadcast, so I had to manually do the broadcasting. Let me know if anything is unclear.
import numpy as np
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = np.vstack((A,B)) # float so that I can use np.nan
i = np.arange(0, 6, 2)[:, None]
selections = np.logical_and(A >= i, A < i+2)[None]
D, selections = np.broadcast_arrays(C[:, None], selections)
D = D.astype(float) # allows use of nan, and makes a copy to prevent repeated behavior
D[~selections] = np.nan # exclude these elements from mean
D = np.nanmean(D, axis=-1)
Then,
>>> D
array([[ 1. , 2.5, 4.5],
[ 50. , 35. , 15. ]])
Another way, using np.histogram to bin your data. This may be faster for large arrays, but is only useful for few rows, since a hist must be done with different weights for each row:
bins = np.arange(0, 7, 2) # include the end
n = np.histogram(A, bins)[0] # number of columns in each bin
a_mean = np.histogram(A, bins, weights=A)[0]/n
b_mean = np.histogram(A, bins, weights=B)[0]/n
D = np.vstack([a_mean, b_mean])