Difference between large numpy arrays containing time values - python

I have ten (1000,1000) numpy arrays. Each array element contains a float, which represents the hour of the day. E.g. 14.0 = 2pm and 15.75 = 15:45pm.
I want to find the maximum difference between these arrays. The result should be a single (1000,1000) numpy array containing, for each array element, the maximum difference between the ten arrays . At the moment I have the following, which seems to work fine:
import numpy as np
max=np.maximum.reduce([data1,data2,data3,data4,data5])
min=np.minimum.reduce([data1,data2,data3,data4,data5])
diff=max-min
However, it results in the difference between 11pm and 1am of 22 hours. I need the difference to be 2 hours. I imagine I need to use datetime.time somehow, but I don't know how to get datetime to play nicely with numpy arrays.
Edit: The times refer to the average time of day that a certain event occurs, so they are not associated with a specific date. The difference two times could therefore be correctly interpreted as 22 hours, or 2 hours. However, I will always want to take the minimum of those two possible interpretations.

You can take the difference between two cyclic values by centering one value around the center location in the cycle (12.0). Rotate the other values the same amount to maintain their relative differences. Take the modulus of the adjusted values by the duration of the cycle to keep everything within bounds. You now have times adjusted so the maximum maximum possible distance stays within +/- 1/2*cycle duration (+/-12 hours).
e.g.,
adjustment = arr1 - 12.0
arr2 = (arr2 - adjustment) % 24.0
diff = 12.0 - arr2 # or abs(12.0 - arr2) if you prefer
If you're not using the absolute value, you'll need to play with the sign depending on which time you want to be considered 'first'.

Let's say you have the number 11pm and 1am, and you want to find the minimum distance.
1am -> 1
11pm -> 23
Then you have either:
23 - 1 = 22
Or,
24 - (23 - 1) % 24 = 2
Then distance can be thought of as:
def dist(x,y):
return min(abs(x - y), 24 - abs(x - y) % 24)
Now we need to take dist and apply it to every combination, If I recall correctly, there is a more numpy/scipy oriented function to do this, but the concept is more or less the same:
from itertools import combinations
data = [data1,data2,data3,data4,data5]
combs = combinations(data,2)
comb_list = list(combs)
dists = [dist(x,y) for x,y in comb_list]
max_dist = max(dists)

If you have an array diff of the time differences ranging between 0 and 24 hours you can make a correction to the wrongly calculated values as follows:
diff[diff > 12] = 24. - diff[diff > 12]

Related

Taking the mean of every n elements in an array and converting MATLAB code to Python

I am attempting to convert a MATLAB program to Python and have run into a snag with a certain loop: I have a 5868x3500 matrix comprising of 5868 daily observations the ratio of returns and volumes of 3500 stocks, this data is used to produce a measure of market liquidity by taking monthly averages the ratio of each stock's return over its volume. I have a 5868x1 vector called Dummymonth which assigns an integer to each month from 1 to 270, with ~22 trading days per month (1,1,1,1,1,1,1,1,1,1,1... 2,2,2,2,2,2... 270,270,270).
The loop I'm stuck on needs to convert the 5868x3500 matrix into a 270x3500 matrix by taking the monthly average according to the Dummymonth values (i.e. Basically taking the average of every 22 values).
I've tried converting the code as cleanly as possible (substituting MATLAB's find() function for Python's .argwhere()), but I am relatively new to Python (and MATLAB really) so the problems with the code do not seem immediately obvious to me.
Here is the section of MATLAB code I am trying to emulate:
numberofmonth=Dummymonth(size(Ret,1));
i=1;
for di=1:numberofmonth
v=find(Dummymonth==di);
for j=1:size(Ret, 2)
Amihud2(i,j)=nanmean(Amihud1(v,j));
end
i=i+1;
end
And here is what I have in Python:
import numpy as np
Amihud2 = np.empty((270, len(Amihud1)))
for month_num in range(0, 270):
v = np.argwhere(dummy == month_num)
for i in range(1, len(Amihud1)):
for j in range(1, len(Amihud1[0])):
Amihud2[i][j] = np.mean(Amihud1[v][j])
The errors I am usually seeing are "index out of bounds errors".
I think one of the errors is related to Python's 0 indexing. If you loop over something and start at 1, you miss the first (index 0) values. Here is one solution (there are many):
#Create Dummy index
dummy =np.array([np.repeat(i,22) for i in np.arange(270)+1]).flatten()
#Make Dataset for example
dat = np.random.random((len(dummy),3500))
#Calculate average per month
dat2 = np.empty((270,3500))
i=-1
for m in np.unique(dummy):
i=i+1
dat2[i,:]=dat[dummy==m].mean(axis=0)

How do I assign tiles to a pandas data frame based on equal parts of a column?

I have sorted a roughly 1 million row dataframe by a certain column. I would like to assign groups to each observation based on equal sums of another column but I'm not sure how to do this.
Example below:
import pandas as pd
value1 = [25,27,20,22,28,20]
value2 = [.34,.43,.54,.43,.5,.7]
df = pd.DataFrame({'value1':value1,'value2':value2})
df.sort_values('value1', ascending = False)
df['wanted_result'] = [1,1,1,2,2,2]
Like this example, I want to sum my column (example column value1) and assign groups to have as close to equal value1 sums as they can. Is there a build in function to this?
Greedy Loop
Using Numba's JIT to quicken it up.
from numba import njit
#njit
def partition(c, n):
delta = c[-1] / n
group = 1
indices = [group]
total = delta
for left, right in zip(c, c[1:]):
left_diff = total - left
right_diff = total - right
if right > total and abs(total - right) > abs(total - left):
group += 1
total += delta
indices.append(group)
return indices
df.assign(result=partition(df.value1.to_numpy().cumsum(), n=2))
value1 value2 result
4 28 0.50 1
1 27 0.43 1
0 25 0.34 1
3 22 0.43 2
2 20 0.54 2
5 20 0.70 2
This is NOT optimal. This is a greedy heuristic. It goes through the list and finds where we step over to the next group. At that point it decides whether it's better to include the current point in the current group or the next group.
This should behave pretty well except in cases with huge disparity in values with the larger values coming towards the end. This is because this algorithm is greedy and only looks at what it knows at the moment and not everything at once.
But like I said, it should be good enough.
I think, this is a kind of optimalisation problem (non-linear)
and Pandas is definitively not any good candidate to solve it.
The basic idea to solve the problem can be as follows:
Definitions:
n - number of elements,
groupNo - the number of groups to divide into.
Start from generating an initial solution, e.g. take consecutive
groups of n / groupNo elements into each bin.
Define the goal function, e.g. sum of squares of differences between
sum of each group and sum of all elements / groupNo.
Perform an iteration:
for each pair of elements a and b from different bins,
calculate the new goal function value, if these elements were moved
to the other bin,
select the pair that gives the greater improvement of the goal function
and perform the move (move a from its present bin to the bin, where b is,
and vice versa).
If no such pair can be found, then we have the final result.
Maybe someone will propose a better solution, but at least this solution is
some concept to start with.

Agglomerate time series data

I have AWS EC2 instance CPU utilization and other metric data given to me in CSV format like this:
Date,Time,CPU_Utilization,Unit
2016-10-17,09:25:00,22.5,Percent
2016-10-17,09:30:00,6.534,Percent
2016-10-17,09:35:00,19.256,Percent
2016-10-17,09:40:00,43.032,Percent
2016-10-17,09:45:00,58.954,Percent
2016-10-17,09:50:00,56.628,Percent
2016-10-17,09:55:00,25.866,Percent
2016-10-17,10:00:00,17.742,Percent
2016-10-17,10:05:00,34.22,Percent
2016-10-17,10:10:00,26.07,Percent
2016-10-17,10:15:00,20.066,Percent
2016-10-17,10:20:00,15.466,Percent
2016-10-17,10:25:00,16.2,Percent
2016-10-17,10:30:00,14.27,Percent
2016-10-17,10:35:00,5.666,Percent
2016-10-17,10:40:00,4.534,Percent
2016-10-17,10:45:00,4.6,Percent
2016-10-17,10:50:00,4.266,Percent
2016-10-17,10:55:00,4.2,Percent
2016-10-17,11:00:00,4.334,Percent
2016-10-17,11:05:00,4.334,Percent
2016-10-17,11:10:00,4.532,Percent
2016-10-17,11:15:00,4.266,Percent
2016-10-17,11:20:00,4.266,Percent
2016-10-17,11:25:00,4.334,Percent
As in evident, this is reported every 5 minutes. I do not have access to the aws-cli. I need to process this and report average utilization every 15 minutes for visualization. That is, for every hour, I need to find the average of the values in the first 15 minutes, the next fifteen minutes and so on. So, I will be reporting 4 values every hour.
A sample output would be:
Date,Time,CPU_Utilization,Unit
2016-10-17,09:30:00,14.517,Percent
2016-10-17,09:45:00,40.414,Percent
2016-10-17,10:00:00,33.412,Percent
2016-10-17,10:15:00,26.785,Percent
...
One way to do it would be to read the entire file (which has 10000+) lines, then for each date, find the values which belong to one window of 15 minutes, compute their average and repeat for all the values. This does not seem to be the best and the most efficient approach. Is there a better way to do it? Thank you.
As your input data is actually pretty small, I'd suggest to read it in at once by use of np.genfromtxt. Then you can find the appropriate range by checking when a full quarter of an hour is reached and end by counting how many full quarters are left. Then you can use np.reshape to get the array to a form with rows of quarters of hours and then average over those rows:
import numpy as np
# Read in the data:
data = np.genfromtxt("data.dat", skip_header=1,
dtype=[("date", "|S10"),
("time", "|S8"),
("cpu_usage", "f8")],
delimiter=',', usecols=(0, 1, 2))
# Find the first full quarter:
firstQuarterHour = 0
while not (int(data[firstQuarterHour]["time"][3:5]) % 15 == 0):
firstQuarterHour += 1
noOfQuarterHours = data[firstQuarterHour:].shape[0]/3
# Create a reshaped array
reshaped = data[firstQuarterHour:firstQuarterHour+3*noOfQuarterHours+1].reshape(
(noOfQuarterHours, 3))
# Average over cpu_usage and take the appropriate dates and times:
cpu_usage = reshaped["cpu_usage"].mean(axis=1)
dates = reshaped["date"][:, 0]
times = reshaped["time"][:, 0]
Now you can use those arrays to for example save into another text file by use of np.savetxt.

efficient, fast numpy histograms

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.
First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.
In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.
Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

What is the fastest way to sample slices of numpy arrays?

I have a 3D (time, X, Y) numpy array containing 6 hourly time series for a few years. (say 5). I would like to create a sampled time series containing 1 instance of each calendar day randomly taken from the available records (5 possibilities per day), as follows.
Jan 01: 2006
Jan 02: 2011
Jan 03: 2009
...
this means I need to take 4 values from 01/01/2006, 4 values from 02/01/2011, etc.
I have a working version which works as follows:
Reshape the input array to add a "year" dimension (Time, Year, X, Y)
Create a 365 values array of randomly generated integers between 0 and 4
Use np.repeat and array of integers to extract only the relevant values:
Example:
sampledValues = Variable[np.arange(numberOfDays * ValuesPerDays), sampledYears.repeat(ValuesPerDays),:,:]
This seems to work, but I was wondering if this is the best/fastest approach to solve my problem? Speed is important as I am doing this in a loop, adn would benefit from testing as many cases as possible.
Am I doing this right?
Thanks
EDIT
I forgot to mention that I filtered the input dataset to remove the 29th of feb for leap years.
Basically the aim of that operation is to find a 365 days sample that matches well the long term time series in terms on mean etc. If the sampled time series passes my quality test, I want to export it and start again.
The year 2008 was 366 days long, so don't reshape.
Have a look at scikits.timeseries:
import scikits.timeseries as ts
start_date = ts.Date('H', '2006-01-01 00:00')
end_date = ts.Date('H', '2010-12-31 18:00')
arr3d = ... # your 3D array [time, X, Y]
dates = ts.date_array(start_date=start_date, end_date=end_date, freq='H')[::6]
t = ts.time_series(arr3d, dates=dates)
# just make sure arr3d.shape[0] == len(dates) !
Now you can access the t data with day/month/year objects:
t[np.logical_and(t.day == 1, t.month == 1)]
so for example:
for day_of_year in xrange(1, 366):
year = np.random.randint(2006, 2011)
t[np.logical_and(t.day_of_year == day_of_year, t.year == year)]
# returns a [4, X, Y] array with data from that day
Play with the attributes of t to make it work with leap years too.
I don't see a real need to reshape the array, since you can embed the year-size information in your sampling process, and leave the array with its original shape.
For example, you can generate a random offset (from 0 to 365), and pick the slice with index, say, n*365 + offset.
Anyway, I don't think your question is complete, because I didn't quite understand what you need to do, or why.

Categories