efficient, fast numpy histograms - python

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.

First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.

In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.

Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

Related

matrix vs. list - switching from Matlab to Python

Coming from a Matlab background, where everything is a matrix/vector, it was very easy to loop through a given data set and build a matrix successively. Since the final object was a matrix, it was also very easy to extract specific elements of the matrix. I'm finding it rather problematic in Python. I've reproduced the code here to explain where I am getting stuck.
The original data just a time series with a month and a price. The goal is to simulate select subsets of these prices. The loop starts by collecting all months into one set, and then drops one month in each successive loop. For 12 months, I will have (n^2 - n)/2 + n, 78 columns in total in this example. To be clear, the n is the total number of time periods; 12 in this data set. The rows of the matrix will be the Z scores sampled from the standard normal variable - the goal is to simulate all 78 prices in one go in a matrix. The # of z scores is determined by the variable num_terminal_values, currently set to 5 for just keeping things simple/easy to visualize at this point.
Here's a link to a google sheet with the original matrix google sheet with corr mat . The code below may not work from the google sheet; the sheet is intended to show what the original data is. My steps (and Python code) are as follows:
#1 read the data
dfCrv = pd.read_excel(xl, sheet_name = 'upload', usecols = range(0,2)).dropna(axis=0)
#2 create the looper variables and then loop through the data to build a matrix. The rows in the matrix are Z values sampled from the standard normal (this is the variable num_terminal_values). The columns refers to each individual simulation month.
import datetime as dt
lst_zUCorr = []
num_terminal_values = 5
as_of = dt.datetime(2020, 12, 1)
max_months = dfCrv.shape[0]
sim_months = pd.date_range(dfCrv['term'].iloc[0], dfCrv['term'].iloc[-1], freq='MS')
end_month = dfCrv['term'].iloc[-1]
dfCrv = dfCrv.set_index('term',drop=False)
for runNum in range(max_months):
sim_month = dfCrv['term'].iloc[runNum]
ttm = ((sim_month - as_of).days)/365
num_months = (end_month.year - sim_month.year) * 12 + (end_month.month - sim_month.month) + 1
zUCorr = npr.standard_normal(size=(num_terminal_values, num_months))
lst_zUCorr.append(zUCorr)
investigate the objects
lst_zUCorr
z = np.hstack(lst_zUCorr)
z
So far, everything works fine. However, I don't know how to transform the object lst_zUCorr to a simple matrix. I've tried hstack etc.; but this still doesn't look like a matrix.
The next set of operations require data in simple matrix form; but what I'm getting here isn't a matrix. Here's a visual:
Key point/question - the final 5x78 matrix in Matlab can be used to do more operations. Is there a way to convert the equivalent Python object into a 5x78 matrix, or will I now need to do more coding to access specific subsets of the Python objects?

Taking the mean of every n elements in an array and converting MATLAB code to Python

I am attempting to convert a MATLAB program to Python and have run into a snag with a certain loop: I have a 5868x3500 matrix comprising of 5868 daily observations the ratio of returns and volumes of 3500 stocks, this data is used to produce a measure of market liquidity by taking monthly averages the ratio of each stock's return over its volume. I have a 5868x1 vector called Dummymonth which assigns an integer to each month from 1 to 270, with ~22 trading days per month (1,1,1,1,1,1,1,1,1,1,1... 2,2,2,2,2,2... 270,270,270).
The loop I'm stuck on needs to convert the 5868x3500 matrix into a 270x3500 matrix by taking the monthly average according to the Dummymonth values (i.e. Basically taking the average of every 22 values).
I've tried converting the code as cleanly as possible (substituting MATLAB's find() function for Python's .argwhere()), but I am relatively new to Python (and MATLAB really) so the problems with the code do not seem immediately obvious to me.
Here is the section of MATLAB code I am trying to emulate:
numberofmonth=Dummymonth(size(Ret,1));
i=1;
for di=1:numberofmonth
v=find(Dummymonth==di);
for j=1:size(Ret, 2)
Amihud2(i,j)=nanmean(Amihud1(v,j));
end
i=i+1;
end
And here is what I have in Python:
import numpy as np
Amihud2 = np.empty((270, len(Amihud1)))
for month_num in range(0, 270):
v = np.argwhere(dummy == month_num)
for i in range(1, len(Amihud1)):
for j in range(1, len(Amihud1[0])):
Amihud2[i][j] = np.mean(Amihud1[v][j])
The errors I am usually seeing are "index out of bounds errors".
I think one of the errors is related to Python's 0 indexing. If you loop over something and start at 1, you miss the first (index 0) values. Here is one solution (there are many):
#Create Dummy index
dummy =np.array([np.repeat(i,22) for i in np.arange(270)+1]).flatten()
#Make Dataset for example
dat = np.random.random((len(dummy),3500))
#Calculate average per month
dat2 = np.empty((270,3500))
i=-1
for m in np.unique(dummy):
i=i+1
dat2[i,:]=dat[dummy==m].mean(axis=0)

How to identify levels of "plateaus" in a pandas series of float values?

I have a pandas dataframe df1 that contains, amongst others, a series of time measurements (duration n of experiment x on sample y; in seconds).
In theory, every duration n is an integer multiple of the shortest duration within the series. Note that the shortest possible duration varies across different samples.
In reality, the time measurements are an approximation only. When sorting the duration according to length in seconds and plotting the result, I get something like this:
I want to open a new column and assign an integer to every measurement. How to determine plateaus 1-3 in the figure above?
I am interested in a scalable solution and hence can't divide by the smallest number in the series, since I will be facing thousands of samples in the future.

Difference between large numpy arrays containing time values

I have ten (1000,1000) numpy arrays. Each array element contains a float, which represents the hour of the day. E.g. 14.0 = 2pm and 15.75 = 15:45pm.
I want to find the maximum difference between these arrays. The result should be a single (1000,1000) numpy array containing, for each array element, the maximum difference between the ten arrays . At the moment I have the following, which seems to work fine:
import numpy as np
max=np.maximum.reduce([data1,data2,data3,data4,data5])
min=np.minimum.reduce([data1,data2,data3,data4,data5])
diff=max-min
However, it results in the difference between 11pm and 1am of 22 hours. I need the difference to be 2 hours. I imagine I need to use datetime.time somehow, but I don't know how to get datetime to play nicely with numpy arrays.
Edit: The times refer to the average time of day that a certain event occurs, so they are not associated with a specific date. The difference two times could therefore be correctly interpreted as 22 hours, or 2 hours. However, I will always want to take the minimum of those two possible interpretations.
You can take the difference between two cyclic values by centering one value around the center location in the cycle (12.0). Rotate the other values the same amount to maintain their relative differences. Take the modulus of the adjusted values by the duration of the cycle to keep everything within bounds. You now have times adjusted so the maximum maximum possible distance stays within +/- 1/2*cycle duration (+/-12 hours).
e.g.,
adjustment = arr1 - 12.0
arr2 = (arr2 - adjustment) % 24.0
diff = 12.0 - arr2 # or abs(12.0 - arr2) if you prefer
If you're not using the absolute value, you'll need to play with the sign depending on which time you want to be considered 'first'.
Let's say you have the number 11pm and 1am, and you want to find the minimum distance.
1am -> 1
11pm -> 23
Then you have either:
23 - 1 = 22
Or,
24 - (23 - 1) % 24 = 2
Then distance can be thought of as:
def dist(x,y):
return min(abs(x - y), 24 - abs(x - y) % 24)
Now we need to take dist and apply it to every combination, If I recall correctly, there is a more numpy/scipy oriented function to do this, but the concept is more or less the same:
from itertools import combinations
data = [data1,data2,data3,data4,data5]
combs = combinations(data,2)
comb_list = list(combs)
dists = [dist(x,y) for x,y in comb_list]
max_dist = max(dists)
If you have an array diff of the time differences ranging between 0 and 24 hours you can make a correction to the wrongly calculated values as follows:
diff[diff > 12] = 24. - diff[diff > 12]

What is the fastest way to sample slices of numpy arrays?

I have a 3D (time, X, Y) numpy array containing 6 hourly time series for a few years. (say 5). I would like to create a sampled time series containing 1 instance of each calendar day randomly taken from the available records (5 possibilities per day), as follows.
Jan 01: 2006
Jan 02: 2011
Jan 03: 2009
...
this means I need to take 4 values from 01/01/2006, 4 values from 02/01/2011, etc.
I have a working version which works as follows:
Reshape the input array to add a "year" dimension (Time, Year, X, Y)
Create a 365 values array of randomly generated integers between 0 and 4
Use np.repeat and array of integers to extract only the relevant values:
Example:
sampledValues = Variable[np.arange(numberOfDays * ValuesPerDays), sampledYears.repeat(ValuesPerDays),:,:]
This seems to work, but I was wondering if this is the best/fastest approach to solve my problem? Speed is important as I am doing this in a loop, adn would benefit from testing as many cases as possible.
Am I doing this right?
Thanks
EDIT
I forgot to mention that I filtered the input dataset to remove the 29th of feb for leap years.
Basically the aim of that operation is to find a 365 days sample that matches well the long term time series in terms on mean etc. If the sampled time series passes my quality test, I want to export it and start again.
The year 2008 was 366 days long, so don't reshape.
Have a look at scikits.timeseries:
import scikits.timeseries as ts
start_date = ts.Date('H', '2006-01-01 00:00')
end_date = ts.Date('H', '2010-12-31 18:00')
arr3d = ... # your 3D array [time, X, Y]
dates = ts.date_array(start_date=start_date, end_date=end_date, freq='H')[::6]
t = ts.time_series(arr3d, dates=dates)
# just make sure arr3d.shape[0] == len(dates) !
Now you can access the t data with day/month/year objects:
t[np.logical_and(t.day == 1, t.month == 1)]
so for example:
for day_of_year in xrange(1, 366):
year = np.random.randint(2006, 2011)
t[np.logical_and(t.day_of_year == day_of_year, t.year == year)]
# returns a [4, X, Y] array with data from that day
Play with the attributes of t to make it work with leap years too.
I don't see a real need to reshape the array, since you can embed the year-size information in your sampling process, and leave the array with its original shape.
For example, you can generate a random offset (from 0 to 365), and pick the slice with index, say, n*365 + offset.
Anyway, I don't think your question is complete, because I didn't quite understand what you need to do, or why.

Categories