What is the fastest way to sample slices of numpy arrays? - python

I have a 3D (time, X, Y) numpy array containing 6 hourly time series for a few years. (say 5). I would like to create a sampled time series containing 1 instance of each calendar day randomly taken from the available records (5 possibilities per day), as follows.
Jan 01: 2006
Jan 02: 2011
Jan 03: 2009
...
this means I need to take 4 values from 01/01/2006, 4 values from 02/01/2011, etc.
I have a working version which works as follows:
Reshape the input array to add a "year" dimension (Time, Year, X, Y)
Create a 365 values array of randomly generated integers between 0 and 4
Use np.repeat and array of integers to extract only the relevant values:
Example:
sampledValues = Variable[np.arange(numberOfDays * ValuesPerDays), sampledYears.repeat(ValuesPerDays),:,:]
This seems to work, but I was wondering if this is the best/fastest approach to solve my problem? Speed is important as I am doing this in a loop, adn would benefit from testing as many cases as possible.
Am I doing this right?
Thanks
EDIT
I forgot to mention that I filtered the input dataset to remove the 29th of feb for leap years.
Basically the aim of that operation is to find a 365 days sample that matches well the long term time series in terms on mean etc. If the sampled time series passes my quality test, I want to export it and start again.

The year 2008 was 366 days long, so don't reshape.
Have a look at scikits.timeseries:
import scikits.timeseries as ts
start_date = ts.Date('H', '2006-01-01 00:00')
end_date = ts.Date('H', '2010-12-31 18:00')
arr3d = ... # your 3D array [time, X, Y]
dates = ts.date_array(start_date=start_date, end_date=end_date, freq='H')[::6]
t = ts.time_series(arr3d, dates=dates)
# just make sure arr3d.shape[0] == len(dates) !
Now you can access the t data with day/month/year objects:
t[np.logical_and(t.day == 1, t.month == 1)]
so for example:
for day_of_year in xrange(1, 366):
year = np.random.randint(2006, 2011)
t[np.logical_and(t.day_of_year == day_of_year, t.year == year)]
# returns a [4, X, Y] array with data from that day
Play with the attributes of t to make it work with leap years too.

I don't see a real need to reshape the array, since you can embed the year-size information in your sampling process, and leave the array with its original shape.
For example, you can generate a random offset (from 0 to 365), and pick the slice with index, say, n*365 + offset.
Anyway, I don't think your question is complete, because I didn't quite understand what you need to do, or why.

Related

Access data by month number in 3D xarray

I have data arrays (361x361) for Jan, Feb, March, Apr, Oct, Nov and Dec for a given year.
So far I've been storing them in individual netcdfs for every month in the year (e.g. 03.nc, 10.nc)
I'd like to combine all months into one netcdf, so that I can do something like:
march_data = data.sel(month='03')
or alternatively data.sel(month=3))
So far I've only been able to stack the monthly data in a 361x361x7 array and it's unhelpfully indexed so that to get March data you need to do data[:,:,2] and to get October it's data[:,:,4]. Clearly 2 & 4 do not intuitively correspond to the months of March and October. This is in part because python is indexed from zero and in part because I'm missing the summer months. I could put nan fields in for the missing months, but that wouldn't solve the index-0 issue.
My attempt so far:
data = xarray.Dataset( data_vars={'ice_type':(['x','y','time'],year_array),},
coords={'lon':(['x','y'],lon_target),
'lat':(['x','y'],lat_target),
'month_number':(['time'],month_int)})
Here year_array is a 361x361x7 numpy array, and month_int is a list that maps the third index of year_array to the month number: [1,2,3,4,10,11,12].
When I try to get Oct data with oct = data.sel(month_number=10) it throws an error.
On a side note, I'm aware that there's possibly a solution to be found here, but to be honest I don't understand how it works. My confusion is mostly based around how they use 'time' both as a dictionary key and list of times at the same time.
I think I've written a helper function to do something just like that:
def combine_new_ds_dim(ds_dict, new_dim_name):
"""
Combines a dictionary of datasets along a new dimension using dictionary keys
as the new coordinates.
Parameters
----------
ds_dict : dict
Dictionary of xarray Datasets or dataArrays
new_dim_name : str
The name of the newly created dimension
Returns
-------
xarray.Dataset
Merged Dataset or DataArray
Raises
------
ValueError
If the values of the input dictionary were of an unrecognized type
"""
expanded_dss = []
for k, v in ds_dict.items():
expanded_dss.append(v.expand_dims(new_dim_name))
expanded_dss[-1][new_dim_name] = [k]
new_ds = xr.concat(expanded_dss, new_dim_name)
return new_ds
If you have all of the data in individual netcdfs then you should be able to import them into individual dataArray's. Assuming you've done that, you could then do
month_das = {
1: january_da,
2: february_da,
...
12: december_da
}
year_data = combine_new_ds_dim(month_das, 'month')
which would be the concatenation of all of the data along the new dimension month with the desired coordinates. I think the main loop of the function is easy enough to separate if you want to use that alone.
EDIT:
For anyone looking at this in the future, there's a much easier way of doing this with builtin xarray functions. You can just concatenate along a new dimension
year_data = xr.concat([january_da, february_da, ..., december_da], dim="month")
which will create a new dataArray with the constituent arrays concatenated along a new dimension, but without coordinates on that dimension. To add coordinates,
year_data["month"] = [1, 2, ..., 12]
at which point year_data will be concatenated along the new dimension "month" and will have the desired coordinates along that dimension.

Aggregrate measurements per time period

I have a 6 x n matrix with the data: year, month, day, hour, minute, use.
I have to make a new matrix containing the aggregated measurements for use, in the value ’hour’. So all rows recorded within the same hour are combined.
So every time the number of hour chances the code need to know a new period starts.
I just tried something, but I don't now how to solve this.
Thank you. This is what I tried + a test
def groupby_measurements(data):
count = -1
for i in range(9):
array = np.split(data, np.where(data[i,3] != data[i+1,3])[0][:1])
return array
print(groupby_measurements(np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])))
In this case I tried, I expect the output to be:
np.array([[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76]]),
np.array([[2006,2,11,10,2,89],
[2006,2,11,10,3,33]]),
np.array([[2006,2,11,14,2,22],
[2006,2,11,14,5,34]])
The final output should be:
np.array([2006,2,11,1,0,278]),
np.array([2006,2,11,10,0,122]),
np.array([2006,2,11,14,0,56])
(the sum of use in the 3 hour periodes)
I would recommend using pandas Dataframes, and then using groupby combined with sum
import pandas as pd
import numpy as np
data = pd.DataFrame(np.array(
[[2006,2,11,1,1,55],
[2006,2,11,1,11,79],
[2006,2,11,1,32,2],
[2006,2,11,1,41,66],
[2006,2,11,1,51,76],
[2006,2,11,10,2,89],
[2006,2,11,10,3,33],
[2006,2,11,14,2,22],
[2006,2,11,14,5,34]]),
columns=['year','month','day','hour','minute','use'])
aggregated = data.groupby(['year','month','day','hour'])['use'].sum()
# you can also use .agg and pass which aggregation function you want as a string.
aggregated = data.groupby(['year','month','day','hour'])['use'].agg('sum')
year month day hour
2006 2 11 1 278
10 122
14 56
Aggregated is now a pandas Series, if you want it as an array just do
aggregated.values

Convert a 4D array to 3D in Python by merging the months and years columns

I have a 4D array that has two spatial directions, a month column and a year column. It gives a scalar value at each spatial point and for each month. I want to reshape this array to be 3D so that instead of the value being defined as x, y, month, year, it is just defined as x, y, month, where now the month column runs from 1-36 say with no year column instead of 1-12 with a year column of 1-3. How would I do this in Python? Thanks!
The basic approach is to code the new column something like:
new_month = old_month + 12*(old_year-1)
This translates your 3-year scale into a continuum of months numbered 1-36. I can't show you how to code this, because (1) you haven't given us reference code, so I have little idea how your 4D array is structured; (2) As I hope you've read in the help documentation, we're not a coding service.
Add a new column with values of (year-1)*12+month then discard or ignore your year and month columns. Details depend on exactly how your data is currently structured if it is a numpy array this would be 2 lines of code!

How can I take a list and add elements in columns in intervals?

Here's my problem:
This is for an introductory Python course, however I just cannot wrap my head around how to do this without using loops. I have a list of lists, with each list containing 12 float values corresponding to sunshine hours in a month. Each list of 12 months corresponds to a year (1929 - 2009).
Here is an example of the list:
data = [
[43.8, 60.5, 190.2, 144.7, 240.9, 210.3, 219.7, 176.3, 199.1, 109.2, 78.7, 67.0],
[49.9, 54.3, 109.7, 102.0, 134.5, 211.2, 174.1, 207.5, 108.2, 113.5, 68.7, 23.3],...]
Now, the task is to calculate mean sunshine hours per day in the winter. This is to be done by the following algorithm: Decade 1930-1939 would equal the hours from (Dec 1929 + Jan 1930 + Dec 1930 + Jan 1931...+ Jan 1939) / (20 numbers * 30 days in a month) = Mean winter sunshine hours per day.
Now I can do this using for loops, but the task is to do this using NO loops and instead using Numpy and array manipulation.
Here's things that I have considered:
-Splitting the data into two arrays (one with the January column and one with the December column)
-Adding those (though remember, there's an offset because Jan 1929 is unused as well as Dec 2009)
-Splitting the addition array into decades and averaging them.
However I'm very lost on how to go about this. So far I've split the data list into January and December arrays, but now I'm stuck.
Update: I've made an array with all the correct "winter" monthly hours (Dec+Jan) and now I just have to figure out how to find the mean of groups of 10 of them.
dataarray = np.array(data)
December = dataarray[:,11]
January = dataarray[:,0]
JanDec = np.zeros(80)
JanDec[:] = January[1:] + December[:-1]
Any help is appreciated. Thanks!
To answer your updated question, to group the data into decades you can reshape your array and take the mean along the correct axis.
This assumes that the number of years you have is divisible by 10 (which it appears to be since you have an array of length 80).
So, as a small example, if you wanted to group [3, 2, 5, 3, 2, 1] into chunks of 2, you could write:
>>> a = np.array([3, 2, 5, 3, 2, 1])
>>> a.reshape(-1, 2)
np.array([[3, 2],
[5, 3],
[2, 1]])
This gives you a 2D array - the groups you want to calculate the mean of are the rows. To take the mean across the rows you use mean(axis=1), so you can write:
>>> a.reshape(-1, 2).mean(axis=1)
np.array([ 2.5 , 4.0 , 1.5 ])
Using this idea, you can quickly take the mean across decades in your data.
Splitting the array is the right idea, but you can do it more simply by just calling from the array itself, I'll let you figure out how you could generalize but for this case -
arraydata=np.array(data)
winter=arraydata[:,::11]
average=np.mean(winter)/(20*30)
'winter' tells numpy to form a new array containing all data from every 11th column. Equally, you can choose which columns to pull with a similar approach for the first dimension of array data and sum it :-)

efficient, fast numpy histograms

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.
First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.
In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.
Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

Categories