calculating means of many matrices in numpy - python

I have many csv files which each contain roughly identical matrices. Each matrix is 11 columns by either 5 or 6 rows. The columns are variables and the rows are test conditions. Some of the matrices do not contain data for the last test condition, which is why there are 5 rows in some matrices and six rows in other matrices.
My application is in python 2.6 using numpy and sciepy.
My question is this:
How can I most efficiently create a summary matrix that contains the means of each cell across all of the identical matrices?
The summary matrix would have the same structure as all of the other matrices, except that the value in each cell in the summary matrix would be the mean of the values stored in the identical cell across all of the other matrices. If one matrix does not contain data for the last test condition, I want to make sure that its contents are not treated as zeros when the averaging is done. In other words, I want the means of all the non-zero values.
Can anyone show me a brief, flexible way of organizing this code so that it does everything I want to do with as little code as possible and also remain as flexible as possible in case I want to re-use this later with other data structures?
I know how to pull all the csv files in and how to write output. I just don't know the most efficient way to structure flow of data in the script, including whether to use python arrays or numpy arrays, and how to structure the operations, etc.
I have tried coding this in a number of different ways, but they all seem to be rather code intensive and inflexible if I later want to use this code for other data structures.

You could use masked arrays. Say N is the number of csv files. You can store all your data in a masked array A, of shape (N,11,6).
from numpy import *
A = ma.zeros((N,11,6))
A.mask = zeros_like(A) # fills the mask with zeros: nothing is masked
A.mask = (A.data == 0) # another way of masking: mask all data equal to zero
A.mask[0,0,0] = True # mask a value
A[1,2,3] = 12. # fill a value: like an usual array
Then, the mean values along first axis, and taking into account masked values, are given by:
mean(A, axis=0) # the returned shape is (11,6)

Related

How do I append new data to the DataArray along existing dimension in-place?

I have data grouped into a 3-D DataArray named 'da' with dimensions 'time', 'indicators', and 'coins', using Dask as a backend:
I need to select the data for peculiar indicator, calculate a new indicator based on it, and append this newly calculated indicator to da along indicators dimension using the new indicator name (let's call it daily_return). In somewhat simplistic terms of a 2-D analogy, I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
So far I've tried to apply_ufunc() with both drop=False (then I retrieve scalar indicators coordinate on the resulting DataArray) and drop=True (respectively, indicators are dropped) using the corresponding tutorial:
dr_func = lambda today, yesterday: today / yesterday - 1 # Assuming for simplicity that yesterday != 0
today_da = da.sel(indicators="price_daily_close_usd", drop=False) # or drop=True
yesterday_da = da.shift(time=1).sel(indicators="price_daily_close_usd", drop=False) # or drop=True
dr = xr.apply_ufunc(
dr_func,
today_da,
yesterday_da,
dask="allowed",
dask_gufunc_kwargs={"allow_rechunk": True},
vectorize=True
)
Obviously, in case of drop=True I cannot concat da and dr DataArrays, since indicators are not present among dr's coordinates.
In its turn, in case of drop=False I've managed to concat these DataArrays along indicators; however, the resulting indicators coord would contain two similarly named CoordinateVariables, specifically "price_daily_close_usd":
...while the second of them should be renamed into "daily_return".
I've also tried to extract the needed data from dr through .sel(), but failed due to the absence of index along indicators dimension (as far as I've understood, it's not possible to set an index in this case, since this dimension is scalar):
dr.sel(indicators="price_daily_close_usd") # Would result in KeyError: "no index found for coordinate 'indicators'"
Moreover, the solution above is not done in-place - i.e. it creates a new combined DataArray instance instead of modifying da, while the latter would be highly preferable.
How can I append new data to da along existing dimension, desirably in-place?
Loading all the data directly into RAM would hardly be possible due to its huge volumes, that's why Dask is being used.
I'm also not sticking to the DataArray data structure and it would be no problem to switch to a Dataset if it has more suitable methods for solving my problem.
Xarray does not support appending in-place. Any change to the shape of your array will need to produce a new array.
If you want to work with a single array and know the size of the final array, you could generate an empty array and assign values based on coordinate labels.
I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
Xarray's Dataset is a better analog to the Pandas.Dataframe. The Dataset is a dict-like container storing ND arrays (DataArray's) just like the Dataframe is a dict-like container storing 1D arrays (Series).

Want to make an array with zeros when data is masked and ones when data is not masked

I have netcdf data that is masked. The data is in (time, latitude, longitude). I would like to make an array with the same size as the original data but with zeros when the data is masked and with ones where it is not masked. So fare I have tried to make this function:
def find_unmasked_values(data):
empty = np.ones((len(data),len(data[0]),len(data[0,0])))
for k in range(0,len(data[0,0]),1): # third coordinate
for j in range(0,len(data[0]),1): # second coordinate
for i in range(0,len(data),1): # first coordinate
if ma.is_mask(data[i,j,k]) is True:
empty[i,j,k] = 0
return(empty)
But this only returns an array with ones and no zeros eventhough there is masked values in the data. If you have suggestions on how to improve the code in efficiency I would also be very happy.
Thanks,
Keep it simple! There is no need for all the manual loops, which will make your approach very slow for large data sets. A small example with some other data (where thl is a masked variable):
import netCDF4 as nc4
nc = nc4.Dataset('bomex_qlcore_0000000.nc')
var = nc['default']['thl'][:]
mask_1 = var.mask # masked=True, not masked=False
mask_2 = ~var.mask # masked=False, not masked=True
# What you need:
int_mask = mask_2.astype(int) # masked=0, not masked=1
p.s.: some other notes:
Instead of len(array), len(array[0]), et cetera, you can also directly get the shape of your array with array.shape, which returns a tupple with the array dimensions.
If you want to create a new array with the same dimensions as another one, just use empty = np.ones_like(data) (or np.zeros_like() is you want an array with zeros).
ma.is_mask() already returns a bool; no need to compare it with True.
Don't confuse is with ==: Is there a difference between "==" and "is"?

Converting 2D numpy array to 3D array without looping

I have a 2D array of shape (t*40,6) which I want to convert into a 3D array of shape (t,40,5) for the LSTM's input data layer. The description on how the conversion is desired in shown in the figure below. Here, F1..5 are the 5 input features, T1...40 are the time steps for LSTM and C1...t are the various training examples. Basically, for each unique "Ct", I want a "T X F" 2D array, and concatenate all along the 3rd dimension. I do not mind losing the value of "Ct" as long as each Ct is in a different dimension.
I have the following code to do this by looping over each unique Ct, and appending the "T X F" 2D arrays in 3rd dimension.
# load 2d data
data = pd.read_csv('LSTMTrainingData.csv')
trainX = []
# loop over each unique ct and append the 2D subset in the 3rd dimension
for index, ct in enumerate(data.ct.unique()):
trainX.append(data[data['ct'] == ct].iloc[:, 1:])
However, there are over 1,800,000 such Ct's so this makes it quite slow to loop over each unique Ct. Looking for suggestions on doing this operation faster.
EDIT:
data_3d = array.reshape(t,40,6)
trainX = data_3d[:,:,1:]
This is the solution for the original question posted.
Updating the question with an additional problem: the T1...40 time steps can have the highest number of steps = 40, but it could be less than 40 as well. The rest of the values can be 'np.nan' out of the 40 slots available.
Since all Ct have not the same length , you have no other choice than rebuild a new block.
But use of data[data['ct'] == ct] can be O(n²) so it's a bad way to do it.
Here a solution using Panel . cumcount renumber each Ct line :
t=5
CFt=randint(0,t,(40*t,6)).astype(float) # 2D data
df= pd.DataFrame(CFt)
df2=df.set_index([df[0],df.groupby(0).cumcount()]).sort_index()
df3=df2.to_panel()
This automatically fills missing data with Nan. But It warns :
DeprecationWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
So perhaps working with df2 is the recommended way to manage your data.

Saving/loading a table (with different column lengths) using numpy

A bit of context: I am writting a code to save the data I plot to a text file. This data should be stored in such a way it can be loaded back using a script so it can be displayed again (but this time without performing any calculation). The initial idea was to store the data in columns with a format x1,y1,x2,y2,x3,y3...
I am using a code which would be simplified to something like this (incidentally, I am not sure if using a list to group my arrays is the most efficient approach):
import numpy as np
MatrixResults = []
x1 = np.array([1,2,3,4,5,6])
y1 = np.array([7,8,9,10,11,12])
x2 = np.array([0,1,2,3])
y2 = np.array([0,1,4,9])
MatrixResults.append(x1)
MatrixResults.append(y1)
MatrixResults.append(x2)
MatrixResults.append(y2)
MatrixResults = np.array(MatrixResults)
TextFile = open('/Users/UserName/Desktop/Datalog.txt',"w")
np.savetxt(TextFile, np.transpose(MatrixResults))
TextFile.close()
However, this code gives and error when any of the data sets have different lengths. Reading similar questions:
Can numpy.savetxt be used on N-dimensional ndarrays with N>2?
Table, with the different length of columns
However, this requires to break the format (either with flattening or adding some filling strings to the shorter columns to fill the shorter arrays)
My issue summarises as:
1) Is there any method that at the same time we transpose the arrays these are saved individually as consecutive columns?
2) Or maybe is there anyway to append columns to a text file (given a certain number of rows and columns to skip)
3) Should I try this with another library such as pandas?
Thank you very for any advice.
Edit 1:
After looking a bit more it seems that leaving blank spaces is more innefficient than filling the lists.
In the end I wrote my own (not sure if there is numpy function for this) in which I match the arrays length with "nan" values.
To get the data back I use the genfromtxt method and then I use this line:
x = x[~isnan(x)]
To remove the these cells from the arrays
If I find a better solution I will post it :)
To save your array you can use np.savez and read them back with np.load:
# Write to file
np.savez(filename, matrixResults)
# Read back
matrixResults = np.load(filename + '.npz').items[0][1]
As a side note you should follow naming conventions i.e. only class names start with upper case letters.

Saving (and averaging) Large Sparse Numpy Array

I have written a code that generates a large 3d numpy array of data observations (floats). The dimensions are (33,000 x 2016 x 53), which corresponds to (#obs.locations x 5min_intervals_perweek x weeks_in_leapyear). It is very sparse (about 1.5% of entries are filled).
Currently I do this by calling:
my3Darray = np.zeros(33000,2016,53)
or
my3Darray = np.empty(33000,2016,53)
My loop then indexes into the array one entry at a time and updates 1.5% with floats (this part is actually very fast). I then need to:
Save each 2D (33000 x 2016) slice as a CSV or other 'general format' data file
Take the mean over the 3rd dimension (so I should get a 33000 x 2016 matrix)
I have tried saving with:
for slice_2d_week_i in xrange(nweeks):
weekfile = str(slice_2d_week_i)
np.savetxt(weekfile, my3Darray[:,:,slice_2d_week_i], delimiter=",")
However, this is extremely slow and the empty entries in the output show up as
0.000000000000000000e+00
which makes the file sizes huge.
Is there a more efficient way to save (possibly leaving blanks for entries that were never updated?) Is there a better way to allocate the array besides np.zeros or np.empty? And how can I take the mean over the 3rd dimension while ignoring non-updated entries ( mean(my3Darray,3) does not ignore the 0 entries ).
You can save in one of numpy's binary formats, here's one I use: np.savez.
You can average with np.sum(a, axis=2) / np.sum(a != 0, axis=2). Keep in mind that this will still give you NaN's when there are zeros in the denominator.

Categories