python average of multidimensional array netcdf plot - python

I read a multidimensional array from netCDF file.
The variable that I need to plot is named "em", and it has 4 dimensions ´em (years, group, lat, lon)´
The "group" variable has 2 values, I am interested only of the first one.
So the only variable that I need to manage is the "years" variable. The variable "years" has 17 values. For the first plot I need to average the first 5 years, and for the second plot I have to aveage from 6th years to the last years.
data = Dataset (r'D:\\Users\\file.nc')
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
year = data.variables['label'][:]
group = data.variables['group'][:]
em= data.variables['em'][:]
How can I create a 2 dimensional array avareging for this array ?
First one :
`em= data.variables['em'][0:4][0][:][:]`
Second one :
em= data.variables['em'][5:16][0][:][:]
I create a simple loop
nyear=(2005-2000)+1
for i in range (nyear):
em_1= data.variables['em'][i][0][:][:]
em_1+=em_1
em_2000_2005=em_1/nyear
but I think there could be more elegant easy way to this on python

I would highly recommend using xarray for working with NetCDF files. Rather than keeping track of indices positionally, you can operate on them by name which greatly improves code readability. In your example all you would need to do is
import xarray as xr
ds = xr.open_dataset(r'D:\\Users\\file.nc')
em_mean1 = ds.em.isel(label = range(6,18)).mean()
em_mean2 = ds.em.isel(label = range(6)).mean()
the .isel method selects the indices of the specified dimension (label in this case), and the .mean() method computes the average over the selection.

You can use NumPy:
em = data.variables['em'][:];
em_mean = np.mean(em,axis=0) # if you want to average all over first dimension
If data contains NaN's, just use NumPY's nanmean.
As you wanted to average first 3 values, for the first case, use:
em_mean1 = np.squeeze(np.mean(em[0:4,:],axis=0))
and take for the plot:
em_mean1 = np.squeeze(em_mean1[0,:]);
You can do similar for the second case.

Related

How do I append new data to the DataArray along existing dimension in-place?

I have data grouped into a 3-D DataArray named 'da' with dimensions 'time', 'indicators', and 'coins', using Dask as a backend:
I need to select the data for peculiar indicator, calculate a new indicator based on it, and append this newly calculated indicator to da along indicators dimension using the new indicator name (let's call it daily_return). In somewhat simplistic terms of a 2-D analogy, I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
So far I've tried to apply_ufunc() with both drop=False (then I retrieve scalar indicators coordinate on the resulting DataArray) and drop=True (respectively, indicators are dropped) using the corresponding tutorial:
dr_func = lambda today, yesterday: today / yesterday - 1 # Assuming for simplicity that yesterday != 0
today_da = da.sel(indicators="price_daily_close_usd", drop=False) # or drop=True
yesterday_da = da.shift(time=1).sel(indicators="price_daily_close_usd", drop=False) # or drop=True
dr = xr.apply_ufunc(
dr_func,
today_da,
yesterday_da,
dask="allowed",
dask_gufunc_kwargs={"allow_rechunk": True},
vectorize=True
)
Obviously, in case of drop=True I cannot concat da and dr DataArrays, since indicators are not present among dr's coordinates.
In its turn, in case of drop=False I've managed to concat these DataArrays along indicators; however, the resulting indicators coord would contain two similarly named CoordinateVariables, specifically "price_daily_close_usd":
...while the second of them should be renamed into "daily_return".
I've also tried to extract the needed data from dr through .sel(), but failed due to the absence of index along indicators dimension (as far as I've understood, it's not possible to set an index in this case, since this dimension is scalar):
dr.sel(indicators="price_daily_close_usd") # Would result in KeyError: "no index found for coordinate 'indicators'"
Moreover, the solution above is not done in-place - i.e. it creates a new combined DataArray instance instead of modifying da, while the latter would be highly preferable.
How can I append new data to da along existing dimension, desirably in-place?
Loading all the data directly into RAM would hardly be possible due to its huge volumes, that's why Dask is being used.
I'm also not sticking to the DataArray data structure and it would be no problem to switch to a Dataset if it has more suitable methods for solving my problem.
Xarray does not support appending in-place. Any change to the shape of your array will need to produce a new array.
If you want to work with a single array and know the size of the final array, you could generate an empty array and assign values based on coordinate labels.
I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
Xarray's Dataset is a better analog to the Pandas.Dataframe. The Dataset is a dict-like container storing ND arrays (DataArray's) just like the Dataframe is a dict-like container storing 1D arrays (Series).

Enumerating through a list of data to find averages, but the lines aren't just numbers

I am new to Python. I am enumerating through a large list of data, as shown below, and would like to find the mean of every line.
for index, line in enumerate (data):
#calculate the mean
However, the lines of this particular set of data are as such:
[array([[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]], dtype=float32)].
I would like to find the mean of both 2x1s separately, then the average of both means, so it outputs a single number.
Thanks in advance.
You probably do not need to enumerate through the list to achieve what you want. You can do it in two steps using list comprehension.
For example,
data = [[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]]
# Calculate the average for 2X1s or each row
avgs_along_x = [sum(line)/len(line) for line in data]
# Calculate the average along y
avg_along_y = sum(avgs_along_x)/len(avgs_along_x)
There are other ways to calculate the mean of a list in python. You can read about them here.
If you are using numpy this can be done in one line.
import numpy as np
np.average(data, 1) # calculate the mean along x-axes denoted as 1
# To get what you want, we can pass tuples of axes.
np.average(data, (1,0))

python efficiently applying function over multiple arrays

(new to python so I apologize if this question is basic)
Say I create a function that will calculate some equation
def plot_ev(accuracy,tranChance,numChoices,reward):
ev=(reward-numChoices)*1-np.power((1-accuracy),numChoices)*tranChance)
return ev
accuracy, tranChance, and numChoices are each float arrays
e.g.
accuracy=np.array([.6,.7,.8])
tranChance=np.array([.6,.7,8])
numChoices=np.array([2,.3,4])
how would I run and plot plot_ev over my 3 arrays so that I end up with an output that has all combinations of elements (ideally not running 3 forloops)
ideally i would have a single plot showing the output of all combinations (1st element from accuracy with all elements from transChance and numChoices, 2nd element from accuracy with all elements from transChance and numChoices and so on )
thanks in advance!
Use numpy.meshgrid to make an array of all the combinations of values of the three variables.
products = np.array(np.meshgrid(accuracy, tranChance, numChoices)).T.reshape(-1, 3)
Then transpose this again and extract three longer arrays with the values of the three variables in every combination:
accuracy_, tranChance_, numChoices_ = products.T
Your function contains only operations that can be carried out on numpy arrays, so you can then simply feed these arrays as parameters into the function:
reward = ?? # you need to set the reward value
results = plot_ev(accuracy_, tranChance_, numChoices_, reward)
Alternatively consider using a pandas dataframe which will provide clearer labeling of the columns.
import pandas as pd
df = pd.DataFrame(products, columns=["accuracy", "tranChance", "numChoices"])
df["ev"] = plot_ev(df["accuracy"], df["tranChance"], df["numChoices"], reward)

Calculating intermittent average

I have a huge dataframe with a lot of zero values. And, I want to calculate the average of the numbers between the zero values. To make it simple, the data shows for example 10 consecutive values then it renders zeros then values again. I just want to tell python to calculate the average of each patch of the data.
The pic shows an example
first of all I'm a little bit confused why you are using a DataFrame. This is more likely being stored in a pd.Series while I would suggest storing numeric data in an numpy array. Assuming that you are having a pd.Series in front of you and you are trying to calculate the moving average between two consecutive points, there are two approaches you can follow.
zero-paddding for the last integer:
assuming circularity and taking the average between the first and the last value
Here is the expected code:
import numpy as np
import pandas as pd
data_series = pd.Series([0,0,0.76231, 0.77669,0,0,0,0,0,0,0,0,0.66772, 1.37964, 2.11833, 2.29178, 0,0,0,0,0])
np_array = np.array(data_series)
#assuming zero_padding
np_array_zero_pad = np.hstack((np_array, 0))
mvavrg_zeropad = [np.mean([np_array_zero_pad[i], np_array_zero_pad[i+1]]) for i in range(len(np_array_zero_pad)-1)]
#asssuming circularity
np_array_circ_arr = np.hstack((np_array, np_array[-1]))
np_array_circ_arr = [np.mean([np_array_circ_arr[i], np_array_circ_arr[i+1]]) for i in range(len(np_array_circ_arr)-1)]

Converting 2D numpy array to 3D array without looping

I have a 2D array of shape (t*40,6) which I want to convert into a 3D array of shape (t,40,5) for the LSTM's input data layer. The description on how the conversion is desired in shown in the figure below. Here, F1..5 are the 5 input features, T1...40 are the time steps for LSTM and C1...t are the various training examples. Basically, for each unique "Ct", I want a "T X F" 2D array, and concatenate all along the 3rd dimension. I do not mind losing the value of "Ct" as long as each Ct is in a different dimension.
I have the following code to do this by looping over each unique Ct, and appending the "T X F" 2D arrays in 3rd dimension.
# load 2d data
data = pd.read_csv('LSTMTrainingData.csv')
trainX = []
# loop over each unique ct and append the 2D subset in the 3rd dimension
for index, ct in enumerate(data.ct.unique()):
trainX.append(data[data['ct'] == ct].iloc[:, 1:])
However, there are over 1,800,000 such Ct's so this makes it quite slow to loop over each unique Ct. Looking for suggestions on doing this operation faster.
EDIT:
data_3d = array.reshape(t,40,6)
trainX = data_3d[:,:,1:]
This is the solution for the original question posted.
Updating the question with an additional problem: the T1...40 time steps can have the highest number of steps = 40, but it could be less than 40 as well. The rest of the values can be 'np.nan' out of the 40 slots available.
Since all Ct have not the same length , you have no other choice than rebuild a new block.
But use of data[data['ct'] == ct] can be O(n²) so it's a bad way to do it.
Here a solution using Panel . cumcount renumber each Ct line :
t=5
CFt=randint(0,t,(40*t,6)).astype(float) # 2D data
df= pd.DataFrame(CFt)
df2=df.set_index([df[0],df.groupby(0).cumcount()]).sort_index()
df3=df2.to_panel()
This automatically fills missing data with Nan. But It warns :
DeprecationWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
So perhaps working with df2 is the recommended way to manage your data.

Categories