Creating and Storing Multi-Dimensional Array in a netCDF File - python

This question has potentially two parts but maybe only one if the first part can be encapsulated by the second. I am using python with numpy and netCDF4
First:
I have four lists of different variable values (hereafter referred to elevation values) each of which has a length of 28. These four lists are one set of 5 different latitude values of which are one set of the 24 different time values.
So 24 times...each time with 5 latitudes...each latitude with four lists...each list with 28 values.
I want to create an array with the following dimensions (elevation, latitude, time, variable)
In words, I want to be able to specify which of the four lists I access,which index in the list, and specify a specific time and latitude. So an index into this array would look like this:
array(0,1,2,3) where 0 specifies the first index of the the 4th list specified by the 3. 1 specifies the 2nd latitude, and 2 specifies the 3rd time and the output is the value at that point.
I won't include my code for this part since literally the only things of mention are the lists
list1=[...]
list2=[...]
list3=[...]
list4=[...]
How can I do this, is there an easier structure of the array, or is there anything else I a missing?
Second:
I have created a netCDF file with variables with these four dimensions. I need to set those variables to the array structure made above. I have no idea how to do this and the netCDF4 documentation does a 1-d array in a fairly cryptic way. If the arrays can be made directly into the netCDF file bypassing the need to use numpy first, by all means show me how.
Thanks!

After talking to a few people where I work we came up with this solution:
First we made an array of zeroes using the following argument:
array1=np.zeros((28,5,24,4))
Then appended this array by specifying where in the array we wanted to change:
array1[:,0,0,0]=list1
This inserted the values of the list into the first entry in the array.
Next to write the array to a netCDF file, I created a netCDF in the same program I made the array, made a single variable and gave it values like this:
netcdfvariable[:]=array1
Hope that helps anyone who finds this.

Related

(Python3.x)Splitting arrays and saving them into new arrays

I'm writing a Python script intended to split a big array of numbers into equal sub-arrays. For that purpose, I use Numpy's split method as follows:
test=numpy.array_split(raw,nslices)
where raw is the complete array containing all the values, which are float64-type by the way.
nslices is the number of sub-arrays I want to create from the raw array.
In the script, nslices may vary depending of the size of the raw array, so I would like to "automatically" save each created sub-arrays in a particular array as : resultsarray(i)in a similar way that it can be made in MATLAB/Octave.
I tried to use afor in range loop in Python but I am only able to save the last sub-array in a variable.
What is the correct way to save the sub-array for each each incrementation from 1 to nslices?
Here, the complete code as is it now (I am a Python beginner, please bother the low-level of the script).
import numpy as np
file = open("results.txt", "r")
raw = np.loadtxt(fname=file, delimiter="/n", dtype='float64')
nslices = 3
rawslice = np.array_split(raw,nslices)
for i in range(0,len(rawslice)):
resultsarray=(rawslice[i])
print(rawslice[i])
Thank you very much for your help solving this problem!
First - you screwed up delimiter :)
It should be backslash+n \n instead of /n.
Second - as Serge already mentioned in comment you can just access to split parts by index (resultarray[0] to [2]). But if you really wanted to assign each part to a separate variable you can do this in fommowing way:
result_1_of_3, result_2_of_3, result_3_of_3 = rawslice
print(result_1_of_3, result_2_of_3, result_3_of_3)
But probably it isn't the way you should go.

How to store array as variables in a dictionary

I'm new to python and stuck with a folder of files that I can't structure.
I have several files (.gra) that I'm importing as numpy arrays using python. These arrays are meteorological variables (GFS) both in 3D and 2D.
The 3D variables are for different altitudes in a polygon (a location). I know that each variable is storage together until the next one start. 3D variables are located first and 2D after.
I would like to create a function that iterates a folder, read each file and, given a step, storage each slice of the array assigned to a key.
My final purpose is to have a dictionary with all data of each variable stored by date.
The output I want is a dataframe of 3 columns (id(date), all meteorological data selected for each date by variable name).
I have tried to create a dictionary with all variables and set up a json file to help me determine the range of elements that contain each variable in the array.
gfs_info = {
"HGTprs": (0,3_042), "CLWMRprs": (3_042, 6_084),"RHprs": (6_084,9_126),
"Velprs": (9_126,12_168),"UGRDprs": (12_168,15_210),"VGRDprs": (15_210,18_252),
"TMPprs": (18_252,21_294),"HGTsfc": (21_294,21_411),"MSLETmsl": (21_411,21_528),
"PWATclm": (21_528,21_645),"RH2m": (21_645,21_762),"Vel100m": (21_762,21_879),
"UGRD100m": (21_879,21_996),"VGRD100m": (21_996,22_113),"Vel80m": (22_113,22_230),
"UGRD80m": (22_230,22_347),"VGRD80m": (22_347,22_464),"Vel10m":(22_464,22_581),
"UGRD10m": (22_581,22_698),"VGRD10m": (22_698,22_815),"GUSTsfc": (22_815,22_932),
"TMPsfc": (22_932,23_049),"TMP2m": (23_049,23_166),"no4LFTXsfc":(23_166,23_283),
"CAPEsfc": (23_283,23_400),"SPFH2m": (23_400,23_517),"SPFH80m": (23_517,23_634),
}
From the 7th key, onwards, the jump is made from 117 to 117 instead of 3042
Thanks in advance!

Numpy mean for big array dataset using For loop Python

I have big dataset in array form and its arranged like this:
Rainfal amount arranged in array form
Average or mean mean for each latitude and longitude at axis=0 is computed using this method declaration:
Lat=data[:,0]
Lon=data[:,1]
rain1=data[:,2]
rain2=data[:,3]
--
rain44=data[:,44]
rainT=[rain1,rain2,rain3,rain4,....rain44]
mean=np.mean(rainT)
The result was aweseome but requires time computation and I look forward to use For Loop to ease the calculation. As for the moment the script that I used is like this:
mean=[]
lat=data[:,0]
lon=data[:,1]
for x in range(2,46):
rainT=data[:,x]
mean=np.mean(rainT,axis=0)
print mean
But weird result is appeared. Anyone?
First, you probably meant to make the for loop add the subarrays rather than keep replacing rainT with other slices of the subarray. Only the last assignment matters, so the code averages that one subarray rainT=data[:,45], also it doesn't have the correct number of original elements to divide by to compute an average. Both of these mistakes contribute to the weird result.
Second, numpy should be able to average elements faster than a Python for loop can do it since that's just the kind of thing that numpy is designed to do in optimized native code.
Third, your original code copies a bunch of subarrays into a Python List, then asks numpy to average that. You should get much faster results by asking numpy to sum the relevant subarray without making a copy, something like this:
rainT = data[:,2:] # this gets a view onto data[], not a copy
mean = np.mean(rainT)
That computes an average over all the rainfall values, like your original code.
If you want an average for each latitude or some such, you'll need to do it differently. You can average over an array axis, but latitude and longitude aren't axes in your data[].
Thanks friends, you are giving me such aspiration. Here is the working script ideas by #Jerry101 just now but I decided NOT to apply Python Loop. New declaration would be like this:
lat1=data[:,0]
lon1=data[:,1]
rainT=data[:,2:46] ---THIS IS THE STEP THAT I AM MISSING EARLIER
mean=np.mean(rainT,axis=1)*24 - MAKE AVERAGE DAILY RAINFALL BY EACH LAT AND LON
mean2=np.array([lat1,lon1,mean])
mean2=mean2.T
np.savetxt('average-daily-rainfall.dat2',mean2,fmt='%9.3f')
And finally the result is exactly same to program made in Fortran.

Deleting rows of data for multiple variables

I have over 500 files that I cleaned up using a pandas data frame, and read in later as a matrix. I now want to delete missing rows of data from multiple variables for the entirety of my files. Each variable is pretty lengthy for its shape, for example, tc and wspd have the shape (84479, 558) and pressure has the shape (558,). I have tried the following example before and has worked in the past for single dimensional arrays with the same shape, but will no longer work with a two dimensional array.
bad=[]
for i in range(len(p)):
if p[i]==-9999 or tc[i]==-9999:
bad.append(i)
p=numpy.delete(p, bad)
tc=numpy.delete(tc, bad)
I tried using the following code instead but with no success (unfortunately).
import numpy as n
import pandas as pd
wspd=pd.read_pickle('/home/wspd').as_matrix()
tc=pd.read_pickle('/home/tc').as_matrix()
press=n.load('/home/file1.npz')
p=press['press']
names=press['names']
length=n.arange(0,84479)
for i in range(len(names[0])): #using the first one as a trial to run faster
print i #used later to see how far we have come in the 558 files
bad=[]
for j in range(len(length)):
if (wspd[j,i]==n.nan or tc[j,i]==n.nan):
bad.append(j)
print bad
From there I plan on deleting missing data as I had done previously except indexing which dimension I am deleting from within my first forloop.
new_tc=n.delete(tc[j,:], bad)
Unfortunately, this has not worked. I have also tried masking the array which also has not worked.
The reason I need to delete the data is my next library does not understand nan values, it requires strictly integers, floats, etc.
I am open to new methods for removing rows of data if anyone has any guidance. I greatly appreciate it.
I would load your 2 dimensional arrays as pandas DataFrames and then use the dropna function to drop any rows that contain a null value
wspd = pd.read_pickle('/home/wspd').dropna()
tc = pd.read_pickle('/home/tc').dropna()
The documentation for pandas.DataFrame.dropna is here

Conditionally add 1 to int element of a NumPy record array

I have a large NumpPy record array 250 million rows by 9 columns (MyLargeRec). and I need to add 1 to the 7th column (dtype = "int") if the the index of that row is in another list or 300,000 integers(MyList). If this was a normal python list I would use the following simple code...
for m in MyList:
MyLargeRec[m][6]+=1
However I can not seem to get a similar functionality using the NumPy record array. I have tried a few options such as nditer, but this will not let me select the specific indices I want.
Now you may say that this is not what NumPy was designed for, so let me explain why I a using this format. I am using it is because it only takes 30 mins to build the record array from scratch whereas it takes over 24 hours if using a conventional 2D list format. I spent all of yesterday trying to find a way to do this and could not, I eventually converted it to a list using...
MyLargeList = list(MyLargeRec)
so I could use the simple code above to achieve what I want, however this took 8.5 hours to perform this function.
Therefore, can anyone tell me first, is there a method to achieve what i want within a NumPy record array? and second, if not, any ideas on the best methods within python 2.7 to create, update and store such a large 2D matrix?
Many thanks
Tom
your_array[index_list, 6] += 1
Numpy allows you to construct some pretty neat slices. This selects the 6th column of all rows in your list of indices and adds 1 to each. (Note that if an index appears multiple times in your list of indices, this will still only add 1 to the corresponding cell.)
This code...
for m in MyList:
MyLargeRec[m][6]+=1
does actually work, silly question by me.

Categories