How to join data from multiple netCDF files with xarray in Python?

How to join data from multiple netCDF files with xarray in Python? - python

I'm trying to open multiple netCDF files with xarray in Python. The files have data with same shape and I want to join them, creating a new dimension.
I tried to use concat_dim argument for xarray.open_mfdataset(), but it doesn't work as expected. An example is given below, which open two files with temperature data for 124 times, 241 latitudes and 480 longitudes:
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases' )
da_t2m = DS.t2m
print( da_t2m )
With this code, I expect that the result data array will have a shape like (cases: 2, time: 124, latitude: 241, longitude: 480). However, its shape was (cases: 2, time: 248, latitude: 241, longitude: 480).
It creates a new dimension, but also sums the leftmost dimension: 'time' dimension of two datasets.
I was wondering whether it's an error from 'xarray.open_mfdateset' or it's an expected behavior because 'time' dimension is UNLIMITED for both datasets.
Is there a way to join data from these files directly using xarray and get the above expected return?
Thank you.
Mateus

Extending from my comment I would try this:
def preproc(ds):
ds = ds.assign({'stime': (['time'], ds.time)}).drop('time').rename({'time': 'ntime'})
# we might need to tweak this a bit further, depending on the actual data layout
return ds
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases', preprocess=preproc)
The good thing here is, that you keep the original time coordinate in stime while renaming the original dimension (time -> ntime).
If everything works well, you should get resulting dimensions as (cases, ntime, latitude, longitude).
Disclaimer: I do similar in a loop with a final concat (wich works very well), but did not test the preprocess-approach.

Thank you #AdrianTompkins and #jhamman. After your comments I realize that due different time periods I really can't get what I want, with xarray.
My main purpose to create such array is to get in one single N-D array all data for different events, with same time duration. Thus, I can get easily, for example, composite fields of all events for each time (hour, day, etc).
I'm trying to do the same as I do with NCL. See below a code for NCL that works as expected (for me) for the same data:
f = addfiles( (/"eraINTERIM_t2m_201812.nc", "eraINTERIM_t2m_201901.nc"/), "r" )
ListSetType( f, "join" )
temp = f[:]->t2m
printVarSummary( temp )
The final result is an array with 4 dimensions, with the new one automatically named as ncl_join.
However, NCL doesn't respect time axis, joins the arrays and gives to the resulting time axis the coordinates of the first file. So, time axis become useless.
However, as well said for #AdrianTompkins, the time periods are different and xarray can't join data like this. So, to create such array, in Python with xarray, I think the only way is to delete time coordinate from arrays. Thus, time dimension would have only integer indexes.
The array given by xarray works like #AdrianThompkins said in his small example. Since it keep time coordinates for all merged data, I think xarray solution is the correct one, in comparison with NCL. But, now I think that a computation of composites (getting same example given above) wouldn't be done as easyly as it seems with NCL.
In a small test, I print two values from merged array with xarray with
print( da_t2m[ 0, 0, 0, 0 ].values )
print( da_t2m[ 1, 0, 0, 0 ].values )
What results in
252.11412
nan
For the second case, there isn't data for the first time, as expected.
UPDATE: all answers help me to understand better this problem, so I had to add an update here to also thanks #kmuehlbauer for his answer, indicating that his code give the expected array.
Again, thank you all for help!
Mateus

The result makes sense if the times are different.
To simplify it, forget about the lat-lon dimension for a moment and imagine you have two files that are simply data at 2 timeslices. The first has data at timesteps 1,2 and the second file with timesteps of 3 and 4. You can't create a combined dataset with a time dimension that only spans 2 timeslices; the time dimension variable has to have the times 1,2,3,4. So if you say you want a new dimension "cases", then the data is then combined as a 2d array and would look like this:
times: 1,2,3,4
cases: 1,2
data:
time
1 2 3 4
cases 1: x1 x2
2: x3 x4
Think of the netcdf file that would be the equivalent, the time dimension has to span the range of values present in both files. The only way you could combine two files and get (cases: 2, time: 124, latitude: 241, longitude: 480) would be if both files have the same time, lat AND lon values, i.e. point to exactly the same region in time-lat-lon space.
ps: Somewhat off-topic for the question, but if you are just starting a new analysis, why not instead switch to the new generation, higher resolution ERA-5 reanalysis, which is now available back to 1979 too (and eventually will be extended further back), you can download it straight to your desktop with the python api scripts from here:
https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset

Related

Python: "Binning" subarrays

I am seeking to make a kind of binning of lines of data according to the first element of the line.
My data has this shape:
[[Temperature, value0, value1, ... value249]
[Temperature, ...
]
So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.
I would like to make an array of this shape:
[Temperature-bin,[[values]
[values]
... ]]
Next Temp.-bin, [[values]
[values]
... ]]
...
]
where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.
data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]
start=23000
end=380000
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)
binneddata=np.empty([len(bincenters),2])
for i in np.arange(len(temp)):
binneddata[i]=[bincenters[i],np.array([])]
I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence.
I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.
Thank you for your input,
lepakk

First: Thanks to kwinkunks for the hint of using a pandas dataframe.
I found a solution using this feature.
The binning is now done like this:
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
lowerbinedges=np.array(bincenters-binsize/2)
higherbinedges=np.array(bincenters+binsize/2)
allbinedges=np.append(lowerbinedges,higherbinedges[-1])
temp_pd=pd.Series(temp[start:end])
traces=pd.Series(list(data[start:end,0:250]))
tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)
df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)
by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.
The actual analysis is then extremely short. Starting with:
rf=pd.DataFrame({'Bincenter': bincenters})
the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.
With
df[df.Bincenter==xyz]
I can select only those data lines from df, that I want to have in the selected bin.
In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.
rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.
I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.
Still, the dataframe was very useful here. I used odr for fitting like this
for i in binnumber:
fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
... some more fit stuff here...
and saved the fitresults in
lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})
and finally added them in the resultframe with
rf=pd.concat([rf,lifetimefits],axis=1)
rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)
which makes an output like
Out[78]:
Bincenter Binsize ... lifetime sd_lifetime
0 139.0 4102 ... 38.492028 2.803211
1 140.0 4252 ... 33.659729 2.534872
2 141.0 3785 ... 31.220312 2.252104
3 142.0 3823 ... 29.391562 1.783890
4 143.0 3808 ... 40.422578 2.849545
I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.
Best,
lepakk

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.

Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

physical dimensions and array dimensions

If I have a rainfall map which has three dimensions (latitude, longitude and rainfall value), if I put it in an array, do I need a 2D or 3D array? How would the array look like?
If I have a series of daily rainfall map which has four dimensions (lat, long, rainfall value and time), if I put it in an array, do I need a 3D or 4D array?
I am thinking that I would need a 2D and 3D arrays, respectively, because the latitude and longitude can be represented by a 1D array only (but reshaped such that it has more than 1 rows and columns). Enlighten me please.

I think that both propositions from #Genhis and #Bitzel are right depending on what you want to do...
If you want to be effective, I would recommend you to put both in 2D data structure and I would even advise you specifically to choose a pandas dataframe which will put your data in some kind of matrix-like data structure but let you choose multiple indexes if you need to "think" in 3D or 4D.
It will especially be helpful with the 2nd kind of data you're mentioning "(lat, long, rainfall value and time)" as it is part of what is called "time series". Pandas has a lot of methods to help you averaging over some period of time (you can also group your data by longitude, latitude or location if needed).
On the contrary, if your objective is to learn about how to compute those numbers in Python, then you can use 2D arrays for the first case and 2D or 3D for the 2nd one as previous answers recommended. You could use something like numpy arrays as data structure instead of pure python list but that's debatable...
One important point: Choosing 3D arrays for the time series as #Genhis proposes would ask you to convert time in indexes (through lookup tables or hash function) but that will require some more work...
As I said, you could also learn about tidy, wide and long formats if you want to learn more about those questions...

for the rainfall map, the values you're describing are (latitude, longitude, rainfall value), you need to use a 2D array (matrix) since all you need is 3 columns and a number of rows. It will look like:
rainfall
For the values (lat, long, rainfall value, time) it's the same case. You need to use a 2D array with 4 columns and a number of rows:
Rainfall matrix

I believe that the rainfall value shouldn't be a dimension. Therefore, you could use 2D array[lat][lon] = rainfall_value or 3D array[time][lat][lon] = rainfall_value respectively.
If you want to reduce number of dimensions further, you can combine latitude and longitude into one dimension as you suggested, which would make arrays 1D/2D.

Pandas Dataframes merging thru iterations. How to avoid lists and rows of headers

I code just once in a while and I am super basic at the moment. Might be a silly question, but it got me stuck in for a bit too much now.
Background
I have a function (get_profiles) that plots points every 5m along one transect line (100m long) and extracts elevation (from a geotiff).
The arguments are:
dsm (digital surface model)
transect_file (geopackage, holds many LineStrings with different transect_ID)
transect_id (int, extracted from transect_file)
step (int, number of meters to extract elevation along transect lines)
The output for one transect line is a dataframe like in the picture, which is what I expected, and I like it!
However, the big issue is when I iterate the function over the transect_ids (the transect_files has 10 Shapely LineStrings), like this:
tr_list = np.arange(1,transect_file.shape[0]-1)
geodb_transects= []
for i in tr_list:
temp=get_profiles(dsm,transect_file,i,5)
geodb_transects.append(temp)
I get a list. It might be here the error, but I don't know how to do in another way.
type(geodb_transects)
output:list
And, what's worse, I get headers (distance, z, tr_id, date) every time a new iteration starts.
How to get a clean pandas dataframe, just like the output of 1 iteration (20rows) but with all the tr_id chunks of 20row each aligned and without headers?

If your output is a DataFrame then you’re simply looking to concatenate the incremental DataFrame into some growing DataFrame.
It’s not the most efficient but something like
import pandas
df = pandas.DataFrame()
for i in range(7) :
df = df.concat( df_ret_func(i))
You may also be interested in the from_records function if you have a list of elements that are all records of the same form and can be converted into the rows of a DataFrame.

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.

If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.