Access data by month number in 3D xarray - python

I have data arrays (361x361) for Jan, Feb, March, Apr, Oct, Nov and Dec for a given year.
So far I've been storing them in individual netcdfs for every month in the year (e.g. 03.nc, 10.nc)
I'd like to combine all months into one netcdf, so that I can do something like:
march_data = data.sel(month='03')
or alternatively data.sel(month=3))
So far I've only been able to stack the monthly data in a 361x361x7 array and it's unhelpfully indexed so that to get March data you need to do data[:,:,2] and to get October it's data[:,:,4]. Clearly 2 & 4 do not intuitively correspond to the months of March and October. This is in part because python is indexed from zero and in part because I'm missing the summer months. I could put nan fields in for the missing months, but that wouldn't solve the index-0 issue.
My attempt so far:
data = xarray.Dataset( data_vars={'ice_type':(['x','y','time'],year_array),},
coords={'lon':(['x','y'],lon_target),
'lat':(['x','y'],lat_target),
'month_number':(['time'],month_int)})
Here year_array is a 361x361x7 numpy array, and month_int is a list that maps the third index of year_array to the month number: [1,2,3,4,10,11,12].
When I try to get Oct data with oct = data.sel(month_number=10) it throws an error.
On a side note, I'm aware that there's possibly a solution to be found here, but to be honest I don't understand how it works. My confusion is mostly based around how they use 'time' both as a dictionary key and list of times at the same time.

I think I've written a helper function to do something just like that:
def combine_new_ds_dim(ds_dict, new_dim_name):
"""
Combines a dictionary of datasets along a new dimension using dictionary keys
as the new coordinates.
Parameters
----------
ds_dict : dict
Dictionary of xarray Datasets or dataArrays
new_dim_name : str
The name of the newly created dimension
Returns
-------
xarray.Dataset
Merged Dataset or DataArray
Raises
------
ValueError
If the values of the input dictionary were of an unrecognized type
"""
expanded_dss = []
for k, v in ds_dict.items():
expanded_dss.append(v.expand_dims(new_dim_name))
expanded_dss[-1][new_dim_name] = [k]
new_ds = xr.concat(expanded_dss, new_dim_name)
return new_ds
If you have all of the data in individual netcdfs then you should be able to import them into individual dataArray's. Assuming you've done that, you could then do
month_das = {
1: january_da,
2: february_da,
...
12: december_da
}
year_data = combine_new_ds_dim(month_das, 'month')
which would be the concatenation of all of the data along the new dimension month with the desired coordinates. I think the main loop of the function is easy enough to separate if you want to use that alone.
EDIT:
For anyone looking at this in the future, there's a much easier way of doing this with builtin xarray functions. You can just concatenate along a new dimension
year_data = xr.concat([january_da, february_da, ..., december_da], dim="month")
which will create a new dataArray with the constituent arrays concatenated along a new dimension, but without coordinates on that dimension. To add coordinates,
year_data["month"] = [1, 2, ..., 12]
at which point year_data will be concatenated along the new dimension "month" and will have the desired coordinates along that dimension.

Related

Python - norming data in a 2D list

I am trying to norm data in a 2D array, so that all numerical data points becomes normative elements of a chosen column. Allow me to elaborate:
I have data from countries and years, showing the population of each country per year. So the first row is years, and the first column is countries, and all the rest of the data is population. The data starts in 2019. I want to use 2019 as a base year and then norm the following years so that they become (+ or - integers * 100) to give increases and drops.
So far my code just gives me 1.0 in that first column, since I am dividing each elements with itself along the array! Only the first column after the "countries" column should have 1, since it is that column's elements divided by themselves. But the other columns should give a value n showing a delta with the 2019 values. Instead, they are just giving seemingly the data that is already in the 2D list. How do I use data from the column "2019" as the divider, so that each year's population becomes a normed integer?
I went ahead and re-printed the list, so that the data is the same in each row. This would mean that if the code was working correctly, then all the data in the array should also be 1.0. Since it is not, there is a problem.
Here is my code so far:
list_1 = [['countries', 2019, 2020, 2025],['aruba', 2,2,2],['barbados', 2,2,2],['japan', 2,2,2]]
for row in range(1,len(list_1)):
for column in range(1,len(list_1[row])):
list_1[row][column] = (list_1[row][column])/list_1[row][1]
print(f'{list_1[row][column]:15}', end='')
print()
Thank you for your insight! And I am not allowed to import any modules for this assignment.

How to join data from multiple netCDF files with xarray in Python?

I'm trying to open multiple netCDF files with xarray in Python. The files have data with same shape and I want to join them, creating a new dimension.
I tried to use concat_dim argument for xarray.open_mfdataset(), but it doesn't work as expected. An example is given below, which open two files with temperature data for 124 times, 241 latitudes and 480 longitudes:
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases' )
da_t2m = DS.t2m
print( da_t2m )
With this code, I expect that the result data array will have a shape like (cases: 2, time: 124, latitude: 241, longitude: 480). However, its shape was (cases: 2, time: 248, latitude: 241, longitude: 480).
It creates a new dimension, but also sums the leftmost dimension: 'time' dimension of two datasets.
I was wondering whether it's an error from 'xarray.open_mfdateset' or it's an expected behavior because 'time' dimension is UNLIMITED for both datasets.
Is there a way to join data from these files directly using xarray and get the above expected return?
Thank you.
Mateus
Extending from my comment I would try this:
def preproc(ds):
ds = ds.assign({'stime': (['time'], ds.time)}).drop('time').rename({'time': 'ntime'})
# we might need to tweak this a bit further, depending on the actual data layout
return ds
DS = xr.open_mfdataset( 'eraINTERIM_t2m_*.nc', concat_dim='cases', preprocess=preproc)
The good thing here is, that you keep the original time coordinate in stime while renaming the original dimension (time -> ntime).
If everything works well, you should get resulting dimensions as (cases, ntime, latitude, longitude).
Disclaimer: I do similar in a loop with a final concat (wich works very well), but did not test the preprocess-approach.
Thank you #AdrianTompkins and #jhamman. After your comments I realize that due different time periods I really can't get what I want, with xarray.
My main purpose to create such array is to get in one single N-D array all data for different events, with same time duration. Thus, I can get easily, for example, composite fields of all events for each time (hour, day, etc).
I'm trying to do the same as I do with NCL. See below a code for NCL that works as expected (for me) for the same data:
f = addfiles( (/"eraINTERIM_t2m_201812.nc", "eraINTERIM_t2m_201901.nc"/), "r" )
ListSetType( f, "join" )
temp = f[:]->t2m
printVarSummary( temp )
The final result is an array with 4 dimensions, with the new one automatically named as ncl_join.
However, NCL doesn't respect time axis, joins the arrays and gives to the resulting time axis the coordinates of the first file. So, time axis become useless.
However, as well said for #AdrianTompkins, the time periods are different and xarray can't join data like this. So, to create such array, in Python with xarray, I think the only way is to delete time coordinate from arrays. Thus, time dimension would have only integer indexes.
The array given by xarray works like #AdrianThompkins said in his small example. Since it keep time coordinates for all merged data, I think xarray solution is the correct one, in comparison with NCL. But, now I think that a computation of composites (getting same example given above) wouldn't be done as easyly as it seems with NCL.
In a small test, I print two values from merged array with xarray with
print( da_t2m[ 0, 0, 0, 0 ].values )
print( da_t2m[ 1, 0, 0, 0 ].values )
What results in
252.11412
nan
For the second case, there isn't data for the first time, as expected.
UPDATE: all answers help me to understand better this problem, so I had to add an update here to also thanks #kmuehlbauer for his answer, indicating that his code give the expected array.
Again, thank you all for help!
Mateus
The result makes sense if the times are different.
To simplify it, forget about the lat-lon dimension for a moment and imagine you have two files that are simply data at 2 timeslices. The first has data at timesteps 1,2 and the second file with timesteps of 3 and 4. You can't create a combined dataset with a time dimension that only spans 2 timeslices; the time dimension variable has to have the times 1,2,3,4. So if you say you want a new dimension "cases", then the data is then combined as a 2d array and would look like this:
times: 1,2,3,4
cases: 1,2
data:
time
1 2 3 4
cases 1: x1 x2
2: x3 x4
Think of the netcdf file that would be the equivalent, the time dimension has to span the range of values present in both files. The only way you could combine two files and get (cases: 2, time: 124, latitude: 241, longitude: 480) would be if both files have the same time, lat AND lon values, i.e. point to exactly the same region in time-lat-lon space.
ps: Somewhat off-topic for the question, but if you are just starting a new analysis, why not instead switch to the new generation, higher resolution ERA-5 reanalysis, which is now available back to 1979 too (and eventually will be extended further back), you can download it straight to your desktop with the python api scripts from here:
https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

Convert a 4D array to 3D in Python by merging the months and years columns

I have a 4D array that has two spatial directions, a month column and a year column. It gives a scalar value at each spatial point and for each month. I want to reshape this array to be 3D so that instead of the value being defined as x, y, month, year, it is just defined as x, y, month, where now the month column runs from 1-36 say with no year column instead of 1-12 with a year column of 1-3. How would I do this in Python? Thanks!
The basic approach is to code the new column something like:
new_month = old_month + 12*(old_year-1)
This translates your 3-year scale into a continuum of months numbered 1-36. I can't show you how to code this, because (1) you haven't given us reference code, so I have little idea how your 4D array is structured; (2) As I hope you've read in the help documentation, we're not a coding service.
Add a new column with values of (year-1)*12+month then discard or ignore your year and month columns. Details depend on exactly how your data is currently structured if it is a numpy array this would be 2 lines of code!

What is the fastest way to sample slices of numpy arrays?

I have a 3D (time, X, Y) numpy array containing 6 hourly time series for a few years. (say 5). I would like to create a sampled time series containing 1 instance of each calendar day randomly taken from the available records (5 possibilities per day), as follows.
Jan 01: 2006
Jan 02: 2011
Jan 03: 2009
...
this means I need to take 4 values from 01/01/2006, 4 values from 02/01/2011, etc.
I have a working version which works as follows:
Reshape the input array to add a "year" dimension (Time, Year, X, Y)
Create a 365 values array of randomly generated integers between 0 and 4
Use np.repeat and array of integers to extract only the relevant values:
Example:
sampledValues = Variable[np.arange(numberOfDays * ValuesPerDays), sampledYears.repeat(ValuesPerDays),:,:]
This seems to work, but I was wondering if this is the best/fastest approach to solve my problem? Speed is important as I am doing this in a loop, adn would benefit from testing as many cases as possible.
Am I doing this right?
Thanks
EDIT
I forgot to mention that I filtered the input dataset to remove the 29th of feb for leap years.
Basically the aim of that operation is to find a 365 days sample that matches well the long term time series in terms on mean etc. If the sampled time series passes my quality test, I want to export it and start again.
The year 2008 was 366 days long, so don't reshape.
Have a look at scikits.timeseries:
import scikits.timeseries as ts
start_date = ts.Date('H', '2006-01-01 00:00')
end_date = ts.Date('H', '2010-12-31 18:00')
arr3d = ... # your 3D array [time, X, Y]
dates = ts.date_array(start_date=start_date, end_date=end_date, freq='H')[::6]
t = ts.time_series(arr3d, dates=dates)
# just make sure arr3d.shape[0] == len(dates) !
Now you can access the t data with day/month/year objects:
t[np.logical_and(t.day == 1, t.month == 1)]
so for example:
for day_of_year in xrange(1, 366):
year = np.random.randint(2006, 2011)
t[np.logical_and(t.day_of_year == day_of_year, t.year == year)]
# returns a [4, X, Y] array with data from that day
Play with the attributes of t to make it work with leap years too.
I don't see a real need to reshape the array, since you can embed the year-size information in your sampling process, and leave the array with its original shape.
For example, you can generate a random offset (from 0 to 365), and pick the slice with index, say, n*365 + offset.
Anyway, I don't think your question is complete, because I didn't quite understand what you need to do, or why.

Categories