I am still fairly new to programming and python in particular.
I have spatial dust optical depth data, with dimensions of lat, lon, and time.
I have managed to make daily plots of a dust plume as it moves across the Atlantic, and now I am trying to make similar plots masking out any cell below the 90th percentile value.
I have an array of 90th percentile values created with the following code:
#====== select and average over study region ======#
tropatl_mjjas = dod_2.sel(latitude=slice(25, 8), longitude=slice(270, 342)).mean(dim=('latitude', 'longitude'))
#====== resample to daily then drop NAN ======#
tropatl_mjjas_daily = tropatl_mjjas.resample(time='D').mean(dim='time').dropna(dim='time')
#====== reshape to 17 rows x 153 columns ======#
tropatl_2d = np.reshape(tropatl_mjjas_daily.values,(17, 153))
percvalues=np.zeros(92)
k=0
for i in range(24,152):
if i==116:
break
ravel_1 = np.ravel(tropatl_2d[:, i:i+15])
percvalues[k]=np.percentile(ravel_1, 90)
k+=1
print(percvalues)
What I would like to do is somehow check each cell against the 90th percentile for that day in the study region. This particular dust event is 12 days long. I started with trying it for a single day:
data = dod.sel(time='2015-06-18').resample(time='D').mean(dim='time')
valid_data = data[data>percvalues[24]]
But I get this error: IndexError: 3-dimensional boolean indexing is not supported.
Here is what the data array looks like:
[data array][1] you can see the variable name is 'duaod550'
I tried it this instead:
valid_data = data[dod>percvalues[24]]
IndexError: Boolean array size 49672 is used to index array with shape (1,).
and
valid_data = data[duaod550>percvalues[24]]
NameError: name 'duaod550' is not defined
So how do I work around the dimension issues to make the code compare each cell to the 90th percentile value?
Thank you in advance for any help.
[1]: https://i.stack.imgur.com/m1FHV.png
Related
I am trying to plot a time series of the sea surface temperature (SST) for a specific region from a .nc file. The SST is a three-dimensional variable (lat,lon,time), that has mean daily values for a specific region from 1982 to 2016. I want my plot to reflect the seasonal sst variability of the entire period of time. I assume that what I need to do first is to obtain a mean sst value for my lat,lon region for each of the days with which I can work alter on. So far, I assume that I need to read the .nc file and the variables:
import netCDF4 as nc
f = nc.Dataset('cmems_SST_MED_SST_L4_REP_OBSERVATIONS_010_021_1639073212518.nc')
sst = f.variables['analysed_sst'][:]
lon = f.variables['longitude'][:]
lat = f.variables['latitude'][:]
Next, following the code suggested here, I tried to reshape and obtain the mean, but an error pops up:
global_average= np.nanmean(sst[:,:,:],axis=(1,2))
annual_temp = np.nanmean(np.reshape(global_average, (34,12)), axis = 1)
#34 years between 1982 and 2016, and 12 months per year.
ERROR cannot reshape array of size 14008 into shape (34,12)
From here I found different ways, like using cdo or nco (which didn't work due installation problems) among others, which were not suitable for my case. I used nanmean because know that in MATLAB this is done using the nanmean function. I am quite new to this topic and I would like to ask for some hints, like, where should I focus more or what path is more suitable for this case. Thank you!!
Handling daily data with just pure python is difficult because you should consider leap years and sub-setting a region require tedious indexing striding....
As steTATO mentioned, since the data that you are working has daily temporal resolution you need to consider the following
You need to reshape the global_average in the shape of (34,365) or (34,366) depending on the year (1984,1988,1992,1996,2000,2004,2008,2012,2016). So your above code should look something like
annual_temp = np.nanmean(np.reshape(global_average, (34,365)), axis = 1)
But, like I said, because of the leap years, you can't do the things you want by simply reshaping the global_average.
If I had no choice but to use python only, I'd do the following
import numpy as np
def days_in_year(in_year):
leap_years = [1984,1988,1992,1996,2000,2004,2008,2012,2016]
if (in_year in leap_years):
out_days = 366
else:
out_days = 365
return out_days
# some of your code, importing netcdf data
year = np.arange(1982,2017)
global_avg= np.nanmean(sst[:,:,:],axis=(1,2))
annual_avgs = []
i = 0
for yr in range(35):
i = i + days_in_year(year[yr])
f = i - days_in_year(year[yr])
annual_avg = np.nanmean(global_avg[i:f])
annual_avgs.append(annual_avg)
Above code basically takes and averages by taking strides of the global_avg considering the leap year, and saving it as annual_avgs.
Let's say I have a 3-dimensional Numpy Array. It is daily data of a year and 1-degree pixels of the globe, resulting in a shape of (365, 180, 360).
Now, you want to insert a value of 16th January to its position, so that each time series becomes:
...val_0114, val_0115, val_0116, val_0116, val_0117, val_0118, ...
I could do it like:
arr_new = np.empty((arr_old.shape[0], 180, 360)) * np.nan
for _lat in range(arr_new.shape[1]):
for _lon in range(arr_new.shape[2]):
arr_new[:, _lat, _lon] = np.insert(arr_old, 15, arr_old[15, _lat, _lon], axis=0)
But I would like to find a more fancy way without the loop.
In numpy you are allowed to directly reference a matrix. This is similar to when you work with multi-dimensional lists:
# If you want to place the same temperature in the whole world/pixels
arr_old[16] = 15
# If you want to place the same temperature in a certain place the whole year
arr_old[:][lat][lon]
I'm working through the pangeo tutorial gallery and am stuck on the ENSO exercise at the end of xarray
you'll need to download some files:
%%bash
git clone https://github.com/pangeo-data/tutorial-data.git
Then:
import numpy as np
import xarray as xr
import pandas as pd
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
# subset years to match hint at the bottom
sst_enso = sst_enso.sel(time=sst_enso.time.dt.year>=1982)
# groupby each timepoint and find mean for entire spatial region
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
This figure matches that shown at the bottom of the tutorial. so far so good, but i'd like to compute and plot ONI as well. Warm or cold phases of the Oceanic Nino Index are defined by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of +0.5°C (-0.5°C). This is known as the Oceanic Niño Index (ONI).
I run into trouble because the month becomes an index.
Q1. I'm not sure how to make sure that subtracting sst_enso - enso_clim results in the correct math.
Assuming that is correct, I can compute the regional mean anomaly again and then use a rolling window mean.
enso_clim = sst_enso.sst.groupby('time.month').mean('time')
sst_anom = sst_enso - enso_clim
enso_anom = sst_anom.groupby('time').mean(dim=['lat','lon'])
oni = enso_anom.rolling(time = 3).mean()
Now I'd like to plot a bar chart of oni with positive red, negative blue. Something like this:
for exaample with:
oni.sst.plot.bar(color=(oni.sst < 0).map({True: 'b', False: 'r'}))
Instead oni.sst.plot() gives me:
Resetting the index enso_anom.reset_index('month', drop=True).sst still keeps month as a dimension and gives the same plot. If you drop_dims('month') then the sst data goes away.
I also tried converting to a pd with oni.to_dataframe() but you end up with 5040 rows which is 12 months x 420 month-years I subsetted for. According to the docs "The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex)." so I guess that makes sense, but not useful. Even if you reset_index of oni before converting to a dataframe you get the same 5040 rows. Q2. Since the dataframe must be repeating itself I can probably figure out where, but is there a way to do this "cleaner" with each date not repeated for all 12 months?
Your code results into an DataArray with the dimensions time and month due to the
re-chunking. This is the reason why you end up with such a plot.
There is a trick (found here) to calculate anomalies. Besides this I would select as a reference period 1986-2015 ( see NOAA definition for ONI-index).
Combining both I ended up in this short code (without the bar plots):
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt
# load all files
ds_all = xr.open_mfdataset('./tutorial-data/sst/*nc', combine='by_coords')
# slice for enso3.4 region
sst_enso = ds_all.sel(lat=slice(-5,5), lon=slice(-170+360,-120+360))
avg_enso = sst_enso.sst.groupby('time').mean(dim=['lat','lon'])
avg_enso.plot()
ds = sst_enso.sst.mean(dim=['lat','lon'])
enso_clim = ds.sel(time=slice('1986-01-01', '2016-01-01')).groupby("time.month").mean("time")
# ref: https://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_change.shtml
enso_anom = ds.groupby("time.month") - enso_clim
# ref: http://xarray.pydata.org/en/stable/examples/weather-data.html#Calculate-monthly-anomalies
enso_anom.plot()
oni = enso_anom.rolling(time = 3).mean()
oni.plot()
I have a data frame with fields: 'unique years', 'counts'. I plotted this data frame and i am getting the following histogram: histogram - example. I need to define a start year variable but if i have empty gaps at the starting point of histogram i need to skip them and shift the starting year. I was wondering if there is a pythonic way to do this. In the histogram - example plot, i have a not empty bin at the starting point but then i have a big gap with empty bins. So i need to find the point with a continuous not empty bins and define this point as a starting year (for the above sample i need the starting year as 1935). The n numpy.ndarray is giving me information about empty or not bins but i need a efficient way to resolve this. Thank you :)
Sample of my data frame:
import pandas as pd
data = {'unique_years': [1907, 1935, 1938, 1939, 1940],
'counts' : [11, 14, 438, 85, 8]}
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
code for the histogram plot
(n, bins, patches) = plt.hist(df.unique_years, bins=25, label='hst')
plt.show()
The issue with your question is that 'continuous' is not really well defined here. Do you mean that every year should have a non-empty count (that is fairly easy to do as you can filter your data for that prior to building your histogram), or should every consecutive bucket be non empty? If the latter, this means that you must:
Build your histogram
Filter your data on the resulting bins
Either use the filtered histogram or re-bin the remaining data, with bins sizes not guaranteed to stay the same (so it is possible that you have the same issue with the new bins!)
As it is difficult to know exactly what is relevant in your exact case, I think the best answer would be to give you a set of tools that you can use as you see fit for the exact problem that you are encountering:
I want to filter my data starting from a certain date
filtered = df.unique_years[df.unique_years > 1930]
I want to find the second non-empty bin
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
From there you can:
rebin your filtered data:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Re-binning on the filtered data
plt.hist(df.unique_years[df.unique_years >= n[second_nonempty]], bins=25)
Plot your histogram directly on the filtered bins:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Forcing the bins to take the provided values
plt.hist(df.unique_years, bins=x[second_nonempty:])
Now the 'second_nonempty' above can of course be replaced by any estimator of where you want to start, e.g.:
# Last empty bin + 1
all_bins_full_after = np.where(n == 0)[0][-1] + 1
Or anything else really
This should work to eliminate all the bins that are not consecutive. I am working mainly on the df. You can use this to plot your histogram
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
yd = df.unique_years.diff().eq(1)
df[yd|yd.shift(-1)]
this is the result you would get:
I have two three-dimensional arrays a and b with [time,lat,lon]. I want to correlate the time series of each grid cell like correlate(a[:,0,0],b[:,0,0]), correlate(a[:,0,1],b[:,0,1]), ... . I'm aiming for two correlations. One with the entire time series and one only where array a surpasses a certain threshold.
The datasets also include some missing values in the time series and I read in both datasets with Xarray. Correlations and masking are done using numpy.
At the moment I walk through each latitude and longitude, grabbing the time series, mask it to account for nan and the threshold and correlate them. My code looks like this:
def correlate(A, B, var1, var2, TH):
name = "corr_"+var1+"_"+var2+"_TH_"+str(TH)+".nc"
a = xr.open_dataset(A).sel(time=slice('1950-03','2013-12'))
b = xr.open_dataset(B).sel(time=slice('1950-03','2013-12'))
corr = np.empty([a[var1].shape[1],a[var1].shape[2]],dtype=float)
corr_TH = corr
varname_TH = "r_TH_"+str(TH)
for lt in range(corr.shape[0]):
for ln in range(corr.shape[1]):
corr[lt,ln] = np.ma.corrcoef(a[var1][:,lt,ln],b[var2][:,lt,ln], rowvar=True)[0,1]
corr_TH[lt,ln] = np.ma.corrcoef(np.ma.masked_greater(a[var1][:,lt,ln],TH),b[var2][:,lt,ln], rowvar=True)[0,1]
# save whole correlations
ds = xr.Dataset({'r': (['lat', 'lon'], corr),varname_TH: (['lat', 'lon'], corr_TH)},coords={'lon': a['lon'],'lat': a['lat']})
return ds
This works in general but is super slow. I found the Xarray function array.stack() to flatten the arrays and tried something like:
A_stack = A.var1.stack(z=('lat','lon'))
B_stack = B.var2.stack(z=('lat','lon'))
cov = ((A_stack - A_stack.mean(axis=0))* (B_stack - B_stack.mean(axis=0))).mean(axis=0)
corr = cov / (A_stack.std(axis=0) * B_stack.std(axis=0))
The multi index 'z' over which the array is stacked is retained through the process, however, the correlation array in the end is empty. I suppose that's because of the Nans.
Does anyone have an idea of the do this?
thanks