Extracting mean values from masked 2-D array - python

I want to extract a 12º x 12º region from lat/long/conductivity grids and calculate the mean conductivity values in this region. I can successfully apply masks on the lat/long grids, but somehow the same process is not working for the conductivity grid.
I've tried masking with for loops and now I'm using numpy.ma.masked_where function. I can successfully plot masked results (i.e: I can see that the region is extracted when I plot global maps), but the calculated mean conductivity values are corresponding to non-masked data.
I did a simple example of what I want to do:
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
xm = np.median(x)
ym = np.median(y)
x = ma.masked_outside(x, xm-3, xm+3)
y = ma.masked_outside(x, ym-3, ym+3)
x = np.ma.filled(x.astype(float), np.nan)
y = np.ma.filled(y.astype(float), np.nan)
x, y = np.meshgrid(x, y)
z = 2*x + 3*y
z = np.ma.masked_where(np.ma.getmask(x), z)
plt.pcolor(x, y, z)
plt.colorbar()
print('Maximum z:', np.nanmax(z))
print('Minimum z:', np.nanmin(z))
print('Mean z:', np.nanmean(z))
My code is:
def Observatory_Cond_Plot(filename, ndcfile, obslon, obslat, obsname, date):
files = np.array(sorted(glob.glob(filename))) #sort txt files containing the 2-D conductivitiy arrays]
filenames = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'] #used for naming output plots and files
for i, fx in zip(filenames, files):
ndcdata = Dataset(ndcfile) #load netcdf file
lat = ndcdata.variables['latitude'][:] #import latitude data
long = ndcdata.variables['longitude'][:] #import longitude data
cond = np.genfromtxt(fx)
cond, long = shiftgrid(180., cond, long, start=False)
#Mask lat and long arrays and fill masks with nan values
lat = ma.masked_outside(lat, obslat-12, obslat+12)
long = ma.masked_outside(long, obslon-12, obslon+12)
lat = np.ma.filled(lat.astype(float), np.nan)
long = np.ma.filled(long.astype(float), np.nan)
longrid, latgrid = np.meshgrid(long, lat)
cond = np.ma.masked_where(np.ma.getmask(longrid), cond)
cond = np.ma.filled(cond.astype(float), np.nan)
condmean = np.nanmean(cond)
print('Mean Conductivity is:', condmean)
print('Minimum conductivity is:', np.nanmin(cond))
print('Maximum conductivity is:', np.nanmax(cond))
After that, the rest of the code just plots the data
My results are:
Mean Conductivity is: 3.5241649673154587
Minimum conductivity is: 0.497494528344129
Maximum conductivity is: 5.997825822915771
However, from tmy maps, it is clear that the conductivity in this region should not be lower than 3.2 S/m. Also, printing lat, long and cond grids:
long:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
lat:
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
cond:
[[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[2.86749432 2.86743283 2.86746221 ... 2.87797247 2.87265508 2.87239185]
...
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]
[ nan nan nan ... nan nan nan]]
And it seems like the mask is not working properly.

The problem is that the call of np.ma.filled will de-mask the long variable. Also np.meshgrid doesn't preserve the masks.
You could save the masks directly after creation and also create the meshgrid from the masks. I adapted your example accordingly. What can be seen is, that all versions of numpy mean take the mask into account. I had to adapt the upper limit (changed to 2), because the mean has been equal.
x = np.linspace(1, 10, 10)
y = np.linspace(1, 10, 10)
xm = np.median(x)
ym = np.median(y)
# Note: changed limits
x = np.ma.masked_outside(x, xm-3, xm+2)
y = np.ma.masked_outside(x, ym-3, ym+2)
xmask = np.ma.getmask(x)
ymask = np.ma.getmask(y)
x, y = np.meshgrid(x, y)
xmask, ymask = np.meshgrid(xmask, ymask)
z = 2*x + 3*y
z1 = np.ma.masked_where(np.ma.getmask(x), z)
z2 = np.ma.masked_where(xmask | ymask, z)
print(z1)
print(z2)
print('Type z1, z2:', type(z1), type(z2))
print('Maximum z1, z2:', np.nanmax(z1), np.nanmax(z2))
print('Minimum z1, z2:', np.nanmin(z1), np.nanmin(z2))
print('Mean z1, z2:', np.mean(z1), np.mean(z2) )
print('nan Mean z1, z2:', np.nanmean(z1), np.nanmean(z2) )
print('masked Mean z1, z2:', z1.mean(), z2.mean())

Beware that any kind of simple mean calculation (summing and dividing by the total), such as np.mean will not give you the correct answer if you are averaging over the lat-lon grid, since the area changes as you move towards the poles. You need to take a weighted average, weighting by cos(lat).
As you say you have the data in netcdf format, I hope you will permit me to suggest an alternative solution from the command line using the utility climate data operators (cdo) (on ubuntu you can install with sudo apt install cdo).
to extract the region of interest:
cdo sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc
then you can work out the correct weighted mean with
cdo fldmean infile.nc outfile.nc
you can pipe the two together like this:
cdo fldmean -sellonlatbox,lon1,lon2,lat1,lat2 infile.nc outfile.nc

Related

Sum Data variables of Dataset

I've merged a list of DataArray in one DataSet using the code below:
surface_dataarray = []
for (key, value) in surfaces_item_service.items():
print(f'{key}: {value}')
single_surface_class = __yearly_surface_type(folder_path, provider, data_source, bins, key)
single_surface_class.name = key
if single_surface_class.count() > 1:
single_surface_class.rio.to_raster(output_file_path + f'/{key}.tif', driver="GTiff")
surface_dataarray.append(single_surface_class)
surface_data = xr.merge(surface_dataarray)
And I have obtained a Dataset like the below:
<xarray.Dataset>
Dimensions: (x: 1868, y: 1373)
Coordinates:
band int64 1
* x (x) float64 4.269e+05 4.269e+05 ... 4.455e+05 4.455e+05
* y (y) float64 4.53e+06 4.53e+06 ... 4.516e+06 4.516e+06
spatial_ref int64 0
year int64 2020
variable <U15 'CLASSIFIED DATA'
Data variables:
water_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
non_green_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
green_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
Is it possible to sum the Data variables?
I need to save as single band.
NB: there are nan values because my area doesn't have values at borders.
Yes, one way to do it is to use Dataset.to_array, which will combine all the data variables into a single array along a new "variable" dimension:
sum_of_data_variables = ds.to_array().sum("variable")

Pandas Dataframe '[nan nan nan ... nan nan nan] not found in axis'

Receiving this error
'[nan nan nan ... nan nan nan] not found in axis'
When trying to drop columns from a dataframe if the value is zero
train_df.head()
external_company_id company_name email_domain mx_record ...
NaN Expresstext expresstext.net unknown expresstext.net ... 0.0 0.0 0.0 0.0
NaN Jobox jobox.ai unknown www.jobox.ai ... 17.0 -31.0 9.0 30.0
NaN Relola relola.com unknown home.relola.com ... 5.0 -25.0 5.0
train_df.drop(train_df[train_df['total_funding'] == float(0)].index, inplace = True, axis=0)
'[nan nan nan ... nan nan nan] not found in axis'
What would be causing this error?
I learned that pandas automatically uses the first column as the index for read_csv.
Because my first column was empty every index somehow ended up being NaN as seen in the question above.
I ran these two lines of code to create a new index and fill it.
train_df.index.name = 'id'
train_df.index = [x for x in range(1, len(train_df.values)+1)]
Then the former error disapeared
Instead of :
train_df.drop(...)
try:
train_df = train_df[train_df[train_df['total_funding'] != float(0)]

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')
here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

Ignoring nan values when creating confidence intervals

AS the title suggest, I am trying to create confidence intervals based on a table with a ton of nan values. Here is an example of what I am working with.
Attendence% 2016-10 2016-11 2017-01 2017-02 2017-03 2017-04 ...
Name
Karl nan 0.2 0.4 0.5 0.2 1.0
Alice 1.0 0.7 0.6 nan nan nan
Ryan nan nan 1.0 0.1 0.9 0.2
Don nan 0.5 nan 0.2 nan nan
Becca nan 0.2 0.6 0 nan nan
For reference, in my actual dataframe there are more NaNs than not, and they represent months where they did not need to show up, so replacing the values with 0 will affect the result.
Now everytime I try applying a Confidence interval to each name, it it returns the mean as NaN, as well as both intervals.
Karl (nan, nan, nan)
Alice (nan, nan, nan)
Ryan (nan, nan, nan)
Don (nan, nan, nan)
Becca (nan, nan, nan)
Is there a way to filter out the NaN so it just applies the formula while not taking into account the NaN values. So far what I have been doing has been the following:
unstacked being the table i visually represented.
def mean_confidence_interval(unstacked, confidence=0.9):
a = 1.0 * np.array(unstacked)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
return m, m-h, m+h
answer = unstacked.apply(mean_confidence_interval)
answer
Use np.nanmean instead of np.mean: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanmean.html
And for scipy.stats.sem(a), replace it with pass scipy.stats.sem(a, nan_policy='omit').
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.sem.html

function returning nan in for loop when performing multivariate linear regression

I am performing multivariate linear regression in pure python as seen from the code below.Can someone please tell me what's wrong in his code?
I have done the same for univariate linear regression. It performed well in it!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x_df=pd.DataFrame([[2.0,70.0],[3.0,30.0],[4.0,80.0],[4.0,20.0],[3.0,50.0],[7.0,10.0],[5.0,50,0],[3.0,90.0],[2.0,20.0]])
y_df=pd.DataFrame([79.4,41.5,97.5,36.1,63.2,39.5,69.8,103.5,29.5])
x_df=x_df.drop(x_df.columns[2:], axis=1)
#print(x_df)
m=len(y_df)
#print(m)
x_df['intercept']=1
X=np.array(x_df)
#print(X)
#print(X.shape)
y=np.array(y_df).flatten()
#print(y.shape)
theta=np.array([0,0,0])
#print(theta)
def hypothesis(x,theta):
return np.dot(x,theta)
#print(hypothesis(X,theta))
def cost(x,y,theta):
m=y.shape[0]
h=np.dot(x,theta)
return np.sum(np.square(y-h))/(2.0*m)
#print(cost(X,y,theta))
def gradientDescent(x,y,theta,alpha=0.01,iter=1500):
m=y.shape[0]
for i in range(1500):
h=hypothesis(x,theta)
error=h-y
update=np.dot(error,x)
theta=np.subtract(theta,((alpha*update)/m))
print('theta',theta)
print('hyp',h)
print('y',y)
print('error',error)
print('cost',cost(x,y,theta))
print(gradientDescent(X,y,theta))
and the output I get is :-
theta [ nan nan nan]
hyp [ nan nan nan nan nan nan nan nan nan]
y [ 79.4 41.5 97.5 36.1 63.2 39.5 69.8 103.5 29.5]
error [ nan nan nan nan nan nan nan nan nan]
cost nan
Can someone please help me in solving this? i have been struck like almost 5 hours trying!
Your learning rate is too large to converge, try alpha=0.00001.

Categories