Aggregating points using xarray - python

I have a set of netcdf datasets that basically look like a CSV file with columns for latitude, longitude, value. These are points along tracks that I want to aggregate to a regular grid of (say) 1 degree from -90 to 90 and -180 to 180 degrees, by for example calculating the mean and/or standard deviation of all points that fall within a given cell.
This is quite easily done with a loop
D = np.zeros((180, 360))
for ilat in np.arange(-90, 90, 1,
for ilon in np.arange(-180, 180, 1,
p1 = np.logical_and( >= ilat, <= ilat + 1)
p2 = np.logical_and(ds.lon >=ilon,
ds.lon <= ilon+1)
if np.sum(p1*p2) == 0:
D[90 + ilat, 180 +ilon] = np.nan
D[90 + ilat, 180 + ilon] = np.mean(ds.var.values[p1*p2])
# D[90 + ilat, 180 + ilon] = np.std(ds.var.values[p1*p2])
Other than using numba/cython to speed this up, I was wondering whether this is something you can directly do with xarray in a more efficient way?

You should be able to solve this using pandas and xarray.
You will first need to convert your data set to a pandas data frame.
Once this is done, df is the dataframe and assuming longitude and latitude are lon/lat, you will need to round the lon/lats to the nearest integer value, and then calculate the mean for each lon/lat. You will then need to set lon/lat to indices. Then you can use xarray's to_xarray to convert to an array:
import xarray as xr
import pandas as pd
import numpy as np
df = df.assign(lon = lambda x: np.round(x.lon))
df = df.assign(lat = lambda x: np.round(
df = df.groupby(["lat", "lon"]).mean()
df = df.set_index(["lat", "lon"])

I use #robert-wilson as a starting point, and to_xarray is indeed part of my solution. Other inspiration came from here. The approach that I used is shown below. It's probably slower than numba-ing my solution above, but much simpler.
import netCDF4
import numpy as np
import xarray as xr
import pandas as pd
fname = ""
f = netCDF4.Dataset(fname)
lat = f.variables['lat'][:]
lon = f.variables['lon'][:]
vari = f.variables['super_duper_variable'][:]
df = pd.DataFrame({"lat":lat,
# Simple functions to calculate the grid location in rows/cols
# using lat/lon as inputs. Global 0.5 deg grid
# Remember to cast to integer
to_col = lambda x: np.floor(
to_row = lambda x: np.floor(
# Map the latitudes to columns
# Map the longitudes to rows
df['col'] =
df['row'] =
# Aggregate by row and col
gg = df.groupby(['col', 'row'])
# Now, create an xarray dataset with
# the mean of vari per grid cell
ds = gg.mean().to_xarray()
dx = gg.std().to_xarray()
ds['stdi'] = dx['vari']
dx = gg.count().to_xarray()
ds['counti'] = dx['vari']```


How to collocate large datasets most efficiently, comparing time, latitude (x), and longitude (y)

I would like some help trying to efficiently collocate two datasets, one is let's say observations of rainfall, in terms of datetime, latitude and longitude. The other is meteorological data e.g. reanalysis given also in terms of datetime, latitude and longitude. Below I provide two example random df and xarrays and then collocate them.
from numpy.random import rand
from random import randint
from datetime import datetime, timedelta
import xarray as xr
import numpy as np
#create example data of the dataframe we want to collocate with the meterological data
datetimes = pd.date_range(start='2002-01-01 10:00:00', end='2002-01-05 10:00:00', freq='H')
rainfall = rand(len(datetimes))
latitudes = [randint(0, 90) for p in range(0, len(datetimes))]
longitudes = [randint(0, 180) for p in range(0, len(datetimes))]
df_obs = pd.DataFrame({'datetime':datetimes, 'rainfall':rainfall, 'latitude':latitudes,
#create an xarray which is the example met data
met_type = np.ones((720, 1440))
rainfall = rand(len(datetimes))
met_list = [x*met_type for x in rainfall]
def produce_xarray(met_list, datetimes, met_type='rain', datetime_var="datetime"): [![enter image description here][1]][1]
if isinstance(datetimes[0], datetime) == False:
dates = [datetime.strptime(x, '%Y%m') for x in datetimes]
if isinstance(datetimes[0], datetime) == True:
dates = datetimes
met_list_dstack = np.dstack(met_list)
lats = np.arange(90, -90, -0.25)
lons = np.arange(-180,180, 0.25)
ds = xr.Dataset(data_vars={met_type:(["latitude","longitude",datetime_var], met_list_dstack),},
coords={"latitude": lats, "longitude": lons, datetime_var: dates})
ds[met_type].attrs["units"] = "g "+str(met_type)+"m$^{-2}$"
return ds
xr_met = produce_xarray(met_list, datetimes, datetime_var="datetime")
#now I wish to collocate the data as quickly as possible, as my datasets are huge -
#here I have a function which finds the closest value using the datetime, latitude and longitude
#the I apply this function to the df of my random observations
var ='rain'
def find_value_lat_lon(lat, lon, traj_datetime):
array = xr_met[var].sel(latitude=lat, longitude=lon, datetime=traj_datetime, method='nearest').squeeze()
value = array.values
return value
def append_var_columnwise(df, var_name):
df = df.copy()
df.loc[:, var_name] = df[['latitude', 'longitude', 'datetime']].apply(lambda x: find_value_lat_lon(*x),
return df
df_obs = append_var_columnwise(df_obs, var_name='rain_met')
The final output is shown in the picture - whereby the df has an additional column with 'rain met' - for 97 data points this takes 212ms.
I don't know that it is any faster, but .sel supports vectorized indexing (see : the last example in this section is a 2D version of your code)
df.loc[:, var_name] = xr_met[var].sel(

boxplot structure disappears when pandas contains nan [duplicate]

I am using matplotlib to plot a box figure but there are some missing values (NaN). Then I found it doesn't display the box figure within the columns having NaN values.
Do you know how to solve this problem?
Here are the codes.
import numpy as np
import matplotlib.pyplot as plt
# open data
filename='C:\\Users\\liren\\OneDrive\\Data\\DATA in the first field-final\\ks.csv'
TreatmentCode = AllData[1:,0]
RepCode = AllData[1:,1]
KsData= AllData[1:,2:].astype('float')
DepthHeader = AllData[0,2:].astype('float')
TreatmentUnique = np.unique(TreatmentCode)[[3,1,4,2,8,6,9,7,0,5,10],]
nT = TreatmentUnique.size#nT=number of treatments
#nD=number of deepth;nR=numbers of replications;nT=number of treatments;iT=iterms of treatments
nD = 5
nR = 6
KsData_3D = np.zeros((nT,nD,nR))
for iT in range(nT):
Treatment = TreatmentUnique[iT]
TreatmentFilter = TreatmentCode == Treatment
KsData_Filtered = KsData[TreatmentFilter,:]
KsData_3D[iT,:,:] = KsData_Filtered.transpose()iD = 4
ax = fig.add_subplot(111)
Here is the final figure and some of the treatments are missing in the box.
You can remove the NaNs from the data first, then plot the filtered data.
To do that, you can first find the NaNs using np.isnan(data), then perform the bitwise inversion of that Boolean array using the ~: bitwise inversion operator. Use that to index the data array, and you filter out the NaNs.
filtered_data = data[~np.isnan(data)]
In a complete example (adapted from here)
Tested in python 3.10, matplotlib 3.5.1, seaborn 0.11.2, numpy 1.21.5, pandas 1.4.2
For 1D data:
import matplotlib.pyplot as plt
import numpy as np
# fake up some data
np.random.seed(2022) # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)
# Add a NaN
data[40] = np.NaN
# Filter data using np.isnan
filtered_data = data[~np.isnan(data)]
# basic plot
For 2D data:
For 2D data, you can't simply use the mask above, since then each column of the data array would have a different length. Instead, we can create a list, with each item in the list being the filtered data for each column of the data array.
A list comprehension can do this in one line: [d[m] for d, m in zip(data.T, mask.T)]
import matplotlib.pyplot as plt
import numpy as np
# fake up some data
np.random.seed(2022) # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)
data = np.column_stack((data, data * 2., data + 20.))
# Add a NaN
data[30, 0] = np.NaN
data[20, 1] = np.NaN
# Filter data using np.isnan
mask = ~np.isnan(data)
filtered_data = [d[m] for d, m in zip(data.T, mask.T)]
# basic plot
I'll leave it as an exercise to the reader to extend this to 3 or more dimensions, but you get the idea.
Use seaborn, which is a high-level API for matplotlib
seaborn.boxplot filters NaN under the hood
import seaborn as sns
NaN is also ignored if plotting from df.plot(kind='box') for pandas, which uses matplotlib as the default plotting backend.
import pandas as pd
df = pd.DataFrame(data)

Pyspark: What is the Fastest way to Calculate Cosine Similarity against a Column of Vectors

Beginner Pyspark question here! I have a dataframe of ~2M rows of already vectorized text (via w2v; 300 dimensions). What is the most efficient way to calculate the cosine distance for each row against a new single vector input?
My current methodology uses a udf and takes a couple minutes, far too long for the webapp I'd like to create.
Create a sample df:
import numpy as np
import pandas as pd
from pyspark.sql.functions import *
num_rows = 10000 #change to 2000000 to really slow your computer down!
for x in range(num_rows):
sample = np.random.uniform(low=-1, high=1, size=(300,)).tolist()
index = range(1000)
df_pd = pd.DataFrame([index, column]).T
#df_pd = pd.concat([df.T[x] for x in df.T], ignore_index=True)
df = spark.createDataFrame(df_pd).withColumnRenamed('0', 'Index').withColumnRenamed('1', 'Vectors')
Create a sample input (which I create as a spark df in order to transform through my existing pipeline):
new_input = np.random.uniform(low=-1, high=1, size=(300,)).tolist()
df_pd_new = pd.DataFrame([[new_input]])
df_new = spark.createDataFrame(df_pd_new, ['Input_Vector'])
Calculate cosine distance or similarity between Vectors and new_input:
value ='Input_Vector').collect()[0][0]
def cos_sim(vec):
if (np.linalg.norm(value) * np.linalg.norm(vec)) !=0:
dot_value =, vec) / (np.linalg.norm(value)*np.linalg.norm(vec))
return dot_value.tolist()
cos_sim_udf = udf(cos_sim, FloatType())
#df_all_cos = df_all.withColumn('cos_dis', dot_product_udf('w2v')).dropna(subset='cos_dis')
df_cos = df.withColumn('cos_dis', cos_sim_udf('Vectors')).dropna(subset='cos_dis')
And finally let's pull out the max 5 indices for fun:
max_values ='index','cos_dis').orderBy('cos_dis', ascending=False).limit(5).collect()
top_indicies = []
for x in max_values:
print top_indicies
No pyspark function for cosine distance exists (which would be ideal), so I'm not sure how to speed this up. Any ideas greatly appreciate!
You could try using pandas_udf instead of udf:
# other imports
from pyspark.sql.pandas.functions import pandas_udf
# make sure arrow is actually used
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")
def cos_sim2(vec: pd.Series) -> pd.Series:
value_norm = np.linalg.norm(value)
cs_value = vec.apply(lambda v:, v) / (np.linalg.norm(v) * value_norm))
return cs_value.replace(np.inf, np.nan)
cos_sim_udf = pandas_udf(cos_sim2, FloatType())

Pandas finding local max and min

I have a pandas data frame with two columns one is temperature the other is time.
I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema.
Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure.
Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(, np.less_equal,
df['max'] = df.iloc[argrelextrema(, np.greater_equal,
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
Some points:
you might need to check the points afterward to ensure there are no twine points very close to each other.
you can play with n to filter the noisy points
argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be
df['min'] =[( > & ( >]
df['max'] =[( < & ( <]
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate a noisy AR(1) sample
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1]*0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
# Find local peaks
df['min'] =[( > & ( >]
df['max'] =[( < & ( <]
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
using Numpy
ser = np.random.randint(-40, 40, 100) # 100 points
peak = np.where(np.diff(ser) < 0)[0]
double_difference = np.diff(np.sign(np.diff(ser)))
peak = np.where(double_difference == -2)[0]
using Pandas
ser = pd.Series(np.random.randint(2, 5, 100))
peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function:
# Find local peaks
n = 5 #rolling period
local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()]
local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()]
plt.scatter(local_min_vals.index, local_min_vals, c='r')
plt.scatter(local_max_vals.index, local_max_vals, c='g')

Python 2D array -- How to plug in x and retrieve y value?

I have been looking for an answer since yesterday but no luck. So I have a 1D spectrum (.fits) file with flux value at each wavelength. I have converted them into a 2D array (x,y)=(wavelength, flux) and want to write a program which will return flux(y) at some assigned wavelengths(x). I have tried this:
import scipy
import numpy as np
import pyfits as pf
#Target Global Vaiables
hdulist_tg ='cutmask1-2.0001.fits')
hdr_tg = hdulist_tg[0].header
flux_tg = hdulist_tg[0].data
crval_tg = hdr_tg['CRVAL1'] #Starting wavelength
cdel_tg = hdr_tg['CDELT1'] #Wavelength axis width
wave_tg = crval_tg + np.arange(3183)*cdel_tg #Create an x-axis
wavelist = [6207,6315,6369,6438,6490,6565,6588]
diff = 10
for wave in wave_tg:
for flux in flux_tg:
for item in wave_flux:
wave = item[0]
flux = item[1]
#Where I got my actual wavelength that exists in wave_tg
diffmatch = np.abs(wave - wavelist[0])
if diffmatch < diff:
flux_wave = flux
diff = diffmatch
wavematch = wave
print wavelist[0],flux_wave,wavematch
but the program always return the same flux value even though the wavelength is different. Please help...
I would skip the creation of the two dimensional table altogether and just use interp:
fluxvalues = np.interp(wavelist, wave_tg, flux_tg)
For the file you posted, the code you posted doesn't work due to the hard-coded length of the wave_tg array. I would therefore recommend you rather use
wave_tg = crval_tg + np.arange(len(flux_tg))*cdel_tg
Also, for some reason it seems that the file you posted doesn't actually go up to the wavelengths you are looking up. You might need to check that you are calculating the corresponding wavelengths correctly or check that you are looking up the right wavelengths.
I've made some changes in your code:
using numpy ot create wave_flux as a ndarray using np.hstack(), np.repeat() and np.tile()
using fancy indexing to get the values matching your search
The resulting code is:
import scipy
import numpy as np
import pyfits as pf
#Target Global Vaiables
hdulist_tg ='cutmask1-2.0001.fits')
hdr_tg = hdulist_tg[0].header
flux_tg = hdulist_tg[0].data
crval_tg = hdr_tg['CRVAL1'] #Starting wavelength
cdel_tg = hdr_tg['CDELT1'] #Wavelength axis width
wave_tg = crval_tg + np.arange(3183)*cdel_tg #Create an x-axis
wavelist = [6207,6315,6369,6438,6490,6565,6588]
wave_flux = np.vstack(( np.repeat(wave_tg, len(flux_tg)),
np.tile(flux_tg, len(wave_tg)) )).transpose()
wave_ref = wavelist[0]
diff = 10
print wave_flux[ np.abs(wave_flux[:,0]-wave_ref) < diff ]
Which will return a sub-group of wave_flux with the wave values in column 0 and flux values in column 1:
[[ 6197.10300138 500.21020508]
[ 6197.10300138 523.24102783]
[ 6197.10300138 510.6390686 ]
[ 6216.68436446 674.94732666]
[ 6216.68436446 684.74255371]
[ 6216.68436446 712.20098877]]
