I have a rather large dataset, and need to find in that dataset extreme values, including coordinates.
The real dataset is much larger, but let's take this one for testing:
import xarray as xr
import numpy as np
import pandas as pd
values = np.array(
[[[3, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 4]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 5]]]
)
da = xr.DataArray(values, dims=('time', 'lat', 'lon'),
coords={'time': list(range(3)), 'lat': list(range(3)), 'lon':list(range(3))})
I want to find in this dataarray all values larger than 2. I found on here this solution:
da.where(da>2, drop=True)
but even in this small example, this produces a lot more nans than values:
array([[[ 3., nan],
[nan, nan]],
[[nan, nan],
[nan, 4.]],
[[nan, nan],
[nan, 5.]]])
and it's worse in my actual dataset.
I've tried to write a helper function to convert it to a pandas dataframe, like this:
def find_val(da):
res = pd.DataFrame(columns=['Time', 'Latitude', 'Longitude', 'Value'])
for time_idx, time in enumerate(da['time']):
for lat_idx, lat in enumerate(da['lat']):
for lon_idx, lon in enumerate(da['lon']):
value = da.isel(time=time_idx, lat=lat_idx, lon=lon_idx).item()
if not np.isnan(value):
res.loc[len(res.index)] = [time.item(), lat.item(), lon.item(), value]
return res
find_val(da.where(da>2, drop=True))
This produces the output I want, but 3 nested for loops seems excessive.
Time Latitude Longitude Value
0 0.0 0.0 0.0 3.0
1 1.0 1.0 1.0 4.0
2 2.0 1.0 1.0 5.0
Any good suggestions on how to improve this?
There is already an implementation of converting to Pandas
DataArray.to_dataframe(name=None, dim_order=None)
https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_dataframe.html
On a side note, if you are looking to remove extreme values with no specific range then you might want to check out outlier detection
https://scikit-learn.org/stable/modules/outlier_detection.html
Related
I need to assign a stack of rows from a pd.DataFrame by index to a matrix of known size as efficiently as possible. Many of the indices will not exist in the dataframe, and these rows must be filled with NaNs. This operation will happen inside a loop iterating over each row in the dataframe and must be as fast as possible.
In short, I want to speed up the following sequence:
# DF to iterate over
df = pd.DataFrame({'a': [1,2,3,4], 'b': [2,4,6,8], 'c': [3,6,9,12]})
# (Relative) DF indices to return if possible
indices = [-2, 0, 2, 4]
len_df = len(df)
len_indices = len(indices)
len_cols = len(df.columns)
values = np.zeros((len_df, len_indices, len_cols))
for i in range(len_df):
for j, n in enumerate(indices):
idx = i + n
try:
assert idx >= 0 # avoid wrapping around
values[i, j] = df.iloc[idx]
except:
values[i, j] = np.nan
values
Returns
[[[nan, nan, nan],
[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan]],
[[nan, nan, nan],
[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan]],
[[ 1, 2, 3],
[ 3, 6, 9],
[nan, nan, nan],
[nan, nan, nan]],
[[ 2, 4, 6],
[ 4, 8, 12],
[nan, nan, nan],
[nan, nan, nan]]]
As desired, but is quite slow. Any suggestions?
As desired
Here's a solution that builds a lookup table:
# repeat indices into new dimension
rep_ids = np.tile(indices, (len(df), 1))
# add row index to the tiled ids
range_ids = rep_ids + np.arange(len(df))[:, np.newaxis]
# check which ids make a valid lookup in the df
# TODO: do you really mean > 0 here in the original code, not >= 0?
valid_test = (range_ids > 0) & (range_ids < len(df))
# lookup table, using 0 for invalid ids (will be overwritten later)
lookup_ids = np.where(valid_test, range_ids, 0)
# lookup all values
values = df.values[lookup_ids].astype(np.float64)
# set the invalids to nan
values[~valid_test] = np.nan
I have a DataFrame that contains floats and I want to get all the indexes of cells that match a certain filter.
So let's say I have this DataFrame:
A
B
C
A
1
0.7
0.9
B
0.7
1
0.3
C
0.9
0.3
1
And my filter is >=0.9
I want to get the indexes (0,0),(1,1),(2,2),(0,2),(2,0).
Or to make it even more specific, I have the pearson correlation data frame and I want to get all the columns that have a correlation bigger than 0.9
You can use np.argwhere():
import numpy as np
out=np.argwhere(df.to_numpy()>=0.9).tolist()
output of out:
[[0, 0], [0, 2], [1, 1], [2, 0], [2, 2]]
I have a four dimensional Numpy ndarray (time, pressure level, latitude, longitude), and I want to check for each time and pressure level (dimensions 0 and 1) if there is an all-NaN slice along the latitude or longitude dimenstion (2 and 3).
I'd like to to it in a vectorized way, so without looping over the array, but I can't figure out how.
import numpy as np
a=np.ones([2,3,5,5])
a[0,2,:,2]=np.nan*np.ones_like(a[0,2,:,2])
a[0,1,1,:]=np.nan*np.ones_like(a[0,1,1,:])
a[0,0,1,2]=np.nan
a[1,1,:,2]=np.nan*np.ones_like(a[0,2,:,2])
a[1,1,1,:]=np.nan*np.ones_like(a[0,1,1,:])
print(a)
The array now holds ones (i.e. numbers), and in some locations slices of only NaNs. I'd like to know these locations. So in this case, I need to find that the NaN slices are at [0,2,:,2], [0,1,1,:], [1,1,:,2], and a[1,1,1,:].
You should use the np.isnan function which creates a boolean matrix of the same size as your original matrix. Then just use boolean reduction operations like np.all. Thus the following code stores in idx the index of the lines (axis=1) of which all the elements are equal to np.nan.
arr = np.array([[0, 0, 0], [np.nan, np.nan, np.nan], [1, np.nan, 1]])
arr_isnan = np.isnan(arr)
idx = np.argwhere(arr_isnan.all(axis=1))
Output:
>>>print(idx)
[[1]]
Following your example this methods gives you this output :
arr_isnan = np.isnan(a)
idx = np.argwhere(arr_isnan.all(axis=2))
>>>print(idx) #[0,2,:,2] and [1,1,:,2] because axis=2
array([[0, 2, 2],
[1, 1, 2]], dtype=int64)
>>>print(a[idx[:,0], idx[:,1], :, idx[:,2]])
[[nan nan nan nan nan]
[nan nan nan nan nan]]
So you just have to adjust the position of ":" according to the axis.
I want to count the number of 'nan' values per column inside a matrix full of string values. Like this one:
m:
[['CB_2' 'CB_3']
['CB_1-1' 'CB_4-1']
['CB_1-2' 'CB_4-2']
['CB_2-1' 'CB_5-1']
['CB_2-2' 'CB_5-2']
[nan 'CB_6-1']
[nan 'CB_6-2']]
I tried using np.count_nonzero(~np.isnan(m) but it seems to work only with numerical values. Perhaps if I convert it into an empty string or zero (?).
Also, I created a sample numpy array with strings (to try several options) (np.array([['a','b'],['c','d'],['e','f'],['e','g'],['k','ñ'],['w','q'],['y','d']])) but when I use np.nan it doesnt seems to works correctly since it adds the nan value as a string ('nan').
Thanks,
You can transform the array into something numerical (I could not reproduce array with nans, but you can make function to return 0 for non-strings):
def f(x):
if isinstance(x, str):
if x == 'nan':
return 0
else:
return 1
return 0
vf = np.vectorize(f)
x = np.array([['CB_2', 'CB_3'],
['CB_1-1', 'CB_4-1'],
['CB_1-2', 'CB_4-2'],
['CB_2-1', 'CB_5-1'],
['CB_2-2', 'CB_5-2'],
[np.nan, 'CB_6-1'],
[np.nan, 'CB_6-2']])
>>> x
array([['CB_2', 'CB_3'],
['CB_1-1', 'CB_4-1'],
['CB_1-2', 'CB_4-2'],
['CB_2-1', 'CB_5-1'],
['CB_2-2', 'CB_5-2'],
['nan', 'CB_6-1'],
['nan', 'CB_6-2']], dtype='<U6')
>>> vf(x)
array([[1, 1],
[1, 1],
[1, 1],
[1, 1],
[1, 1],
[0, 1],
[0, 1]])
I have a Dataframe in which some columns contain wrong information. This wrong information is always before a longer sequence of NaN values. Let's imagine I have the following dataset:
import pandas as pd
from numpy import nan
d = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [0.1, 0.1, nan, nan, nan, nan, 100, 101]}
df = pd.DataFrame(data=d)
"Obs1" is without wrong information, while "Obs2" has wrong values before the 4-NaN sequence. Does anyone know how to find such a longer sequence in a timeseries (e.g. an occurence of 4 NaN values), to then fill all previous entries with NaN? To give an example, my desired Output would be:
Output = {'Obs1': [1, 2, 3, 4, 5, 6, 7, 8], 'Obs2': [nan, nan, nan, nan, nan, nan, 100, 101]}
Thanks in advance
For each column, check the i'th element and (i+1)'th element are NaN and find max index (i) satisfying the i'th element and (i+1)'th element are NaN.
See the following code.
for col in df.columns:
cond = df[col].iloc[1:].isnull() + df[col].iloc[:-1].isnull() == 2
if sum(cond) >= 2:
df[col].iloc[:cond.index[-1] - 1] = nan