How to select data from netcdf file by specific variable value? - python

I am searching for an option to select data from a NetCDF file at a specific variable value. The dataset contains time, lat, and lon coordinates and a range of variables. One of these variables is a mask with specific values for Land/ open ocean/ sea-ice /lake. Since the open ocean is represented by ds.mask = 1, I want to extract only sea surface temperature values which are located at the coordinates (time and space) where mask = 1. However, I do not want the sea surface temperature values at other coordinates to be set to NaN, but to keep only those coordinates and variable's values where ds.mask = 1. I know how to select and data with xarray.sel/isel, however, this works only with selecting by coordinates, not by variable values as I am trying it. Any help would be very much appreciated.
lati = stormtrack_lat.values
loni = stormtrack_lon.values
timei = stormtrack_datetime.values
tmax = timei.max() + np.timedelta64(10,'D')
tmin = timei.min() - np.timedelta64(10,'D')
SSTskin_subfile = SSTskin_file.sel(time=slice(tmin, tmax))
#HERE I NEED HELP:
#extract data where mask = ocean (1) and use only these data points and keep these only!
SSTskin_subfile_masked = SSTskin_subfile.sel(SSTskin_subfile.mask == 1) #does not work yet (Thrown error: ValueError: the first argument to .isel must be a dictionary)
This is the NetCDF file's structure:

You can apply the ocean mask with .where :
SSTskin_subfile_masked = SSTskin_subfile.where(SSTskin_subfile.mask)
It is not possible to drop all masked points because the data are gridded. For example if you have just one defined value for a given latitude, you have to keep all the values along it. However you can drop the coordinates where all values are NaN with:
SSTskin_subfile_masked.dropna(dim = ['lat', 'lon'], how = 'all')

Related

Extract values from XArray's DataArray to column using indices

So, I'm doing something that is maybe a bit unorthodox, I have a number of 9-billion pixel raster maps based on the NLCD, and I want to get the values from these rasters for the pixels which have ever been built-up, which are about 500 million:
built_up_index = pandas.DataFrame(np.column_stack(np.where(unbuilt == 0)), columns = ["row", "column"]).sort_values(["row", "column"])
That piece of code above gives me a dataframe where one column is the row index and the other is the column index of all the pixels which show construction in any of the NLCD raster maps (unbuilt is the ones and zeros raster which contains that).
I want to use this to then read values from these NLCD maps and others, so that each pixel is a row and each column is a variable, say, its value in the NLCD 2001, then its value in 2004, 2006 and so on (As well as other indices I have calculated). So the dataframe would look as such:
|row | column | value_2001 | value_2004 | var3 | ...
(VALUES HERE)
I have tried the following thing:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[:,0]), 'x': np.array(built_up_frame.iloc[:,1])}, drop = True).to_dataset(name="var").to_dataframe()
which works if I take a subsample as such:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[0:10000,0]), 'x': np.array(built_up_frame.iloc[0:10000,1])}, drop = True).to_dataset(name="var").to_dataframe()
but it doesn't do what I want, because the length is squared, as it seems it's trying to create a 2-d array which it then flattens, when what I want is a vector containing the values of the pixels I subsampled.
I could obviously do this in a loop, pixel by pixel, but I imagine this would be extremely slow for 500 million values and there has to be a more efficient way.
Any advice here?
EDIT: In the end I gave up on using the index, because I get the impression Xarrays will only make an array of the same dimensions (about 161000 columns and 104000 rows) as my original dataset with a bunch of missing values, rather than creating a column vector with the values I want. I'm using np.extract:
def src_to_frame(src, unbuilt, varname):
return pd.DataFrame(np.extract(unbuilt == 0, src), columns=[varname])
where src is the raster containing the variable of interest, unbuilt is the raster of the same size where 0s are the pixels that have ever been built, and varname is the name of the variable. It does what I want and fits in the RAM I have. Maybe not the most optimal, but it works!
This looks like a good application for advanced indexing with DataArrays
sprawl_2001.isel(
y=built_up_frame.iloc[0:10000,0].to_xarray(),
x=built_up_frame.iloc[0:10000,1].to_xarray(),
).to_dataset(name="var").to_dataframe()

Using pd.merge to map values for multiple columns in a dataframe from another dataframe

I have a dataframe(df3)
df3 = pd.DataFrame({
'Origin':['DEL','BOM','AMD'],
'Destination':['BOM','AMD','DEL']})
comprising of Travel Data which contains Origin/Destination and I'm trying to map Latitude and Longitude for Origin & Destination airports using 3 letter city codes (df_s3).
df_s3 = pd.DataFrame({
'iata_code':['AMD','BOM','DEL'],
'Lat':['72.6346969603999','72.8678970337','77.103104'],
'Lon':['23.0771999359','19.0886993408','28.5665']})
I've tried mapping them one at a time, i.e.
df4=pd.merge(left=df3,right=df_s3,how='left',left_on=['Origin'],right_on=['iata_code'],suffixes=['_origin','_origin'])
df5=pd.merge(left=df4,right=df_s3,how='left',left_on=['Destination'],right_on=['iata_code'],suffixes=['_destination','_destination'])
This updates the values in the dataframe but the columns corresponding to origin lat/long have '_destination' as the suffix
I've even taken an aspirational long shot by combining the two, i.e.
df4=pd.merge(left=df3,right=df_s3,how='left',left_on=['Origin','Destination'],right_on=['iata_code','iata_code'],suffixes=['_origin','_destination'])
Both of these dont seem to be working. Any suggestions on how to make it work in a larger dataset while keeping the processing time low.
Your solution was almost correct. But you need to specify the origin suffix in the second merge:
df4=pd.merge(left=df3,
right=df_s3,how='left',
left_on=['Origin'],
right_on=['iata_code'])
df5=pd.merge(left=df4,
right=df_s3,how='left',
left_on=['Destination'],
right_on=['iata_code'],
suffixes=['_origin', '_destination'])
In the first merge you don't need to specify any suffix as there is no overlap. In the second merge you need to specify the suffix for the right side and the left side. The right side is the longitude and latitude from the origin and the left side are from the destination.
You can try to apply to each column a function like this one:
def from_place_to_coord(place: str):
if place in df_s3['iata_code'].to_list():
Lat = df_s3[df_s3['iata_code'] == place]['Lat'].values[0]
Lon = df_s3[df_s3['iata_code'] == place]['Lon'].values[0]
return Lat, Lon
else:
print('Not found')
and then:
df3['origin_loc'] = df3['Origin'].apply(from_place_to_coord)
df3['destination_loc'] = df3['Destination'].apply(from_place_to_coord)
It will return you 2 more columns with a tuple of Lat,Lon according to the location

Pandas - How do I look for a set of values in a column and if it is present return a value in another column

I am new to pandas. I have a csv file which has a latitude and longitude columns and also a tile ID column, the file has around 1 million rows. I have a list of around a hundred tile ID's and want to get the latitude and longitude coordinates for these tile ID's. Currently I have:
good_tiles_str = [str(q) for q in good_tiles]#setting list elements to string data type
file['tile'] = file.tile.astype(str)#setting title column to string data type
for i in range (len(good_tiles_str)):
x = good_tiles_str[i]
lat = file.loc[file['tile'].str.contains(x), 'BL_Latitude'] #finding lat coordinates
long = file.loc[file['tile'].str.contains(x), 'BL_Longitude'] #finding long coordinates
print(lat)
print(long)
This method is very slow and I know it is not the correct way as I heard you should not use for loops like this whilst using pandas. Also, it does not work as it doesn't find all the latitude and longitude points for the tile ID's
Any help would be very gladly appreciated
There is no need to iterate rows explicitly , I think as far as I understood your question.
If you wish a particular assignment given a condition, you can do so explicitly. Here's one way using numpy.where; we use ~ to indicate "negative".
rule1= file['tile'].str.contains(x)
rule2= file['tile'].str.contains(x)
file['flag'] = np.where(rule1 , 'BL_Latitude', " " )
file['flag'] = np.where(rule2 & ~rule1, 'BL_Longitude', file['flag'])
Try this:
search_for = '|'.join(good_tiles_str)
good = file[file.tile.str.contains(search_for)]
good = good[['BL_Latitude', 'BL_Longitude']].drop_duplicates()

Python - Interpolating problems

I am creating a program that creates a galaxy spectra for ages not specified in a given catalogue using interpolation from the nearest ages.
I am trying to find a solution to make sure I am not extrapolating by adding another if statement to my runinterpolation function below.
Limits is a list of ages in the form [[age1,age2],[age3,age4],...]
Data is a list of dataframes with the corresponding data to be interpolated for each k in limits.
For ages above/below the given ages in the original data, the previous function returns the lowest/highest age for the limit, i.e. [[age1,age1]]
I cannot seem to write an if statement that says if age1 = age2, create a column with age1's non interpolated data.
The functions for interpolation are below:
# linear interpolation
def interpolation(limits,data,age):
interp = sc.interp1d(limits,data)
interped = interp(age)
return interped
#runs the interpolation for each age, values returned as columns
#of new dataframe
def runinterpolation(limits,data,ages):
int_data = pd.DataFrame(index=Lambda)
for x,y,z in zip(limits,data,ages):
W = interpolation(x,y,z)
int_data[z] = W
return int_data
Any help is much appreciated.

Read value over time at certain coordinate from netCDF with python

I have a netCDF file with a grid (each step 0.25°).
What I want is the value of the variable, lets say tempMax, at a certain gridpoint, over the last 50 years.
I am aware that you read the data into python like this
lon = numpy.array(file.variables['longitude'][:])
lat = numpy.array(file.variables['latitude'][:])
temp = numpy.array(file.variables['tempMax'][:])
time = numpy.array(file.variables['time'][:])
That leaves me with an array and I do not know how to "untangle" it.
How to get the value at a certain coordinate (stored in temp) over the whole time (stored in time)?
S display is the value over the time at the certain coordinate.
Any ideas how I could achieve that?
Thanks!
I'm guessing that tempMax is 3D (time x lat x lon) and should then be read in as
temp = ncfile.variables['tempMAx'][:,:,:]
(Note two things: (1) if you're using Python v2, it's best to avoid the word file and instead use something like ncfile as shown above, (2) temp will be automatically stored as a numpy.ndarray simply with the call above, you don't need to use the numpy.array() command during the read in of variables.)
Now you can extract temperatures for all times at a certain location with
temp_crd = temp[:,lat_idx,lon_idx]
where lat_idx and lon_idx are integers corresponding to the index of the latitude and longitude coordinates. If you know these indices beforehand, great, just plug them in, e.g. temp_crd = temp[:,25,30]. (You can use the tool ncdump to view the contents of a netCDF file, https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/ncdump.html)
The more likely case is that you know the coordinates, but not their indices beforehand. Let's say you want temperatures at 50N and 270E. You can use the numpy.where function to extract the indices of the coordinates given the lat and lon arrays that you've already read in.
lat_idx = numpy.where(lat==50)[0][0]
lon_idx = numpy.where(lon==270)[0][0]
tmp_crd = temp[:,lat_idx,lon_idx]

Categories