I am creating a program that creates a galaxy spectra for ages not specified in a given catalogue using interpolation from the nearest ages.
I am trying to find a solution to make sure I am not extrapolating by adding another if statement to my runinterpolation function below.
Limits is a list of ages in the form [[age1,age2],[age3,age4],...]
Data is a list of dataframes with the corresponding data to be interpolated for each k in limits.
For ages above/below the given ages in the original data, the previous function returns the lowest/highest age for the limit, i.e. [[age1,age1]]
I cannot seem to write an if statement that says if age1 = age2, create a column with age1's non interpolated data.
The functions for interpolation are below:
# linear interpolation
def interpolation(limits,data,age):
interp = sc.interp1d(limits,data)
interped = interp(age)
return interped
#runs the interpolation for each age, values returned as columns
#of new dataframe
def runinterpolation(limits,data,ages):
int_data = pd.DataFrame(index=Lambda)
for x,y,z in zip(limits,data,ages):
W = interpolation(x,y,z)
int_data[z] = W
return int_data
Any help is much appreciated.
Related
I'm trying to create an experimental dataframe (that will be used only for the comparative viz) from other correlated dataframes and then sort the column values independent of each other to later visualize and show what correlated data should look like (because my current data actually shows no correlation)
experimental_df = full_df[['Token_Rarity','Price_USD']]
Becomes:
experimental_df.sort_values(by=['Token_Rarity','Price_USD'],ascending=[True,True])
I'm trying to get the lowest value in Token Column, and lowest value in price column, or vise versa, regardless of any other values or arguments. Current result:
Please try this:
experimental_df = pd.DataFrame()
experimental_df['Token_Rarity'] = full_df['Token_Rarity'].sort_values()
experimental_df['Price_USD'] = full_df['Price_USD'].sort_values().reset_index(drop=True)
I am searching for an option to select data from a NetCDF file at a specific variable value. The dataset contains time, lat, and lon coordinates and a range of variables. One of these variables is a mask with specific values for Land/ open ocean/ sea-ice /lake. Since the open ocean is represented by ds.mask = 1, I want to extract only sea surface temperature values which are located at the coordinates (time and space) where mask = 1. However, I do not want the sea surface temperature values at other coordinates to be set to NaN, but to keep only those coordinates and variable's values where ds.mask = 1. I know how to select and data with xarray.sel/isel, however, this works only with selecting by coordinates, not by variable values as I am trying it. Any help would be very much appreciated.
lati = stormtrack_lat.values
loni = stormtrack_lon.values
timei = stormtrack_datetime.values
tmax = timei.max() + np.timedelta64(10,'D')
tmin = timei.min() - np.timedelta64(10,'D')
SSTskin_subfile = SSTskin_file.sel(time=slice(tmin, tmax))
#HERE I NEED HELP:
#extract data where mask = ocean (1) and use only these data points and keep these only!
SSTskin_subfile_masked = SSTskin_subfile.sel(SSTskin_subfile.mask == 1) #does not work yet (Thrown error: ValueError: the first argument to .isel must be a dictionary)
This is the NetCDF file's structure:
You can apply the ocean mask with .where :
SSTskin_subfile_masked = SSTskin_subfile.where(SSTskin_subfile.mask)
It is not possible to drop all masked points because the data are gridded. For example if you have just one defined value for a given latitude, you have to keep all the values along it. However you can drop the coordinates where all values are NaN with:
SSTskin_subfile_masked.dropna(dim = ['lat', 'lon'], how = 'all')
Coming from a Matlab background, where everything is a matrix/vector, it was very easy to loop through a given data set and build a matrix successively. Since the final object was a matrix, it was also very easy to extract specific elements of the matrix. I'm finding it rather problematic in Python. I've reproduced the code here to explain where I am getting stuck.
The original data just a time series with a month and a price. The goal is to simulate select subsets of these prices. The loop starts by collecting all months into one set, and then drops one month in each successive loop. For 12 months, I will have (n^2 - n)/2 + n, 78 columns in total in this example. To be clear, the n is the total number of time periods; 12 in this data set. The rows of the matrix will be the Z scores sampled from the standard normal variable - the goal is to simulate all 78 prices in one go in a matrix. The # of z scores is determined by the variable num_terminal_values, currently set to 5 for just keeping things simple/easy to visualize at this point.
Here's a link to a google sheet with the original matrix google sheet with corr mat . The code below may not work from the google sheet; the sheet is intended to show what the original data is. My steps (and Python code) are as follows:
#1 read the data
dfCrv = pd.read_excel(xl, sheet_name = 'upload', usecols = range(0,2)).dropna(axis=0)
#2 create the looper variables and then loop through the data to build a matrix. The rows in the matrix are Z values sampled from the standard normal (this is the variable num_terminal_values). The columns refers to each individual simulation month.
import datetime as dt
lst_zUCorr = []
num_terminal_values = 5
as_of = dt.datetime(2020, 12, 1)
max_months = dfCrv.shape[0]
sim_months = pd.date_range(dfCrv['term'].iloc[0], dfCrv['term'].iloc[-1], freq='MS')
end_month = dfCrv['term'].iloc[-1]
dfCrv = dfCrv.set_index('term',drop=False)
for runNum in range(max_months):
sim_month = dfCrv['term'].iloc[runNum]
ttm = ((sim_month - as_of).days)/365
num_months = (end_month.year - sim_month.year) * 12 + (end_month.month - sim_month.month) + 1
zUCorr = npr.standard_normal(size=(num_terminal_values, num_months))
lst_zUCorr.append(zUCorr)
investigate the objects
lst_zUCorr
z = np.hstack(lst_zUCorr)
z
So far, everything works fine. However, I don't know how to transform the object lst_zUCorr to a simple matrix. I've tried hstack etc.; but this still doesn't look like a matrix.
The next set of operations require data in simple matrix form; but what I'm getting here isn't a matrix. Here's a visual:
Key point/question - the final 5x78 matrix in Matlab can be used to do more operations. Is there a way to convert the equivalent Python object into a 5x78 matrix, or will I now need to do more coding to access specific subsets of the Python objects?
I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.
I have a netCDF file with a grid (each step 0.25°).
What I want is the value of the variable, lets say tempMax, at a certain gridpoint, over the last 50 years.
I am aware that you read the data into python like this
lon = numpy.array(file.variables['longitude'][:])
lat = numpy.array(file.variables['latitude'][:])
temp = numpy.array(file.variables['tempMax'][:])
time = numpy.array(file.variables['time'][:])
That leaves me with an array and I do not know how to "untangle" it.
How to get the value at a certain coordinate (stored in temp) over the whole time (stored in time)?
S display is the value over the time at the certain coordinate.
Any ideas how I could achieve that?
Thanks!
I'm guessing that tempMax is 3D (time x lat x lon) and should then be read in as
temp = ncfile.variables['tempMAx'][:,:,:]
(Note two things: (1) if you're using Python v2, it's best to avoid the word file and instead use something like ncfile as shown above, (2) temp will be automatically stored as a numpy.ndarray simply with the call above, you don't need to use the numpy.array() command during the read in of variables.)
Now you can extract temperatures for all times at a certain location with
temp_crd = temp[:,lat_idx,lon_idx]
where lat_idx and lon_idx are integers corresponding to the index of the latitude and longitude coordinates. If you know these indices beforehand, great, just plug them in, e.g. temp_crd = temp[:,25,30]. (You can use the tool ncdump to view the contents of a netCDF file, https://www.unidata.ucar.edu/software/netcdf/docs/netcdf/ncdump.html)
The more likely case is that you know the coordinates, but not their indices beforehand. Let's say you want temperatures at 50N and 270E. You can use the numpy.where function to extract the indices of the coordinates given the lat and lon arrays that you've already read in.
lat_idx = numpy.where(lat==50)[0][0]
lon_idx = numpy.where(lon==270)[0][0]
tmp_crd = temp[:,lat_idx,lon_idx]