I have been looking for an answer since yesterday but no luck. So I have a 1D spectrum (.fits) file with flux value at each wavelength. I have converted them into a 2D array (x,y)=(wavelength, flux) and want to write a program which will return flux(y) at some assigned wavelengths(x). I have tried this:
#modules
import scipy
import numpy as np
import pyfits as pf
#Target Global Vaiables
hdulist_tg = pf.open('cutmask1-2.0001.fits')
hdr_tg = hdulist_tg[0].header
flux_tg = hdulist_tg[0].data
crval_tg = hdr_tg['CRVAL1'] #Starting wavelength
cdel_tg = hdr_tg['CDELT1'] #Wavelength axis width
wave_tg = crval_tg + np.arange(3183)*cdel_tg #Create an x-axis
wavelist = [6207,6315,6369,6438,6490,6565,6588]
wave_flux=[]
diff = 10
for wave in wave_tg:
for flux in flux_tg:
wave_flux.append((wave,flux))
for item in wave_flux:
wave = item[0]
flux = item[1]
#Where I got my actual wavelength that exists in wave_tg
diffmatch = np.abs(wave - wavelist[0])
if diffmatch < diff:
flux_wave = flux
diff = diffmatch
wavematch = wave
print wavelist[0],flux_wave,wavematch
but the program always return the same flux value even though the wavelength is different. Please help...
I would skip the creation of the two dimensional table altogether and just use interp:
fluxvalues = np.interp(wavelist, wave_tg, flux_tg)
For the file you posted, the code you posted doesn't work due to the hard-coded length of the wave_tg array. I would therefore recommend you rather use
wave_tg = crval_tg + np.arange(len(flux_tg))*cdel_tg
Also, for some reason it seems that the file you posted doesn't actually go up to the wavelengths you are looking up. You might need to check that you are calculating the corresponding wavelengths correctly or check that you are looking up the right wavelengths.
I've made some changes in your code:
using numpy ot create wave_flux as a ndarray using np.hstack(), np.repeat() and np.tile()
using fancy indexing to get the values matching your search
The resulting code is:
#modules
import scipy
import numpy as np
import pyfits as pf
#Target Global Vaiables
hdulist_tg = pf.open('cutmask1-2.0001.fits')
hdr_tg = hdulist_tg[0].header
flux_tg = hdulist_tg[0].data
crval_tg = hdr_tg['CRVAL1'] #Starting wavelength
cdel_tg = hdr_tg['CDELT1'] #Wavelength axis width
wave_tg = crval_tg + np.arange(3183)*cdel_tg #Create an x-axis
wavelist = [6207,6315,6369,6438,6490,6565,6588]
wave_flux = np.vstack(( np.repeat(wave_tg, len(flux_tg)),
np.tile(flux_tg, len(wave_tg)) )).transpose()
wave_ref = wavelist[0]
diff = 10
print wave_flux[ np.abs(wave_flux[:,0]-wave_ref) < diff ]
Which will return a sub-group of wave_flux with the wave values in column 0 and flux values in column 1:
[[ 6197.10300138 500.21020508]
[ 6197.10300138 523.24102783]
[ 6197.10300138 510.6390686 ]
...,
[ 6216.68436446 674.94732666]
[ 6216.68436446 684.74255371]
[ 6216.68436446 712.20098877]]
Related
I have a set of netcdf datasets that basically look like a CSV file with columns for latitude, longitude, value. These are points along tracks that I want to aggregate to a regular grid of (say) 1 degree from -90 to 90 and -180 to 180 degrees, by for example calculating the mean and/or standard deviation of all points that fall within a given cell.
This is quite easily done with a loop
D = np.zeros((180, 360))
for ilat in np.arange(-90, 90, 1, dtype=np.int):
for ilon in np.arange(-180, 180, 1, dtype=np.int):
p1 = np.logical_and(ds.lat >= ilat,
ds.lat <= ilat + 1)
p2 = np.logical_and(ds.lon >=ilon,
ds.lon <= ilon+1)
if np.sum(p1*p2) == 0:
D[90 + ilat, 180 +ilon] = np.nan
else:
D[90 + ilat, 180 + ilon] = np.mean(ds.var.values[p1*p2])
# D[90 + ilat, 180 + ilon] = np.std(ds.var.values[p1*p2])
Other than using numba/cython to speed this up, I was wondering whether this is something you can directly do with xarray in a more efficient way?
You should be able to solve this using pandas and xarray.
You will first need to convert your data set to a pandas data frame.
Once this is done, df is the dataframe and assuming longitude and latitude are lon/lat, you will need to round the lon/lats to the nearest integer value, and then calculate the mean for each lon/lat. You will then need to set lon/lat to indices. Then you can use xarray's to_xarray to convert to an array:
import xarray as xr
import pandas as pd
import numpy as np
df = df.assign(lon = lambda x: np.round(x.lon))
df = df.assign(lat = lambda x: np.round(x.lat))
df = df.groupby(["lat", "lon"]).mean()
df = df.set_index(["lat", "lon"])
df.to_xarray()
I use #robert-wilson as a starting point, and to_xarray is indeed part of my solution. Other inspiration came from here. The approach that I used is shown below. It's probably slower than numba-ing my solution above, but much simpler.
import netCDF4
import numpy as np
import xarray as xr
import pandas as pd
fname = "super_funky_file.nc"
f = netCDF4.Dataset(fname)
lat = f.variables['lat'][:]
lon = f.variables['lon'][:]
vari = f.variables['super_duper_variable'][:]
df = pd.DataFrame({"lat":lat,
"lon":lon,
"vari":vari})
# Simple functions to calculate the grid location in rows/cols
# using lat/lon as inputs. Global 0.5 deg grid
# Remember to cast to integer
to_col = lambda x: np.floor(
(x+90)/0.5).astype(
np.int)
to_row = lambda x: np.floor(
(x+180.)/0.5).astype(
np.int)
# Map the latitudes to columns
# Map the longitudes to rows
df['col'] = df.lat.map(to_col)
df['row'] = df.lon.map(to_row)
# Aggregate by row and col
gg = df.groupby(['col', 'row'])
# Now, create an xarray dataset with
# the mean of vari per grid cell
ds = gg.mean().to_xarray()
dx = gg.std().to_xarray()
ds['stdi'] = dx['vari']
dx = gg.count().to_xarray()
ds['counti'] = dx['vari']```
Background
I am attempting to slice a NetCDF file using a bounding box of lat/lons. The relevant information of this file is listed below (variables, shape, dimensions):
Per most answers here and standard tutorials, this should be very straightforward, and my interpretation is that you just find the indices of the lat/lons and slice the variable array by those indices.
Attempt/Code
def netcdf_worker(nc_file, bbox):
dataset = Dataset(nc_file)
for variable in dataset.variables.keys():
if (variable != 'lat') and (variable != 'lon'):
var_name = variable
break
# Full extent of data
lats = dataset.variables['lat'][:]
lons = dataset.variables['lon'][:]
if bbox:
lat_bnds = [bbox[0], bbox[2]] # min lat, max lat
lon_bnds = [bbox[1], bbox[3]] # min lon, max lon
lat_inds = np.where((lats > lat_bnds[0]) & (lats < lat_bnds[1]))
lon_inds = np.where((lons > lon_bnds[0]) & (lons < lon_bnds[1]))
var_subset = dataset.variables[var_name][:, lat_inds[0], lon_inds[0]]
# would also be great to slice the lats and lons too for visualization
Problem
When attempting to implement the solutions found in other answers listed on SO via the above code, I am met with the error:
File "/Users/XXXXXX/Desktop/Viewer/viewer.py", line 41, in netcdf_worker
var_subset = dataset.variables[var_name][:, lat_inds[0], lon_inds[0]]
File "netCDF4/_netCDF4.pyx", line 4095, in netCDF4._netCDF4.Variable.__getitem__
File "/Users/XXXXXX/Viewer/lib/python3.6/site-packages/netCDF4/utils.py", line 242, in _StartCountStride
ea = np.where(ea < 0, ea + shape[i], ea)
IndexError: tuple index out of range
I believe there is something minor I am missing/not understanding about slicing multidimensional arrays and would appreciate any help. I am not interested in any solutions that bring any other packages or operate external to python (no CDO or NCKS answers please!). Thank you for your help.
In Python, I think that the easiest solution is to use xarray. Minimal example (using some ERA5 data):
import xarray as xr
f = xr.open_dataset('model_fc.nc')
print(f['latitude'].values) # [52.771 52.471 52.171 51.871 51.571 51.271 50.971]
print(f['longitude'].values) # [3.927 4.227 4.527 4.827 5.127 5.427 5.727]
f2 = f.sel(longitude=slice(4.5, 5.4), latitude=slice(52.45, 51.5))
print(f2['latitude'].values) # [52.171 51.871 51.571]
print(f2['longitude'].values) # [4.527 4.827 5.127]
As an example I'm only showing the latitude and longitude variables, but all variables in the NetCDF file which have latitude and longitude dimensions are sliced automatically.
Alternatively, if you want to select the box manually (using NetCDF4):
import netCDF4 as nc4
import numpy as np
f = nc4.Dataset('model_fc.nc')
lat = f.variables['latitude'][:]
lon = f.variables['longitude'][:]
# All indices in bounding box:
where_j = np.where((lon >= 4.5) & (lon <= 5.4))[0]
where_i = np.where((lat >= 51.5) & (lat <= 52.45))[0]
# Start and end+1 indices in each dimension:
i0 = where_i[0]
i1 = where_i[-1]+1
j0 = where_j[0]
j1 = where_j[-1]+1
print(lat[i0:i1]) # [52.171 51.871 51.571]
print(lon[j0:j1]) # [4.527 4.827 5.127]
Now of course you have to slice each data array manually, using e.g. var_slice = var[j0:j1, i0:i1]
I've some problems and I could not find any answer to my problem.
I'm trying to create a datacube in python, where the three axis are (RA,DEC,z), that is 2 sky position and red shift.
I think my code for generating the cube works, I define the cube as:
cube = np.zeros([int(size_x),int(size_y),int(Nchannel)])
where x and y are pixel coordinates and the redshift is sliced in channels. Having this cube I'm filling it with intensity of some lines. At the end I define my .fits header as follows:
hdr = fits.Header()
hdr['EQUINOX'] = 2000
hdr['CRPIX1'] = round(size_ra*3600./pix_size/2.)
hdr['CRPIX2'] = round(size_dec*3600./pix_size/2.)
hdr['CRPIX3'] = 0
hdr['CRVAL1'] = ra0
hdr['CRVAL2'] = dec0
hdr['CRVAL3'] = z_min
hdr['CD1_1'] = pix_size/3600.
hdr['CD1_2'] = 0.
hdr['CD2_1'] = 0.
hdr['CD2_2'] = pix_size/3600.
hdr['CTYPE1'] = "RA---TAN"
hdr['CTYPE2'] = "DEC--TAN"
hdr['CTYPE3'] = "Z"
hdr['BUNIT'] = "Jy/pixel"
fits.writeto('cube.fits',cube,hdr,overwrite=True)
And here is the problem, my cube.fits is in the "bad" direction. When I open it using ds9 the z-axis is not the redshift z...
I'm suspecting a bad header, but where can I specify the axis in the fits header?
Cheers
The axes are indeed inverted, FITS uses the Fortran convention (column-major order) whereas Python/Numpy uses the C convention (row-major order).
http://docs.astropy.org/en/latest/io/fits/appendix/faq.html#what-convention-does-astropy-use-for-indexing-such-as-of-image-coordinates
So for your cube you need to define the axes as (z, y, x):
In [1]: import numpy as np
In [2]: from astropy.io import fits
In [3]: fits.ImageHDU(data=np.zeros((5,4,3))).header
Out[3]:
XTENSION= 'IMAGE ' / Image extension
BITPIX = -64 / array data type
NAXIS = 3 / number of array dimensions
NAXIS1 = 3
NAXIS2 = 4
NAXIS3 = 5
PCOUNT = 0 / number of parameters
GCOUNT = 1 / number of groups
I have some datasets (lets stay at 2 here) which are dependent on a common variable t, like X1(t) and X2(t). However X1(t) and X2(t) don't have to share the same t values or even have the same amount of datapoints.
For example they could look like:
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
I am trying to create a new dataset YNew(XNew) (=X2(X1)) such that both datasets are linked without the shared variable t.
In this case it should look like:
XNew = [10,20,30]
YNew = [100,150,200]
where to every occuring X1-value a corresponding X2-value (a mean value) is assigned.
Is there an easy already known way to achieve this(maybe with pandas)?
My first guess would be to find all t-values for a certain X1-value (in the example case the X1-value 10 would lie in the range 2,...,7) and then look for all X2-values in that range and get their mean value. Then you should be able to assign YNew(XNew).
Thanks for every advice!
Update:
I added a graph, so maybe my intentions are a bit more clear. I want to assign the mean X2-value to the corresponding X1-value in the marked regions (where the same X1-values occur).
graph corresponding to example lists
alright, I just tried to implement what I mentioned and it works as I liked it.
Although I think that some things are still a little clumsy...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# datasets to treat
t1 = [2,6,7,8,10,13,14,16,17]
X1 = [10,10,10,20,20,20,30,30,30]
t2 = [3,4,5,6,8,10,11,14,15,16]
X2 = [95,100,100,105,158,150,142,196,200,204]
X1Series = pd.Series(X1, index = t1)
X2Series = pd.Series(X2, index = t2)
X1Values = X1Series.drop_duplicates().values #returns all occuring values of X1 without duplicates as array
# lists for results
XNew = []
YNew = []
#find for every occuring value X1 the mean value of X2 in the range of X1
for value in X1Values:
indexpos = X1Series[X1Series == value].index.values
max_t = indexpos[indexpos.argmax()] # get max and min index of the range of X1
min_t =indexpos[indexpos.argmin()]
print("X1 = "+str(value)+" occurs in range from "+str(min_t)+" to "+str(max_t))
slicedX2 = X2Series[(X2Series.index >= min_t) & (X2Series.index <= max_t)] # select range of X2
print("in this range there are following values of X2:")
print(slicedX2)
mean = slicedX2.mean() #calculate mean value of selection and append extracted values
print("with the mean value of: " + str(mean))
XNew.append(value)
YNew.append(mean)
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(t1, X1,'ro-',label='X1(t)')
ax1.plot(t2, X2,'bo',label='X2(t)')
ax1.legend(loc=2)
ax1.set_xlabel('t')
ax1.set_ylabel('X1/X2')
ax2.plot(XNew,YNew,'ro-',label='YNew(XNew)')
ax2.legend(loc=2)
ax2.set_xlabel('XNew')
ax2.set_ylabel('YNew')
plt.show()
I've a gridded weather data set which have a dimension 33 X 77 X 77. The first dimension is time and rest are Lat and Lon respectively. I need to interpolate (linear or nearest neighbour) the data to different points (lat&lon) for each time and write it into a csv file. I've used interp2d function from scipy and it is successful for one time step. As I've many locations I don't want to loop over time.
below shown is the piece of code that I wrote, Can any one suggest a better method to accomplish the task?
import sys ; import numpy as np ; import scipy as sp ; from scipy.interpolate import interp2d ;import datetime ; import time ; import pygrib as pg ;
grb_f=pg.open('20150331/gfs.20150331.grb2') lat=tmp[0].data(lat1=4,lat2=42,lon1=64,lon2=102)[1] ; lat=lat[:,0];
lon=tmp[0].data(lat1=4,lat2=42,lon1=64,lon2=102)[2] ; lon=lon[0,:] ;
temp=np.empty((0,lon.shape[0]))
for i in range(0,tmp.shape[0]):
dat=tmp[i].data(lat1=4,lat2=42,lon1=64,lon2=102)
temp=np.concatenate([temp,dat[0]-273.15],axis=0)
temp1=temp.reshape(tmp.shape[0],lat.shape[0],lon.shape[0])
x=77 ; y=28 #(many points)
f=interp2d(lon,lat, temp1[0,:,:],kind='linear',copy=False,bounds_error=True ) ; Z=f(x,y)
EDIT ::
Instead of making a 3D matrix, I appended the data in vertically and made data matrix of size 2541 X 77 and lat and lon of size 2541 X 1. the interp2d function gives Invalid length Error.
f=interp2d(lon,lat, temp1[0,:,:],kind='linear',copy=False,bounds_error=True )
"Invalid length for input z for non rectangular grid")
ValueError: Invalid length for input z for non rectangular grid
length of my x,y,z matrix are same (2541,2541,2541). Then why did it throw an Error?
Could any one explain ? Your help will be highly appreciated.
Processing of time series is very easy with RedBlackPy.
import datetime as dt
import redblackpy as rb
index = [dt.date(2018,1,1), dt.date(2018,1,3), dt.date(2018,1,5)]
lat = [10.0, 30.0, 50.0]
# create Series object
lat_series = rb.Series(index=index, values=lat, dtype='float32',
interpolate='linear')
# Now you can access at any key using linear interpolation
# Interpolation does not create new items in Series
# It uses neighbours to calculate value inplace when you call getitem
print(lat_series[dt.date(2018,1,2)]) #prints 20
So, if you want to just write interpolated values to csv file, you can iterate over list of needed keys and call getitem of Series object then put value to file:
# generator for dates range
def date_range(start, stop, step=dt.timedelta(1)):
it = start - step
while it < step:
it += step
yield it
#------------------------------------------------
# create list for keeping output strings
out_data = []
# create output file
out_file = open('data.csv', 'w')
# add head for output table
out_data.append('Time,Lat\n')
for date in date_range(dt.date(2018,1,1), dt.date(2018,1,5)):
out_data.append( '{:},{:}\n'.format(date, lat_series[date]) )
# write output Series
out_file.writelines(out_data)
out_file.close()
By the same way you can add to your processing Lon data.
If you want to create an "interpolator" object once, and use it to sequentially query just the specific points you need, you could take a loot at the scipy.interpolate.Rbf module:
"A class for radial basis function approximation/interpolation of n-dimensional scattered data."
Where n-dimensional would work for your data if you adjust ratio between temporal and spatial dimensions, and scattered meaning you can also use it for regular/uniform data.
If it's the same lat and lon for each time could you do it using slices and a manual interpolation. So if you want a 1D array of values at lat = 4.875, lon = 8.4 (obviously you would need to scale to match your actual spacing)
b = a[:,4:6, 8:10]
c = ((b[:,0,0] * 0.125 + b[:,0,1] * 0.875) * 0.6 + ((b[:,1,0] * 0.125 + b[:,1,1] * 0.875) * 0.4)
obviously you could do it all in one line but it would be even uglier
EDIT to allow variable lat and lon at each time period.
lat = np.linspace(55.0, 75.0, 33)
lon = np.linspace(1.0, 25.0, 33)
data = np.linspace(18.0, 25.0, 33 * 77 * 77).reshape(33, 77, 77)
# NB for simplicity I map 0-360 and 0-180 rather than -180+180
# also need to ensure values on grid lines or edges work ok
lat_frac = lat * 77.0 / 360.0
lat_fr = np.floor(lat_frac).astype(int)
lat_to = lat_fr + 1
lat_frac -= lat_fr
lon_frac = lon * 77.0 / 180.0
lon_fr = np.floor(lon_frac).astype(int)
lon_to = lon_fr + 1
lon_frac -= lon_fr
data_interp = ((data[:,lat_fr,lon_fr] * (1.0 - lat_frac) +
data[:,lat_fr,lon_to] * lat_frac) * (1.0 - lon_frac) +
(data[:,lat_to,lon_fr] * (1.0 - lat_frac) +
data[:,lat_to,lon_to] * lat_frac) * lon_frac)