I am trying to read a NetCDF file from IRI/LDEO Climate Data Library (dust_pm25_sconc10_mon) but I am with a problem to read this file. When I select the variables that compuse the database (longitude (X), latitude (Y) and time (T)), the output from X and Y are a sequence with the number of observations (1, 2, ..., 139 for example). That is, the values of longitude and latitude are not exported corretly.
Someone could help me with this problem? I already tried read this file with R, Python and Qgis and in all of these threes the output of X and Y are the same.
My codes are below (Python).
Thank you all very much.
from netCDF4 import Dataset as dt
filestr = 'dust_pm25_sconc10_mon.nc'
ncfile = dt(filestr, 'r')
print(ncfile.variables)
lat = ncfile.variables['Y'][:]
lat
lon = ncfile.variables['X'][:]
lon
time = ncfile.variables['T'][:]
time
Edit:
This file has three independent variables, X, Y, and T. And the values of X and Y intentionally go from 1 to len(X) and len(Y) respectively.
Look at the description of the file:
http://iridl.ldeo.columbia.edu/home/.nasa_roses_a19/.Dust_model/.dust_mon_avg/.dust_pm25_sconc10_mon/
Independent Variables (Grids)
Time
grid: /T (months since 1960-01-01) ordered (Mar 1979) to (Mar 2010) by 1.0 N= 373 pts :grid
Longitude
grid: /X (unitless) ordered (1.0) to (191.0) by 1.0 N= 191 pts :grid
Latitude
grid: /Y (unitless) ordered (1.0) to (139.0) by 1.0 N= 139 pts :grid
Of course, this might be meaningful for longitude, but for latitude this is nonsense. Unfortunately, I did not find any hint which area on this planet this dataset should describe.
However, I also did not find any data in it's only dependent variable dust_pm25_sconc10_mon - it's empty.
PS: Just as an example:
This dataset here
http://iridl.ldeo.columbia.edu/home/.nasa_roses_a19/.Dust_model/.RegDustModelProjected/.dust_pm25_sconc10/datafiles.html
looks much more reasonable...
The description alone is much more promising:
Independent Variables (Grids)
Time (time)
grid: /T (days since 2009-01-02 00:00) ordered (0130-0430 2 Jan 2009) to (2230 1 Apr 2010 - 0130 2 Apr 2010) by 0.125 N= 3640 pts :grid
Longitude
grid: /X (degree_east) ordered (19.6875W) to (54.6875E) by 0.625 N= 120 pts :grid
Latitude
grid: /Y (degree_north) ordered (0.3125N) to (39.6875N) by 0.625 N= 64 pts :grid
And its dependent variable dust_pm25_sconc10 is also not empty.
I really tried to find this file on the website you mentioned, but it is futile imo. So without knowing it, I have to guess:
netcdf-files provide the possibility to save data space by scaling and shifting the values of any variable so that they can be stored e.g. as int instead of float.
You could simply check, if there are attributes add_offset other than 0 and scale_factor other than 1.
For further information about this concept you can refer to https://www.unidata.ucar.edu/software/netcdf/workshops/2010/bestpractices/Packing.html.
While the information in the link above states that the java interface to netcdf does apply these attributes automatically, the netcdf4-python library does not. So if you want to stay with this package, you have to rescale and -offset the data back to the original values as described.
However, you could also consider trying out xarray, a library which implements the n-dimensional datastructure of netcdf files and as far ss I experienced, this library does automatic scaling and offsetting according to the rules described above.
http://xarray.pydata.org/en/stable/
The example file at http://iridl.ldeo.columbia.edu/home/.nasa_roses_a19/.Dust_model/.dust_mon_avg/.dust_pm25_sconc10_mon/datafiles.html that you linked in your comment on SpghttCd's response is not well-formed. For one thing, the X and Y arrays do not have units attributes appropriate to such dimensions but instead both have value "units". And as already noted the values in the arrays don't "look" valid anyway. Further, the values in the dust_pm25_sconc10_mon array in that file all appear to be NaN.
On the other hand the example dataset at http://iridl.ldeo.columbia.edu/home/.nasa_roses_a19/.Dust_model/.RegDustModelProjected/.dust_pm25_sconc10/datafiles.html that SpghttCd references has good units attribute information ("degrees_east" and "degrees_north", respectively). Furthermore, the actual values in the X and Y arrays look good. I had no problem making a plot of the dust_pm25_sconc10 variable in that dataset (using Panoply) and seeing the data mapped over the appropriate region.
SpghttCd's comments regarding scaling and offsets do not apply here as the longitude and latitudes in that second, good file have actual lon and lat values.
Related
I would like to convert an image (.tiff) into Shapely points. There are 45 million pixels, I need a way to accomplish this without a loop (currently taking 15+ hours)
For example, I have a .tiff file which when opened is a 5000x9000 array. The values are pixel values (colors) that range from 1 to 215.
I open tif with rasterio.open(xxxx.tif).
Desired epsg is 32615
I need to preserve the pixel value but also attach geospatial positioning. This is to be able to sjoin over a polygon to see if the points are inside. I can handle the transform after processing, but I cannot figure a way to accomplish this without a loop. Any help would be greatly appreciated!
If you just want a boolean array indicating whether the points are within any of the geometries, I'd dissolve the shapes into a single MultiPolygon then use shapely.vectorized.contains. The shapely.vectorized module is currently not covered in the documentation, but it's really good to know about!
Something along the lines of
# for a gridded dataset with 2-D arrays lats, lons
# and a list of shapely polygons/multipolygons all_shapes
XX = lons.ravel()
YY = lats.ravel()
single_multipolygon = shapely.ops.unary_union(all_shapes)
in_any_shape = shapely.vectorized.contains(single_multipolygon, XX, YY)
If you're looking to identify which shape the points are in, use geopandas.points_from_xy to convert your x, y point coordinates into a GeometryArray, then use geopandas.sjoin to find the index of the shape corresponding to each (x, y) point:
geoarray = geopandas.points_from_xy(XX, YY)
points_gdf = geopandas.GeoDataFrame(geometry=geoarray)
shapes_gdf = geopandas.GeoDataFrame(geometry=all_shapes)
shape_index_by_point = geopandas.sjoin(
shapes_gdf, points_gdf, how='right', predicate='contains',
)
This is still a large operation, but it's vectorized and will be significantly faster than a looped solution. The geopandas route is also a good option if you'd like to convert the projection of your data or use other geopandas functionality.
So, I have three numpy arrays which store latitude, longitude, and some property value on a grid -- that is, I have LAT(y,x), LON(y,x), and, say temperature T(y,x), for some limits of x and y. The grid isn't necessarily regular -- in fact, it's tripolar.
I then want to interpolate these property (temperature) values onto a bunch of different lat/lon points (stored as lat1(t), lon1(t), for about 10,000 t...) which do not fall on the actual grid points. I've tried matplotlib.mlab.griddata, but that takes far too long (it's not really designed for what I'm doing, after all). I've also tried scipy.interpolate.interp2d, but I get a MemoryError (my grids are about 400x400).
Is there any sort of slick, preferably fast way of doing this? I can't help but think the answer is something obvious... Thanks!!
Try the combination of inverse-distance weighting and
scipy.spatial.KDTree
described in SO
inverse-distance-weighted-idw-interpolation-with-python.
Kd-trees
work nicely in 2d 3d ..., inverse-distance weighting is smooth and local,
and the k= number of nearest neighbours can be varied to tradeoff speed / accuracy.
There is a nice inverse distance example by Roger Veciana i Rovira along with some code using GDAL to write to geotiff if you're into that.
This is of coarse to a regular grid, but assuming you project the data first to a pixel grid with pyproj or something, all the while being careful what projection is used for your data.
A copy of his algorithm and example script:
from math import pow
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
def pointValue(x,y,power,smoothing,xv,yv,values):
nominator=0
denominator=0
for i in range(0,len(values)):
dist = sqrt((x-xv[i])*(x-xv[i])+(y-yv[i])*(y-yv[i])+smoothing*smoothing);
#If the point is really close to one of the data points, return the data point value to avoid singularities
if(dist<0.0000000001):
return values[i]
nominator=nominator+(values[i]/pow(dist,power))
denominator=denominator+(1/pow(dist,power))
#Return NODATA if the denominator is zero
if denominator > 0:
value = nominator/denominator
else:
value = -9999
return value
def invDist(xv,yv,values,xsize=100,ysize=100,power=2,smoothing=0):
valuesGrid = np.zeros((ysize,xsize))
for x in range(0,xsize):
for y in range(0,ysize):
valuesGrid[y][x] = pointValue(x,y,power,smoothing,xv,yv,values)
return valuesGrid
if __name__ == "__main__":
power=1
smoothing=20
#Creating some data, with each coodinate and the values stored in separated lists
xv = [10,60,40,70,10,50,20,70,30,60]
yv = [10,20,30,30,40,50,60,70,80,90]
values = [1,2,2,3,4,6,7,7,8,10]
#Creating the output grid (100x100, in the example)
ti = np.linspace(0, 100, 100)
XI, YI = np.meshgrid(ti, ti)
#Creating the interpolation function and populating the output matrix value
ZI = invDist(xv,yv,values,100,100,power,smoothing)
# Plotting the result
n = plt.normalize(0.0, 100.0)
plt.subplot(1, 1, 1)
plt.pcolor(XI, YI, ZI)
plt.scatter(xv, yv, 100, values)
plt.title('Inv dist interpolation - power: ' + str(power) + ' smoothing: ' + str(smoothing))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.colorbar()
plt.show()
There's a bunch of options here, which one is best will depend on your data...
However I don't know of an out-of-the-box solution for you
You say your input data is from tripolar data. There are three main cases for how this data could be structured.
Sampled from a 3d grid in tripolar space, projected back to 2d LAT, LON data.
Sampled from a 2d grid in tripolar space, projected into 2d LAT LON data.
Unstructured data in tripolar space projected into 2d LAT LON data
The easiest of these is 2. Instead of interpolating in LAT LON space, "just" transform your point back into the source space and interpolate there.
Another option that works for 1 and 2 is to search for the cells that maps from tripolar space to cover your sample point. (You can use a BSP or grid type structure to speed up this search) Pick one of the cells, and interpolate inside it.
Finally there's a heap of unstructured interpolation options .. but they tend to be slow.
A personal favourite of mine is to use a linear interpolation of the nearest N points, finding those N points can again be done with gridding or a BSP. Another good option is to Delauney triangulate the unstructured points and interpolate on the resulting triangular mesh.
Personally if my mesh was case 1, I'd use an unstructured strategy as I'd be worried about having to handle searching through cells with overlapping projections. Choosing the "right" cell would be difficult.
I suggest you taking a look at GRASS (an open source GIS package) interpolation features (http://grass.ibiblio.org/gdp/html_grass62/v.surf.bspline.html). It's not in python but you can reimplement it or interface with C code.
Am I right in thinking your data grids look something like this (red is the old data, blue is the new interpolated data)?
alt text http://www.geekops.co.uk/photos/0000-00-02%20%28Forum%20images%29/DataSeparation.png
This might be a slightly brute-force-ish approach, but what about rendering your existing data as a bitmap (opengl will do simple interpolation of colours for you with the right options configured and you could render the data as triangles which should be fairly fast). You could then sample pixels at the locations of the new points.
Alternatively, you could sort your first set of points spatially and then find the closest old points surrounding your new point and interpolate based on the distances to those points.
There is a FORTRAN library called BIVAR, which is very suitable for this problem. With a few modifications you can make it usable in python using f2py.
From the description:
BIVAR is a FORTRAN90 library which interpolates scattered bivariate data, by Hiroshi Akima.
BIVAR accepts a set of (X,Y) data points scattered in 2D, with associated Z data values, and is able to construct a smooth interpolation function Z(X,Y), which agrees with the given data, and can be evaluated at other points in the plane.
So, I am doing some work with data from an INS unit, in order to calculate the errors in its readings by integrating its velocity data over time to get a change in position, and then comparing that to its actual recorded change in position. The problem is that it gives its position with Latitude and Longitude in degrees (to 11 decimal places), and its documentation indicates that these are using the WGS84 standard, while its velocities are given in meters/second (to 10 decimal places).
I found this other question, but the answers to it were giving answers that assumed that the Earth is a sphere, while the WGS standard uses an ellipsoid, and it seems possible that using calculations that assume that the Earth is spherical might introduce errors into my calculations.
I'm intending to use Python to perform my data analysis with, so ideally answers should use Python as well, but using another language to do the data cleaning would work as long as I can save the cleaned data into a text file that Python can read.
Perhaps you could use LatLon (or for python3 LatLon23), which does enable treating eearth as an ellipsoid.
see an example code using LatLon23 for python3:
from LatLon23 import LatLon, Latitude, Longitude
palmyra = LatLon(Latitude(5.8833), Longitude(-162.0833)) # Location of Palmyra Atoll
honolulu = LatLon(Latitude(21.3), Longitude(-157.8167)) # Location of Honolulu, HI
distance = palmyra.distance(honolulu) # WGS84 distance in km
print(distance)
print(palmyra.distance(honolulu, ellipse = 'sphere')) # FAI distance in km
initial_heading = palmyra.heading_initial(honolulu) # Heading from Palmyra to Honolulu on WGS84 ellipsoid
print(initial_heading)
hnl = palmyra.offset(initial_heading, distance) # Reconstruct Honolulu based on offset from Palmyra
print(hnl.to_string('D')) # Coordinates of Honolulu
I have downloaded the velocity field of the Greenland ice sheet from the CCI website as a NetCDF file. However, the projection is given as (see below, where x ranges between [-639750,855750] and y [-655750,-3355750])
How can I project these data to actual lat/lon coordinates in the NetCDF file? Thanks already! For the ones interested: the file can be downloaded here: http://products.esa-icesheets-cci.org/products/downloadlist/IV/
Variables:
crs
Size: 1x1
Dimensions:
Datatype: int32
Attributes:
grid_mapping_name = 'polar_stereographic'
standard_parallel = 70
straight_vertical_longitude_from_pole = -45
false_easting = 0
false_northing = 0
unit = 'meter'
latitude_of_projection_origin = 90
spatial_ref = 'PROJCS["WGS 84 / NSIDC Sea Ice Polar Stereographic North",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Polar_Stereographic"],PARAMETER["latitude_of_origin",70],PARAMETER["central_meridian",-45],PARAMETER["scale_factor",1],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["X",EAST],AXIS["Y",NORTH],AUTHORITY["EPSG","3413"]]'
y
Size: 5401x1
Dimensions: y
Datatype: double
Attributes:
units = 'm'
axis = 'Y'
long_name = 'y coordinate of projection'
standard_name = 'projection_y_coordinate'
x
Size: 2992x1
Dimensions: x
Datatype: double
Attributes:
units = 'm'
axis = 'X'
long_name = 'x coordinate of projection'
standard_name = 'projection_x_coordinate'
If you want to transform the whole grid from its native Polar Stereographic coordinates to a geographic (longitude by latitude) grid, you'll probably want to use a tool like gdalwarp. I don't think that's the question you're asking, though.
If I'm reading your question correctly, you want to pick points out of the file and locate them as lon/lat coordinate pairs. I'm assuming that you know how to get a location as an XY pair out of your netCDF file, along with the velocity values at that location. I'm also assuming that you're doing this in Python, since you put that tag on this question.
Once you've got an XY pair, you just need a function (with a bunch of parameters) to transform it to lon/lat. You can find that function in the pyproj module.
Pyproj wraps the proj4 C library, which is very widely used for coordinate system transformations. If you have an XY pair in projected coordinates and you know the definition of the projected coordinate system, you can use pyproj's transform function like this:
import pyproj
# Output coordinates are in WGS 84 longitude and latitude
projOut = pyproj.Proj(init='epsg:4326')
# Input coordinates are in meters on the Polar Stereographic
# projection given in the netCDF file
projIn = pyproj.Proj('+proj=stere +lat_0=90 +lat_ts=70 +lon_0=-45
+k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs ',
preserve_units=True)
# here is a coordinate pair near the middle of your data set
x, y = 0.0, -2000000
# transform x,y to lon/lat
lon, lat = pyproj.transform(projIn, projOut, x, y)
# answer: lon = -45.0; lat = 71.6886
... and there you go. Note that the output longitude is -45.0, which should give you a nice warm feeling, since the input X coordinate was 0, and -45.0 is the central meridian of the data set's projection. If you want your answer in radians instead of degrees, set the radians kwarg in the transform function to True.
Now for the hard part, which is actually the thing you do first -- defining the projIn and projOut that are used as arguments for the transform function. These are in the input and output coordinate systems for the transformation. These are Proj objects, and they hold a mess of parameters for the coordinate system transformation equations. The proj4 developers have encapsulated them all in a tidy set of functions and the pyproj developers have put a nice python wrapper around them, so you and I don't have to keep track of all the details. I will be grateful to them for all the days that remain to me.
The output coordinate system is trivial
projOut = pyproj.Proj(init='epsg:4326')
The pyproj library can build a Proj object from an EPSG code. 4326 is the EPSG code for WGS 84 lon/lat. Done.
Setting projIn is harder, because your netCDF file defines its coordinate system with a WKT string, which (I'm pretty sure) can't be read directly by proj4 or pyproj. However, pyproj.Proj() will take a proj4 parameter string as an argument. I've already given you the one you need for this operation, so you can just take my for for it that this
+proj=stere +lat_0=90 +lat_ts=70 +lon_0=-45 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs
is the equivalent of this (which is copied directly from your netCDF file):
PROJCS["WGS 84 / NSIDC Sea Ice Polar Stereographic North",
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],
UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],
AUTHORITY["EPSG","4326"]],
PROJECTION["Polar_Stereographic"],
PARAMETER["latitude_of_origin",70],
PARAMETER["central_meridian",-45],
PARAMETER["scale_factor",1],
PARAMETER["false_easting",0],
PARAMETER["false_northing",0],
UNIT["metre",1,AUTHORITY["EPSG","9001"]],
AXIS["X",EAST],
AXIS["Y",NORTH],
AUTHORITY["EPSG","3413"]]'
If you want to be able to do this more generally, you'll need another module to convert WKT coordinate system definitions to proj4 parameter strings. One such module is osgeo.osr and there's an example program at this blog post that shows you how to do that conversion.
I have a numpy array for an image and am trying to dump it into the libsvm format of LABEL I0:V0 I1:V1 I2:V2..IN:VN. I see that scikit-learn has a dump_svmlight_file and would like to use that if possible since it's optimized and stable.
It takes parameters of X, y, and file output name. The values I'm thinking about would be:
X - numpy array
y - ????
file output name - self-explanatory
Would this be a correct assumption for X? I'm very confused about what I should do for y though.
It appears it needs to be a feature set of some kind. I don't know how I would go about obtaining that however. Thanks in advance for the help!
The svmlight format is tailored to classification/regression problems. Therefore, the array X is a matrix with as many rows as data points in your set, and as many columns as features. y is the vector of instance labels.
For example, suppose you have 1000 objects (images of bicycles and bananas, for example), featurized in 400 dimensions. X would be 1000x400, and y would be a 1000-vector with a 1 entry where there should be a bicycle, and a -1 entry where there should be a banana.