netCDF4-python-generated file poorly displayed in Panoply

netCDF4-python-generated file poorly displayed in Panoply - python

I'm creating netCDF-files with pre-defined data types for the variables and attributes, and I'm using netCDF4 and python for this.
My minimal example looks like this:
from netCDF4 import Dataset
import numpy as np
root_grp = Dataset("test_single_band.nc" ,'w',format = 'NETCDF4')
data_grp = root_grp.createGroup("data")
data_grp.createDimension("num_pixels", 3264)
data_grp.createDimension("num_lines", None)
measurement_data_grp = data_grp.createGroup("measurement_data")
measurement_data_grp.createVariable("band", "u2", \
("num_pixels","num_lines"), fill_value = np.uint16(8191))
measurement_data_grp["band"].long_name = "radiances"
measurement_data_grp["band"].units = "W m-2 sr-1 um-1"
measurement_data_grp["band"].scale_factor = np.float(0.085006105)
measurement_data_grp["band"].add_offset = np.float(7.61)
measurement_data_grp["band"].valid_min = np.uint16(0)
measurement_data_grp["band"].valid_max = np.uint16(8190)
data_max = 4830.
data_min = 30.
data = data_max*np.random.random((3264,3800)) + data_min
target = root_grp["data/measurement_data/band"]
target[:] = data.astype(target.dtype)
root_grp.close()
And my issue is as follows: The file created by this script is displayed weirdly by Panoply:
Erroneous plot by Panoply
i.e. many values are displayed as if being NaN or larger than valid_max, which they are not by construction. It should rather look like this:
Correct plot by Panoply
Panoply displays the data correctly if I leave out the definition of valid_max, or if valid_max is set to a floating point data type. Using valid_range instead doesn't change anything.
Any pointers to what is going wrong?

Your code generates random data values between 30 and 4860, but since you have specified a scale_factor and add_offset, the values will be stored as packed data. So in this case the values written to the file will be ints between 263 [= (30 - 7.61) / 0.085006105] and 57082 [= (4860 - 7.61) / 0.085006105].
Where the problem lies is that the convention when using packed data in a netCDF file in conjunction with valid_min and valid_max specifications is that the min and max must be specified in terms of the packed values rather than the unpacked values. Since you specified a valid_max of 8190, then any value which packed as an int between 8190 and 57082 will be treated as invalid when unpacked by software following standard netCDF conventions.
See:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html#packed-data
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch02s05.html#missing-data

Related

How do I resample a high-resolution GRIB grid to a coarser resolution using xESMF?

I'm trying to resample a set of GRIB2 arrays at 0.25 degree resolution to a coarser 0.5 degree resolution using the xESMF package (xarray's coarsen method does not work here because there is an odd number of coordinates in the latitude).
I have converted the GRIB data to xarray format through the pygrib package, which then subsets out the specific grid I need:
fhr = 96
gridDefs = {
"0.25":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/landsfc.pgrb2.0p25"},
"0.5":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/landsfc.pgrb2.0p50"},
}
fileDefs = {
"0.25":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2019/2019051900/c00/Days%3A1-10/tmp_pres_2019051900_c00.grib2",
'localfile': "tmp_pres.grib2"},
"0.5":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2019/2019051900/c00/Days%3A1-10/tmp_pres_abv700mb_2019051900_c00.grib2",
'localfile': "tmp_pres_abv_700.grib2"},
}
def grib_to_xs(grib, vName):
arr = xr.DataArray(grib.values)
arr = arr.rename({'dim_0':'lat', 'dim_1':'lon'})
xs = arr.to_dataset(name=vName)
return xs
gribs = {}
for key, item in gridDefs.items():
if not os.path.exists(item['url'][item['url'].rfind('/')+1:]):
os.system("wget " + item['url'])
lsGrib = pygrib.open(item['url'][item['url'].rfind('/')+1:])
landsea = lsGrib[1].values
gLats = lsGrib[1]["distinctLatitudes"]
gLons = lsGrib[1]["distinctLongitudes"]
gribs["dataset" + key] = xr.Dataset({'lat': gLats, 'lon': gLons})
lsGrib.close()
for key, item in fileDefs.items():
if not os.path.exists(item['localfile']):
os.system("wget " + item['url'])
os.system("mv " + item['url'][item['url'].rfind('/')+1:] + " " + item['localfile'])
for key, item in fileDefs.items():
hold = pygrib.open(item['localfile'])
subsel = hold.select(forecastTime=fhr)
#Grab the first item
gribs[key] = grib_to_xs(subsel[1], "TT" + key)
hold.close()
The above code downloads two constant files (landsfc) at the two grid domains (0.25 and 0.5), then downloads two GRIB files at each of the resolutions as well. I'm trying to resample the 0.25 degree GRIB file (tmp_pres.grib2) to a 0.5 degree domain as such:
regridder = xe.Regridder(ds, gribs['dataset0.5'], 'bilinear')
print(regridder)
ds2 = regridder(ds)
My issue is that I generate two warning messages when trying to use the regridder:
/media/robert/HDD/Anaconda3/envs/wrf-work/lib/python3.8/site-packages/xarray/core/dataarray.py:682: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
return key in self.data
/media/robert/HDD/Anaconda3/envs/wrf-work/lib/python3.8/site-packages/xesmf/backend.py:53: UserWarning: Latitude is outside of [-90, 90]
warnings.warn('Latitude is outside of [-90, 90]')
The output xarray does have the correct coordinates, however the values inside the grid are way off (Outside the maxima/minima of the finer resolution grid), and exhibit these strange banding patterns that make no physical sense.
What I would like to know is, is this the correct process to upscale an array using xEMSF, and if not, how would I address the problem?
Any assistance would be appreciated, thanks!

I would recommend first trying conservative instead of bilinear (it's recommended on their documentation) and maybe check if you're using the parameters correctly because it seems something is wrong, my first guess would be that something you're doing moves the latitud around for some reason, I'm leaving the docs link here and hope someone knows more.
Regridder docs:
https://xesmf.readthedocs.io/en/latest/user_api.html?highlight=regridder#xesmf.frontend.Regridder.__init__
Upscaling recommendation (search for upscaling, there's also a guide for increasing resolution):
https://xesmf.readthedocs.io/en/latest/notebooks/Compare_algorithms.html?highlight=upscaling

Thanks to the documentation links and recommendations provided by MASACR 99, I was able to do some more digging into the xESMF package and to find a working example of resampling methods from the package author (https://github.com/geoschem/GEOSChem-python-tutorial/blob/main/Chapter03_regridding.ipynb), my issue was solved by two changes:
I changed the method from bilinear to conservative (This required also adding in two fields to the input array (boundaries for latitude and longitude).
Instead of directly passing the variable being resampled to the resampler, I instead had to define two fixed grids to create the resampler, then pass individual variables.
To solve the first change, I created a new function to give me the boundary variables:
def get_bounds(arr, gridSize):
lonMin = np.nanmin(arr["lon"].values)
latMin = np.nanmin(arr["lat"].values)
lonMax = np.nanmax(arr["lon"].values)
latMax = np.nanmax(arr["lat"].values)
sizeLon = len(arr["lon"])
sizeLat = len(arr["lat"])
bounds = {}
bounds["lon"] = arr["lon"].values
bounds["lat"] = arr["lat"].values
bounds["lon_b"] = np.linspace(lonMin-(gridSize/2), lonMax+(gridSize/2), sizeLon+1)
bounds["lat_b"] = np.linspace(latMin-(gridSize/2), latMax+(gridSize/2), sizeLat+1).clip(-90, 90)
return bounds
For the second change, I modified the regridder definition and application to use the statically defined grids, then passed the desired variable to resample:
regridder = xe.Regridder(get_bounds(gribs['dataset0.25'], 0.25), get_bounds(gribs['dataset0.5'], 0.5), 'conservative')
print(regridder)
ds2 = regridder(ds)

How to store 25M 3-D int tuple with python?

As a hobby, I try to code a basic game in python, and I need to store a map of the game world. It can be viewed as a 2-D array to store height. The point is, for the moment, my map dimensions are 5000x5000.
I store that in a sqlite db (schema : CREATE TABLE map (x SMALLINT, y SMALLINT, h SMALLINT); + VACCUM at the end of the creation), but it take up to 500MB on the disk.
I can compress (lzma, for example) the sqlite file, and it only takes ~35-40MB, but in order to use it in python, I need to unzip it first, so it always ends up taking so much place.
How would you store that kind of data in python ?
A 2-D array of int, or a list 3-int tuple of that dimensions (or bigger) and it could still run on a Raspberry Pi ? Speed is not important, but RAM and file size are.

You need 10 bits to store each height, so 10 bytes can store 8 heights, and thus 31.25Mo can store all 25,000,000 of them. You can figure out which block of 10 bytes stores a desired height (how depends on how you arrange them), and a little bit shifting can isolate the specific height you want (since every height will be split between 2 adjacent bytes).

I finally used the HDF5 file format, with pyTables.
The outcome is a ~20MB file for the exact same data, directly usable by the application.
Here is how I create it:
import tables
db_struct = {
'x': tables.Int16Col(),
'y': tables.Int16Col(),
'h': tables.Int16Col()
}
h5file = tables.open_file("my_file.h5", mode="w", title='Map')
filters = tables.Filters(complevel=9, complib='lzo')
group = h5file.create_group('/', 'group', 'Group')
table = h5file.create_table(group, 'map', db_struct, filters=filters)
heights = table.row
for y in range(0, int(MAP_HEIGHT)):
for x in range(0, int(MAP_WIDTH)):
heights['x'] = x
heights['y'] = y
heights['h'] = h
heights.append()
table.flush()
table.flush()
h5file.close()

Read Data into Python Line by Line as a List

I would like to read in a series of coordinates with their accuracy into a triangulation function to provide the triangulated coordinates. I've been able to use python to create a .txt document that contains the list of coordinates for each triangulation i.e.
[(-1.2354798, 36.8959406, -22.0), (-1.245124, 36.9027361, -31.0), (-1.2387697, 36.897921, -12.0), (-1.3019762, 36.8923956, -4.0)]
[(-1.3103075, 36.8932163, -70.0), (-1.3017684, 36.8899228, -12.0)]
[(-1.3014139, 36.8899931, -34.0), (-1.2028006, 36.9180461, -54.0), (-1.1996497, 36.9286186, -67.0), (-1.2081047, 36.9239936, -22.0), (-1.2013893, 36.9066869, -11.0)]
Each of those would be one group of coordinates and accuracy to feed into the triangulation function. The text documents separate them by line.
This is the triangulation function I am trying to read the text file into:
def triangulate(points):
"""
Given points in (x,y, signal) format, approximate the position (x,y).
Reading:
* http://stackoverflow.com/questions/10329877/how-to-properly-triangulate-gsm-cell-towers-to-get-a-location
* http://www.neilson.co.za/?p=364
* http://gis.stackexchange.com/questions/40660/trilateration-algorithm-for-n-amount-of-points
* http://gis.stackexchange.com/questions/2850/what-algorithm-should-i-use-for-wifi-geolocation
"""
# Weighted signal strength
ws = sum(p[2] for p in points)
points = tuple( (x,y,signal/ws) for (x,y,signal) in points )
# Approximate
return (
sum(p[0]*p[2] for p in points), # x
sum(p[1]*p[2] for p in points) # y
)
print(triangulate([
(14.2565389, 48.2248439, 80),
(14.2637736, 48.2331576, 55),
(14.2488966, 48.232513, 55),
(14.2488163, 48.2277972, 55),
(14.2647612, 48.2299558, 21),
]))
When I test the function with the above print statement it works. But when I try to load the data from the text file into the function as follows"
with open(filename, 'r') as file:
for points in file:
triangulation(points)
I get the error: IndexError: string index out of range. I understand that this is because it is not being read in as a list but as a string, but when I try to convert it to a list object points = list(points) it is also not recognized as a list of different coordinates. My question is how should I read the file into python in order for it to be translated to working within the triangulate function.

What you get from the file is a string, but Python doesn't know anything about how that string should be interpreted. It could be a printed representation of a list of tuples, as in your case, but it could just as well be a part of a book, or it could be some compressed data, or so on. It's not the language's job to guess how to treat the string that gets read from the file. That's your job; you have to write some code to take those strings and parse them - that is, convert them into the data your program needs, using the reverse of the rules that were used to convert that data into strings in the first place.
Now, this is certainly a thing you could do, but it's probably better to just use something other than print(). That is, use a different set of rules for converting your data into strings, one where people have already written the code to reverse the process. A common format you could use is JSON, for which Python includes a library to do the conversions. Other formats that can work with numerical data include CSV (here's the Python module) and HDF5 (supported by an external library, probably overkill for your case). The point is, you need to choose some set of rules for converting between data and strings and use the corresponding code in both directions. In your original example, you were only using the rule for going from data to strings and expecting Python to guess the rule for going back.
If you want to read more about this, the process of converting data to strings (or, really, to something that can be put in a file) is called formatting or serialization, depending on context, and the reverse process of converting the strings back to the original data is called parsing or deserialization.

As suggested by #FMCorz you should use JSON or some other machine-readable format.
Doing so is simple and just a matter of dumping your list of points to the text file in any machine-readable format and then later reading it back in.
Here is a minimal example (using JSON):
import json
def triangulate(points):
""" Given points in (x,y, signal) format, approximate the position (x,y).
Reading:
* http://stackoverflow.com/questions/10329877/how-to-properly-triangulate-gsm-cell-towers-to-get-a-location
* http://www.neilson.co.za/?p=364
* http://gis.stackexchange.com/questions/40660/trilateration-algorithm-for-n-amount-of-points
* http://gis.stackexchange.com/questions/2850/what-algorithm-should-i-use-for-wifi-geolocation
"""
# Weighted signal strength
ws = sum(p[2] for p in points)
points = tuple( (x,y,signal/ws) for (x,y,signal) in points )
# Approximate
return (
sum(p[0]*p[2] for p in points), # x
sum(p[1]*p[2] for p in points) # y
)
points = [(14.2565389, 48.2248439, 80),
(14.2637736, 48.2331576, 55),
(14.2488966, 48.232513, 55),
(14.2488163, 48.2277972, 55),
(14.2647612, 48.2299558, 21)]
with open("points.txt", 'w') as file:
file.write(json.dumps(points))
with open("points.txt", 'r') as file:
for line in file:
points = json.loads(line)
print(triangulate(points))
If you wanted to use a list of lists (a list containing lists of points), you could do something like this:
import json
def triangulate(points):
""" Given points in (x,y, signal) format, approximate the position (x,y).
Reading:
* http://stackoverflow.com/questions/10329877/how-to-properly-triangulate-gsm-cell-towers-to-get-a-location
* http://www.neilson.co.za/?p=364
* http://gis.stackexchange.com/questions/40660/trilateration-algorithm-for-n-amount-of-points
* http://gis.stackexchange.com/questions/2850/what-algorithm-should-i-use-for-wifi-geolocation
"""
# Weighted signal strength
ws = sum(p[2] for p in points)
points = tuple( (x,y,signal/ws) for (x,y,signal) in points )
# Approximate
return (
sum(p[0]*p[2] for p in points), # x
sum(p[1]*p[2] for p in points) # y
)
points_list = [[(-1.2354798, 36.8959406, -22.0), (-1.245124, 36.9027361, -31.0), (-1.2387697, 36.897921, -12.0), (-1.3019762, 36.8923956, -4.0)],
[(-1.3103075, 36.8932163, -70.0), (-1.3017684, 36.8899228, -12.0)],
[(-1.3014139, 36.8899931, -34.0), (-1.2028006, 36.9180461, -54.0), (-1.1996497, 36.9286186, -67.0), (-1.2081047, 36.9239936, -22.0), (-1.2013893, 36.9066869, -11.0)]]
with open("points.txt", 'w') as file:
file.write(json.dumps(points_list))
with open("points.txt", 'r') as file:
for line in file:
points_list = json.loads(line)
for points in points_list:
print(triangulate(points))

astropy.fits: Manipulating image data from a fits Table? (e.g., 3072R x 2C)

I'm currently having a little issue with a fits file. The data is in table format, a format I haven't previously used. I'm a python user, and rely heavily on astropy.fits to manipulate fits images. A quick output of the info gives:
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 60 ()
1 BinTableHDU 29 3072R x 2C [1024E, 1024E]
The header for the BinTableHDU is as follows:
XTENSION= 'BINTABLE' /Written by IDL: Mon Jun 22 23:28:21 2015
BITPIX = 8 /
NAXIS = 2 /Binary table
NAXIS1 = 8192 /Number of bytes per row
NAXIS2 = 3072 /Number of rows
PCOUNT = 0 /Random parameter count
GCOUNT = 1 /Group count
TFIELDS = 2 /Number of columns
TFORM1 = '1024E ' /Real*4 (floating point)
TFORM2 = '1024E ' /Real*4 (floating point)
TTYPE1 = 'COUNT_RATE' /
TUNIT1 = '1e-6cts/s/arcmin^2' /
TTYPE2 = 'UNCERTAINTY' /
TUNIT2 = '1e-6cts/s/arcmin^2' /
HISTORY g000m90r1b120pm.fits created on 10/08/97. PI channel range: 8: 19
PIXTYPE = 'HEALPIX ' / HEALPIX pixelisation
ORDERING= 'NESTED ' / Pixel ordering scheme, either RING or NESTED
NSIDE = 512 / Healpix resolution parameter
NPIX = 3145728 / Total number of pixels
OBJECT = 'FULLSKY ' / Sky coverage, either FULLSKY or PARTIAL
FIRSTPIX= 0 / First pixel # (0 based)
LASTPIX = 3145727 / Last pixel # (zero based)
INDXSCHM= 'IMPLICIT' / indexing : IMPLICIT or EXPLICIT
GRAIN = 0 / GRAIN = 0: No index,
COMMENT GRAIN =1: 1 pixel index for each pixel,
COMMENT GRAIN >1: 1 pixel index for Grain consecutive pixels
BAD_DATA= -1.63750E+30 / Sentinel value given to bad pixels
COORDSYS= 'G ' / Pixelization coordinate system
COMMENT G = Galactic, E = ecliptic, C = celestial = equatorial
END
I'd like to access the fits image which is stored within the TTYPE labeled 'COUNT-RATE', and then have this in a format with which I can then add to other count-rate arrays with the same dimensions.
I started with my usual prodcedure for opening a fits file:
hdulist_RASS_SXRB_R1 = fits.open('/Users/.../RASS_SXRB_R1.fits')
hdulist_RASS_SXRB_R1.info()
image_XRAY_SKYVIEW_R1 = hdulist_RASS_SXRB_R1[1].data
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
image_XRAY_SKYVIEW_header_R1 = hdulist_RASS_SXRB_R1[1].header
But this is coming back with IndexError: too many indices for array. I've had a look at accessing table data in the astropy documentation here (Accessing data stored as a table in a multi-extension FITS (MEF) file)
If anyone has a tried and tested method for accessing such images from a fits table I'd be very grateful! Many thanks.

I can't be sure without seeing the full traceback but I think the exception you're getting is from this:
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
There's no reason to manually wrap numpy.array() around the array. It's already a Numpy array. But in this case it's a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html).
#Andromedae93's answer is right one. But also for general documentation on this see: http://docs.astropy.org/en/stable/io/fits/index.html#working-with-table-data
However, the way you're working (which is fine for images) of manually calling fits.open, accessing the .data attribute of the HDU, etc. is fairly low level, and Numpy structured arrays are good at representing tables, but not great for manipulating them.
You're better off generally using Astropy's higher-level Table interface. A FITS table can be read directly into an Astropy Table object with Table.read(): http://docs.astropy.org/en/stable/io/unified.html#fits
The only reason the same thing doesn't exist for FITS images is there's no a generic "Image" class yet.

I used astropy.io.fits during my internship in Astrophysics and this is my process to open file .fits and make some operations :
# Opening the .fits file which is named SMASH.fits
field = fits.open(SMASH.fits)
# Data fits reading
tbdata = field[1].data
Now, with this kind of method, tbdata is a numpy.array and you can make lots of things.
For example, if you have data like :
ID, Name, Object
1, HD 1527, Star
2, HD 7836, Star
3, NGC 6739, Galaxy
If you want to print data along one condition :
Data_name = tbdata['Name']
You will get :
HD 1527
HD 7836
NGC 6739
I don't know what do you want exactly with your data, but I can help you ;)

Test Highcharts with selenium webdriver

I would like to test the accuracy of a Highcharts graph presenting data from a JSON file (which I already read) using Python and Selenium Webdriver.
How can I read the Highchart data from the website?
thank you,
Evgeny

The highchart data is converted to an SVG path, so you'd have to interpret the path yourself. I'm not sure why you would want to do this, actually: in general you can trust 3rd party libraries to work as advertised; the testing of that code should reside in that library.
If you still want to do it, then you'd have to dive into Javascript to retrieve the data. Taking the Highcharts Demo as an example, you can extract the data points for the first line as shown below. This will give you the SVG path definition as a string, which you can then parse to determine the origin and the data points. Comparing this to the size of the vertical axis should allow you to calculate the value implied by the graph.
# Get the origin and datapoints of the first line
s = selenium.get_eval("window.jQuery('svg g.highcharts-tracker path:eq(0)')")
splitted = re.split('\s+L\s+', s)
origin = splitted[0].split(' ')[1:]
data = [p.split(' ') for p in splitted[1:]]
# Convert to floats
origin = [float(origin[1]), float(origin[2])]
data = [[float(x), float(y)] for x, y in data]
# Get the min and max y-axis value and position
min_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').text()")
max_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').text()")
min_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').attr('y')")
max_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').attr('y')")
# Calculate the value based on the retrieved positions
y_scale = min_y_pos - max_y_pos
y_range = max_y_val - min_y_val
y_percentage = data[0][1] * 100.0 / y_scale
value = max_y_val - (y_range * percentage)
Disclaimer: I didn't have to time to fully verify it, but something along these lines should give you what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.