NDFD GRIB2 how to fix mirrored data when using xarray - python

My code code for pulling in a grib file of windspeeds in New England:
import pandas as pd
import numpy as np
import requests
import cfgrib
import xarray as xr
resp = requests.get('https://tgftp.nws.noaa.gov/SL.us008001/ST.opnl/DF.gr2/DC.ndfd/AR.neast/VP.001-003/ds.wspd.bin', stream=True)
f = open('..\\001_003wspd.grib2', 'wb')
f.write(resp1.content)
f.close()
xr_set = xr.load_dataset('..\\001_003wspd.grib2', engine='cfgrib')
xr_set.si10[0].plot(cmap=matplotlib.pyplot.cm.coolwarm)
This gives:
As you can see, it is mirrored every other line east to west. Maine is the most obvious.

I believe this is not a code problem, but rather the file which wasn't written correctly. If you take only one line every two lines, you get a correct map :
import numpy as np
import requests
import xarray as xr
from fs.tempfs import TempFS
resp = requests.get('https://tgftp.nws.noaa.gov/SL.us008001/ST.opnl/DF.gr2/DC.ndfd/AR.neast/VP.001-003/ds.wspd.bin', stream=True)
with TempFS() as tempfs:
path = tempfs.getsyspath("001_003wspd.grib2")
f = open(path, 'wb')
f.write(resp.content)
f.close()
ds = xr.load_dataset(path, engine='cfgrib')
ds = ds.isel(y=np.arange(len(ds.y))[1::2])
ds.si10.isel(step=15).plot(cmap="coolwarm", x='longitude', y='latitude')

The problem is related to how the grib files are ordered. If you perform the following command you can see the order.
$wgrib2 -grid in.grib2
You will likely see something like, "(2345 x 1597) input WE|EW:SN output WE:SN"
The WE|EW:SN need to be changed to WE:SN. To change the ordering do the following command
wgrib2 in.grib2 -ijsmall_grib 1:2345 1:1597 out.grib2
Now you will see that out.grib2 has the correct WE:SN ordering and can work with xarray

Related

How to read a H5 file containing satellite data in Python?

As part of a project I'm exploring satellite data and the data is available in H5 format. I'm new to this format and I'm unable to process the data. I'm able to open the file in a software called Panoply and found that the DHI value is available in a format called Geo2D. Is there anyway to extract the data into a CSV format as shown below:
X
Y
GHI
X1
Y1
X2
Y2
Attaching screenshots of the file opened in Panoply alongside.
Link to the file: https://drive.google.com/file/d/1xQHNgrlrbyNcb6UyV36xh-7zTfg3f8OQ/view
I tried the following code to read the data. I'm able to store it as a 2d numpy array, but unable to do it along with the location.
`
import h5py
import numpy as np
import pandas as pd
import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
key ='X'
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
masterdf = dataset1
np.savetxt("FILENAME.csv",dataset1, delimiter=",")
#masterdf.to_csv('new.csv')
enter image description here
enter image description here
`
Found an effective way to read the data, convert it to a dataframe and convert the projection parameters.
Code is tracked here: https://github.com/rishikeshsreehari/boring-stuff-with-python/blob/main/data-from-hdf5-file/final_converter.py
Code is as follows:
import pandas as pd
import h5py
import time
from pyproj import Proj, transform
input_epsg=24378
output_epsg=4326
start_time = time.time()
with h5py.File("mer.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
final['X2'],final['Y2']=transform(input_epsg,output_epsg,final[["X"]].to_numpy(),final[["Y"]].to_numpy(),always_xy=True)
#final.to_csv("final_converted1.csv", index=False)
print("--- %s seconds ---" % (time.time() - start_time))

Accelerating speed of reading contents from dataframe in pandas

Let us suppose we have table with following dimension :
print(metadata.shape)-(8732, 8)
let us suppose we want to read slice_file_name for each row( and then read sound files from drive ) and extract mel frequencies :
def feature_extractor(file_name):
audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
return mfccs_scaled_features
if i use following loops :
from tqdm import tqdm
extracted_features =[]
for index_num, row in tqdm(metadata.iterrows()):
file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
final_class_labels=row["class"]
data=feature_extractor(file_name)
extracted_features.append([data,final_class_labels])
it takes in total following amount of time :
3555it [21:15, 2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
n_fft, y.shape[-1]
8326it [48:40, 3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
n_fft, y.shape[-1]
8329it [48:41, 3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
n_fft, y.shape[-1]
8732it [50:53, 2.86it/s]
how can i optimize this code to do the thing in less amount of time? it is possible?
You could try and run the feature extractor in parallel, this could give a new column in your dataframe with the mfccs_scaled_features.
from pandarallel import pandarallel
pandarallel.initialize()
PATH = os.path.abspath(Base_Directory)
def feature_extractor(file_name):
# If using windows, you may need to put these here~
# import librosa
# import numpy as np
# import os
file_name = os.path.join(PATH, file_name)
audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
return mfccs_scaled_features
df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)

How to import .dat file in Google Co-lab

I am implementing famous Iris classification problem in python for 1st time. I have a data file namely iris.data. I have to import this file in my python project. I try my hand in Google Colab.
Sample data
Attributes are:
1.sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
I worte
import torch
import numpy as np
import matplotlib.pyplot as plt
FILE_PATH = "E:\iris dataset"
MAIN_FILE_NAME = "iris.dat"
data = np.loadtxt(FILE_PATH+MAIN_FILE_NAME, delimiter=",")
But it did not work and through errors.
But it worked when I wrote the code in Linux. But currently I am using windows 10 and it did not work.
Thank you for help in advance.
When constructing the file name for np.loadtxt, there is a \ missing, as FILE_PATH+MAIN_FILE_NAME = 'E:\iris_datasetiris.dat. To avoid having to add \manually between FILE_PATH and MAIN_FILE_NAME, you could use os.path.join, which does this for you.
import os
import numpy as np
FILE_PATH = 'E:\iris dataset'
MAIN_FILE_NAME = 'iris.dat'
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',') # not actually working due to last column of file
On the other hand, I am not sure why it did work with Linux, because numpy is not able to convert the string "Iris-setosa" into a number, which np.loadtxt tries to do. If you are only interested in the numeric values, you could use the usecols keyword of np.loadtxt
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',', usecols=(0, 1, 2, 3))

How does numpy.memmap work on HDF5 with multiple datasets?

I'm trying to memory-map individual datasets in an HDF5 file:
import h5py
import numpy as np
import numpy.random as rdm
n = int(1E+8)
rdm.seed(70)
dset01 = rdm.rand(n)
dset02 = rdm.normal(0, 1, size=n).astype(np.float32)
with h5py.File('foo.h5', mode='w') as f0:
f0.create_dataset('dset01', data=dset01)
f0.create_dataset('dset02', data=dset02)
fp = np.memmap('foo.h5', mode='r', dtype='double')
print(dset01[:3])
print(fp[:3])
del fp
However, the outputs below indicate that values in fp don't match those in dset01.
[0.92748054 0.87242629 0.58463127]
[5.29239776e-260 1.11688278e-308 5.18067355e-318]
I am guessing, maybe I should have set an 'offset' value when I did np.memmap. Is that the mistake in my code? If so, how do I find out the correct offset value of each dataset in an HDF5?

How do I import tif using gdal?

How do I import tif using gdal?
I'm trying to get my tif file in a usable format in Python, so I can analyze the data. However, every time I import it, I just get an empty list. Here's my code:
xValues = [447520.0, 432524.0, 451503.0]
yValues = [4631976.0, 4608827.0, 4648114.0]
gdal.AllRegister()
dataset = gdal.Open('final_snow.tif', GA_ReadOnly)
if dataset is None:
print 'Could not open image'
sys.exit(1)
data = np.array([gdal.Open(name, gdalconst.GA_ReadOnly).ReadAsArray() for name, descr in dataset.GetSubDatasets()])
print 'this is data ', data`
It always prints an empty list, but it doesn't throw an error. I checked out other questions, such as [this] (Create shapefile from tif file using GDAL) What might be the problem?
For osgeo.gdal, it should look like this:
from osgeo import gdal
gdal.UseExceptions() # not required, but a good idea
dataset = gdal.Open('final_snow.tif', gdal.GA_ReadOnly)
data = dataset.ReadAsArray()
Where data is either a 2D array for 1-banded rasters, or a 3D array for multiband.
An alternative with rasterio looks like:
import rasterio
with rasterio.open('final_snow.tif', 'r') as r:
data = r.read()
Where data is always a 3D array, with the first dimension as band index.

Categories