As part of a project I'm exploring satellite data and the data is available in H5 format. I'm new to this format and I'm unable to process the data. I'm able to open the file in a software called Panoply and found that the DHI value is available in a format called Geo2D. Is there anyway to extract the data into a CSV format as shown below:
X
Y
GHI
X1
Y1
X2
Y2
Attaching screenshots of the file opened in Panoply alongside.
Link to the file: https://drive.google.com/file/d/1xQHNgrlrbyNcb6UyV36xh-7zTfg3f8OQ/view
I tried the following code to read the data. I'm able to store it as a 2d numpy array, but unable to do it along with the location.
`
import h5py
import numpy as np
import pandas as pd
import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
key ='X'
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
masterdf = dataset1
np.savetxt("FILENAME.csv",dataset1, delimiter=",")
#masterdf.to_csv('new.csv')
enter image description here
enter image description here
`
Found an effective way to read the data, convert it to a dataframe and convert the projection parameters.
Code is tracked here: https://github.com/rishikeshsreehari/boring-stuff-with-python/blob/main/data-from-hdf5-file/final_converter.py
Code is as follows:
import pandas as pd
import h5py
import time
from pyproj import Proj, transform
input_epsg=24378
output_epsg=4326
start_time = time.time()
with h5py.File("mer.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
final['X2'],final['Y2']=transform(input_epsg,output_epsg,final[["X"]].to_numpy(),final[["Y"]].to_numpy(),always_xy=True)
#final.to_csv("final_converted1.csv", index=False)
print("--- %s seconds ---" % (time.time() - start_time))
Related
I want to plot a graph by taking data from multiple csv files and plot them against round number(= number of csv files for that graph). Suppose I have to take a max value of a particular column from all csvs and plot them with the serial number of the csvs on x axis. I am able to read csv files and plot too but from a single file. I am unable to do as stated above.
Below is what I did-
import pandas as pd
import matplotlib.pyplot as plt
import glob
import numpy as np
csv = glob.glob("path" + "*csv")
csv.sort()
N = len(csv)
r = 0
for rno in range(1, N + 1):
r += 1
for f in csv:
df = pd.read_csv(f)
col1 = pd.DataFrame(df, columns=['col. name'])
a = col1[:].to_numpy()
Max = col1.max()
plt.plot(r, Max)
plt.show()`
If anyone has an idea it'd be helpful. Thank you.
I am currently working on the NASA aerosol optical depth data (MCD19A2), which is a NASA satellite level three product. I have uploaded the data. I want to save the data as a dataframe including all the information of longitude and latitude, and values. I have successfully converted the 0.47um band file into a three-dimensional array. I want to ask how to convert this array into a correct dataframe includes X, Y and the value.
Below are the codes I have tried:
from osgeo import gdal
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
rds = gdal.Open("MCD19A2.A2006001.h26v04.006.2018036214627.hdf")
names=rds.GetSubDatasets()
names[0][0]
*'HDF4_EOS:EOS_GRID:"MCD19A2.A2006001.h26v04.006.2018036214627.hdf":grid1km:Optical_Depth_047'*
aod_047 = gdal.Open(names[0][0])
a47=aod_047.ReadAsArray()
a47[1].shape
(1200,1200)
I would like the result to be like
X (n=1200)
Y (n=1200)
AOD_047
8896067
5559289
0.0123
I know that in R this can be done by
require('gdalUtils')
require('raster')
require('rgdal')
file.name<-"MCD19A2.A2006001.h26v04.006.2018036214627.hdf"
sds <- get_subdatasets(file.name)
gdal_translate(sds[1], dst_dataset = paste0('tmp047', basename(file.name), '.tiff'), b = nband)
r.047 <- raster(paste0('tmp047', basename(file.name), '.tiff'))
df.047 <- raster::as.data.frame(r.047, xy = T)
names(df.047)[3] <- 'AOD_047'
But, R really relies on memory and saving to 'tif' and reading 'tif' is using a lot of memory. So I want to do this task in python. Thanks a lot for your help.
You can use pandas:
import pandas as pd
df=pd.read_hdf('filename.hdf')
My code code for pulling in a grib file of windspeeds in New England:
import pandas as pd
import numpy as np
import requests
import cfgrib
import xarray as xr
resp = requests.get('https://tgftp.nws.noaa.gov/SL.us008001/ST.opnl/DF.gr2/DC.ndfd/AR.neast/VP.001-003/ds.wspd.bin', stream=True)
f = open('..\\001_003wspd.grib2', 'wb')
f.write(resp1.content)
f.close()
xr_set = xr.load_dataset('..\\001_003wspd.grib2', engine='cfgrib')
xr_set.si10[0].plot(cmap=matplotlib.pyplot.cm.coolwarm)
This gives:
As you can see, it is mirrored every other line east to west. Maine is the most obvious.
I believe this is not a code problem, but rather the file which wasn't written correctly. If you take only one line every two lines, you get a correct map :
import numpy as np
import requests
import xarray as xr
from fs.tempfs import TempFS
resp = requests.get('https://tgftp.nws.noaa.gov/SL.us008001/ST.opnl/DF.gr2/DC.ndfd/AR.neast/VP.001-003/ds.wspd.bin', stream=True)
with TempFS() as tempfs:
path = tempfs.getsyspath("001_003wspd.grib2")
f = open(path, 'wb')
f.write(resp.content)
f.close()
ds = xr.load_dataset(path, engine='cfgrib')
ds = ds.isel(y=np.arange(len(ds.y))[1::2])
ds.si10.isel(step=15).plot(cmap="coolwarm", x='longitude', y='latitude')
The problem is related to how the grib files are ordered. If you perform the following command you can see the order.
$wgrib2 -grid in.grib2
You will likely see something like, "(2345 x 1597) input WE|EW:SN output WE:SN"
The WE|EW:SN need to be changed to WE:SN. To change the ordering do the following command
wgrib2 in.grib2 -ijsmall_grib 1:2345 1:1597 out.grib2
Now you will see that out.grib2 has the correct WE:SN ordering and can work with xarray
I am trying to load a .mat dataset into my dataframe. So, I am only able to load a single file at a time from the Folder TrainingSet1 with
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
data = loadmat('A2001.mat')
And i am able to see the data in it, but how am i supposed to load the whole TrainingSet1 Folder, so that i can view the whole thing.
Also, how could I view the .mat files as images?
Heres my code,
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.vision import *
from fastai.metrics import error_rate
from mat4py import loadmat
from pylab import*
import matplotlib
import os
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
data = loadmat('A2001.mat')
data
{'ECG': {'sex': 'Male', 'age': 68,
'data': [[0.009784321006571624,
0.006006033870606647,
...This is roughly how the data looks like
imshow('A2001.mat',[])
---------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-52-23bbdf3a7668> in <module>
----> 1 imshow('A2001.mat',[])...A long error is displayed
TypeError: unhashable type: 'list'
Thanks for any help
It hard to tell from your post what is the input format, and what is your desired output format.
I am giving you an example of reading all the .mat files in the folder, and an example of how to show data['data'] as image.
I hope the example is enough for you to keep advancing by your own.
I created a sample data set 'A2001.mat', 'A2002.mat', 'A2003.mat' using MATLAB.
In case you have MATLAB installation, I recommend you to execute the following code for creating a sample input (in order for the Python sample to be reproducible):
ECG.sex = 'Male';
ECG.age = 68;
data = im2double(imread('cameraman.tif')) / 10; % Divide by 10 for simulating range [0, 0.1] instead of [0, 1]
save('A2001.mat', 'ECG', 'data');
ECG.sex = 'Male';
ECG.age = 46;
data = im2double(imread('cell.tif'));
save('A2002.mat', 'ECG', 'data');
ECG.sex = 'Female';
ECG.age = 54;
data = im2double(imread('tire.tif'));
save('A2003.mat', 'ECG', 'data');
The Python code sample does the following:
Get a list of all mat files in the folder using glob.glob('*.mat').
Iterate mat files, load data from the files, and append the data to a list.
The result of the loop is a list named alldata, containing data from all mat files.
Iterate alldata and showing data['data'] as an image.
(Assuming data['data'] is the matrix you want to show as an image).
Here is the code:
from matplotlib import pyplot as plt
from mat4py import loadmat
import glob
import os
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
# Get a list for .mat files in current folder
mat_files = glob.glob('*.mat')
# List for stroring all the data
alldata = []
# Iterate mat files
for fname in mat_files:
# Load mat file data into data.
data = loadmat(fname)
# Append data to the list
alldata.append(data)
# Iterate alldata elelemts, and show images
for data in alldata:
# Assume image is stored in matrix named data in MATLAB.
# data['data'], access data with string 'data', becuase data is a dictionary
img = data['data']
# Show data as image using matplotlib
plt.imshow(img, cmap='gray')
plt.show(block=True) # Show image with "blocking"
Update:
The ECG data is not an image but a list of 12 data samples.
The internal structure of the data (after data = loadmat(fname)) is:
Parent dictionary named data.
data contains a dictionary in data['ECG'].
data['ECG']['data'] is a list of 12 lists.
The following code iterates the mat files and displays the ECG data as a graph:
from matplotlib import pyplot as plt
from mat4py import loadmat
import glob
import os
import numpy as np
os.chdir('/Users/Ashi/Downloads/TrainingSet2')
# Get a list for .mat files in current folder
mat_files = glob.glob('*.mat')
# List for stroring all the data
alldata = []
# Iterate mat files
for fname in mat_files:
# Load mat file data into data.
data = loadmat(fname)
# Append data to the list
alldata.append(data)
# Iterate alldata elelemts, and show images
for data in alldata:
# The internal structure of the data is a dictionary with a dictionary.
ecg = data['ECG']
data = ecg['data'] # Data is a list of lists
# Convert data to NumPy array
ecg_data = np.array(data)
# Show data as image using matplotlib
#plt.imshow(img, cmap='gray')
plt.plot(ecg_data.T) # Plot the data as graph.
plt.show(block=True) # Show image with "blocking"
Result:
A0001.mat:
A0002.mat:
Graph with labels:
# Iterate alldata elements, and show images
for data in alldata:
# The internal structure of the data is a dictionary with a dictionary.
ecg = data['ECG']
data = ecg['data'] # Data is a list of lists
# Convert data to NumPy array
#ecg_data = np.array(data)
# Show data as graph using matplotlib
# Iterate data list:
for i in range(len(data)):
# Plot the data as graph.
# Set labels d0, d1, d2...
plt.plot(data[i], label='d'+str(i))
plt.legend() # Add legend
plt.show(block=True) # Show image with "blocking"
Result:
How do I import tif using gdal?
I'm trying to get my tif file in a usable format in Python, so I can analyze the data. However, every time I import it, I just get an empty list. Here's my code:
xValues = [447520.0, 432524.0, 451503.0]
yValues = [4631976.0, 4608827.0, 4648114.0]
gdal.AllRegister()
dataset = gdal.Open('final_snow.tif', GA_ReadOnly)
if dataset is None:
print 'Could not open image'
sys.exit(1)
data = np.array([gdal.Open(name, gdalconst.GA_ReadOnly).ReadAsArray() for name, descr in dataset.GetSubDatasets()])
print 'this is data ', data`
It always prints an empty list, but it doesn't throw an error. I checked out other questions, such as [this] (Create shapefile from tif file using GDAL) What might be the problem?
For osgeo.gdal, it should look like this:
from osgeo import gdal
gdal.UseExceptions() # not required, but a good idea
dataset = gdal.Open('final_snow.tif', gdal.GA_ReadOnly)
data = dataset.ReadAsArray()
Where data is either a 2D array for 1-banded rasters, or a 3D array for multiband.
An alternative with rasterio looks like:
import rasterio
with rasterio.open('final_snow.tif', 'r') as r:
data = r.read()
Where data is always a 3D array, with the first dimension as band index.