How does numpy.memmap work on HDF5 with multiple datasets? - python

I'm trying to memory-map individual datasets in an HDF5 file:
import h5py
import numpy as np
import numpy.random as rdm
n = int(1E+8)
rdm.seed(70)
dset01 = rdm.rand(n)
dset02 = rdm.normal(0, 1, size=n).astype(np.float32)
with h5py.File('foo.h5', mode='w') as f0:
f0.create_dataset('dset01', data=dset01)
f0.create_dataset('dset02', data=dset02)
fp = np.memmap('foo.h5', mode='r', dtype='double')
print(dset01[:3])
print(fp[:3])
del fp
However, the outputs below indicate that values in fp don't match those in dset01.
[0.92748054 0.87242629 0.58463127]
[5.29239776e-260 1.11688278e-308 5.18067355e-318]
I am guessing, maybe I should have set an 'offset' value when I did np.memmap. Is that the mistake in my code? If so, how do I find out the correct offset value of each dataset in an HDF5?

Related

TruncatedSVD n_oversamples seems to have no bearing

I'm looking for way to improve the quality of my eigenvectors produced by sklearn TruncatedSVD. The documentation at scikit-learn.org suggests that the n_oversamples parameter is a good place to start. I have a sparse 2200 square matrix as input (provided as three separate files consisting of row indexes, column indexes, and data value.) Here's my code:
from array import array
import sys
import numpy as np
import struct
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
path="c:\\users\\lenwh\\documents\\wikipedia\\weights\\"
file=sys.argv[1]
dims=int(sys.argv[2]) #I use 300
with open(path+ file + ".rows","rb") as f:
rows=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".cols","rb") as f:
cols=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".data","rb") as f:
data = np.fromfile(f, dtype=np.float32)
rowCount=len(np.unique(rows))
csr=csr_matrix((data, (rows, cols)), shape=(rowCount, rowCount))
vectorsfile=path+"eigens.vec"
transfile=path+ file + ".eig"
oversamples=10;
pca=TruncatedSVD(n_components=dims, n_oversamples=oversamples)
pca.fit(csr)
np.savetxt(transfile,pca.transform(csr),fmt='%16f')
The problem is that whether I have oversamples set to 10, 100, or 1000, the results are not discernably different, meaning the explained variance is the same for all, as is the performance of the results in my application. As a minimum, I expected that the quality of the explained variance would change. I would appreciate any explanation of where my expectations are misguided, and whether there are any other settings -- or alternatives to TruncatedSVD -- that I could looked to other than the n_components setting.

NDFD GRIB2 how to fix mirrored data when using xarray

My code code for pulling in a grib file of windspeeds in New England:
import pandas as pd
import numpy as np
import requests
import cfgrib
import xarray as xr
resp = requests.get('https://tgftp.nws.noaa.gov/SL.us008001/ST.opnl/DF.gr2/DC.ndfd/AR.neast/VP.001-003/ds.wspd.bin', stream=True)
f = open('..\\001_003wspd.grib2', 'wb')
f.write(resp1.content)
f.close()
xr_set = xr.load_dataset('..\\001_003wspd.grib2', engine='cfgrib')
xr_set.si10[0].plot(cmap=matplotlib.pyplot.cm.coolwarm)
This gives:
As you can see, it is mirrored every other line east to west. Maine is the most obvious.
I believe this is not a code problem, but rather the file which wasn't written correctly. If you take only one line every two lines, you get a correct map :
import numpy as np
import requests
import xarray as xr
from fs.tempfs import TempFS
resp = requests.get('https://tgftp.nws.noaa.gov/SL.us008001/ST.opnl/DF.gr2/DC.ndfd/AR.neast/VP.001-003/ds.wspd.bin', stream=True)
with TempFS() as tempfs:
path = tempfs.getsyspath("001_003wspd.grib2")
f = open(path, 'wb')
f.write(resp.content)
f.close()
ds = xr.load_dataset(path, engine='cfgrib')
ds = ds.isel(y=np.arange(len(ds.y))[1::2])
ds.si10.isel(step=15).plot(cmap="coolwarm", x='longitude', y='latitude')
The problem is related to how the grib files are ordered. If you perform the following command you can see the order.
$wgrib2 -grid in.grib2
You will likely see something like, "(2345 x 1597) input WE|EW:SN output WE:SN"
The WE|EW:SN need to be changed to WE:SN. To change the ordering do the following command
wgrib2 in.grib2 -ijsmall_grib 1:2345 1:1597 out.grib2
Now you will see that out.grib2 has the correct WE:SN ordering and can work with xarray

Create a raster stack in from xarray dataset in Python

I am trying to make a raster stack from an xarray dataset which I obtained from multiple netCDF files. There are 365 netCDF files each containing a 2D data of Sea Surface Temperature (SST) having height and width 3600 and 7200 respectively. To perform further operations I need to prepare a raster stack.
import netCDF4 as nc
import rasterio as rio
import numpy as np
import xarray as xr
import os
fpath = '/home/sst/2015'
pattern = '*.nc'
filelist = []
for root, dirs, files in os.walk(fpath):
for name in files:
filelist.append(os.path.join(fpath,name))
ds = xr.open_mfdataset(eflist, concat_dim='time', parallel = True) # netCDF data
ds_data = ds.sel(time='2015')['SST'] # xarray dataset with dimension 365x3600x7200.
Raster stack of this xarray will be used to extract data values at point locations. I am currently using numpy and rasterio as mentioned in rasterio documentation. By iterating over the 3D xarray following code writes 365 files to HDD and later I can read and stack those.
from rasterio.transform import from_origin
transform = from_origin(-180,90, 0.05, 0.05)
fpath = '/home/sst/sst_tif'
fname = 'sst_array_'
extname = '.tiff'
timedf = ds.time # time dimension to loop over
for i in range(len(timedf)):
np_array = np.array(ds_data[i])
iname = str(i)
fwname = fpath + fname + iname + extname
sst_tif = rio.open(fwname,
'w',
driver = 'GTiff',
height = 3600,
width = 7200,
dtype = np_array.dtype,
count = 1,
crs = 'EPSG:4326',
transform = transform)
sst_tif.write(np_array, 1)
sst_tif.close()
However this takes a very long time to process entire dataset. I also attempted converting entire xarray to a numpy array and writing all 365 layers in a single file but it freezes the Python kernel.
Is there any way I can create this raster stack in memory and do further processing without having to write it to file/s on HDD. I am trying to obtain functionality similar to that of stack function available in raster package of R.

Fast way to create an animation from binary data

I have a C++ program which writes pixel data for a 2D grid to a binary file. Each binary file may have numerous grid states back-to-back, and there may be multiple of these binary files.
e.g. I could have 10 binary files bin0, bin1, bin2... bin9 each holding data for 10 grid states for a total of 100 grid states to animate.
I'm looking for a fast way to create an animation from the grid states in these binary files.
My best attempt used python and PIL.Image to create a gif:
import glob
import numpy as np
from PIL import Image
from functools import partial
def create_images():
paths = glob.glob('./outfiles/dump*')
imgs = []
for path in sorted(paths, key=lambda x: int(x.split("p")[1])):
with open(path, 'rb') as ifile:
for dat in iter(partial(ifile.read, rows*cols), b''):
mem = memoryview(dat).cast('B', shape=[rows,cols])
arr = np.asarray(mem)
img = Image.fromarray(arr*255)
imgs.append(img)
return imgs
rows = 128
cols = 128
imgs = create_images()
imgs[0].save('./animation.gif', save_all=True, append_images=imgs[1:], loop=0)
but the final line where I actually write the gif can take a long time if I have potentially thousands of images, each with thousands of pixels. The rendering of the gif is also poor quality when the images are large.
Looking forward to suggestions for how to make this run fast using a different library from Pillow. Not bothered about sticking to Python if there is a better alternative using C/C++ (or other languages, but Python or C/C++ preferred), nor does the animation have to be a gif.
In my case I'm working with grid data which is either 0 or 1 (the context is Conway's Game of Life), so optimizations which take advantage of this would be welcome. Note, however, that currently each 1 or 0 occupies a whole byte in the binary file, therefore are not packed into bits.
EDIT
Just adding a helper python script to generate binary files as I described above for anyone who gives it a go.
import numpy as np
rows = 128
cols = 128
img_per_file = 10
num_files = 3
filesize = rows*cols*img_per_file
for i in range(num_files):
filename = 'bin'+str(i)
data = np.random.randint(2, size=filesize, dtype=np.uint8).tobytes()
f = open(filename, 'wb')
f.write(data)
f.close()

How to import .dat file in Google Co-lab

I am implementing famous Iris classification problem in python for 1st time. I have a data file namely iris.data. I have to import this file in my python project. I try my hand in Google Colab.
Sample data
Attributes are:
1.sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
I worte
import torch
import numpy as np
import matplotlib.pyplot as plt
FILE_PATH = "E:\iris dataset"
MAIN_FILE_NAME = "iris.dat"
data = np.loadtxt(FILE_PATH+MAIN_FILE_NAME, delimiter=",")
But it did not work and through errors.
But it worked when I wrote the code in Linux. But currently I am using windows 10 and it did not work.
Thank you for help in advance.
When constructing the file name for np.loadtxt, there is a \ missing, as FILE_PATH+MAIN_FILE_NAME = 'E:\iris_datasetiris.dat. To avoid having to add \manually between FILE_PATH and MAIN_FILE_NAME, you could use os.path.join, which does this for you.
import os
import numpy as np
FILE_PATH = 'E:\iris dataset'
MAIN_FILE_NAME = 'iris.dat'
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',') # not actually working due to last column of file
On the other hand, I am not sure why it did work with Linux, because numpy is not able to convert the string "Iris-setosa" into a number, which np.loadtxt tries to do. If you are only interested in the numeric values, you could use the usecols keyword of np.loadtxt
data = np.loadtxt(os.path.join(FILE_PATH, MAIN_FILE_NAME), delimiter=',', usecols=(0, 1, 2, 3))

Categories