Python garbage collector for large data formatting program

Python garbage collector for large data formatting program - python

I have written a program to read through a folder of excel files and load each file into the program. It then takes the data and creates an array of zeros of size (3001,2001), which will be iterated through and the corresponding coordinate values from excel will put changed to ones. The array is then reshaped to a size of (1,6005001). I am using tensorflow to reshape the array since the program considers it a tuple, but the final values are stored in a numpy array. I finally store the final formatted array into a csv file named "filename_Array.csv" and the program moves on to the next excel file to be formatted. I am running Python on Eclipse with tensorflow installed
The issue I am running into is that some values are being cached in memory, but I can not figure out what it is. I have tried explicitly deleting large variables that will be reinitialized and having gc.collect() to clean the inactive memory that is stored. I am still seeing a steady increase in memory usage until around 25 files formatted, then the computer begins freezing up as all of the RAM on my pc (12GB) is being used. I know that python automatically clears memory for values that are completely unreachable by the program, so I am not sure if this is an issue with fragmenting of RAM or something else.
Sorry for the walls of text, I am just trying to give as much info to the problem as possible.
Here is a link to a screenshot of my performance tab while running the program through about 24 files before I had to terminate the program due to the computer freezing.
Here is my code:
from __future__ import print_function
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf
import numpy as np
import csv
import gc
path = r'C:\Users\jeremy.desforges\Desktop\Eclipse\NN_MNIST\VAM SLIJ-II 4.500'
def create_array(g,h,trainingdata,filename):
# Multiplying by factors of 10 to keep precision of data
g = g*1000
h = h*1
max_g = 3000
max_h = 2000
# Initializes an array with zeros to represent a blank graph
image = np.zeros((max_g+1,max_h+1),dtype=np.int)
shape = ((max_g+1)*(max_h+1))
# Fills the blank graph with the input data points
for i in range(len(h)):
image[g[i].astype('int'),h[i].astype('int')] = 1
trainingdata.close()
image = tf.reshape(image,[-1,shape])
# Converts tensor objects to numpy arrays to feed into network
sess = tf.InteractiveSession()
image = sess.run(image)
np.savetxt((filename + "_Array.csv"), np.flip(image,1).astype(int), fmt = '%i' ,delimiter=",")
print(filename, "appended")
print("size",image.shape)
print(image,"= output array")
del image,shape,g,h,filename,sess
return
# Initializing variables
image = []
shape = 1
g = 1.0
h = 1.0
f = 1
specials = '.csv'
folder = os.listdir(path)
for filename in folder:
trainingdata = open(filename, "r+")
filename = str(filename.replace(specials, ''))
data_read = csv.reader(trainingdata)
for row in data_read:
in1 = float(row[0])
in2 = float(row[1])
if (f==0):
z_ = np.array([in1])
g = np.hstack((g,z_))
q = np.array([in2])
h = np.hstack((h,q))
if (f == 1):
g = np.array([in1])
h = np.array([in2])
f = 0
create_array(g,h,trainingdata,filename)
gc.collect()
image = []
shape = 1
g = 1.0
h = 1.0
f = 1

Related

How to calculate the mean value of multiple large rasters in python/arcpy?

After having converted the original monthly MOD13C2 product (from year 2000-2020) to raster format, now I have to calculate the mean value of the 251 rasters.
First I have tried this tutorial which simply used the algebra function that behaved badly in cooperating NA values. So I tried the second one which converted each raster to array to skip NA pixel. I adapted it into my code:
import arcpy, sys, os, glob
from arcpy.sa import *
import numpy
arcpy.CheckOutExtension('Spatial')
# input path
inws = "G:/data0610/MODIS_VI/EVI/EVI_pro/"
# output path
outws = "G:/data0610/MODIS_VI/EVI/EVI_pro/"
rasters = glob.glob(os.path.join(inws, "*.tif"))
r = Raster(rasters[0])
array = arcpy.RasterToNumPyArray(r) # convert to numpy
rowNum, colNum = array.shape
sum = numpy.zeros(shape=array.shape) # save the accumulating value
count = numpy.zeros(shape=array.shape) # save the counting number
Average = numpy.zeros(shape=array.shape) # save the mean value
for ras in rasters:
rmm = Raster(ras)
array = arcpy.RasterToNumPyArray(rmm)
# one by one pixel
for i in range(0, rowNum):
for j in range(0, colNum):
if array[i][j] >= 0 : # verdict invalid value
sum[i][j] += array[i][j] # accumulate
count[i][j] += 1 # counter
continue
Average = sum / count # cal the mean value
# save the raster
lowerLeft = arcpy.Point(r.extent.XMin, r.extent.YMin)
cellWidth = r.meanCellWidth
cellHeight = r.meanCellHeight
nameT = "evi.tif"
outname = os.path.join(outws, nameT)
arcpy.env.overwriteOutput = True
#convert to WGS84
inf = "G:/data0610/MODIS_VI/sm_mean.tif"
arcpy.env.outputCoordinateSystem = Raster(inf) # convert the crs to wgs84
print("successfully converted the CRS!")
AvgRas = arcpy.NumPyArrayToRaster(Average, lowerLeft, cellWidth, cellHeight, r.noDataValue) # turn into raster
AvgRas.save(outname)
print("successfully output the evi_mean.tif!")
# resample
outname_res = outws + "evi_mean_res.tif"
# get the standard cellsize
cellsize025 = "{0} {1}".format(arcpy.Describe(inf).meanCellWidth, arcpy.Describe(inf).meanCellHeight)
arcpy.Resample_management(AvgRas, outname_res, cellsize025, "NEAREST")
print("successfully output the evi_mean.tif with the 0.25 degree resolution!")
Unfortunately, the arcpy (python 2.7, 32 bit) had the memory error because there were too many large rasters (I'm dealing with the global extent). I found the reason and solution from this question:
enter image description here
So I installed the 64-bit Background Geoprocessing of ArcGIS and run the above code again, then it came up with another problem:
enter image description here
It turned out the soultion might be useless, because ArcGIS is very sensitive to administrator license, you can't run the 64-bit python while the ArcGIS being 32-bit.
Now return to my initial question: How to calculate the mean value of multiple rasters in python/arcpy? Did I make things complicated? Is there a simpler way of generating the mean value raster?
It's really driving me crazy.

Python3.8: The last output file is not stored properly on disk

I have a global dataset at about 300m resolution in tif. I want to upscale it to 9km resolution (below you see my code). I decided to do upscaling piecewise due to high resolution data and large computing time. So I divided the whole global data into 10 pieces, do upscaling and store each piece separately in a tif file again. NOW my problem pops up: the last piece of global data is NOT saved completely on the disk. Each piece of map should be 2M but piece#10 is 1.7M. And the strange thing is that after running my script twice, that piece#10 will be completed and it will change from 1.7M to 2M. But the current piece10 is again not complete.
import numpy as np
from osgeo import gdal
from osgeo import osr
from osgeo.gdalconst import *
import pandas as pd
#
#%%
#-----converting--------#
df_new = pd.read_excel("input_attribute_table.xlsx",sheet_name='Global_data')
listvar = ['var1']
number = df_new['data_number'][:]
##The size of global array is 129599 x 51704. The pieces should be square
xoff = np.array([0, 25852.00, 51704.00, 77556.00, 103408.00])
yoff = np.array([0, 25852.00])
xcount = 25852
ycount = 25852
o = 1
for q in range(len(yoff)):
for p in range(len(xoff)):
src = gdal.Open('Global_database.tif')
ds_xform = src.GetGeoTransform()
ds_driver = gdal.GetDriverByName('Gtiff')
srs = osr.SpatialReference()
srs.ImportFromEPSG(4326)
data =src.GetRasterBand(1).ReadAsArray(xoff[p],yoff[q],xcount,ycount).astype(np.float32)
Var = np.zeros(data.shape, dtype=np.float32)
Variable_load = df_new[listvar[0]][:]
for m in range(len(number)):
Var[data==number[m]] = Variable_load[m]
#-------rescaling-----------#
Var[np.where(np.isnan(Var))]=0
ds_driver = gdal.GetDriverByName('Gtiff')
srs = osr.SpatialReference()
srs.ImportFromEPSG(4326)
sz = Var.itemsize
h,w = Var.shape
bh, bw = 36, 36
shape = (h/bh, w/bw, bh, bw)
shape2 = (int(shape[0]),int(shape[1]),shape[2],shape[3])
strides = sz*np.array([w*bh,bw,w,1])
blocks = np.lib.stride_tricks.as_strided(Var,shape=shape2,strides=strides)
resized_array=ds_driver.Create(str(listvar[0])+'_resized_to_9km_glob_piece'+str(o)+'.tif',shape2[1],shape2[0],1,gdal.GDT_Float32) resized_array.SetGeoTransform((ds_xform[0],ds_xform[1]*bw,ds_xform[2],ds_xform[3],ds_xform[4],ds_xform[5]*bh))
resized_array.SetProjection(srs.ExportToWkt())
band = resized_array.GetRasterBand(1)
zero_array = np.zeros([shape2[0],shape2[1]], dtype=np.float32)
for z in range(len(blocks)):
for k in range(len(blocks)):
zero_array[z][k] = np.mean(blocks[z][k])
band.WriteArray(zero_array)
band.FlushCache()
band = None
del zero_array
del Var
o=o+1

Normally, you should either be sure to call close on a file, or use the with statement. However, it looks like neither of those is supported by gdal.
Instead, you're expected to remove all references to the file. You're already setting band = None, but you also need to set src = None.
This is a bad, non-Pythonic interface, but that's apparently what the Python gdal library does. In addition to being a weird gotcha in its own right, it also interacts poorly with exceptions; any unhandled exceptions may also result in the file not being saved (or being partly saved, or being corrupted).
For the immediate problem, though, adding src = None or del src should do the trick.
PS (from comments): Another option would be to move the body of the for loop into a function; that will automatically delete all the variables without you having to list them all and potentially miss one. It'll still have problems if there's an exception, but at least the normal case should start working...

Fast way to create an animation from binary data

I have a C++ program which writes pixel data for a 2D grid to a binary file. Each binary file may have numerous grid states back-to-back, and there may be multiple of these binary files.
e.g. I could have 10 binary files bin0, bin1, bin2... bin9 each holding data for 10 grid states for a total of 100 grid states to animate.
I'm looking for a fast way to create an animation from the grid states in these binary files.
My best attempt used python and PIL.Image to create a gif:
import glob
import numpy as np
from PIL import Image
from functools import partial
def create_images():
paths = glob.glob('./outfiles/dump*')
imgs = []
for path in sorted(paths, key=lambda x: int(x.split("p")[1])):
with open(path, 'rb') as ifile:
for dat in iter(partial(ifile.read, rows*cols), b''):
mem = memoryview(dat).cast('B', shape=[rows,cols])
arr = np.asarray(mem)
img = Image.fromarray(arr*255)
imgs.append(img)
return imgs
rows = 128
cols = 128
imgs = create_images()
imgs[0].save('./animation.gif', save_all=True, append_images=imgs[1:], loop=0)
but the final line where I actually write the gif can take a long time if I have potentially thousands of images, each with thousands of pixels. The rendering of the gif is also poor quality when the images are large.
Looking forward to suggestions for how to make this run fast using a different library from Pillow. Not bothered about sticking to Python if there is a better alternative using C/C++ (or other languages, but Python or C/C++ preferred), nor does the animation have to be a gif.
In my case I'm working with grid data which is either 0 or 1 (the context is Conway's Game of Life), so optimizations which take advantage of this would be welcome. Note, however, that currently each 1 or 0 occupies a whole byte in the binary file, therefore are not packed into bits.
EDIT
Just adding a helper python script to generate binary files as I described above for anyone who gives it a go.
import numpy as np
rows = 128
cols = 128
img_per_file = 10
num_files = 3
filesize = rows*cols*img_per_file
for i in range(num_files):
filename = 'bin'+str(i)
data = np.random.randint(2, size=filesize, dtype=np.uint8).tobytes()
f = open(filename, 'wb')
f.write(data)
f.close()

gdal WriteArray() crashes python without a stack trace

I'm trying to write an array to a geotiff using gdal. Each row of the array is identical, and I used np.broadcast_to to create the array.
When I try to write it, the I get a windows popup saying "Python has stopped working: A problem cause the program to stop working correctly. Please close the program"
This approximates the steps I'm taking:
import gdal
import numpy as np
driver = gdal.GetDriverByName('GTiff')
outRaster = driver.Create("C:/raster.tif", 1000, 1000, 1, 6)
band = outRaster.GetRasterBand(1)
# Create array
a = np.arange(0,1000, dtype='float32')
a1 = np.broadcast_to(a, (1000,1000))
# try writing
band.WriteArray(a1) # crash

The problem is that the input array created by broadcast_to isn't contiguous on disk. As described in the numpy documentation, more than one element array may point to the same memory address. This causes problems in gdal.
Instead of using broadcast_to, use something that stores each element as its own place in memory.
As an illustrative example, see the following code:
import gdal
import numpy as np
import sys
driver = gdal.GetDriverByName('GTiff')
outRaster = driver.Create("C:/raster.tif", 1000, 1000, 1, 6)
band = outRaster.GetRasterBand(1)
# Create 1000 x 1000 array two different ways
a = np.arange(0,1000, dtype='float32')
a1 = a[np.newaxis, :]
a1 = a1.repeat(1000, axis=0)
a2 = np.broadcast_to(a, (1000,1000))
# examine size of objects
sys.getsizeof(a1) # 4000112
sys.getsizeof(a2) # 112
# try writing
band.WriteArray(a1) # writes fine
band.WriteArray(a2) # crash

Creating a HDF5 datacube in Python with multiple FITS files

I am currently struggling with a problem. I have 1500 fits files that contain 3800 x 3800 arrays. My objective is to create a single HDF5 datacube with them. Unfortunately I cannot provide with the fits files (due to storage problems). So far, I have been able to create an empty HDF5 array with the required shape (3800, 3800, 1500) by doing:
import h5py
from astropy.io import fits
import glob
outname = "test.hdf5"
NAXIS1 = 3800
NAXIS2 = 3800
NAXIS3 = 1500
f = h5py.File(outname,'w')
dataset = f.create_dataset("DataCube",
(NAXIS1,NAXIS2,NAXIS3),
dtype=np.float32)
f.close()
but I am having trouble trying to write the arrays from the fits files, because each element from the following for loop takes 30 mins at least:
f = h5py.File(outname,'r+')
# This is the actual line, but I will replace it by random noise
# in order to make the example reproducible.
# fitslist = glob.glob("*fits") # They are 1500 fits files
for i in range(NAXIS3):
# Again, I replace the real data with noise.
# hdul = fits.open(fitslist[i])
# file['DataCube'][:,:,i] = hdul[0].data
data = np.random.normal(0,1,(dim0,dim1))
file['DataCube'][:,:,i] = data
f.close()
There is any better way to construct a 3D datacube made of N slices which are already stored in N fits files? I was expecting that once the HDF5 file was created in the disc, writing it would be quite fast, but it wasn't.
Thank you very much for your help.
EDIT 1: I tested the modification proposed by astrofrog and it worked really well. Now the performance is quite good. In addition to this, I stored several fits files (~50) into one temporary numpy array in order to reduce the number of times I write into the hdf5 file. Now the code looks like this:
NAXIS1 = len(fitslist)
NAXIS2 = fits_0[ext].header['NAXIS1']
NAXIS3 = fits_0[ext].header['NAXIS2']
shape_array = (NAXIS2, NAXIS3)
print(shape_array)
f = h5py_cache.File(outname, 'w', chunk_cache_mem_size=3000*1024**2,
libver='latest')
dataset = f.create_dataset("/x", (NAXIS1, NAXIS2, NAXIS3),
dtype=np.float32)
cache_size = 50
cache_array = np.empty(shape=(cache_size, NAXIS2, NAXIS3))
j = 0
for i in tqdm(range(len(fitslist))):
print(fitslist[i])
hdul = fits.getdata(fitslist[i], ext)
cache_array[j:j+1, :, :] = hdul
if ((i % cache_size == 0) & (i != 0)):
print("Writing to disc")
f['/x'][i-cache_size+1:i+1, :, :] = cache_array
j = 0
if (i % 100 == 0):
print("collecting garbage")
gc.collect()
j = j + 1
f.close()
My question is: There is any more pythonic way of doing this? I am not sure is this is the most efficient way of writing files with h5py, or if there is any better way to read from fits to numpy and then to hdf5.

I think the issue might be that the order of the dimensions should be NAXIS3, NAXIS2, NAXIS1 (currently I think it is doing very inefficient striding over the array). I would also only add the array to the HDF5 file at the end:
import glob
import h5py
import numpy as np
from astropy.io import fits
fitslist = glob.glob("*.fits")
NAXIS1 = 3800
NAXIS2 = 3800
NAXIS3 = 1000
array = np.zeros((NAXIS3, NAXIS2, NAXIS1), dtype=np.float32)
for i in range(NAXIS3):
hdul = fits.open(fitslist[i], memmap=False)
array[i, :, :] = hdul[0].data
f = h5py.File('test.hdf5', 'w')
f.create_dataset("DataCube", data=array)
f.close()
If you need the array in the NAXIS1, NAXIS2, NAXIS3 order, just transpose it at the very end.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python garbage collector for large data formatting program - python

Related

How to calculate the mean value of multiple large rasters in python/arcpy?

Python3.8: The last output file is not stored properly on disk

Fast way to create an animation from binary data

gdal WriteArray() crashes python without a stack trace

Creating a HDF5 datacube in Python with multiple FITS files

Categories

Resources