Python load large number of files - python

I'm trying to load a large number of files saved in the Ensight gold format into a numpy array. In order to conduct this read I've written my own class libvec which reads the geometry file and then preallocates the arrays which python will use to save the data as shown in the code below.
N = len(file_list)
# Create the class object and read geometry file
gvec = vec.libvec(os.path.join(current_dir,casefile))
x,y,z = gvec.xyz()
# Preallocate arrays
U_temp = np.zeros((len(y),len(x),N),dtype=np.dtype('f4'))
V_temp = np.zeros((len(y),len(x),N),dtype=np.dtype('f4'))
u_temp = np.zeros((len(x),len(x),N),dtype=np.dtype('f4'))
v_temp = np.zeros((len(x),len(y),N),dtype=np.dtype('f4'))
# Read the individual files into the previously allocated arrays
for idx,current_file in enumerate(file_list):
U,V =gvec.readvec(os.path.join(current_dir,current_file))
U_temp[:,:,idx] = U
V_temp[:,:,idx] = V
del U,V
However this takes seemingly forever so I was wondering if you have any idea how to speed up this process? The code reading the individual files into the array structure can be seen below:
def readvec(self,filename):
# we are supposing for the moment that the naming scheme PIV__vxy.case PIV__vxy.geo not changes should that
# not be the case appropriate changes have to be made to the corresponding file
data_temp = np.loadtxt(filename, dtype=np.dtype('f4'), delimiter=None, converters=None, skiprows=4)
# U value
for i in range(len(self.__y)):
# x value counter
for j in range(len(self.__x)):
# y value counter
self.__U[i,j]=data_temp[i*len(self.__x)+j]
# V value
for i in range(len(self.__y)):
# x value counter
for j in range(len(self.__x)):
# y value counter
self.__V[i,j]=data_temp[len(self.__x)*len(self.__y)+i*len(self.__x)+j]
# W value
if len(self.__z)>1:
for i in range(len(self.__y)):
# x value counter
for j in range(len(self.__xd)):
# y value counter
self.__W[i,j]=data_temp[2*len(self.__x)*len(self.__y)+i*len(self.__x)+j]
return self.__U,self.__V,self.__W
else:
return self.__U,self.__V
Thanks a lot in advance and best regards,
J

It'a bit hard to say without any test input\output to compare against. But i think this would give you the same U\V arrays as your nested for loops in readvec. This method should be considerably faster then the for loops.
U = data[:size_x*size_y].reshape(size_x, size_y)
V = data[size_x*size_y:].reshape(size_x, size_y)
Returning these directly into U_temp and V_temp should also help. Right now you're doing 3(?) copies of your data to get them into U_temp and V_temp
From file to temp_data
From temp_data to self.__U\V
From U\V into U\V_temp
Although my guess is that the two nested for loop, and accessing one element at a time is causing the slowness

Related

Saving continuously generated simulation data with Python3

So my question is how I should save a large amount of simulation data to a file using Python (or update new data rows to the existing file).
Lets say I have NN=1000 particles, and I want to save the position and velocity data of each particle (x y z, vx vy vz). The data is in format [x1,y1,z1,vx1,vy1,vz1, x2,y2,z2,vx2,vy2,vz2, ...] and so on.
Simulation is working well, but I believe the methods I use for saving and keeping these information saved is not really optimal for me.
Pseudo code similar to my code
T_max = 1000 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
ZZ = np.zeros( (iterations, 2+NN*6 ) ) # Here I generate whole data matrix at the beginning.
# ^ might not be the best idea as the system needs to keep everything in memory for the whole time
# So I guess saving could be done in chunks?
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
i = 0
while i < iteration:
T += dt
Z[i+1][0], Z[i+1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
Z[i+1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
i += 1
# Now the simulation data is basically done, so one would need to save
# This one feels slow, as it takes 181s to save and is size of 1046246KB
np.savetxt('test1.txt', ZZ)
#other method with a bit less accuracy as I don't need to have all decimals saved
np.savetxt('test2.txt', ZZ, fmt='%1.6f') # Takes 125s and size is 426698KB
# Both of the above are kinda slow so I also tried to save to npy format
np.save('test.npy', ZZ) # It took 8.9s and size 164118KB
so this np.save() method seems to be fast, but I read somewhere that I can not append data to it. So this would not work if I keep saving the data in parts while calculating new positions.
So back to my question. How should/could I save the data efficiently (fast and memory friendly). I keep having some memory issues when NN and T_max gets larger because with this method I keep this whole ZZ all the time in memory.
So I guess I should calculate ZZ in parts, i.e. iterations/10 parts but then I should append this data to an existing file, and tests I have made felt slow. Any suggestions?
EDIT: feel free to ask more specifying questions as I feel like I forgot to explain something.
That highly depends on what you intend to use the output for. If it's stored for further calculations, .npy or some other binary format is always the way to go as it is faster, takes less space, and doesn't lose precision between loads and saves, instead of serializing it into a human readable format. If you need it to be readable, you might as well just output row by row to a csv file or something.
If you want to do it with binary, h5py allows you to extend a dataset after saving and append more stuff to it.
import numpy as np
import h5py
T_max = 10**4 # for example
dt = 0.1 # time step
T = 0 # current time
iterations = int(T_max/dt) # number of iterations we are doing
NN = 1000 # Number of particles
chunk_size = 10**3
ZZ = np.zeros( (chunk_size, 2+NN*6 ) )
ZZ[0][0], ZZ[0][1] = T , dt
# ZZ[0][2:] = initialize_system(NN=NN) # so lets initialize the system.
# However, for this post I do this differently due to simplicity. See below
ZZ[0][2:] = np.random.uniform(-100,100,NN*6)
with h5py.File("test.h5", "a") as f:
dset = f.create_dataset('ZZ', (0,2+NN*6), maxshape=(None,2+NN*6), dtype='float64', chunks=(chunk_size,2+NN+6))
for chunk in range(0, iterations, chunk_size):
for i in range(0, chunk_size - 1):
T += dt
ZZ[i + 1][0], ZZ[i + 1][1] = T, dt
#Z[i+1][2:] = rk4(EOM_function, posvel=Z[i][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[i + 1][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
# Expand the file here to allow for more data.
dset.resize(dset.shape[0] + chunk_size, axis=0)
dset[chunk: chunk + chunk_size ] = ZZ
# update and initialize next chunk. the next chunk's first row should be the last row of the previous chunk + iteration
T += dt
ZZ[0][0], ZZ[0][1] = T, dt
#Z[0][2:] = rk4(EOM_function, posvel=Z[-1][2:])
# ^ Using this I would calculate new positions based on previous ones.
ZZ[0][2:] = np.random.uniform(-100,100,NN*6) #This is just for example here.
print(dset.shape)
This takes 70 seconds on the save step on my computer, generating a 45GB file, for a dataset that is 100 times your original code.
The above code is more general in case you are streaming your data and don't know your final size. If you know it from the start, you can replace the initial create_dataset with
dset = f.create_dataset('ZZ', (iterations,2+NN*6), dtype='float64')
and remove the dset.resize(dset.shape[0] + chunk_size, axis=0)
You'll probably also want to read it back in chunks afterwards for other processing, in which case you can follow the docs here: https://docs.h5py.org/en/latest/high/dataset.html#reading-writing-data
Okay so I'm continuing my question / providing possible answer to it based on the answer of EricChen1248. EDIT: Answer provided by EricChen1248 works now and is way better than this my code part. See his code
I do not yet still understand completely how this f.create_dataset () truly works (i.e. when does it write data to file in the loop etc).
Using the code provided by Eric, it created and saved the data files fastly, but when I read the file as follows
hf = h5py.File('temp/test.h5', 'r')
ZZ = np.array(hf['ZZ'])
hf.close()
and plotted the first column (time T column, which should increase by timestep dt after each iteration) I get the following figure
plt.plot(ZZ[:,0])
time T column plotted
and as can be seen, it grows to a time of 100, and then goes to zero. This happens after the first 'chunk_size' has been passed. I started to read docs provided by Eric, and using his code as reference I managed to write something like this
import numpy as np
import h5py
T_max = 10**4
dt = 0.1
T = 0
NN = 1000
iterations = int(T_max/dt)
chunk_size = 10**3
with h5py.File('temp/data12.h5', 'a') as hf:
dset = hf.create_dataset("ZZ", (chunk_size, 2+NN*6),maxshape=(None,2+NN*6) ,chunks=(chunk_size, 2+NN*6), dtype='f8' )
# ^ first I create data set equals to one chunk_size
# Here I initialize the system. Columns ; 0=T , 1=dt, 2=arbitrary data point, 3=sin(column2)
# all the rest columns are random numbers just to fill some numbers in
dset[0,0], dset[0,1] = T, dt
#dset[0,2:] = np.random.uniform(0,1,NN*6)
dset[0,2] = 1
dset[0,3] = np.sin(dset[0,2])
dset[0,4:] = np.random.uniform(0,1,NN*6 -2)
print('starts')
# Main difference down there is that I use dataset (dset)
# as a data matrix to be filled instead of matrix ZZ as in my question.
i = 0
#for j, s in enumerate(dset.iter_chunks()):
for j, s in enumerate(range(0, iterations, chunk_size )):
print(j, s)
while i < iterations and i < chunk_size*(j+1) -1:
#for i in range(chunk_size*j, chunk_size*(j+1)-1):
T += dt
dset[i+1,0], dset[i+1,1] = T, dt
#dset[i+1,2:] = np.sin(dset[i,2:]+dt)
dset[i+1,2] = dset[i,2] + dt
dset[i+1,3] = np.sin(dset[i,2]+dt)
dset[i+1,4:] = dset[i,4:] + np.random.uniform(-1,1,NN*6-2)
i+=1
print(dset.shape)
dset.resize(dset.shape[0] + chunk_size, axis=0)
This code runs in 1min 50s , and saves a file of size 4.47GB so I am happy with the speed, and what I'm really happy is that it do not use so much memory while iterating (I used to get into problem with huge RAM usage).
When I read the data file provided by my code (similarly as above) I get following image for time Time T column plotted, my code version and it grows nicely to T=10e4 as should be. It still generated one more chunk_size block to the end of dataset which is full of zeros. That I need to get rid of. One more proof that the code works and saves data without weird problems is this sinusoidal plot plt.plot(ZZ[500:1500,0] , ZZ[500:1500,3]). Sinusoidal image proof Note that the plot is limited for T ~ [50,150] so one could still see something there (if plotted the whole thing, one could not see lines well).
I believe this is not the best way to write this code, but it is the way I got this working. So if someone sees improvements, please let me know. Also, I am curious to know why the code provided by Eric did not work, at least for me.
EDIT : fixed typos

How to make code like repmat MATLAB on Python?

This MATLAB code is from Main_MOHHO.m from https://www.mathworks.com/matlabcentral/fileexchange/80776-multi-objective-harris-hawks-optimization-mohho. I want to make the same code using python, but I can't make the Rabbits variabel.
clc;
clear;
close all;
%% Problem Definition
nVar=3; % Number of Decision Variables
VarSize=[1 nVar]; % Size of Decision Variables Matrix
VarMin=0; % Lower Bound of Variables
VarMax=1; % Upper Bound of Variables
nPop=5; % Population Size
%% Initialization
empty_Rabbit.Location=[];
empty_Rabbit.Cost=[];
empty_Rabbit.Sol=[];
empty_Rabbit.IsDominated=[];
empty_Rabbit.GridIndex=[];
empty_Rabbit.GridSubIndex=[];
Rabbits=repmat(empty_Rabbit,nPop,1);
for i=1:nPop
Rabbits(i).Location = rand(VarSize).*(VarMax-VarMin)+VarMin;
X(i,:) = rand(VarSize).*(VarMax-VarMin)+VarMin;
end
I try to make it on google colab like this.
import numpy as np
nVar = 3 # Number of Decision Variables
VarSize = np.array((1, nVar)) # Size of Decision Variables Matrix
VarMin = 0 # Lower Bound of Variables
VarMax = 1 # Upper Bound of Variables
nPop = 5 # Population Size
class empty_Rabbit:
Location = []
Cost = []
IsDominated = []
GridIndex = []
GridSubIndex = []
Sol = []
Rabbits = np.tile(empty_Rabbit, (nPop, 1))
X = np.zeros((nPop, nVar))
Rabbit_Location = np.zeros((VarSize))
Rabbit_Energy = math.inf
for i in range(nPop):
Rabbits[i, 0].Location = np.multiply(np.random.rand(VarSize[0], VarSize[1]),
(VarMax-VarMin) + VarMin)
print(Rabbits[i,0].Location)
But, the Rabbits_Location same for each row.
Output Google Colab
What is the correct way to create Rabbits variable in python so the output like the output with number 1 in the pic? Thank you.
Two issues exist in your code. First, np.tile repeats the same object (nPop, 1) times. So, when you change one of the objects, you actually change the same memory location. Second, you want to initialize a different object each time instead of referring to the same object, so you want to write empty_Rabbit() to create a new instance of that object. Both suggestions can be achieved using a comprehension like [empty_Rabbit() for i in range(nPop)] and reshape to any new dimensions if required.
import numpy as np
nVar = 3 # Number of Decision Variables
VarSize = np.array((1, nVar)) # Size of Decision Variables Matrix
VarMin = 0 # Lower Bound of Variables
VarMax = 1 # Upper Bound of Variables
nPop = 5 # Population Size
class empty_Rabbit:
Location = []
Cost = []
IsDominated = []
GridIndex = []
GridSubIndex = []
Sol = []
Rabbits = np.array([empty_Rabbit() for i in range(nPop)]).reshape(nPop,1)
X = np.zeros((nPop, nVar))
Rabbit_Location = np.zeros((VarSize))
Rabbit_Energy = np.inf
for i in range(nPop):
Rabbits[i, 0].Location = np.multiply(np.random.rand(VarSize[0], VarSize[1]),
(VarMax-VarMin) + VarMin)
print(Rabbits[i,0].Location)
for i in range(nPop):
print(Rabbits[i,0].Location)
Now, the output of both print statements will be identical with distinct rows:
[[0.5392264 0.39375339 0.59483626]]
[[0.53959355 0.91049574 0.58115175]]
[[0.46152304 0.43111977 0.06882631]]
[[0.13693784 0.82075653 0.49488394]]
[[0.06901317 0.34133836 0.91453956]]
[[0.5392264 0.39375339 0.59483626]]
[[0.53959355 0.91049574 0.58115175]]
[[0.46152304 0.43111977 0.06882631]]
[[0.13693784 0.82075653 0.49488394]]
[[0.06901317 0.34133836 0.91453956]]
scipy.io.loadmat uses structured arrays when loading struct from MATLAB .mat files. But I think that's too advanced for you.
I think you need to create a set of numpy arrays, rather than try for some sort of class or more complicated structure.
empty_Rabbit.Location=[];
empty_Rabbit.Cost=[];
empty_Rabbit.Sol=[];
empty_Rabbit.IsDominated=[];
empty_Rabbit.GridIndex=[];
empty_Rabbit.GridSubIndex=[];
becomes instead
location = np.zeros(nPop)
cost = np.zeros(nPop)
sol = np.zeros(nPop)
isDominated = np.zeros(nPop) # or bool dtype?
gridIndex = np.zeros(nPop)
gridSubIndex = np.zeros(nPop)
np.zeros makes a float array; for some of those you might want np.zeros(nPop, dtype=int) (if used as index).
rabbit= np.zeros(nPop, dtype=[('location',float), ('cost',float),('sol',float), ....])
could be used to make structured array, but you'll need to read more about those.
MATLAB lets you use iteration freely as in
for i=1:nPop
Rabbits(i).Location = rand(VarSize).*(VarMax-VarMin)+VarMin;
X(i,:) = rand(VarSize).*(VarMax-VarMin)+VarMin;
end
but that's slow (as it used to be MATLAB before jit compilation). It's better to use whole array calculations
location = np.random.rand(nPop,VarSize) * (VarMax-VarMin)+VarMin
will make a (nPop,VarSize) 2d array, not the 1d that np.zeros(nPop) created.
Looks like X could be created in the same way (without iteration).

getting one matrix from each of the files in a folder and sum the matrices

I was trying to read multiple files in a folder and get one matrix from each of them and sum all of the matrices.
script for reading only one file and it works well(md.MCERunFile Item2d is some modules that existed for data reading):
outfile=md.MCERunfile('/somepath/filename')
rn_matrix=outfile.Item2d('IV', 'Rn_C%i')
Shape=np.shape(rn_matrix)
rn_matrix_float = np.array([]).reshape(0,55)
for x in range(Shape[0]):
row = map(float, rn_matrix[x])
rn_matrix_float=np.vstack([rn_matrix_float, row])
The final output rn_matrix_float is a 32 by 64 numpy array
Now I tried:
path = '/somepath/*.xxx'
files = glob.glob(path)
final_matrix=np.zeros((32, 64))
for j in range(0,len(files)):
outfile = md.MCERunfile(files[j])
rn_matrix=outfile.Item2d('IV', 'cut_rec_C%i')
Shape=np.shape(rn_matrix)
for x in range(Shape[0]):
rn_matrix_float = np.array([]).reshape(0,64)
row = map(float, rn_matrix[x])
rn_matrix_float=np.vstack([rn_matrix_float, row])
final_matrix=final_matrix+rn_matrix_float
I think my mistake is that I have already defined outfile and rn_matrix in the loop, that make every rn_matrix_float to be exactly the same instead of reading data from different files, so the final_matrix is a summation of same arrays. But I don't know how to fix it.
An iteration like this
rn_matrix_float = np.array([]).reshape(0,55)
for x in range(Shape[0]):
row = map(float, rn_matrix[x])
rn_matrix_float=np.vstack([rn_matrix_float, row])
should be written as a list append
alist = []
for x in range(Shape[0]):
row = map(float, rn_matrix[x])
alist.append(row)
rn_matrix_float=np.vstack(alist)
Appending to a list is faster and easier than repeatedly 'concatenating' arrays. Actually it could probably be written as a list comprehension, or even a one-line array operation
rn_matrix_float = rn_matrix.astype(float)
(but that's more of a guess since I haven't tried to recreate your data.)
Similarly I'd be inclined to collect the multi-file case, and do one sum at the end
alist2 = []
for j in range(0,len(files)):
outfile = md.MCERunfile(files[j])
rn_matrix=outfile.Item2d('IV', 'cut_rec_C%i')
alist2.append(rn_matrix.astype(float)
final_matrix = np.array(alist2)
print(final_matrix.shape) # check shape
final_matrix = final_matrix.sum(axis=0)
If the intermediate array gets too big we might want to add incrementally. But for a start I think you should become comfortable with accumulating multidimensional arrays, and then 'reducing' them with actions like sum.

How to write .csv file in Python?

I am running the following: output.to_csv("hi.csv") where output is a pandas dataframe.
My variables all have values but when I run this in iPython, no file is created. What should I do?
Better give the complete path for your output csv file. May be that you are checking in a wrong folder.
You have to make sure that your 'to_csv' method of 'output' object has a write-file function implemented.
And there is a lib for csv manipulation in python, so you dont need to handle all the work:
https://docs.python.org/2/library/csv.html
I'm not sure if this will be useful to you, but I write to CSV files frequenly in python. Here is an example generating random vectors (X, V, Z) values and writing them to a CSV, using the CSV module. (The paths are os paths are for OSX but you should get the idea even on a different os.
Working Writing Python to CSV example
import os, csv, random
# Generates random vectors and writes them to a CSV file
WriteFile = True # Write CSV file if true - useful for testing
CSVFileName = "DataOutput.csv"
CSVfile = open(os.path.join('/Users/Si/Desktop/', CSVFileName), 'w')
def genlist():
# Generates a list of random vectors
global v, ListLength
ListLength = 25 #Amount of vectors to be produced
Max = 100 #Maximum range value
x = [] #Empty x vector list
y = [] #Empty y vector list
z = [] #Empty x vector list
v = [] #Empty xyz vector list
for i in xrange (ListLength):
rnd = random.randrange(0,(Max)) #Generate random number
x.append(rnd) #Add it to x list
for i in xrange (ListLength):
rnd = random.randrange(0,(Max))
y.append(rnd) #Add it to y list
for i in xrange (ListLength):
rnd = random.randrange(0,(Max)) #Generate random number
z.append(rnd) #Add it to z list
for i in xrange (ListLength):
merge = x[i], y[i],z[i] # Merge x[i], y[i], x[i]
v.append(merge) #Add merged list into v list
def writeCSV():
# Write Vectors to CSV file
wr = csv.writer(CSVfile, quoting = csv.QUOTE_MINIMAL, dialect='excel')
wr.writerow(('Point Number', 'X Vector', 'Y Vector', 'Z Vector'))
for i in xrange (ListLength):
wr.writerow((i+1, v[i][0], v[i][1], v[i][2]))
print "Data written to", CSVfile
genlist()
if WriteFile is True:
writeCSV()
Hopefully there is something useful in here for you!

Size-Incremental Numpy Array in Python

I just came across the need of an incremental Numpy array in Python, and since I haven't found anything I implemented it. I'm just wondering if my way is the best way or you can come up with other ideas.
So, the problem is that I have a 2D array (the program handles nD arrays) for which the size is not known in advance and variable amount of data need to be concatenated to the array in one direction (let's say that I've to call np.vstak a lot of times). Every time I concatenate data, I need to take the array, sort it along axis 0 and do other stuff, so I cannot construct a long list of arrays and then np.vstak the list at once.
Since memory allocation is expensive, I turned to incremental arrays, where I increment the size of the array of a quantity bigger than the size I need (I use 50% increments), so that I minimize the number of allocations.
I coded this up and you can see it in the following code:
class ExpandingArray:
__DEFAULT_ALLOC_INIT_DIM = 10 # default initial dimension for all the axis is nothing is given by the user
__DEFAULT_MAX_INCREMENT = 10 # default value in order to limit the increment of memory allocation
__MAX_INCREMENT = [] # Max increment
__ALLOC_DIMS = [] # Dimensions of the allocated np.array
__DIMS = [] # Dimensions of the view with data on the allocated np.array (__DIMS <= __ALLOC_DIMS)
__ARRAY = [] # Allocated array
def __init__(self,initData,allocInitDim=None,dtype=np.float64,maxIncrement=None):
self.__DIMS = np.array(initData.shape)
self.__MAX_INCREMENT = maxIncrement
if self.__MAX_INCREMENT == None:
self.__MAX_INCREMENT = self.__DEFAULT_MAX_INCREMENT
# Compute the allocation dimensions based on user's input
if allocInitDim == None:
allocInitDim = self.__DIMS.copy()
while np.any( allocInitDim < self.__DIMS ) or np.any(allocInitDim == 0):
for i in range(len(self.__DIMS)):
if allocInitDim[i] == 0:
allocInitDim[i] = self.__DEFAULT_ALLOC_INIT_DIM
if allocInitDim[i] < self.__DIMS[i]:
allocInitDim[i] += min(allocInitDim[i]/2, self.__MAX_INCREMENT)
# Allocate memory
self.__ALLOC_DIMS = allocInitDim
self.__ARRAY = np.zeros(self.__ALLOC_DIMS,dtype=dtype)
# Set initData
sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
self.__ARRAY[sliceIdxs] = initData
def shape(self):
return tuple(self.__DIMS)
def getAllocArray(self):
return self.__ARRAY
def getDataArray(self):
"""
Get the view of the array with data
"""
sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
return self.__ARRAY[sliceIdxs]
def concatenate(self,X,axis=0):
if axis > len(self.__DIMS):
print "Error: axis number exceed the number of dimensions"
return
# Check dimensions for remaining axis
for i in range(len(self.__DIMS)):
if i != axis:
if X.shape[i] != self.shape()[i]:
print "Error: Dimensions of the input array are not consistent in the axis %d" % i
return
# Check whether allocated memory is enough
needAlloc = False
while self.__ALLOC_DIMS[axis] < self.__DIMS[axis] + X.shape[axis]:
needAlloc = True
# Increase the __ALLOC_DIMS
self.__ALLOC_DIMS[axis] += min(self.__ALLOC_DIMS[axis]/2,self.__MAX_INCREMENT)
# Reallocate memory and copy old data
if needAlloc:
# Allocate
newArray = np.zeros(self.__ALLOC_DIMS)
# Copy
sliceIdxs = [slice(self.__DIMS[i]) for i in range(len(self.__DIMS))]
newArray[sliceIdxs] = self.__ARRAY[sliceIdxs]
self.__ARRAY = newArray
# Concatenate new data
sliceIdxs = []
for i in range(len(self.__DIMS)):
if i != axis:
sliceIdxs.append(slice(self.__DIMS[i]))
else:
sliceIdxs.append(slice(self.__DIMS[i],self.__DIMS[i]+X.shape[i]))
self.__ARRAY[sliceIdxs] = X
self.__DIMS[axis] += X.shape[axis]
The code shows considerably better performances than vstack/hstack several random sized concatenations.
What I'm wondering about is: is it the best way? Is there anything that do this already in numpy?
Further it would be nice to be able to overload the slice assignment operator of np.array, so that as soon as the user assign anything outside the actual dimensions, an ExpandingArray.concatenate() is performed. How to do such overloading?
Testing code: I post here also some code I used to make comparison between vstack and my method. I add up random chunk of data of maximum length 100.
import time
N = 10000
def performEA(N):
EA = ExpandingArray(np.zeros((0,2)),maxIncrement=1000)
for i in range(N):
nNew = np.random.random_integers(low=1,high=100,size=1)
X = np.random.rand(nNew,2)
EA.concatenate(X,axis=0)
# Perform operations on EA.getDataArray()
return EA
def performVStack(N):
A = np.zeros((0,2))
for i in range(N):
nNew = np.random.random_integers(low=1,high=100,size=1)
X = np.random.rand(nNew,2)
A = np.vstack((A,X))
# Perform operations on A
return A
start_EA = time.clock()
EA = performEA(N)
stop_EA = time.clock()
start_VS = time.clock()
VS = performVStack(N)
stop_VS = time.clock()
print "Elapsed Time EA: %.2f" % (stop_EA-start_EA)
print "Elapsed Time VS: %.2f" % (stop_VS-start_VS)
I think the most common design pattern for these things is to just use a list for the small arrays. Sure you could do things like dynamic resizing (if you want to do crazy things, you can try to use the resize array method too). I think a typical method is to always double the size, when you really don't know how large things will be. Of course if you know how large the array will grow to, just allocating the full thing up front is simplest.
def performVStack_fromlist(N):
l = []
for i in range(N):
nNew = np.random.random_integers(low=1,high=100,size=1)
X = np.random.rand(nNew,2)
l.append(X)
return np.vstack(l)
I am sure there are some use cases where an expanding array could be useful (for example when the appending arrays are all very small), but this loop seems better handled with the above pattern. The optimization is mostly about how often you need to copy everything around, and doing a list like this (other then the list itself) this is exactly once here. So it is much faster normally.
When I faced a similar problem, I used ndarray.resize() (http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.resize.html#numpy.ndarray.resize). Most of the time, it will avoid reallocation+copying altogether. I can't guarantee it would prove to be faster (it probably would), but it's so much simpler.
As for your second question, I think overriding slice assignment for extending purposes is not a good idea. That operator is meant for assigning to existing items/slices. If you want to change that, it's not immediately clear how you'd want it to behave in some cases, e.g.:
a = MyExtendableArray(np.arange(100))
a[200] = 6 # resize to 200? pad [100:200] with what?
a[90:110] = 7 # assign to existing items AND automagically-allocated items?
a[::-1][200] = 6 # ...
My suggestion is that slice-assignment and data appending should remain separate.

Categories