Speeding slicing of big numpy array - python

I have a big array ( 1000x500000x6 ) that is stored in a pyTables file. I am doing some calculations on it that are fairly optimized in terms of speed, but what is taking the most time is the slicing of the array.
At the beginning of the script, I need to get a subset of the rows : reduced_data = data[row_indices, :, :] and then, for this reduced dataset, I need to access:
columns one by one: reduced_data[:,clm_indice,:]
a subset of the columns: reduced_data[:,clm_indices,:]
Getting these arrays takes forever. Is there any way to speed that up ? storing the data differently for example ?

You can try choosing the chunkshape of your array wisely, see: http://pytables.github.com/usersguide/libref.html#tables.File.createCArray
This option controls in which order the data is physically stored in the file, so it might help to speed up access.
With some luck, for your data access pattern, something like chunkshape=(1000, 1, 6) might work.

Related

Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.
If it makes a difference, the ints themselve range from 0 to 15 million.
I have considered using:
Pandas, storing the batches as python lists
Numpy, where the batch is stored as it's own numpy array (since numpy doesn't allow variable length rows in it's 2D data structures)
Python List of Lists.
I also looked at Tensorflow tfrecords but not too sure about this one.
I only have about 12 gbs of RAM. I will also be using to train over a machine learning algorithm so
If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.
Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.
with open(file, 'r') as f:
data = (l for l in f) # generator
for line in data:
# process each record here
I would do the following:
# create example data
A = np.random.randint(0,15000000,100)
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
int32 is sufficient
A32 = A.astype(np.int32)
We want to glue all the batches together.
First, write down the batch sizes so we can separate them later.
from itertools import chain
sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()
# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
After glueing resplit.
B32 = np.split(B_all, boundaries[1:-1])
Finally, make an array of pairs for convenience:
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
What was the point of glueing and then splitting again?
First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.
The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']
Use numpy. It us the most efficient and you can use it easily with a machine learning model.

Whey saving an numpy array of float arrays to .npy file using numpy.save/numpy.load, is there any reason why the order of the arrays would change?

I currently have data where each row has a text passage and a numpy float array.
As far as I know, the it's not efficient to save these two datatypes into one data format (correct me if I am wrong). So I am going to save them separately, with another column of ints that will be used to map the two datasets together when I want to join them again.
I have having trouble figuring out how to append a column of ints next to the float arrays (if anyone has a solution to that I would love to hear it) and then save the numpy array.
But then I realized I can just save the float arrays as is with numpy.save without the extra int column if I can get a confirmation that numpy.save and numpy.load will never change the order of the arrays.
That way I can just append the loaded numpy float arrays to the pandas dataframe as is.
Logically, I don't see any reason why the order of the rows would change, but perhaps there's some optimization compression that I am unaware of.
Would numpy.save or numpy.load ever change the order of a numpy array of float arrays?
The order will not change by the numpy save / load. You are saving the numpy object as is. An array is an ordered object.
Note: if you want to save multiple data arrays to the same file, you can use np.savez.
>>> np.savez('out.npz', f=array_of_floats, s=array_of_strings)
You can retrieve back each with the following:
>>> data = np.load('out.npz')
>>> array_of_floats = data['f']
>>> array_of_strings = data['s']

(Py)Spark combineByKey mergeCombiners output type != mergeCombinerVal type

I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform

python array initialisation (preallocation) with nans

I want to initialise an array that will hold some data. I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
To further explain my situation: I have data I need to store in an array. Say I have 8 rows of data. The number of elements in each row is not equal, so my matrix row length needs to be as long as the longest row. In other rows, some elements will not be filled. I don't want to use zeros since some of my data might actually be zeros.
I realise I can use some value I know my data will never, but nans is definitely clearer. Just wondering if that can cause any issues later with processing. I realise I need to use nanmax instead of max and so on.
I have created a random matrix (using np.empty) and then multiplied it by np.nan. Is there anything wrong with that? Or is there a better practice that I should stick to?
You can use np.full, for example:
np.full((100, 100), np.nan)
However depending on your needs you could have a look at numpy.ma for masked arrays or scipy.sparse for sparse matrices. It may or may not be suitable, though. Either way you may need to use different functions from the corresponding module instead of the normal numpy ufuncs.
A way I like to do it which probably isn't the best but it's easy to remember is adding a 'nans' method to the numpy object this way:
import numpy as np
def nans(n):
return np.array([np.nan for i in range(n)])
setattr(np,'nans',nans)
and now you can simply use np.nans as if it was the np.zeros:
np.nans(10)

Efficiently Removing Very-Near-Duplicates From Python List

Background:My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)Question:What is the most efficient way to remove very-near-duplicates from a potentially very large data set?My Attempts:I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries. Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).Clarification on 'Duplicates':Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.Note on decimal.Decimal: This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).
You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :
Import NumPy :
import numpy as np
Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :
a = np.float128(np.random.random((8000,)))
Find the indexes of the unique elements in the rounded array :
_, unique = np.unique(a.round(decimals=14), return_index=True)
And take those indexes from the original (non-rounded) array :
no_duplicates = a[unique]
Why don't you create a dict that maps the 14dp values to the corresponding full 16dp values:
d = collections.defaultdict(list)
for x in l:
d[round(x, 14)].append(x)
Now if you just want "unique" (by your definition) values, you can do
unique = [v[0] for v in d.values()]

Categories