Parsing a very large array with list comprehension is slow - python

I have an ultra large list of numerical values in numpy.float64 format, and I want to convert each value, to 0.0 if there's an inf value, and parse the rest the elements to simple float.
This is my code, which works perfectly:
# Values in numpy.float64 format.
original_values = [np.float64("Inf"), np.float64(0.02345), np.float64(0.2334)]
# Convert them
parsed_values = [0.0 if x == float("inf") else float(x) for x in original_values]
But this is slow. Is there any way to faster this code? Using any magic with map or numpy (I have no experience with these libraries)?

Hey~ you probably are asking how could you do it faster with numpy, the quick answer is to turn the list into a numpy array and do it the numpy way:
import numpy as np
original_values = [np.float64("Inf"), ..., np.float64(0.2334)]
arr = np.array(original_values)
arr[arr == np.inf] = 0
where arr == np.inf returns another array that looks like array([ True, ..., False]) and can be used to select indices in arr in the way I showed.
Hope it helps.
I tested a bit, and it should be fast enough:
# Create a huge array
arr = np.random.random(1000000000)
idx = np.random.randint(0, high=1000000000, size=1000000)
arr[idx] = np.inf
# Time the replacement
def replace_inf_with_0(arr=arr):
arr[arr == np.inf] = 0
timeit.Timer(replace_inf_with_0).timeit(number=1)
The output says it takes 1.5 seconds to turn all 1,000,000 infs into 0s in a 1,000,000,000-element array.
#Avión used arr.tolist() in the end to convert it back to a list for MongoDB, which should be the common way. I tried with the billion-sized array, and the conversion took about 30 seconds, while creating the billion-sized array took less than 10 sec. So, feel free to recommend more efficient methods.

Related

Calculate partitioned sum efficiently with CuPy or NumPy

I have a very long array* of length L (let's call it values) that I want to sum over, and a sorted 1D array of the same length L that contains N integers with which to partition the original array – let's call this array labels.
What I'm currently doing is this (module being cupy or numpy):
result = module.empty(N)
for i in range(N):
result[i] = values[labels == i].sum()
But this can't be the most efficient way of doing it (it should be possible to get rid of the for loop, but how?). Since labels is sorted, I could easily determine the break points and use those indices as start/stop points, but I don't see how this solves the for loop problem.
Note that I would like to avoid creating an array of size NxL along the way, if possible, since L is very large.
I'm working in cupy, but any numpy solution is welcome too and could probably be ported. Within cupy, it seems this would be a case for a ReductionKernel, but I don't quite see how to do it.
* in my case, values is 1D, but I assume the solution wouldn't depend on this
You are describing a groupby sum aggregation. You could write a CuPy RawKernel for this, but it would be much easier to use the existing groupby aggregations implemented in cuDF, the GPU dataframe library. They can interoperate without requiring you to copy the data. If you call .values on the resulting cuDF Series, it will give you a CuPy array.
If you went back to the CPU, you could do the same thing with pandas.
import cupy as cp
import pandas as pd
N = 100
values = cp.random.randint(0, N, 1000)
labels = cp.sort(cp.random.randint(0, N, 1000))
L = len(values)
result = cp.empty(L)
for i in range(N):
result[i] = values[labels == i].sum()
result[:5]
array([547., 454., 402., 601., 668.])
import cudf
df = cudf.DataFrame({"values": values, "labels": labels})
df.groupby(["labels"])["values"].sum().values[:5]
array([547, 454, 402, 601, 668])
Here is a solution which, instead of a N x L array, uses a N x <max partition size in labels> array (which should not be large, if the disparity between different partitions is not too high):
Resize the array into a 2-D array with partitions in each row. Since the length of the row equals the size of the maximum partition, fill unavailable values with zeros (since it doesn't affect any sum). This uses #Divakar's solution given here.
def jagged_to_regular(a, parts):
lens = np.ediff1d(parts,to_begin=parts[0])
mask = lens[:,None]>np.arange(lens.max())
out = np.zeros(mask.shape, dtype=a.dtype)
out[mask] = a
return out
parts_stack = jagged_to_regular(values, labels)
Sum along axis 1:
result = np.sum(parts_stack, axis = 1)
In case you'd like a CuPy implementation, there's no direct CuPy alternative to numpy.ediff1d in jagged_to_regular. In that case, you can substitute the statement with numpy.diff like so:
lens = np.insert(np.diff(parts), 0, parts[0])
and then continue to use CuPy as a drop-in replacement for numpy.

Numpy: sorting matrix like list of lists, or: sort matrix by rows

I have a numpy matrix that contains row vectors. I want to sort the matrix by its rows, like Python would sort a list of lists:
import numpy as np
def sortx(a):
return np.array(sorted([list(i) for i in a]))
a = np.array([[1,4,0,2],[0,2,3,1]])
print(sortx(a))
Output:
[[0 2 3 1]
[1 4 0 2]]
Is there a numpy equivalent of my sortx() function so I don't have to convert the data twice?
You can try to use numpy's lexsort:
a=a[np.lexsort(a[:,::-1].T)]
On my machine this was about four times faster than your sortx method when applied to a 4x4 matrix. On a matrix with 100 rows, the speed difference is even more significant.
arr=np.random.randint(0,100,(100,4))
%timeit np.lexsort(arr[:,::-1].T)
#6.29 µs +- 27.1ns
% timeit sortx(arr)
# 112µs +- 1.2µs
Edit:
Andyk suggested an improved version of the sortx() method.
def sortx_andyk(a):
return np.array(sorted(a.tolist())
Timing of this method:
%timeit sortx_andryk(arr)
# 43µs +- 169ns
You can use np.sort(arr, axis=0)
In your case
import numpy as np
a = np.array([[1,4,0,2],[0,2,3,1]])
np.sort(a, axis=0)
Edit
I misunderstood the question, even though I do not have an exact answer for your question, you might be able to use argsort. This returns the indices in to sort your array. However, it only does this based on an axis. It is possible to use this to sort your arrays based on a specific column, e.g. the first. Then you would use it as such
a = a[a.argsort(axis=0)[:, 0]]
where [:, 0] specifies the column by which to sort, i.e. [:, n] will sort on the n-th column.

Vectorization - how to append array without loop for

I have the following code:
x = range(100)
M = len(x)
sample=np.zeros((M,41632))
for i in range(M):
lista=np.load('sample'+str(i)+'.npy')
for j in range(41632):
sample[i,j]=np.array(lista[j])
print i
to create an array made of sample_i numpy arrays.
sample0, sample1, sample3, etc. are numpy arrays and my expected output is a Mx41632 array like this:
sample = [[sample0],[sample1],[sample2],...]
How can I compact and make more quick this operation without loop for? M can reach also 1 million.
Or, how can I append my sample array if the starting point is, for example, 1000 instead of 0?
Thanks in advance
Initial load
You can make your code a lot faster by avoiding the inner loop and not initialising sample to zeros.
x = range(100)
M = len(x)
sample = np.empty((M, 41632))
for i in range(M):
sample[i, :] = np.load('sample'+str(i)+'.npy')
In my tests this took the reading code from 3 seconds to 60 miliseconds!
Adding rows
In general it is very slow to change the size of a numpy array. You can append a row once you have loaded the data in this way:
sample = np.insert(sample, len(sample), newrow, axis=0)
but this is almost never what you want to do, because it is so slow.
Better storage: HDF5
Also if M is very large you will probably start running out of memory.
I recommend that you have a look at PyTables which will allow you to store your sample results in one HDF5 file and manipulate the data without loading it into memory. This will in general be a lot faster than the .npy files you are using now.
It is quite simple with numpy. Consider this example:
import numpy as np
l = [[1,2,3],[4,5,6],[7,8,9],[10,11,12]]
#create an array with 4 rows and 3 columns
arr = np.zeros([4,3])
arr[:,:] = l
You can also insert rows or columns separately:
#insert the first row
arr[0,:] = l[0]
You just have to provide that dimensions are the same.

Numpy incorrectly changes large floats to very small numbers when I append

I have an array of datetimes times and I'm appending them to an empty numpy array but first im converting them to unix time. The conversion works fine but when I add them to the array I'm getting crazy small values like e-310
#times = [Array of datetimes]
time_unix = np.empty(len(times))
for t in times:
temp_time = time.mktime( t.timetuple() )
np.append(time_unix, temp_time)
Results
For datetime: 2015-08-05 00:27:00
What time_unix[0] should be: 1438734420.0
What time_unix[0] actually is: 6.92520780368e-310
You shouldn't use np.append if you want to insert the values. What you are seeing is the result of an np.empty-cell (which could be anything).
Change your loop to:
for idx, t in enumerate(times):
...
time_unix[idx] = temp_time
For iterative definition of arrays, start with a list, and append values to it. Make the array at the end. np.append consistently gives beginners problems, and should be banned.
In [393]: times =[]
In [394]: for i in range(3):
...: times.append('2015-08-%02d 00:27:00'%(i+5))
...:
In [395]: times
Out[395]: ['2015-08-05 00:27:00', '2015-08-06 00:27:00', '2015-08-07 00:27:00']
In [396]: dates = np.array(times, np.datetime64)
In [397]: dates
Out[397]: array(['2015-08-05T00:27:00', '2015-08-06T00:27:00', '2015-08-07T00:27:00'], dtype='datetime64[s]')
Look into the use of np.datetime64. It makes array manipulation of data much easier.

Simple question: In numpy how do you make a multidimensional array of arrays?

Right, perhaps I should be using the normal Python lists for this, but here goes:
I want a 9 by 4 multidimensional array/matrix (whatever really) that I want to store arrays in. These arrays will be 1-dimensional and of length 4096.
So, I want to be able to go something like
column = 0 #column to insert into
row = 7 #row to insert into
storageMatrix[column,row][0] = NEW_VALUE
storageMatrix[column,row][4092] = NEW_VALUE_2
etc..
I appreciate I could be doing something a bit silly/unnecessary here, but it will make it ALOT easier for me to have it structured like this in my code (as there's alot of these, and alot of analysis to be done later).
Thanks!
Note that to leverage the full power of numpy, you'd be much better off with a 3-dimensional numpy array. Breaking apart the 3-d array into a 2-d array with 1-d values
may complicate your code and force you to use loops instead of built-in numpy functions.
It may be worth investing the time to refactor your code to use the superior 3-d numpy arrays.
However, if that's not an option, then:
import numpy as np
storageMatrix=np.empty((4,9),dtype='object')
By setting the dtype to 'object', we are telling numpy to allow each element of storageMatrix to be an arbitrary Python object.
Now you must initialize each element of the numpy array to be an 1-d numpy array:
storageMatrix[column,row]=np.arange(4096)
And then you can access the array elements like this:
storageMatrix[column,row][0] = 1
storageMatrix[column,row][4092] = 2
The Tentative NumPy Tutorial says you can declare a 2D array using the comma operator:
x = ones( (3,4) )
and index into a 2D array like this:
>>> x[1,2] = 20
>>> x[1,:] # x's second row
array([ 1, 1, 20, 1])
>>> x[0] = a # change first row of x
>>> x
array([[10, 20, -7, -3],
[ 1, 1, 20, 1],
[ 1, 1, 1, 1]])

Categories