I want to access and do some basic stuff with columns of a huge numpy array, non sparse, full of float, with shape 2 million x 800.
So basically:
import numpy as np
a = np.ones(2000000) # 2 millions elts
%timeit np.log10(a)
10 loops, best of 3: 28.8 ms per loop
but if I create a bigger array with 800 columns, the access time to a column is ridiculously high:
a = np.ones((2000000, 800)) # take 12 Go RAM
b = a[:, 0].copy() # 6 mins
%timeit np.log10(b)
10 loops, best of 3: 31.4 ms per loop
so my question is : how do you handle this situation ? Is there better objects to store such a big matrix and do simple operations like log10, plots, etc ? Or do I need to trick it, like creating 800 np.arrays with shape (2000000,) ? I've quickly read threads about Pytables and hdf, but I don't know if those solutions are appropriate for my problem, which looks very simple at first sight...
thanks a lot !
edit : I am on a xp 64 bits, and dtype of the array is float64.
I have tested this column access time with several shapes, and smth strange came up :
a = np.ones((1000000, 800), dtype="float64")
%timeit np.log10(a[:, 0])
1 loops, best of 3: 150 ms per loop
and
a = np.ones((1100000, 800), dtype="float64") # just a few more column
b = np.log10(a[:, 0]) # 6 min 20 sec
there is a threshold for the time access around 1050000 lines, so I suspect that the problem is related to the maximum size in bits for a memory allocation on a 64 bits machine ?
Related
I have a numpy array (users_to_remove) consisting of user Ids to remove (75000 in that array) , and a pandas dataframe(orders) from which I want to remove rows that contain thoses Ids.
orders has about 35 million rows.
Here is how I currently proceed:
for i in users_to_remove:
orders = orders[orders.user_id != i]
Its taking ages and still didnt finish. I have 8gb of ram and a quad core i5 with 3,2ghz.
Is there an efficient way to do this with pandas, should I use another language ? Or is my computer just to slow for this?
Thank you
I think you need isin with boolean indexing:
orders = orders[~orders.user_id.isin(users_to_remove)]
Timings are similar (but I have only 4GB RAM, i5 2.5 GHz Win7):
np.random.seed(100)
N = 35000000
users_to_remove = np.random.randint(75000, size=N)
orders = pd.DataFrame({'user_id':np.random.randint(100000, size=N)})
print (orders.head())
In [54]: %timeit (orders[~orders.user_id.isin(users_to_remove)])
1 loop, best of 3: 16.9 s per loop
In [55]: %timeit (orders[~np.in1d(orders.user_id.values, users_to_remove)])
1 loop, best of 3: 14.6 s per loop
When sampling randomly from distributions of varying sizes I was surprised to observe that execution time seems to scale mostly with the size of the dataset being sampled from, not the number of values being sampled. Example:
import pandas as pd
import numpy as np
import time as tm
#generate a small and a large dataset
testSeriesSmall = pd.Series(np.random.randn(10000))
testSeriesLarge = pd.Series(np.random.randn(10000000))
sampleSize = 10
tStart = tm.time()
currSample = testSeriesLarge.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart)))
tStart = tm.time()
currSample = testSeriesSmall.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))
sampleSize = 1000
tStart = tm.time()
currSample = testSeriesLarge.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart)))
tStart = tm.time()
currSample = testSeriesSmall.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))
The output is:
sample 10 from 10000 values: 0.00126 s
sample 10 from 10000000 values: 1.10504 s
sample 1000 from 10000 values: 0.00122 s
sample 1000 from 10000000 values: 1.15000 s
This seems counterintuitive. Maybe I'm dense, but the problem seems similar to generating a list of random indices, and I would have expected the number of values sampled to matter and the size of the dataset not to matter much. I've tried another implementation or two with similar results and it's beginning to feel like I'm just missing a fundamental issue, though.
My questions are twofold: (1) Is this a fundamental issue or a quirk of implementation in pandas? (2) Is there a significantly faster approach that can be taken to randomly sample from large datasets in this way?
pandas.Series.sample() in your case boils down to this:
rs = np.random.RandomState()
locs = rs.choice(axis_length, size=n, replace=False)
return self.take(locs)
The slow part is rs.choice():
%timeit rs.choice(100000000, size=1, replace=False)
1 loop, best of 3: 9.43 s per loop
It takes about 10 seconds to generate a single random number! If you divide the first argument by 10, it takes about 1 second. That's slow!
If you use replace=True it's super fast. So that's one workaround for you, if you don't mind having duplicate entries in your results.
The NumPy documentation for choice(replace=False) says:
This is equivalent to np.random.permutation(np.arange(5))[:3]
Which pretty much explains the problem--it generates a huge array of possible values, shuffles them, then takes the first N. This is the root cause of your performance problem, and has already been reported as an issue in NumPy here: https://github.com/numpy/numpy/pull/5158
It is apparently difficult to fix in NumPy, because people rely on the result of choice() not changing (between versions of NumPy) when using the same random seed value.
Since your use case is quite narrow, you can do something like this:
def sample(series, n):
locs = np.random.randint(0, len(series), n*2)
locs = np.unique(locs)[:n]
assert len(locs) == n, "sample() assumes n << len(series)"
return series.take(locs)
That gives much faster times:
sample 10 from 10000 values: 0.00735 s
sample 10 from 1000000 values: 0.00944 s
sample 10 from 100000000 values: 1.44148 s
sample 1000 from 10000 values: 0.00319 s
sample 1000 from 1000000 values: 0.00802 s
sample 1000 from 100000000 values: 0.01989 s
sample 100000 from 1000000 values: 0.05178 s
sample 100000 from 100000000 values: 0.93336 s
This looks to be an internal numpy issue. I believe the pandas sample method calls numpy.random.choice. Let's take a look at how numpy performs with various array sizes and sample sizes.
Create arrays
large = np.arange(1000000)
small = np.arange(1000)
Time the sample without replacement
%timeit np.random.choice(large, 10, replace=False)
10 loops, best of 3: 27.4 ms per loop
%timeit np.random.choice(small, 10, replace=False)
10000 loops, best of 3: 41.4 µs per loop
Time the sample with replacement
%timeit np.random.choice(large, 10, replace=True)
100000 loops, best of 3: 11.7 µs per loop
%timeit np.random.choice(small, 10, replace=True)
100000 loops, best of 3: 12.2 µs per loop
Very interestingly, when doing the sample without replacement, the large array is taking nearly 3 orders of magnitude longer and it is exactly three orders of magnitude as large. This suggests to me that numpy is randomly sorting the array and then taking the first 10 items.
When sampling with replacement, each value is independently chosen so the timings are identical.
In order to speed up the functions like np.std, np.sum etc along an axis of an n dimensional huge numpy array, it is recommended to apply along the last axis.
When I do, np.transpose to rotate the axis I want to operate, to the last axis. Is it really reshuffling the data in memory, or just changing the way the axis are addressed?
When i tried to measure the time using %timeit. it was doing this transpose in micro seconds, (much smaller than the time required to copy the (112x1024x1024) array i was having.
If it is not actually reordering the data in memory and only changing the addressing, will it still speed up the np.sum or np.std when applied to newly rotated last axis?
When i tried to measure it, i does seem to speed up. But i don't understand how.
Update
It doesn't really seem to speed up with transpose. The fastest axis is last one when it is C-ordered, and first one when it is Fortran-ordered. So there is no point in transposing before applying np.sum or np.std.
For my specific code, i solved the issue by giving order='FORTRAN' during the array creation. Which made the first axis fastest.
Thanks for all the answers.
Transpose just changes the strides, it doesn't touch the actual array. I think the reason why sum etc. along the final axis is recommended (I'd like to see the source for that, btw.) is that when an array is C-ordered, walking along the final axis preserves locality of reference. That won't be the case after you transpose, since the transposed array will be Fortran-ordered.
To elaborate on larsman's answer, here are some timings:
# normal C (row-major) order array
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a, axis=1)
1000 loops, best of 3: 272 us per loop
# transposing and summing along the first axis makes no real difference
# to performance
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a.T, axis=0)
1000 loops, best of 3: 269 us per loop
# however, converting to Fortran (column-major) order does improve speed...
>>> %%timeit a = np.asfortranarray(np.random.randn(500,400))
>>> np.sum(a, axis=1)
10000 loops, best of 3: 114 us per loop
# ... but only if you don't count the conversion in the timed operations
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(np.asfortranarray(a), axis=1)
1000 loops, best of 3: 599 us per loop
In summary, it might make sense to convert your arrays to Fortran order if you're going to apply a lot of operations over the columns, but the conversion itself is costly and almost certainly not worth it for a single operation.
You can force a physical reorg with np.ascontinguousarray. For example...
a = np.ascontiguousarray(a.transpose())
In my code I am currently iterating and creating three lists:
data, row, col
There is a high repetition of (row, col) pairs, and in my final sparse matrix M I would like the value of M[row, col] to be the sum of all the corresponding elements in data. From reading the documentation, the coo_matrix format seems perfect and for small examples it works just fine.
The problem I am having is that when I scale up my problem size, it looks like the intermediate lists data, row, col are using up all of my (8gb) of memory and the swap space and my script gets automatically killed.
So my question is:
Is there an appropriate format or an efficient way of incrementally building my summed matrix so I don't have to store the full intermediate lists / numpy arrays?
My program loops over a grid, creating local_data, local_row, local_col lists at each point, the elements of which are then appended to data, row, col, so being able to update the sparse matrix with lists as per the sparse matrix constructors would be the ideal case.
There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix and adding it to your total. I took a couple of timings:
rows = list(np.random.randint(100, size=(10000,)))
cols = list(np.random.randint(100, size=(10000,)))
values = list(np.random.rand(10000))
%timeit sps.coo_matrix((values, (rows, cols)))
100 loops, best of 3: 4.03 ms per loop
%timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
100 loops, best of 3: 5.24 ms per loop
%timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
100 loops, best of 3: 2.16 ms per loop
So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix and then adding them together. And it doesn't seem to be as bad if you do more splits:
%timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +
sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +
sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) +
sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
100 loops, best of 3: 5.76 ms per loop
I am trying to remove rows or columns from an image represented by a Numpy array. My image is of type uint16 and 2560 x 2176. As an example, say I want to remove the first 16 columns to make it 2560 x 2160.
I'm a MATLAB-to-Numpy convert, and in MATLAB would use something like:
A = rand(2560, 2196);
A(:, 1:16) = [];
As I understand, this deletes the columns in place and saves a lot of time by not copying to a new array.
For Numpy, previous posts have used commands like numpy.delete. However, the documentation is clear that this returns a copy, and so I must reassign the copy to A. This seems like it would waste a lot of time copying.
import numpy as np
A = np.random.rand(2560,2196)
A = np.delete(A, np.r_[:16], 1)
Is this truly as fast as an in-place deletion? I feel I must be missing a better method or not understanding how python handles array storage during deletion.
Relevant previous posts:
Removing rows in NumPy efficiently
Documentation for numpy.delete
Why not just do a slice? Here I'm removing the first 3000 columns instead of 16 to make the memory usage more clear:
import numpy as np
a = np.empty((5000, 5000)
a = a[:, 3000:]
This effectively reduces the size of the array in memory, as can be seen:
In [31]: a = np.zeros((5000, 5000), dtype='d')
In [32]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x5000: 25000000 elems, type `float64`, 200000000 bytes (190 Mb)
In [33]: a = a[:, 3000:]
In [34]: whos
Variable Type Data/Info
-------------------------------
a ndarray 5000x2000: 10000000 elems, type `float64`, 80000000 bytes (76 Mb)
For this size of array a slice seems to be about 10,000x faster than your delete option:
%timeit a=np.empty((5000,5000), dtype='d'); a=np.delete(a, np.r_[:3000], 1)
1 loops, best of 3: 404 ms per loop
%timeit a=np.empty((5000,5000), dtype='d'); a=a[:, 3000:]
10000 loops, best of 3: 39.3 us per loop