Delete millions of pandas rows, w.r.t values in numpy array - python

I have a numpy array (users_to_remove) consisting of user Ids to remove (75000 in that array) , and a pandas dataframe(orders) from which I want to remove rows that contain thoses Ids.
orders has about 35 million rows.
Here is how I currently proceed:
for i in users_to_remove:
orders = orders[orders.user_id != i]
Its taking ages and still didnt finish. I have 8gb of ram and a quad core i5 with 3,2ghz.
Is there an efficient way to do this with pandas, should I use another language ? Or is my computer just to slow for this?
Thank you

I think you need isin with boolean indexing:
orders = orders[~orders.user_id.isin(users_to_remove)]
Timings are similar (but I have only 4GB RAM, i5 2.5 GHz Win7):
np.random.seed(100)
N = 35000000
users_to_remove = np.random.randint(75000, size=N)
orders = pd.DataFrame({'user_id':np.random.randint(100000, size=N)})
print (orders.head())
In [54]: %timeit (orders[~orders.user_id.isin(users_to_remove)])
1 loop, best of 3: 16.9 s per loop
In [55]: %timeit (orders[~np.in1d(orders.user_id.values, users_to_remove)])
1 loop, best of 3: 14.6 s per loop

Related

Time access of elements in huge numpy array

I want to access and do some basic stuff with columns of a huge numpy array, non sparse, full of float, with shape 2 million x 800.
So basically:
import numpy as np
a = np.ones(2000000) # 2 millions elts
%timeit np.log10(a)
10 loops, best of 3: 28.8 ms per loop
but if I create a bigger array with 800 columns, the access time to a column is ridiculously high:
a = np.ones((2000000, 800)) # take 12 Go RAM
b = a[:, 0].copy() # 6 mins
%timeit np.log10(b)
10 loops, best of 3: 31.4 ms per loop
so my question is : how do you handle this situation ? Is there better objects to store such a big matrix and do simple operations like log10, plots, etc ? Or do I need to trick it, like creating 800 np.arrays with shape (2000000,) ? I've quickly read threads about Pytables and hdf, but I don't know if those solutions are appropriate for my problem, which looks very simple at first sight...
thanks a lot !
edit : I am on a xp 64 bits, and dtype of the array is float64.
I have tested this column access time with several shapes, and smth strange came up :
a = np.ones((1000000, 800), dtype="float64")
%timeit np.log10(a[:, 0])
1 loops, best of 3: 150 ms per loop
and
a = np.ones((1100000, 800), dtype="float64") # just a few more column
b = np.log10(a[:, 0]) # 6 min 20 sec
there is a threshold for the time access around 1050000 lines, so I suspect that the problem is related to the maximum size in bits for a memory allocation on a 64 bits machine ?

Approach to speed up pandas multilevel index selection?

I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.

Is there a performance difference between Numpy and Pandas?

I've written a bunch of code on the assumption that I was going to use Numpy arrays. Turns out the data I am getting is loaded through Pandas. I remember now that I loaded it in Pandas because I was having some problems loading it in Numpy. I believe the data was just too large.
Therefore I was wondering, is there a difference in computational ability when using Numpy vs Pandas?
If Pandas is more efficient then I would rather rewrite all my code for Pandas but if there is no more efficiency then I'll just use a numpy array...
There can be a significant performance difference, of an order of magnitude for multiplications and multiple orders of magnitude for indexing a few random values.
I was actually wondering about the same thing and came across this interesting comparison:
http://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/
I think it's more about using the two strategically and shifting data around (from numpy to pandas or vice versa) based on the performance you see. As a recent example, I was trying to concatenate 4 small pickle files with 10k rows each data.shape -> (10,000, 4) using numpy.
Code was something like:
n_concat = np.empty((0,4))
for file_path in glob.glob('data/0*', recursive=False):
n_data = joblib.load(file_path)
n_concat = np.vstack((co_np, filtered_snp))
joblib.dump(co_np, 'data/save_file.pkl', compress = True)
This crashed my laptop (8 GB, i5) which was surprising since the volume wasn't really that huge. The 4 compressed pickled files were roughly around 5 MB each.
The same thing, worked great on pandas.
for file_path in glob.glob('data/0*', recursive=False):
n_data = joblib.load(sd)
try:
df = pd.concat([df, pd.DataFrame(n_data, columns = [...])])
except NameError:
df = pd.concat([pd.DataFrame(n_data,columns = [...])])
joblib.dump(df, 'data/save_file.pkl', compress = True)
One the other hand, when I was implementing gradient descent by iterating over a pandas data frame, it was horribly slow, while using numpy for the job was much quicker.
In general, I've seen that pandas usually works better for moving around/munging moderately large chunks of data and doing common column operations while numpy works best for vectorized and recursive work (maybe more math intense work) over smaller sets of data.
Moving data between the two is hassle free, so I guess, using both strategically is the way to go.
In my experiments on large numeric data, Pandas is consistently 20 TIMES SLOWER than Numpy. This is a huge difference, given that only simple arithmetic operations were performed: slicing of a column, mean(), searchsorted() - see below. Initially, I thought Pandas was based on numpy, or at least its implementation was C optimized just like numpy's. These assumptions turn out to be false, though, given the huge performance gap.
In examples below, data is a pandas frame with 8M rows and 3 columns (int32, float32, float32), without NaN values, column #0 (time) is sorted. data_np was created as data.values.astype('float32'). Results on Python 3.8, Ubuntu:
A. Column slices and mean():
# Pandas
%%timeit
x = data.x
for k in range(100): x[100000:100001+k*100].mean()
15.8 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Numpy
%%timeit
for k in range(100): data_np[100000:100001+k*100,1].mean()
874 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Pandas is 18 times slower than Numpy (15.8ms vs 0.874 ms).
B. Search in a sorted column:
# Pandas
%timeit data.time.searchsorted(1492474643)
20.4 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Numpy
%timeit data_np[0].searchsorted(1492474643)
1.03 µs ± 3.55 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Pandas is 20 times slower than Numpy (20.4µs vs 1.03µs).
EDIT: I implemented a namedarray class that bridges the gap between Pandas and Numpy in that it is based on Numpy's ndarray class and hence performs better than Pandas (typically ~7x faster) and is fully compatible with Numpy'a API and all its operators; but at the same time it keeps column names similar to Pandas' DataFrame, so that manipulating on individual columns is easier. This is a prototype implementation. Unlike Pandas, namedarray does not allow for different data types for columns. The code can be found here: https://github.com/mwojnars/nifty/blob/master/math.py (search "namedarray").

What is the right SciPy sparse matrix format for incremental summation

In my code I am currently iterating and creating three lists:
data, row, col
There is a high repetition of (row, col) pairs, and in my final sparse matrix M I would like the value of M[row, col] to be the sum of all the corresponding elements in data. From reading the documentation, the coo_matrix format seems perfect and for small examples it works just fine.
The problem I am having is that when I scale up my problem size, it looks like the intermediate lists data, row, col are using up all of my (8gb) of memory and the swap space and my script gets automatically killed.
So my question is:
Is there an appropriate format or an efficient way of incrementally building my summed matrix so I don't have to store the full intermediate lists / numpy arrays?
My program loops over a grid, creating local_data, local_row, local_col lists at each point, the elements of which are then appended to data, row, col, so being able to update the sparse matrix with lists as per the sparse matrix constructors would be the ideal case.
There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix and adding it to your total. I took a couple of timings:
rows = list(np.random.randint(100, size=(10000,)))
cols = list(np.random.randint(100, size=(10000,)))
values = list(np.random.rand(10000))
%timeit sps.coo_matrix((values, (rows, cols)))
100 loops, best of 3: 4.03 ms per loop
%timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
100 loops, best of 3: 5.24 ms per loop
%timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
100 loops, best of 3: 2.16 ms per loop
So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix and then adding them together. And it doesn't seem to be as bad if you do more splits:
%timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +
sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +
sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) +
sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
100 loops, best of 3: 5.76 ms per loop

Efficient column indexing and selection in PANDAS

I'm looking for the most efficient way to select multiple columns from a data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(4,8), columns = list('abcdefgh'))
I want to select columns the following columns a,c,e,f,g only, which can be done by using indexing:
df.ix[:,[0,2,4,5,6]]
For a large data frame of many columns, this seems an inefficient method and I would much rather specify consecutive column indexes by range, if at all possible, but attempts such as the following, both throw up syntax errors:
df.ix[:,[0,2,4:6]]
or
df.ix[:,[0,2,[4:6]]]
As soon as you select non adjacent columns, you will pay the load.
If your data is homogeneous, falling back to numpy give you notable improvement.
In [147]: %timeit df[['a','c','e','f','g']]
%timeit df.values[:,[0,2,4,5,6]]
%timeit df.ix[:,[0,2,4,5,6]]
%timeit pd.DataFrame(df.values[:,[0,2,4,5,6]],columns=df.columns[[0,2,4,5,6]])
100 loops, best of 3: 2.67 ms per loop
10000 loops, best of 3: 58.7 µs per loop
1000 loops, best of 3: 1.81 ms per loop
1000 loops, best of 3: 568 µs per loop
I think you can use range:
print [0,2] + range(4,7)
[0, 2, 4, 5, 6]
print df.ix[:, [0,2] + range(4,7)]
a c e f g
0 0.278231 0.192650 0.653491 0.944689 0.663457
1 0.416367 0.477074 0.582187 0.730247 0.946496
2 0.396906 0.877941 0.774960 0.057290 0.556719
3 0.119685 0.211581 0.526096 0.213282 0.492261
Pandas is relatively well thought, the shortest way is the most efficient:
df[['a','c','e','f','g']]
You don't need ix, as it will do a search in your data, but for that you obviously need the names of the columns.

Categories