Casting one-hot encoded array to bool arrays is slow - python

I have a large array with uuids, lets call it labels. Now I need for every different uuid in this array a bool mask which shows me at which positions in the array every uuid is located. I need this for later computations.
I use pandas' get_dummies() function to create a one-hot encoding of the labels array. Each column of the resulting dataframe is then casted to a a boolean array and stored in a dictionary. The key of the entry is the uuid.
The creation of the dataframe with the get_dummies() function is always as fast as I need it. But casting the columns to bools gets really slow:
import pandas as pd
import numpy as np
labels = np.random.randint(0, 10000, 500000)
%timeit -n 1 -r 1 d = pd.get_dummies(labels); d = {key: d[key].astype(bool) for i, key in enumerate(d.columns.values)}
>>52.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
#smaller dataset
labels = np.random.randint(0, 10000, 100000)
%timeit -n 1 -r 1 d = pd.get_dummies(labels); d = {key: d[key].astype(bool) for i, key in enumerate(d.columns.values)}
>>3.52 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
#without casting to bool
labels = np.random.randint(0, 10000, 500000)
%timeit -n 1 -r 1 d = pd.get_dummies(labels); d = {key: d[key] for i, key in enumerate(d.columns.values)}
>>1.24 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
How can I make this faster, i.e. how can I get my boolean masks from the one-hot encoding?

In order to convert the df to boolean values you can convert it to a numpy array and compare it to 1 and make a df again:
%timeit pd.DataFrame(d.values==1)
1 loop, best of 3: 281 ms per loop
Its not a good idea to follow my original advice from the comment (a was short one zero when i did the timings there)
%timeit d==1
1 loop, best of 3: 4.83 s per loop
I think pandas is much slower here because its iterating over the columns internally.
edit:
to retain the original index you can do:
e = pd.DataFrame(d.values==1)
e.index = d.index
edit2:
to save another 60 ms its also possible to use pandas eval function
%timeit pd.eval('d==1')
1 loop, best of 3: 220 ms per loop

Related

Form an item index in a masked array, calculate the index of the same item in the original sorted array

I masked a sorted 1-D numpy array using the method below (which follows a solution proposed here):
def get_from_sorted(sorted,idx):
mask = np.zeros(sorted.shape, bool)
mask[idx] = True
return sorted[mask]
The python method returns the array after masking on the indexes idx. For example, if sorted=np.array([0.1,0.2,0.3.0.4,0.5]), and idx=np.array([4,0,1]), then the method get_from_sorted should return np.array([0.1,0.2,0.5]) (note the order in the original array is preserved.)
Question: I need to get the mapping between the indices of the items in the array after masking and those in the original list. In the example above, such a mapping is
0 -> 0
1 -> 1
2 -> 5
because 0.1, 0.2, and 0.5 is on the 0th, 1st, and 5th place in sorted.
How can I program this mapping efficiently?
Requirement on efficiency: Efficiency is the key in my problem solving. Here, both "idx" and "sorted" is a 1-D array of 1 million elements, and idx is a 1-D array of about 0.5 million elements (taken from an image processing application). Thus, checking the elements of the masked array one by one, or in a vectorized fashion, against the original array, for example, using np.where, would not perform well in my case. Ideally, there should be a relatively simply mathematical relation between the indices in the masked array and the original sorted array. Any idea?
I assume (from your example) that the original list is the sorted list. In which case, unless I misunderstand, you just do:
idx.sort()
and then the mapping is i-> idx[i]
Of course, if the original order of idx is important, make a copy first.
A question is not clear for me. It can have several interpretations.
mask -> idx (in ascending order):
Let me try with this quite large dataset (10M of values, 10% of them are True):
x = np.random.choice(a=[False, True], size=(10000000,), p=[0.9, 0.1])
In this case usage of np.where is quite effective:
%timeit np.where(x)[0]
%timeit x.nonzero()[0]
%timeit np.arange(len(x))[x]
24.8 ms ± 551 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
24.5 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
52.4 ms ± 895 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
random items of sorted -> idx (in ascending order):
If you have lost any reference to positions of items you need to take from sorted, you're still able to find idx if there are no duplicate items. This is O(n logn):
x = np.random.choice(a=[False, True], size=(10000000,), p=[0.9, 0.1])
arr = np.linspace(0,1,len(x))
sub_arr = arr[x] %input data: skipping 90% of items
%timeit np.searchsorted(arr, sub_arr) %output data
112 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
idx (in any order) -> idx (in ascending order)
this is just simple:
x = np.arange(10000000)
np.random.shuffle(x)
idx = x[:1000000] #input data: first 1M of random idx
%timeit np.sort(idx) #output data
65.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you need to know where the masked entries came from, you can use one of np.where, np.nonzero or np.flatnonzero. However, if you need to get the origins of only a subset of the indices, you can use a function I recently wrote as part of my library, haggis: haggis.npy_util.unmasked_index1.
Given mask and the indices of some of your mask elements, you can retrieve a multi-dimensional index of the original locations with
unmasked_index(idx, mask)
If you ever need it, there is also an inverse function haggis.npy_util.masked_index that converts a location in a multidimensional input array into its index in the masked array.
1Disclaimer: I am the author of haggis.

Vectorized way for applying a function to a dataframe to create lists

I have seen few questions like these
Vectorized alternative to iterrows ,
Faster alternative to iterrows , Pandas: Alternative to iterrow loops ,
for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach.
My questions is also again about .iterrows.
I am trying to pass the first and second row to a function and create a list out of it.
What I have:
I have a pandas DataFrame with two columns that look like this.
I.D Score
1 11 26
3 12 26
5 13 26
6 14 25
What I did:
where the term Point is a function I earlier defined.
my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
What I am trying to do:
The faster and vectorized form of the above.
Try list comprehension:
score = pd.concat([score] * 1000, ignore_index=True)
def Points(a,b):
return (a,b)
In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])]
1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()]
259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()]
3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Have you ever tried the method .itertuples()?
my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()]
Is a faster way to iterate over a pandas dataframe.
I hope it help.
The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column.
You can use pandas.DataFrame.apply with axis set to 1:
df.apply(func, axis=1)
To put in a list, it depends what your function returns but you could:
df.apply(Points, axis=1).tolist()
If you want to apply on only some columns:
df[['Score', 'I.D']].apply(Points, axis=1)
If you want to apply on a func that takes multiple args use numpy.vectorize for speed:
np.vectorize(Points)(df['Score'], df['I.D'])
Or a lambda:
df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()

how to improve searching index in dataframe

Given a pandas dataframe with a timestamp index, sorted.
I have a label and I need to find the closest index to that label.
Also, I need to find a smaller timestamp, so the search should be computed in the minor timestamps.
Here is my code:
import pandas as pd
import datetime
data = [i for i in range(100)]
dates = pd.date_range(start="01-01-2018", freq="min", periods=100)
dataframe = pd.DataFrame(data, dates)
label = "01-01-2018 00:10:01"
method = "pad"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print("Closest idx:"+str(idx))
print("Closest date:"+str(dataframe.index[idx]))
the searching is too slow. Is there a way to improve it?
To improve performance, I recommend a transformation of what you're searching. Instead of using get_loc, you can convert your DateTimeIndex to Unix Time, and use np.searchsorted on the underlying numpy array (As the name implies, this requires a sorted index).
get_loc:
(Your current approach)
label = "01-01-2018 00:10:01"
tol = datetime.timedelta(seconds=60)
idx = dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
print(dataframe.iloc[idx])
0 10
Name: 2018-01-01 00:10:00, dtype: int64
And it's timings:
%timeit dataframe.index.get_loc(key=label, method="pad", tolerance=tol)
2.03 ms ± 81.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.searchsorted:
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
idx = np.max(np.searchsorted(arr, l, side='left')-1, 0)
print(dataframe.iloc[idx])
0 10
Name: 2018-01-01 00:10:00, dtype: int64
And the timings:
%timeit np.max(np.searchsorted(arr, l, side='left')-1, 0)
56.6 µs ± 979 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(I didn't include the setup costs, because the initial array creation should be something you do once, then use for every single query, but even if I did include the setup costs, this method is faster):
%%timeit
arr = df.index.astype(int)//10**9
l = pd.to_datetime(label).timestamp()
np.max(np.searchsorted(arr, l, side='left')-1, 0)
394 µs ± 3.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The above method does not enforce a tolerance of 60s, although this is trivial to check:
>>> np.abs(arr[idx]-l)<60
True

numpy array fromfunction using each previous value as input, with non-zero initial value

I would like to fill a numpy array with values using a function. I want the array to start with one initial value and be filled to a given length, using each previous value in the array as the input to the function.
Each array value i should be (i-1)*x**(y/z).
After a bit of work, I have got to:
import numpy as np
f = np.zeros([31,1])
f[0] = 20
fun = lambda i, j: i*2**(1/3)
f[1:] = np.fromfunction(np.vectorize(fun), (len(f)-1,1), dtype = int)
This fills an array with
[firstvalue=20, 0, i-1 + 1*2**(1/3),...]
I have arrived here having read
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfunction.html
Use of numpy fromfunction
Most efficient way to map function over numpy array
Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?
How do I create a numpy array using a function?
But I'm just not getting how to translate it to my function.
Except for the initial 20, this produces the same values
np.arange(31)*2**(1/3)
Your iterative version (slightly modified)
def foo0(n):
f = np.zeros(n)
f[0] = 20
for i in range(1,n):
f[i] = f[i-1]*2**(1/3)
return f
An alternative:
def foo1(n):
g = [20]
for i in range(n-1):
g.append(g[-1]*2**(1/3))
return np.array(g)
They produce the same thing:
In [25]: np.allclose(foo0(31), foo1(31))
Out[25]: True
Mine is a bit faster:
In [26]: timeit foo0(100)
35 µs ± 75 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [27]: timeit foo1(100)
23.6 µs ± 83.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
But we don't need to evaluate 2**(1/3) every time
def foo2(n):
g = [20]
const = 2**(1/3)
for i in range(n-1):
g.append(g[-1]*const)
return np.array(g)
minor time savings. But that's just multiplying each entry by the same const. So we can use cumprod for a bigger time savings:
def foo3(n):
g = np.ones(n)*(2**(1/3))
g[0]=20
return np.cumprod(g)
In [37]: timeit foo3(31)
14.9 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [40]: np.allclose(foo0(31), foo3(31))
Out[40]: True

How to turn Numpy array to set efficiently?

I used:
df['ids'] = df['ids'].values.astype(set)
to turn lists into sets, but the output was a list not a set:
>>> x = np.array([[1, 2, 2.5],[12,35,12]])
>>> x.astype(set)
array([[1.0, 2.0, 2.5],
[12.0, 35.0, 12.0]], dtype=object)
Is there an efficient way to turn list into set in Numpy?
EDIT 1:
My input is as big as below:
I have 3,000 records. Each has 30,000 ids: [[1,...,12,13,...,30000], [1,..,43,45,...,30000],...,[...]]
First flatten your ndarray to obtain a single dimensional array, then apply set() on it:
set(x.flatten())
Edit : since it seems you just want an array of set, not a set of the whole array, then you can do value = [set(v) for v in x] to obtain a list of sets.
The current state of your question (can change any time): how can I efficiently remove unique elements from a large array of large arrays?
import numpy as np
rng = np.random.default_rng()
arr = rng.random((3000, 30000))
out1 = list(map(np.unique, arr))
#or
out2 = [np.unique(subarr) for subarr in arr]
Runtimes in an IPython shell:
>>> %timeit list(map(np.unique, arr))
5.39 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit [np.unique(subarr) for subarr in arr]
5.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Update: as #hpaulj pointed out in his comment, my dummy example is biased since floating-point random numbers will almost certainly be unique. So here's a more life-like example with integer numbers:
>>> arr = rng.integers(low=1, high=15000, size=(3000, 30000))
>>> %timeit list(map(np.unique, arr))
4.98 s ± 83.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit [np.unique(subarr) for subarr in arr]
4.95 s ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In this case the elements of the output list have varying lengths, since there are actual duplicates to remove.
A couple of earlier 'row-wise' unique questions:
vectorize numpy unique for subarrays
Numpy: Row Wise Unique elements
Count unique elements row wise in an ndarray
In a couple of these the count is more interesting than the actual unique values.
If the number of unique values per row differs, then the result cannot be a (2d) array. That's a pretty good indication that the problem cannot be fully vectorized. You need some sort of iteration over the rows.

Categories