Replace rarely occurring values in a pandas dataframe - python

I have a moderately large (~60,000 rows by 15 columns) csv file that I'm working on with pandas. Each row represents an individual and contains personal data. I want to render the data anonymous. One way I want to do so is by replacing values in a particular column where they are rare. I initially tried to do so as follows:
def clean_data(entry):
if df[df.column_name == entry].index.size < 10:
return 'RARE_VALUE'
else:
return entry
df.new_column_name = df.column_name.apply(clean_data)
But running it froze my system every time. This unfortunately means I have no useful debugging data. Does anyone know the correct way to do this? The column contains both strings and null values.

I think you want to groupby column name:
g = df.groupby('column_name')
You can use a filter, for example, to return only those rows who have something in column_name which appears more than 10 times:
g.filter(lambda x: len(x) >= 10)
To overwrite the column with 'RARE_VALUE' you can use transform (which calculates the result once for each group, and spreads it around appropriately):
df.loc[g[col].transform(lambda x: len(x) < 10).astype(bool), col] = 'RARE_VALUE'
As DSM points out, the following trick is much faster:
df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"
Here's some timeit information (to show how impressive DSM's solution is!):
In [21]: g = pd.DataFrame(np.random.randint(1, 100, (1000, 2))).groupby(0)
In [22]: %timeit g.filter(lambda x: len(x) >= 10)
10 loops, best of 3: 67.2 ms per loop
In [23]: %timeit df.loc[g[1].transform(lambda x: len(x) < 10).values.astype(bool), 1]
10 loops, best of 3: 44.6 ms per loop
In [24]: %timeit df.loc[df[1].value_counts()[df[1]].values < 10, 1]
1000 loops, best of 3: 1.57 ms per loop

#Andy Hayden solves the issue in various ways. I would recommend using pipelines for this kind of task though. The following may seem more unwieldy, but it comes in handy if you want to save the whole pipeline as an object, or if you have to generalize predictions on a test set:
class RemoveScarceValuesFeatureEngineer:
def __init__(self, min_occurences):
self._min_occurences = min_occurences
self._column_value_counts = {}
def fit(self, X, y):
for column in X.columns:
self._column_value_counts[column] = X[column].value_counts()
return self
def transform(self, X):
for column in X.columns:
X.loc[self._column_value_counts[column][X[column]].values
< self._min_occurences, column] = "RARE_VALUE"
return X
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
You may find more informations here: Pandas replace rare values

Related

Numpy searchsorted along many dimensions? [duplicate]

Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop

Vectoriced iterative fixpoint search

I have a function that will always converge to a fixpoint, e.g. f(x)= (x-a)/2+a. I have a function that will find this fixpoint through repetive invoking of the function:
def find_fix_point(f,x):
while f(x)>0.1:
x = f(x)
return x
Which works fine, now I want to do this for a vectoriced version;
def find_fix_point(f,x):
while (f(x)>0.1).any():
x = f(x)
return x
However this is quite inefficient, if most of the instances only need about 10 iterations and one needs 1000. What is a fast method to remove `x that already have been found?
The code can use numpy or scipy.
One way to solve this would be to use recursion:
def find_fix_point_recursive(f, x):
ind = x > 0.1
if ind.any():
x[ind] = find_fix_point_recursive(f, f(x[ind]))
return x
With this implementation, we only call f on the points which need to be updated.
Note that by using recursion we avoid having to do the check x > 0.1 all the time, with each call working on smaller and smaller arrays.
%timeit x = np.zeros(10000); x[0] = 10000; find_fix_point(f, x)
1000 loops, best of 3: 1.04 ms per loop
%timeit x = np.zeros(10000); x[0] = 10000; find_fix_point_recursive(f, x)
10000 loops, best of 3: 141 µs per loop
First for generality,I change the criteria to fit with the fix-point definition : we stop when |x-f(x)|<=epsilon.
You can mix boolean indexing and integer indexing to keep each time the active points. Here a way to do that :
def find_fix_point(f,x,epsilon):
ind=np.mgrid[:len(x)] # initial indices.
while ind.size>0:
xind=x[ind] # integer indexing
yind=f(xind)
x[ind]=yind
ind=ind[abs(yind-xind)>epsilon] # boolean indexing
An example with a lot of fix points :
from matplotlib.pyplot import plot,show
x0=np.linspace(0,1,1000)
x = x0.copy()
def f(x): return x*np.sin(1/x)
find_fix_point(f,x,1e-5)
plot(x0,x,'.');show()
The general method is to use boolean indexing to compute only the ones, that did not yet reach equilibrium.
I adapted the algorithm given by Jonas Adler to avoid maximal recursion depth:
def find_fix_point_vector(f,x):
x = x.copy()
x_fix = np.empty(x.shape)
unfixed = np.full(x.shape, True, dtype = bool)
while unfixed.any():
x_new = f(x) #iteration
x_fix[unfixed] = x_new # copy the values
cond = np.abs(x_new-x)>1
unfixed[unfixed] = cond #find out which ones are fixed
x = x_new[cond] # update the x values that still need to be computed
return x_fix
Edit:
Here I review the 3 solutions proposed. I will call the different fucntions according to their proposer, find_fix_Jonas, find_fix_Jurg, find_fix_BM. I changed the fixpoint condition in all functions according to BM (see updated fucntion of Jonas at the end).
Speed:
%timeit find_fix_BM(f, np.linspace(0,100000,10000),1)
100 loops, best of 3: 2.31 ms per loop
%timeit find_fix_Jonas(f, np.linspace(0,100000,10000))
1000 loops, best of 3: 1.52 ms per loop
%timeit find_fix_Jurg(f, np.linspace(0,100000,10000))
1000 loops, best of 3: 1.28 ms per loop
According to readability I think the version of Jonas is the easiest one to understand, so should be chosen when speed does not matter very much.
Jonas's version however might raise a Runtimeerror, when the number of iterations until fixpoint is reached is large (>1000). The other two solutions do not have this drawback.
The verion of B.M. however might be easier to understand than the version proposed by me.
#
Version of Jonas used:
def find_fix_Jonas(f, x):
fx = f(x)
ind = np.abs(fx-x)>1
if ind.any():
fx[ind] = find_fix_Jonas(f, fx[ind])
return fx
...remove `x that already have been found?
Create a new array using boolean indexing with your condition.
>>> a = np.array([3,1,6,3,9])
>>> a != 3
array([False, True, True, False, True], dtype=bool)
>>> b = a[a != 3]
>>> b
array([1, 6, 9])
>>>

Find nearest indices for one array against all values in another array - Python / NumPy

I have a list of complex numbers for which I want to find the closest value in another list of complex numbers.
My current approach with numpy:
import numpy as np
refArray = np.random.random(16);
myArray = np.random.random(1000);
def find_nearest(array, value):
idx = (np.abs(array-value)).argmin()
return idx;
for value in np.nditer(myArray):
index = find_nearest(refArray, value);
print(index);
Unfortunately, this takes ages for a large amount of values.
Is there a faster or more "pythonian" way of matching each value in myArray to the closest value in refArray?
FYI: I don't necessarily need numpy in my script.
Important: the order of both myArray as well as refArray is important and should not be changed. If sorting is to be applied, the original index should be retained in some way.
Here's one vectorized approach with np.searchsorted based on this post -
def closest_argmin(A, B):
L = B.size
sidx_B = B.argsort()
sorted_B = B[sidx_B]
sorted_idx = np.searchsorted(sorted_B, A)
sorted_idx[sorted_idx==L] = L-1
mask = (sorted_idx > 0) & \
((np.abs(A - sorted_B[sorted_idx-1]) < np.abs(A - sorted_B[sorted_idx])) )
return sidx_B[sorted_idx-mask]
Brief explanation :
Get the sorted indices for the left positions. We do this with - np.searchsorted(arr1, arr2, side='left') or just np.searchsorted(arr1, arr2). Now, searchsorted expects sorted array as the first input, so we need some preparatory work there.
Compare the values at those left positions with the values at their immediate right positions (left + 1) and see which one is closest. We do this at the step that computes mask.
Based on whether the left ones or their immediate right ones are closest, choose the respective ones. This is done with the subtraction of indices with the mask values acting as the offsets being converted to ints.
Benchmarking
Original approach -
def org_app(myArray, refArray):
out1 = np.empty(myArray.size, dtype=int)
for i, value in enumerate(myArray):
# find_nearest from posted question
index = find_nearest(refArray, value)
out1[i] = index
return out1
Timings and verification -
In [188]: refArray = np.random.random(16)
...: myArray = np.random.random(1000)
...:
In [189]: %timeit org_app(myArray, refArray)
100 loops, best of 3: 1.95 ms per loop
In [190]: %timeit closest_argmin(myArray, refArray)
10000 loops, best of 3: 36.6 µs per loop
In [191]: np.allclose(closest_argmin(myArray, refArray), org_app(myArray, refArray))
Out[191]: True
50x+ speedup for the posted sample and hopefully more for larger datasets!
An answer that is much shorter than that of #Divakar, also using broadcasting and even slightly faster:
abs(myArray[:, None] - refArray[None, :]).argmin(axis=-1)

What is this set of all sets one line function doing?

I found this one line function on the python wiki that creates a set of all sets that can be created from a list passed as an argument.
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
Can someone please explain how this function works?
To construct the powerset, iterating over 2**len(set(x)) gives you all the binary combinations of the set.
range(2**len(set(x))) == [00000, 00001, 00010, ..., 11110, 11111]
Now you just need to test if the bit is set in i to see if you need to include it in the set, e.g.:
>>> i = 0b10010
>>> [y for j, y in enumerate(range(5)) if (i >> j) & 1]
[1, 4]
Though I'm not sure how efficient it is given the call to set(x) for every iteration. There is a small hack that would avoid that:
f = lambda x: [[y for j, y in enumerate(s) if (i >> j) & 1] for s in [set(x)] for i in range(2**len(s))]
A couple of other forms using itertools:
import itertools as it
f1 = lambda x: [list(it.compress(s, i)) for s in [set(x)] for i in it.product((0,1), repeat=len(s))]
f2 = lambda x: list(it.chain.from_iterable(it.combinations(set(x), r) for r in range(len(set(x))+1)))
Note: this last one could just return an iterable vs list if you remove list() depending on the use-case this could save some memory.
Looking at some timings of a list of 25 random numbers 0-50:
%%timeit binary: 1 loop, best of 3: 20.1 s per loop
%%timeit binary+hack: 1 loop, best of 3: 17.9 s per loop
%%timeit compress/product: 1 loop, best of 3: 5.27 s per loop
%%timeit chain/combinations: 1 loop, best of 3: 659 ms per loop
Let's rewrite it a bit and break it down step by step:
f = lambda x: [[y for j, y in enumerate(set(x)) if (i >> j) & 1] for i in range(2**len(set(x)))]
is equivalent to:
def f(x):
n = len(set(x))
sets = []
for i in range(n): # all combinations of members of the set in binary
set_i = []
for j, y in enumerate(set(x)):
if (i>>j) & 1: #check if bit nr j is set
set_x.append(y)
sets.append(set_i)
return sets
for an input list like [1,2,3,4], the following happens:
n=4
range(2**n)=[0,1,2,3...15]
which, in binary is:
0,1,10,11,100...1110,1111
Enumerate makes tuples of y with its index, so in our case:
[(0,1),(1,2),(2,3),(3,4)]
The (i>>j) & 1 part might require some explanation.
(i>>j) shifts the number i j places to the right, e.g. in decimal: 4>>2=1, or in binary:100>>2=001. The & is the bit-wise and operator. This checks, for every bit of both operands, if they are 1 and returns the result as a number, acting like a filter: 10111 & 11001 = 10101.
In the case of our example, it checks if the bit at place j is 1. If it is, the corresponding value is added to the result list. This way the binary map of combinations is converted to a list of lists, which is returned.

Optimize Python: Large arrays, memory problems

I'm having a speed problem running a python / numypy code. I don't know how to make it faster, maybe someone else?
Assume there is a surface with two triangulation, one fine (..._fine) with M points, one coarse with N points. Also, there's data on the coarse mesh at every point (N floats). I'm trying to do the following:
For every point on the fine mesh, find the k closest points on coarse mesh and get mean value. Short: interpolate data from coarse to fine.
My code right now goes like that. With large data (in my case M = 2e6, N = 1e4) the code runs about 25 minutes, guess due to the explicit for loop not going into numpy. Any ideas how to solve that one with smart indexing? M x N arrays blowing the RAM..
import numpy as np
p_fine.shape => m x 3
p.shape => n x 3
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm(ps-p,axis=1))[:k]])
Cheers!
First of all thanks for the detailed help.
First, Divakar, your solutions gave substantial speed-up. With my data, the code ran for just below 2 minutes depending a bit on the chunk size.
I also tried my way around sklearn and ended up with
def sklearnSearch_v3(p, p_fine, k):
neigh = NearestNeighbors(k)
neigh.fit(p)
return data_coarse[neigh.kneighbors(p_fine)[1]].mean(axis=1)
which ended up being quite fast, for my data sizes, I get the following
import numpy as np
from sklearn.neighbors import NearestNeighbors
m,n = 2000000,20000
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 3
yields
%timeit sklearv3(p, p_fine, k)
1 loop, best of 3: 7.46 s per loop
Approach #1
We are working with large sized datasets and memory is an issue, so I will try to optimize the computations within the loop. Now, we can use np.einsum to replace np.linalg.norm part and np.argpartition in place of actual sorting with np.argsort, like so -
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
out = out/k
Approach #2
Now, as another approach we can also use Scipy's cdist for a fully vectorized solution, like so -
from scipy.spatial.distance import cdist
out = data_coarse[np.argpartition(cdist(p_fine,p),k,axis=1)[:,:k]].mean(1)
But, since we are memory bound here, we can perform these operations in chunks. Basically, we would get chunks of rows from that tall array p_fine that has millions of rows and use cdist and thus at each iteration get chunks of output elements instead of just one scalar. With this, we would cut the loop count by the length of that chunk.
So, finally we would have an implementation like so -
out = np.empty((m,))
L = 10 # Length of chunk (to be used as a param)
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].mean(1)
Runtime test
Setup -
# Setup inputs
m,n = 20000,100
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 5
def original_approach(p,p_fine,m,n,k):
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm\
(ps-p,axis=1))[:k]])
return data_fine
def proposed_approach(p,p_fine,m,n,k):
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
return out/k
def proposed_approach_v2(p,p_fine,m,n,k,len_per_iter):
L = len_per_iter
out = np.empty((m,))
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].sum(1)
return out/k
Timings -
In [134]: %timeit original_approach(p,p_fine,m,n,k)
1 loops, best of 3: 1.1 s per loop
In [135]: %timeit proposed_approach(p,p_fine,m,n,k)
1 loops, best of 3: 539 ms per loop
In [136]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=100)
10 loops, best of 3: 63.2 ms per loop
In [137]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=1000)
10 loops, best of 3: 53.1 ms per loop
In [138]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=2000)
10 loops, best of 3: 63.8 ms per loop
So, there's about 2x improvement with the first proposed approach and 20x over the original approach with the second one at the sweet spot with the len_per_iter param set at 1000. Hopefully this will bring down your 25 minutes runtime to little over a minute. Not bad I guess!

Categories