Pandas drop rows lower then others in all colums - python

I have a dataframe with a lot of rows with numerical columns, such as:
A
B
C
D
12
7
1
0
7
1
2
0
1
1
1
1
2
2
0
0
I need to reduce the size of the dataframe by removing those rows that has another row with all values bigger.
In the previous example i need to remove the last row because the first row has all values bigger (in case of dubplicate rows i need to keep one of them).
And return This:
A
B
C
D
12
7
1
0
7
1
2
0
1
1
1
1
My faster solution are the folowing:
def complete_reduction(df, columns):
def _single_reduction(row):
df["check"] = True
for col in columns:
df["check"] = df["check"] & (df[col] >= row[col])
drop_index.append(df["check"].sum() == 1)
df = df.drop_duplicates(subset=columns)
drop_index = []
df.apply(lambda x: _single_reduction(x), axis=1)
df = df[numpy.array(drop_index).astype(bool)]
return df
Any better ideas?
Update:
A new solution has been found here
https://stackoverflow.com/a/68528943/11327160
but i hope for somethings faster.

An more memory-efficient and faster solution than the one proposed so far is to use Numba. There is no need to create huge temporary array with Numba. Moreover, it is easy to write a parallel implementation that makes use of all CPU cores. Here is the implementation:
import numba as nb
#nb.njit
def is_dominated(arr, k):
n, m = arr.shape
for i in range(n):
if i != k:
dominated = True
for j in range(m):
if arr[i, j] < arr[k, j]:
dominated = False
if dominated:
return True
return False
# Precompile the function to native code for the most common types
#nb.njit(['(i4[:,::1],)', '(i8[:,::1],)'], parallel=True, cache=True)
def dominated_rows(arr):
n, m = arr.shape
toRemove = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
toRemove[i] = is_dominated(arr, i)
return toRemove
# Special case
df2 = df.drop_duplicates()
# Main computation
result = df2[~dominated_rows(np.ascontiguousarray(df.values))]
Benchmark
The input test is two random dataframes of shape 20000x5 and 5000x100 containing small integers (ie. [0;100[). Tests have been done on a (6-core) i5-9600KF processor with 16 GiB of RAM on Windows. The version of #BingWang is the updated one of the 2022-05-24. Here are performance results of the proposed approaches so far:
Dataframe with shape 5000x100
- Initial code: 114_340 ms
- BENY: 2_716 ms (consume few GiB of RAM)
- Bing Wang: 2_619 ms
- Numba: 303 ms <----
Dataframe with shape 20000x5
- Initial code: (too long)
- BENY: 8.775 ms (consume few GiB of RAM)
- Bing Wang: 578 ms
- Numba: 21 ms <----
This solution is respectively about 9 to 28 times faster than the fastest one (of #BingWang). It also has the benefit of consuming far less memory. Indeed, the #BENY implementation consume few GiB of RAM while this one (and the one of #BingWang) only consumes no more than few MiB for this used-case. The speed gain over the #BingWang implementation is due to the early stop, parallelism and the native execution.
One can see that this Numba implementation and the one of #BingWang are quite efficient when the number of column is small. This makes sense for the #BingWang since the complexity should be O(N(logN)^(d-2)) where d is the number of columns. As for Numba, it is significantly faster because most rows are dominated on the second random dataset causing the early stop to be very effective in practice. I think the #BingWang algorithm might be faster when most rows are not dominated. However, this case should be very uncommon on dataframes with few columns and a lot of rows (at least, clearly on uniformly random ones).

We can do numpy board cast
s = df.values
out = df[np.sum(np.all(s>=s[:,None],-1),1)==1]
Out[44]:
A B C D
0 12 7 1 0
1 7 1 2 0
2 1 1 1 1

Here is a try based on Kung et al 1975
http://www.eecs.harvard.edu/~htk/publication/1975-jacm-kung-luccio-preparata.pdf
Brutal force solution is from https://stackoverflow.com/a/68528943/11327160
I didn't robustly test it, but using these parameters it looks to be the same answer
There is no guarantee it is correct, or I am even following the paper. Please test thoroughly. In addition, there is very likely to be a commercial solution to calculate it.
D=5 #dimension, or number of columns
N=2000 #number of data rows
M=1000 #upper bound for random integers
Changing to D=20 and N=20000 you can see Kung75 completes in <1 minute but Brutal Force will use more than 10x the time.
Even at Dimension=1000,Rows=20000,value range 0~999, it can still complete slightly over 1 minute
This can be revised similar to merge sort (compute small chunks by brutal force, then merge up with Filter), which is easier to switch to parallel computing.
Another way of speeding up is to turn off array boundary check after you are comfortable with the code. This is due to heavy array indexing here. I would recommend C# if you want to try this path.
import pandas as pd
import numpy as np
import datetime
#generate fake data
D=1000 #dimension, or number of columns
N=20000 #number of data rows
M=1000 #upper bound for random integers
np.random.seed(12345) #set seed so this is reproducible
data=np.random.randint(0,M,(N,D))
for i in range(0,12):
print(i,data[i])
#Compare w and v starting dimention d
def Compare(w,v,d=0):
cmp=3 #0x11, low bit is GE, high bit is LE, together means EQ
while d<D:
if w[d]>v[d]:
cmp&=1
elif w[d]<v[d]:
cmp&=2
if cmp>0:
d+=1
else:
break
return cmp # 0=uncomparable, 1=GT, 2=LT, 3=EQ
#unit test:
#print(Compare(data[0],data[1]))
#print(Compare(data[0],data[1],4))
#print(Compare(data[1],data[11]))
#print(Compare(data[11],data[1]))
#print(Compare(data[1],data[1]))
def AuxSort(d,ndxArray): #stable sort desc by dimention d
return [x[1] for x in sorted([(-data[n][d],n) for n in ndxArray])]
#unit test
#print(AuxSort(data,0,[0,4,3]))
#print(AuxSort(data,2,[0,1,2]))
#cumulatively find the pareto front. Time O(N^2), space O(N)
def N2BrutalForce(data,ndxArray=None,d=0):
if len(data)==0:
return []
if not ndxArray: #by default check the entire data
ndxArray=list(range(len(data)))
#up to this point ndxArray is not empty
result={ndxArray[0]:data[ndxArray[0]]}
for i in range(1,len(ndxArray)):
dominated=[]
j=ndxArray[i]
for k,v in result.items():
c=Compare(data[j],v,d)
if c>1:
break
elif c==1:
dominated.append(k)
else:
for o in dominated:
del result[o]
result[j]=data[j]
return [r for r in result]
def resultPrinter(res, ShowCountOnly=False):
if not ShowCountOnly:
for r in sorted(res):
print(r,data[r])
print(len(res),'results found',datetime.datetime.today())
#unit rest
#resultPrinter(N2BrutalForce(data),True)
#resultPrinter(N2BrutalForce(data,list(range(15))))
def FindT(R1,R2,S1,S2,d):
S1R1=set(Filter(data,d,R1,S1))
T1=[s for s in S1 if s in S1R1]
S2R1=Filter(data,d+1,R1,S2)
S2R2=set(Filter(data,d,R2,S2))
T2=[s for s in S2R1 if s in S2R2]
return T1+T2
def BreakAtPseudoMedian(sArray,d):
sArray=AuxSort(d,sArray) #this could speed up by moving the sort to caller and avoid redo sorting
if data[sArray[0]][d]==data[sArray[-1]][d]:
return [],sArray
L=len(sArray)
mHigh=mLow=L//2
while mLow>0 and data[sArray[mLow]][d]==data[sArray[mLow-1]][d]:
mLow-=1
if mLow>0:
return sArray[:mLow],sArray[mLow:]
while mHigh<L-1 and data[sArray[mHigh]][d]==data[sArray[mHigh+1]][d]:
mHigh+=1
return sArray[:mHigh],sArray[mHigh:]
def Filter(data,d,rArray,sArray):
L=len(rArray)+len(sArray)
if d==D-1 and rArray:
R=max(data[r][d] for r in rArray)
return [s for s in sArray if data[s][d]>R]
elif len(rArray)*len(sArray)<=30 or len(rArray)<=2 or len(sArray)<=2:
nonDominated=[]
for s in sArray:
for r in rArray:
c=Compare(data[s],data[r],d)
if c>1:
break
else:
nonDominated.append(s)
return nonDominated
S1,S2=BreakAtPseudoMedian(sArray,d)
R1,R2=BreakAtRefValue(rArray,d,data[S2[0]][d])
if not S1 and not R1:
return Filter(data,d+1,rArray,sArray)
return FindT(R1,R2,S1,S2,d)
#Filter(data,0,[0,1,2,3,4,5,6,7,8,9],[11])
def BreakAtRefValue(rArray,d,br):
rArray=AuxSort(d,rArray)
if data[rArray[0]][d]<=br:
return [],rArray
if data[rArray[-1]][d]>br:
return rArray,[]
mLow,mHigh=0,len(rArray)-1
while mLow<mHigh-1 and data[rArray[mLow]][d]>br and data[rArray[mHigh]][d]<br:
mid=(mLow+mHigh)//2
if data[rArray[mid]][d]>br:
mLow=mid
elif data[rArray[mid]][d]<br:
mHigh=mid
else:
mLow=mid
break
if data[rArray[mLow]][d]>br and data[rArray[mHigh]][d]<br:
return rArray[:mHigh],rArray[mHigh:]
if data[rArray[mLow]][d]==br:
while data[rArray[mLow-1]][d]==br:
mLow-=1
return rArray[:mLow],rArray[mLow:]
while data[rArray[mHigh-1]][d]==br:
mHigh-=1
return rArray[:mHigh],rArray[mHigh:]
def Kung75(data,d,ndxArray):
L=len(ndxArray)
if L<10:
return N2BrutalForce(data,ndxArray,d)
elif d==D-1:
x,y=-1,-1
for n in ndxArray:
if y<0 or data[n][d]>x:
x,y=data[n][d],n
return [y]
if data[ndxArray[0]][d]==data[ndxArray[-1]][d]:
return Kung75(data,d+1,AuxSort(d+1,ndxArray))
R,S=BreakAtPseudoMedian(ndxArray,d)
R=Kung75(data,d,R)
S=Kung75(data,d,S)
T=Filter(data,d+1,R,S)
return R+T
print('started at',datetime.datetime.today())
resultPrinter(Kung75(data,0,AuxSort(0,list(range(len(data))))),True)

We take the cumulative maximum value per column in the dataframe.
We want to keep all rows that have a single column value that is equal to the maximum. We then drop duplicates using pandas drop_duplicates
In [14]: df = pd.DataFrame(
...: [[12, 7, 1, 0], [7, 1, 2, 0], [1, 1, 1, 1], [2, 2, 0, 0]],
...: columns=["A", "B", "C", "D"],
...: )
In [15]: df[(df == df.cummax(axis=0)).any(axis=1)].drop_duplicates()
Out[15]:
A B C D
0 12 7 1 0
1 7 1 2 0
2 1 1 1 1

df.sort_values(by=['A', 'B', 'C', 'D'], ascending=False, inplace=True)
df = df.iloc[:cutoff]
If this takes too long you could do it on subsets of the df until
it is small enough.

Related

Compare rows with conditions and generate a new dataframe in Pandas

I have a very big dataframe with this structure:
Timestamp Val1
Here you can see a real sample:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.
Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.
This is my code:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).
Thank you very much.
Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.

What is the way to create DataFrame of length of intersections of a list of sets

I have a dictionary filled with sets. It might look something like this:
import pandas as pd
my_dict = {'gs_1': set(('ENS1', 'ENS2', 'ENS3')),
'gs_2': set(('ENS1', 'ENS4', 'ENS5', 'ENS7', 'ENS8')),
'gs_3': set(('ENS2', 'ENS3', 'ENS6'))}
I've also built a pandas DataFrame that looks something like this:
my_df = pd.DataFrame(columns=my_dict.keys())
my_df.gs_1=[0, 0, 0]
my_df.gs_2=[0, 0, 0]
my_df.gs_3=[0, 0, 0]
my_df.index = my_dict.keys()
my_df
Yields
gs_1 gs_2 gs_3
gs_1 0 0 0
gs_2 0 0 0
gs_3 0 0 0
My goal here is to populate the DataFrame with the length of the intersection between each set as efficiently as possible. The DataFrame doesn't strictly have to be built before-hand and then populated. Right now, my working solution is:
for gs_1 in my_df.index:
for gs_2 in my_df.columns:
my_df.loc[gs_1, gs_2] = len(my_dict[gs_1] & my_dict[gs_2])
my_df
Yields, correctly,
gs_1 gs_2 gs_3
gs_1 3 1 2
gs_2 1 5 0
gs_3 2 0 3
My problem is that this is far too slow. In practice, gs_n extends to around 6000, and my projected runtime for this approaches 2 hours. What's the best way to go here?
Here's my approach based on scipy.spatial.distance_matrix:
# create unions of values
total = set()
for key, val in my_dict.items():
total = total.union(val)
total = list(total)
# create data frame
df = pd.DataFrame({}, index=total)
for key, val in my_dict.items():
df[key] = pd.Series(np.ones(len(val)), index=list(val))
df = df.fillna(0).astype(bool)
# return result:
x = df.values
np.sum(x[:,np.newaxis,:]&x[:,:,np.newaxis], axis=0)
#array([[3, 1, 2],
# [1, 5, 0],
# [2, 0, 3]], dtype=int32)
# if you want a data frame:
new_df = pd.DataFrame(np.sum(x[:,np.newaxis,:]&x[:,:,np.newaxis],
axis=0),
index=df.columns, columns=df.columns)
Took 11s for 6000 gs_ and 100 unique values:
max_total = 100
my_dict = {}
for i in range(6000):
np.random.seed(i)
sample_size = np.random.randint(1,max_total)
my_dict[i] = np.random.choice(np.arange(max_total), replace=False, size=sample_size)
Edit: if you have a large number of unique values, you can work on small subsets, and add them up. Something like:
chunk_size = 100
ans = np.zeros(num_gs, num_gs)
for x in range(0, len(total), chunk_size):
chunk = total[x:x+chunk_size]
df = pd.DataFrame({}, index=chunk)
for key, val in my_dict.items():
sub_set = val.intersection(set(chunk))
df[key] = pd.Series(np.ones(len(sub_set )), index=list(sub_set ))
df = df.fillna(0).astype(bool)
# return result:
x = df.values
ans += np.sum(x[:,np.newaxis,:]&x[:,:,np.newaxis], axis=0)
With 14000 unique values, that would that approximately 140 * 15 = 2000 seconds. Not so fast but significantly less than 2 hours :-).
You can also increase chunk_size if your memory allows. That was the limit of my 8GB Ram system :-).
Also, it is possible to parallelize on the subsets (chunk) as well.
Quang's solution worked well, but it broke down when I tried to put it into practice; even with the chunking solution, I ran into memory issues at the last step:
ans += np.sum(x[:,np.newaxis,:]&x[:,:,np.newaxis], axis=0)
I decided to take an alternative approach, and I managed to find a solution that was both faster and more memory-efficient when applied to the problem:
import pandas as pd
import itertools
import numpy as np
my_dict = {'gs_1': set(('ENS1', 'ENS2', 'ENS3')),
'gs_2': set(('ENS1', 'ENS4', 'ENS5', 'ENS7', 'ENS8')),
'gs_3': set(('ENS2', 'ENS3', 'ENS6'))}
gs_series = pd.Series({a:b for a,b in zip(itertools.combinations_with_replacement(my_dict.keys(),2),
[len(c&d) for c,d in itertools.combinations_with_replacement(my_dict.values(),2)])})
gs_df = gs_series.unstack()
proper_index = gs_series.index.get_level_values(0).unique()
gs_df = gs_df.reindex(proper_index)[proper_index.values].copy()
i_lower = np.tril_indices(np.array(len(gs_df.columns)), -1)
gs_matrix = gs_df.values
gs_matrix[i_lower] = gs_matrix.T[i_lower]
gs_df
This yields, correctly,
gs_1 gs_2 gs_3
gs_1 3.0 1.0 2.0
gs_2 1.0 5.0 0.0
gs_3 2.0 0.0 3.0
The basic idea was to build a dictionary with the length of the intersection between each 2 sets using itertools, and convert that to a pd.Series. itertools.combinations_with_replacement performs each comparison once, so upon unstacking of the pd.Series, we have the (unordered) top right triangle of the matrix. Sorting the rows and the columns by our original index leaves us with a correctly populated top right triangle, and all that's left to do is reflect that onto the bottom left triangle of the matrix. I wound up using ~8 GB of RAM for a 5200x5200 matrix comparison, where there are ~17000 possible unique values to fill in each set and each set contains 10-1000 unique values. This finished in a matter of minutes.

Find when the values of a pandas.Series change by at least x

I have a time series s stored as a pandas.Series and I need to find when the value tracked by the time series changes by at least x.
In pseudocode:
print s(0)
s*=s(0)
for all t in ]t, t_max]:
if |s(t)-s*| > x:
s* = s(t)
print s*
Naively, this can be coded in Python as follows:
import pandas as pd
def find_changes(s, x):
changes = []
s_last = None
for index, value in s.iteritems():
if s_last is None:
s_last = value
if value-s_last > x or s_last-value > x:
changes += [index, value]
s_last = value
return changes
My data set is large, so I can't just use the method above. Moreover, I cannot use Cython or Numba due to limitations of the framework I will run this on. I can (and plan to) use pandas and NumPy.
I'm looking for some guidance on what NumPy vectorized/optimized methods to use and how.
Thanks!
EDIT: Changed code to match pseudocode.
I don't know if I am understanding you correctly, but here is how I interpreted the problem:
import pandas as pd
import numpy as np
# Our series of data.
data = pd.DataFrame(np.random.rand(10), columns = ['value'])
# The threshold.
threshold = .33
# For each point t, grab t - 1.
data['value_shifted'] = data['value'].shift(1)
# Absolute difference of t and t - 1.
data['abs_change'] = abs(data['value'] - data['value_shifted'])
# Test against the threshold.
data['change_exceeds_threshold'] = np.where(data['abs_change'] > threshold, 1, 0)
print(data)
Giving:
value value_shifted abs_change change_exceeds_threshold
0 0.005382 NaN NaN 0
1 0.060954 0.005382 0.055573 0
2 0.090456 0.060954 0.029502 0
3 0.603118 0.090456 0.512661 1
4 0.178681 0.603118 0.424436 1
5 0.597814 0.178681 0.419133 1
6 0.976092 0.597814 0.378278 1
7 0.660010 0.976092 0.316082 0
8 0.805768 0.660010 0.145758 0
9 0.698369 0.805768 0.107400 0
I don't think the pseudo code can be vectorized because the next state of s* is dependent on the last state. There's a pure python solution (1 iteration):
import random
import pandas as pd
s = [random.randint(0,100) for _ in range(100)]
res = [] # record changes
thres = 20
ss = s[0]
for i in range(len(s)):
if abs(s[i] - ss) > thres:
ss = s[i]
res.append([i, s[i]])
df = pd.DataFrame(res, columns=['value'])
I think there's no way to run faster than O(N) in this case.

Performance enhancement of ranking function by replacement of lambda x with vectorization

I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:
(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.
Sample input data is as follows:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
The two ranking functions are:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
And the code to call my ranking function and recombine with df is:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.
I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df
I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby
def rnk(df):
a = df.values.argsort(0)
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame(b / n, df.index, df.columns)
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')
var_1_ranked var_2_ranked var_3_ranked
0 0.333333 0.809524 0.428571
1 0.160000 0.360000 0.240000
2 0.153846 0.384615 0.461538
3 0.000000 0.315789 0.105263
4 0.560000 0.200000 0.160000
...
How It Works
Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.
a = np.array([25, 300, 7])
b = a.argsort()
print(b)
[2 0 1]
print(a[b])
[ 7 25 300]
So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.
# create an empty array that is the same size as b or a
# but these will be ranks, so I want them to be integers
# so I use empty_like(b) because b is the result of
# argsort and is already integers.
u = np.empty_like(b)
# now just like when I sliced a above with a[b]
# I slice u the same way but instead I assign to
# those positions, the ranks I want.
# In this case, I defined the ranks as np.arange(b.size) + 1
u[b] = np.arange(b.size) + 1
print(u)
[2 3 1]
And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.
Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
What's left to do is apply this logic to each group. We can do this in pandas with groupby
Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.
Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit
from numba import njit
#njit
def count_factor(f):
c = np.arange(f.max() + 2) * 0
for i in f:
c[i + 1] += 1
return c
#njit
def factor_fun(f):
c = count_factor(f)
cc = c[:-1].cumsum()
return c[1:][f], cc[f]
def lexsort(a, f):
n, m = a.shape
f = f * (a.max() - a.min() + 1)
return (f.reshape(-1, 1) + a).argsort(0)
def rnk_numba(df, gcols, rcols):
tups = list(zip(*[df[c].values.tolist() for c in gcols]))
f = pd.Series(tups).factorize()[0]
a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
c, cc = factor_fun(f)
c = c[:, None]
cc = cc[:, None]
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')
How it works
Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.
So that's some of the high level philosophy. What about #njit?
Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them.
Timing
%%timeit
ranked_cols = [col + '_ranked' for col in to_rank]
​
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
1 loop, best of 3: 481 ms per loop
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop
%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop
I suggest you try this code. It's 3 times faster than yours, and more clear.
rank function:
def rank(x):
counts = x.value_counts()
bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
bins = 100 if bins > 100 else bins
if bins < 5:
return x.apply(lambda x: 50)
else:
return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)
single thread apply:
for col in to_rank:
df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)
mulple thread apply:
import sys
from multiprocessing import Pool
def tfunc(col):
return df.groupby(['date_id', 'category'])[col].apply(rank)
pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
df[col + '_ranked'] = val

Is there lexographical version of searchsorted in numpy?

I have two arrays which are lex-sorted.
In [2]: a = np.array([1,1,1,2,2,3,5,6,6])
In [3]: b = np.array([10,20,30,5,10,100,10,30,40])
In [4]: ind = np.lexsort((b, a)) # sorts elements first by a and then by b
In [5]: print a[ind]
[1 1 1 2 2 3 5 6 6]
In [7]: print b[ind]
[ 10 20 30 5 10 100 10 30 40]
I want to do a binary search for (2, 7) and (5, 150) expecting (4, 7) as the answer.
In [6]: np.lexsearchsorted((a,b), ([2, 5], [7,150]))
We have searchsorted function but that works only on 1D arrays.
EDIT: Edited to reflect comment.
def comp_leq(t1,t2):
if (t1[0] > t2[0]) or ((t1[0] == t2[0]) and (t1[1] > t2[1])):
return 0
else:
return 1
def bin_search(L,item):
from math import floor
x = L[:]
while len(x) > 1:
index = int(floor(len(x)/2) - 1)
#Check item
if comp_leq(x[index], item):
x = x[index+1:]
else:
x = x[:index+1]
out = L.index(x[0])
#If greater than all
if item >= L[-1]:
return len(L)
else:
return out
def lexsearch(a,b,items):
z = zip(a,b)
return [bin_search(z,item) for item in items]
if __name__ == '__main__':
a = [1,1,1,2,2,3,5,6,6]
b = [10,20,30,5,10,100,10,30,40]
print lexsearch(a,b,([2,7],[5,150])) #prints [4,7]
This code seems to do it for a set of (exactly) 2 lexsorted arrays
You might be able to make it faster if you create a set of values[-1], and than create a dictionary with the boundries for them.
I haven't checked other cases apart from the posted one, so please verify it's not bugged.
def lexsearchsorted_2(arrays, values, side='left'):
assert len(arrays) == 2
assert (np.lexsort(arrays) == range(len(arrays[0]))).all()
# here it will be faster to work on all equal values in 'values[-1]' in one time
boundries_l = np.searchsorted(arrays[-1], values[-1], side='left')
boundries_r = np.searchsorted(arrays[-1], values[-1], side='right')
# a recursive definition here will make it work for more than 2 lexsorted arrays
return tuple([boundries_l[i] +
np.searchsorted(arrays[-2[boundries_l[i]:boundries_r[i]],
values[-2][i],
side=side)
for i in range(len(boundries_l))])
Usage:
import numpy as np
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
lexsearchsorted_2((b, a), ([7,150], [2, 5])) # return (4, 7)
I ran into the same issue and came up with a different solution. You can treat the multi-column data instead as single entries using a structured data type. A structured data type will allow one to use argsort/sort on the data (instead of lexsort, although lexsort appears faster at this stage) and then use the standard searchsorted. Here is an example:
import numpy as np
from itertools import repeat
# Setup our input data
# Every row is an entry, every column what we want to sort by
# Unlike lexsort, this takes columns in decreasing priority, not increasing
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
data = np.transpose([a,b])
# Sort the data
data = data[np.lexsort(data.T[::-1])]
# Convert to a structured data-type
dt = np.dtype(zip(repeat(''), repeat(data.dtype, data.shape[1]))) # the structured dtype
data = np.ascontiguousarray(data).view(dt).squeeze(-1) # the dtype change leaves a trailing 1 dimension, ascontinguousarray is required for the dtype change
# You can also first convert to the structured data-type with the two lines above then use data.sort()/data.argsort()/np.sort(data)
# Search the data
values = np.array([(2,7),(5,150)], dtype=dt) # note: when using structured data types the rows must be a tuple
pos = np.searchsorted(data, values)
# pos is (4,7) in this example, exactly what you would want
This works for any number of columns, uses the built-in numpy functions, the columns remain in the "logical" order (decreasing priority), and it should be quite fast.
A compared the two two numpy-based methods time-wise.
#1 is the recursive method from #j0ker5 (the one below extends his example with his suggestion of recursion and works with any number of lexsorted rows)
#2 is the structured array from me
They both take the same inputs, basically like searchsorted except a and v are as per lexsort.
import numpy as np
def lexsearch1(a, v, side='left', sorter=None):
def _recurse(a, v):
if a.shape[1] == 0: return 0
if a.shape[0] == 1: return a.squeeze(0).searchsorted(v.squeeze(0), side)
bl = np.searchsorted(a[-1,:], v[-1], side='left')
br = np.searchsorted(a[-1,:], v[-1], side='right')
return bl + _recurse(a[:-1,bl:br], v[:-1])
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
if sorter is not None: a = a[:,sorter]
bl = np.searchsorted(a[-1,:], v[-1,:], side='left')
br = np.searchsorted(a[-1,:], v[-1,:], side='right')
for i in xrange(len(bl)): bl[i] += _recurse(a[:-1,bl[i]:br[i]], v[:-1,i])
return bl
def lexsearch2(a, v, side='left', sorter=None):
from itertools import repeat
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
a_dt = np.dtype(zip(repeat(''), repeat(a.dtype, a.shape[0])))
v_dt = np.dtype(zip(a_dt.names, repeat(v.dtype, a.shape[0])))
a = np.asfortranarray(a[::-1,:]).view(a_dt).squeeze(0)
v = np.asfortranarray(v[::-1,:]).view(v_dt).squeeze(0)
return a.searchsorted(v, side, sorter).ravel()
a = np.random.randint(100, size=(2,10000)) # Values to sort, rows in increasing priority
v = np.random.randint(100, size=(2,10000)) # Values to search for, rows in increasing priority
sorted_idx = np.lexsort(a)
a_sorted = a[:,sorted_idx]
And the timing results (in iPython):
# 2 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 33.4 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14 ms per loop
# 10 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 103 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14.7 ms per loop
Overall the structured array approach is faster, and can be made even faster if you design it to work with the flipped and transposed versions of a and v. It gets even faster as the numbers of rows/keys goes up, barely slowing down when going from 2 rows to 10 rows.
I did not notice any significant timing difference between using a_sorted or a and sorter=sorted_idx so I left those out for clarity.
I believe that a really fast method could be made using Cython, but this is as fast as it is going to get with pure pure Python and numpy.

Categories