Longest chain of points satisfying given condition - python

I have a graph of discrete set of points.
y x
0 1.000000 1000.000000
1 0.999415 1000.000287
2 0.999420 1000.000358
3 0.999376 1000.000609
4 0.999239 1000.000788
5 0.999011 1000.000967
6 1.000389 1000.001433
7 0.999871 1000.001756
8 0.995070 1000.002723
9 0.996683 1000.003404
I want to determine the longest chain of consecutive points where the slope of the line connecting i-1 to i remains within a given range epsilon = 0.4.
def tangent(df, pt1, pt2):
y = df.iloc[pt2]['y'] - df.iloc[pt1]['y']
x = df.iloc[pt2]['x'] - df.iloc[pt1]['x']
return x/y
The data has been normalized to scale the tangent results.
index = 1
while index < df.shape[0]:
if abs(math.tan(tangent(df,index-1,index) * math.pi)) < epsilon:
print("result:",index)
index += 1
The snippet is the draft to detect all such points.

You can simplify the code using pandas methods which apply to whole column (Series):
import numpy as np
...
# equivalent of your `tangent` function
# df['x'].diff() will return a column where every row is
# actual rows difference with previous one
df['tang'] = df['x'].diff()/df['y'].diff()
# np.tan will calculate the tan of whole column values at once
matches = np.tan(df['tang']) < epsilon
# Get longest chain
(~matches).cumsum()[matches].value_counts().max()
More info:
Pandas diff function
Getting longest True chain

Related

Find where the slope changes in my data as a parameter that can be easily indexed and extracted

I have the following data:
0.8340502011561366 0.8423491600218922
0.8513456021654467
0.8458192388553084
0.8440111276014195
0.8489589671423143
0.8738088120491972
0.8845129900705279
0.8988298998926688
0.924633964692693
0.9544790734065157
0.9908034431246875
1.0236430466543138
1.061619773027915
1.1050038249835414
1.1371449802490126
1.1921182610371368
1.2752207659022576
1.344047620255176
1.4198117350668353
1.507943067143741
1.622137968203745
1.6814098429502085
1.7646810054280595
1.8485457435775694
1.919591124757554
1.9843144220593145
2.030158014640226
2.018184122476175
2.0323466012624207
2.0179200409023874
2.0316932950853723
2.013683870089898
2.03010703506514
2.0216151623726977
2.038855467786505
2.0453923522466093
2.03759031642753
2.019424996752278
2.0441806106428606
2.0607521369415136
2.059310067318373
2.0661157975162485
2.053216429539864
2.0715123971225564
2.0580473413362075
2.055814512721712
2.0808278560688964
2.0601637029377113
2.0539429365156003
2.0609648613513754
2.0585135712612646
2.087674625814453
2.062482961966647
2.066476100210777
2.0568444178944967
2.0587903943282266
2.0506399365756396
The data plotted looks like:
I want to find the point where the slope changes in sign (I circled it in black. Should be around index 26):
I need to find this point of change for several hundred files. So far I tried the recommendation from this post:
Finding the point of a slope change as a free parameter- Python
I think since my data is a bit noisey I am not getting a smooth transition in the change of the slope.
This is the code I have tried so far:
import numpy as np
#load 1-D data file
file = str(sys.argv[1])
y = np.loadtxt(file)
#create X based on file length
x = np.linspace(1,len(y), num=len(y))
Find first derivative:
m = np.diff(y)/np.diff(x)
print(m)
#Find second derivative
b = np.diff(m)
print(b)
#find Index
index = 0
for difference in b:
index += 1
if difference < 0:
print(index, difference)
Since my data is noisey I am getting some negative values before the index I want. The index I want it to retrieve in this case is around 26 (which is where my data becomes constant). Does anyone have any suggestions on what I can do to solve this issue? Thank you!
A gradient approach is useless in this case because you don't care about velocities or vector fields. The knowledge of the gradient don't add extra information to locate the maximum value since the run are always positive hence will not effect the sign of the gradient. A method based entirly on raise is suggested.
Detect the indices for which the data are decreasing, find the difference between them and the location of the max value. Then by index manipulation you can find the value for which data has a maximum.
data = '0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396'
data = data.split()
import numpy as np
a = np.array(data, dtype=float)
diff = np.diff(a)
neg_indeces = np.where(diff<0)[0]
neg_diff = np.diff(neg_indeces)
i_max_dif = np.where(neg_diff == neg_diff.max())[0][0] + 1
i_max = neg_indeces[i_max_dif] - 1 # because aise as a difference of two consecutive values
print(i_max, a[i_max])
Output
26 1.9843144220593145
Some details
print(neg_indeces) # all indeces of the negative values in the data
# [ 2 3 27 29 31 33 36 37 40 42 44 45 47 48 50 52 54 56]
print(neg_diff) # difference between such indices
# [ 1 24 2 2 2 3 1 3 2 2 1 2 1 2 2 2 2]
print(neg_diff.max()) # value with highest difference
# 24
print(i_max_dif) # location of the max index of neg_indeces -> 27
# 2
print(i_max) # index of the max of the origonal data
# 26
When the first derivative changes sign, that's when the slope sign changes. I don't think you need the second derivative, unless you want to determine the rate of change of the slope. You also aren't getting the second derivative. You're just getting the difference of the first derivative.
Also, you seem to be assigning arbitrary x values. If you're y-values represent points that are equally spaced apart, than it's ok, otherwise the derivative will be wrong.
Here's an example of how to get first and second der...
import numpy as np
x = np.linspace(1, 100, 1000)
y = np.cos(x)
# Find first derivative:
m = np.diff(y)/np.diff(x)
#Find second derivative
m2 = np.diff(m)/np.diff(x[:-1])
print(m)
print(m2)
# Get x-values where slope sign changes
c = len(m)
changes_index = []
for i in range(1, c):
prev_val = m[i-1]
val = m[i]
if prev_val < 0 and val > 0:
changes_index.append(i)
elif prev_val > 0 and val < 0:
changes_index.append(i)
for i in changes_index:
print(x[i])
notice I had to curtail the x values for the second der. That's because np.diff() returns one less point than the original input.

What is the first minimum OR saddle point (calculus derivative) of a numpy array?

I have the following numpy array which is depicted above.
Functions like
print(arr.argsort()[:3])
will return the three lowest indeces of the three lowest value:
[69 66 70]
How do I return the first index where the first minimum or first saddle point (in the calculus sense) whichever comes first of an array?
In this case the two numbers 0.62026396 0.60566623 at index 2 and 3 is a first saddle point (it isn't a true saddle point since the slope doesn't flatten, but it clearly breaks the hard downward slope there. Perhaps add a threshold of what "flattens" means). Since the function never goes up before the first saddle point and therefore the first mimimum occurs after the saddle point, that is the index I am interested in.
[1.04814804 0.90445908 0.62026396 0.60566623 0.32295758 0.26658469
0.19059289 0.10281547 0.08582772 0.05091265 0.03391474 0.03844931
0.03315003 0.02838656 0.03420759 0.03567401 0.038203 0.03530763
0.04394316 0.03876966 0.04156067 0.03937291 0.03966426 0.04438747
0.03690863 0.0363976 0.03171374 0.03644719 0.02989291 0.03166156
0.0323875 0.03406287 0.03691943 0.02829374 0.0368121 0.02971704
0.03427005 0.02873735 0.02843848 0.02101889 0.02114978 0.02128403
0.0185619 0.01749904 0.01441699 0.02118773 0.02091855 0.02431763
0.02472427 0.03186318 0.03205664 0.03135686 0.02838413 0.03206674
0.02638371 0.02048122 0.01502128 0.0162665 0.01331485 0.01569286
0.00901017 0.01343558 0.00908635 0.00990869 0.01041151 0.01063606
0.00822482 0.01312368 0.0115005 0.00620334 0.0084177 0.01058152
0.01198732 0.01451455 0.01605602 0.01823713 0.01685975 0.03161889
0.0216687 0.03052391 0.02220871 0.02420951 0.01651778 0.02066987
0.01999613 0.02532265 0.02589186 0.02748692 0.02191687 0.02612152
0.02309497 0.02744753 0.02619196 0.02281516 0.0254296 0.02732746
0.02567608 0.0199178 0.01831929 0.01776025]
This is how I would detect local maxima/minima, inflection points, and saddles.
Let first define the following functions
import numpy as np
def n_derivative(arr, degree=1):
"""Compute the n-th derivative."""
result = arr.copy()
for i in range(degree):
result = np.gradient(result)
return result
def sign_change(arr):
"""Detect sign changes."""
sign = np.sign(arr)
result = ((np.roll(sign, 1) - sign) != 0).astype(bool)
result[0] = False
return result
def zeroes(arr, threshold=1e-8):
"""Find zeroes of an array."""
return sign_change(arr) | (abs(arr) < threshold)
We can now make use of the derivative test
A critical points will have first-derivative equal to zero.
def critical_points(arr):
return zeroes(n_derivative(arr, 1))
If a critical point has the second-derivative non-zero, then the point is either a maximum or a minimum:
def maxima_minima(arr):
return zeroes(n_derivative(arr, 1)) & ~zeroes(n_derivative(arr, 2))
def maxima(arr):
return zeroes(n_derivative(arr, 1)) & (n_derivative(arr, 2) < 0)
def minima(arr):
return zeroes(n_derivative(arr, 1)) & (n_derivative(arr, 2) > 0)
If the second-derivative is equal to zero but the third-derivative is non-zero, then the point is a point of inflection:
def inflections(arr):
return zeroes(n_derivative(arr, 2)) & ~zeroes(n_derivative(arr, 3))
If a critical point has second-derivative equal to zero, but third-derivative is non-zero, then this is a saddle:
def inflections(arr):
return zeroes(n_derivative(arr, 1)) & zeroes(n_derivative(arr, 2)) & ~zeroes(n_derivative(arr, 3))
Note that this method is numerically not stable, in the sense that, on one hand the zeroes are detected on some arbitrary threshold definition, and on the other hand different sampling may result in the function / array not being differentiable.
Hence, according to this definition, what you expect is actually not a saddle point.
To have a better approximation of a continuous function, one could use a cubic interpolation on a largely oversampled (as per K in the code) function, e.g.:
import scipy as sp
import scipy.interpolate
data = [
1.04814804, 0.90445908, 0.62026396, 0.60566623, 0.32295758, 0.26658469, 0.19059289,
0.10281547, 0.08582772, 0.05091265, 0.03391474, 0.03844931, 0.03315003, 0.02838656,
0.03420759, 0.03567401, 0.038203, 0.03530763, 0.04394316, 0.03876966, 0.04156067,
0.03937291, 0.03966426, 0.04438747, 0.03690863, 0.0363976, 0.03171374, 0.03644719,
0.02989291, 0.03166156, 0.0323875, 0.03406287, 0.03691943, 0.02829374, 0.0368121,
0.02971704, 0.03427005, 0.02873735, 0.02843848, 0.02101889, 0.02114978, 0.02128403,
0.0185619, 0.01749904, 0.01441699, 0.02118773, 0.02091855, 0.02431763, 0.02472427,
0.03186318, 0.03205664, 0.03135686, 0.02838413, 0.03206674, 0.02638371, 0.02048122,
0.01502128, 0.0162665, 0.01331485, 0.01569286, 0.00901017, 0.01343558, 0.00908635,
0.00990869, 0.01041151, 0.01063606, 0.00822482, 0.01312368, 0.0115005, 0.00620334,
0.0084177, 0.01058152, 0.01198732, 0.01451455, 0.01605602, 0.01823713, 0.01685975,
0.03161889, 0.0216687, 0.03052391, 0.02220871, 0.02420951, 0.01651778, 0.02066987,
0.01999613, 0.02532265, 0.02589186, 0.02748692, 0.02191687, 0.02612152, 0.02309497,
0.02744753, 0.02619196, 0.02281516, 0.0254296, 0.02732746, 0.02567608, 0.0199178,
0.01831929, 0.01776025]
samples = np.arange(len(data))
f = sp.interpolate.interp1d(samples, data, 'cubic')
K = 10
N = len(data) * K
x = np.linspace(min(samples), max(samples), N)
y = f(x)
Then, all these definitions can be visually tested with:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(samples, data, label='data')
plt.plot(x, y, label='f')
plt.plot(x, n_derivative(y, 1), label='d1f')
plt.plot(x, n_derivative(y, 2), label='d2f')
plt.plot(x, n_derivative(y, 3), label='d3f')
plt.legend()
for w in np.where(inflections(y))[0]:
plt.axvline(x=x[w])
plt.show()
but even in this case, that point is not a saddle.
you can use np.gradient or np.diff to evaluate differences (the first computes central differences, the second is just x[1:] - x[:-1]), then use np.sign to get the gradient sign and another np.diff to see where the sign changes. Then filter the positive sign changes (corresponding to minima):
np.where(np.diff(np.sign(np.gradient(x))) > 0)[0][0]+2 #add 2 as each time you call np.gradient or np.diff you are substracting 1 in size, the first [0] is to get the positions, the second [0] is to get the "first" element
>> 8
x[np.where(np.diff(np.sign(np.gradient(x))) > 0)[0][0]+2]
>> 0.03420759
After looking around a little bit and from the two suggestions given (so far), I did this:
import scipy
from scipy import interpolate
x = np.arange(0, 100)
spl = scipy.interpolate.splrep(x,arr,k=3) # no smoothing, 3rd order spline
ddy = scipy.interpolate.splev(x,spl,der=2) # use those knots to get second derivative
print(ddy)
asign = np.sign(ddy)
signchange = ((np.roll(asign, 1) - asign) != 0).astype(int)
print(signchange)
This gives me the second derivative, which then I can analyse, for example, seeing where the sign changes happen:
[-0.894053 -0.14050616 0.61304067 -0.69407217 0.55458251 -0.16624336
-0.0073225 0.12481963 -0.067218 0.03648846 0.02876712 -0.02236204
0.00167794 0.01886512 -0.0136314 0.00953279 -0.01812436 0.03041855
-0.03436446 0.02418512 -0.01458896 0.00429809 0.01227133 -0.02679232
0.02168571 -0.0181437 0.02585209 -0.02876075 0.0214645 -0.00715966
0.0009179 0.00918466 -0.03056938 0.04419937 -0.0433638 0.03557532
-0.02904901 0.02010647 -0.0199739 0.0170648 -0.00298236 -0.00511529
0.00630525 -0.01015011 0.02218007 -0.01945341 0.01339405 -0.01211326
0.01710444 -0.01591092 0.00486652 -0.00891456 0.01715403 -0.01976949
0.00573004 -0.00446743 0.01479495 -0.01448144 0.01794968 -0.02533936
0.02904355 -0.02418628 0.01505374 -0.00499926 0.00302616 -0.00877499
0.01625907 -0.01240068 -0.00578862 0.01351128 -0.00318733 -0.0010652
0.0029 -0.0038062 0.0064102 -0.01799678 0.04422601 -0.0620881
0.05587037 -0.04856099 0.03535114 -0.03094757 0.03028399 -0.01912546
0.01726283 -0.01392421 0.00989012 -0.01948119 0.02504401 -0.02204667
0.0197554 -0.01270022 -0.00260326 0.01038581 -0.00299247 -0.00271539
-0.00744152 0.00784016 0.00103947 -0.00576122]
[0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1]

Performance enhancement of ranking function by replacement of lambda x with vectorization

I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:
(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.
Sample input data is as follows:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
The two ranking functions are:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
And the code to call my ranking function and recombine with df is:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.
I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df
I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby
def rnk(df):
a = df.values.argsort(0)
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame(b / n, df.index, df.columns)
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')
var_1_ranked var_2_ranked var_3_ranked
0 0.333333 0.809524 0.428571
1 0.160000 0.360000 0.240000
2 0.153846 0.384615 0.461538
3 0.000000 0.315789 0.105263
4 0.560000 0.200000 0.160000
...
How It Works
Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.
a = np.array([25, 300, 7])
b = a.argsort()
print(b)
[2 0 1]
print(a[b])
[ 7 25 300]
So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.
# create an empty array that is the same size as b or a
# but these will be ranks, so I want them to be integers
# so I use empty_like(b) because b is the result of
# argsort and is already integers.
u = np.empty_like(b)
# now just like when I sliced a above with a[b]
# I slice u the same way but instead I assign to
# those positions, the ranks I want.
# In this case, I defined the ranks as np.arange(b.size) + 1
u[b] = np.arange(b.size) + 1
print(u)
[2 3 1]
And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.
Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
What's left to do is apply this logic to each group. We can do this in pandas with groupby
Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.
Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit
from numba import njit
#njit
def count_factor(f):
c = np.arange(f.max() + 2) * 0
for i in f:
c[i + 1] += 1
return c
#njit
def factor_fun(f):
c = count_factor(f)
cc = c[:-1].cumsum()
return c[1:][f], cc[f]
def lexsort(a, f):
n, m = a.shape
f = f * (a.max() - a.min() + 1)
return (f.reshape(-1, 1) + a).argsort(0)
def rnk_numba(df, gcols, rcols):
tups = list(zip(*[df[c].values.tolist() for c in gcols]))
f = pd.Series(tups).factorize()[0]
a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
c, cc = factor_fun(f)
c = c[:, None]
cc = cc[:, None]
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')
How it works
Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.
So that's some of the high level philosophy. What about #njit?
Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them.
Timing
%%timeit
ranked_cols = [col + '_ranked' for col in to_rank]
​
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
1 loop, best of 3: 481 ms per loop
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop
%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop
I suggest you try this code. It's 3 times faster than yours, and more clear.
rank function:
def rank(x):
counts = x.value_counts()
bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
bins = 100 if bins > 100 else bins
if bins < 5:
return x.apply(lambda x: 50)
else:
return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)
single thread apply:
for col in to_rank:
df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)
mulple thread apply:
import sys
from multiprocessing import Pool
def tfunc(col):
return df.groupby(['date_id', 'category'])[col].apply(rank)
pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
df[col + '_ranked'] = val

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame

I have a Pandas Dataframe of indices and values between 0 and 1, something like this:
6 0.047033
7 0.047650
8 0.054067
9 0.064767
10 0.073183
11 0.077950
I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this:
[(150, 185), (632, 680), (1500,1870)]
Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive.
I started by filtering for only values above 0.5 like so
df = df[df['values'] >= 0.5]
And now I have values like this:
632 0.545700
633 0.574983
634 0.572083
635 0.595500
636 0.632033
637 0.657617
638 0.643300
639 0.646283
I can't show my actual dataset, but the following one should be a good representation
import numpy as np
from pandas import *
np.random.seed(seed=901212)
df = DataFrame(range(1,501), columns=['indices'])
df['values'] = np.random.rand(500)*.5 + .35
yielding:
1 0.491233
2 0.538596
3 0.516740
4 0.381134
5 0.670157
6 0.846366
7 0.495554
8 0.436044
9 0.695597
10 0.826591
...
Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other:
# tag rows based on the threshold
df['tag'] = df['values'] > .5
# first row is a True preceded by a False
fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)]
# last row is a True followed by a False
lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)]
# filter those which are adequately apart
pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4]
so for example the first region would be:
>>> i, j = pr[0]
>>> df.loc[i:j]
indices values tag
15 16 0.639992 True
16 17 0.593427 True
17 18 0.810888 True
18 19 0.596243 True
19 20 0.812684 True
20 21 0.617945 True
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that.
import numpy as np
# from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373
# with minor edits
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition,n=1, axis=0)
idx, _ = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right. -JK
# LB this copy to increment is horrible but I get
# ValueError: output array is read-only without it
mutable_idx = np.array(idx)
mutable_idx += 1
idx = mutable_idx
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def main():
import pandas as pd
RUN_LENGTH_THRESHOLD = 5
VALUE_THRESHOLD = 0.5
np.random.seed(seed=901212)
data = np.random.rand(500)*.5 + .35
df = pd.DataFrame(data=data,columns=['values'])
match_bools = df.values > VALUE_THRESHOLD
print('with boolian array')
for start, stop in contiguous_regions(match_bools):
if (stop - start > RUN_LENGTH_THRESHOLD):
print (start, stop)
if __name__ == '__main__':
main()
I would be surprised if there were not more elegant ways

Categories