Binning and then combining bins with minimum number of observations? - python
Let's say I create some data and then create bins of different sizes:
from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())
reveals:
20 17
16 1
4 1
2 1
dtype: int64
I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts() is this:
20 17
16 3
dtype: int64
EDITED:
x = np.random.rand(1,100)
bins = np.arange(1,x.shape[1]+1)/100
new = np.digitize(x,bins)
n = new.copy()[0] # this will hold the the result
threshold = 2
for i in np.unique(n):
if sum(n == i) <= threshold:
n[n == i] += 1
n.clip(0, bins.size) # avoid adding beyond the last bin
n = n.reshape(1,-1)
This can move counts up multiple times, until a bin is filled sufficiently.
Instead of using np.digitize, it might be simpler to use np.histogram instead, because it will directly give you the counts, so that we don't need to sum ourselves.
Related
Find where the slope changes in my data as a parameter that can be easily indexed and extracted
I have the following data: 0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396 The data plotted looks like: I want to find the point where the slope changes in sign (I circled it in black. Should be around index 26): I need to find this point of change for several hundred files. So far I tried the recommendation from this post: Finding the point of a slope change as a free parameter- Python I think since my data is a bit noisey I am not getting a smooth transition in the change of the slope. This is the code I have tried so far: import numpy as np #load 1-D data file file = str(sys.argv[1]) y = np.loadtxt(file) #create X based on file length x = np.linspace(1,len(y), num=len(y)) Find first derivative: m = np.diff(y)/np.diff(x) print(m) #Find second derivative b = np.diff(m) print(b) #find Index index = 0 for difference in b: index += 1 if difference < 0: print(index, difference) Since my data is noisey I am getting some negative values before the index I want. The index I want it to retrieve in this case is around 26 (which is where my data becomes constant). Does anyone have any suggestions on what I can do to solve this issue? Thank you!
A gradient approach is useless in this case because you don't care about velocities or vector fields. The knowledge of the gradient don't add extra information to locate the maximum value since the run are always positive hence will not effect the sign of the gradient. A method based entirly on raise is suggested. Detect the indices for which the data are decreasing, find the difference between them and the location of the max value. Then by index manipulation you can find the value for which data has a maximum. data = '0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396' data = data.split() import numpy as np a = np.array(data, dtype=float) diff = np.diff(a) neg_indeces = np.where(diff<0)[0] neg_diff = np.diff(neg_indeces) i_max_dif = np.where(neg_diff == neg_diff.max())[0][0] + 1 i_max = neg_indeces[i_max_dif] - 1 # because aise as a difference of two consecutive values print(i_max, a[i_max]) Output 26 1.9843144220593145 Some details print(neg_indeces) # all indeces of the negative values in the data # [ 2 3 27 29 31 33 36 37 40 42 44 45 47 48 50 52 54 56] print(neg_diff) # difference between such indices # [ 1 24 2 2 2 3 1 3 2 2 1 2 1 2 2 2 2] print(neg_diff.max()) # value with highest difference # 24 print(i_max_dif) # location of the max index of neg_indeces -> 27 # 2 print(i_max) # index of the max of the origonal data # 26
When the first derivative changes sign, that's when the slope sign changes. I don't think you need the second derivative, unless you want to determine the rate of change of the slope. You also aren't getting the second derivative. You're just getting the difference of the first derivative. Also, you seem to be assigning arbitrary x values. If you're y-values represent points that are equally spaced apart, than it's ok, otherwise the derivative will be wrong. Here's an example of how to get first and second der... import numpy as np x = np.linspace(1, 100, 1000) y = np.cos(x) # Find first derivative: m = np.diff(y)/np.diff(x) #Find second derivative m2 = np.diff(m)/np.diff(x[:-1]) print(m) print(m2) # Get x-values where slope sign changes c = len(m) changes_index = [] for i in range(1, c): prev_val = m[i-1] val = m[i] if prev_val < 0 and val > 0: changes_index.append(i) elif prev_val > 0 and val < 0: changes_index.append(i) for i in changes_index: print(x[i]) notice I had to curtail the x values for the second der. That's because np.diff() returns one less point than the original input.
Find combinations (without "of size=r") from a set with decreasing sum value using Python
(Revised for clarity 02-08-2021) This is similar to the question here: Find combinations of size r from a set with decreasing sum value This is different from the answer posted in the link above because I am looking for answers without "size r=3". I have a set (array) of numbers. I need to have the sums of combinations of the numbers sorted from largest to smallest and show the numbers from the array that were used to get the total for that row. Any number in the array can only be used once per row but all the numbers don't have to be used in each row as the total decreases. If a number is not used then zero should be used as a placeholder instead so I can create a CSV file with the columns aligned. Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1] Desired Output Example #1 format where the last number in each row is the total (sum) of the row: Beginning of list 30,25,20,15,10,5,1,106 30,25,20,15,10,5,0,105 30,25,20,15,10,0,1,101 30,25,20,15,10,0,0,100 ...(all number combinations in between) 30,0,0,0,0,0,0,30 0,25,0,0,0,5,0,30 ...(all number combinations in between) 0,0,0,15,0,0,1,16 0,0,0,15,0,0,0,15 0,0,0,0,10,5,0,15 0,0,0,0,10,0,1,11 0,0,0,0,0,5,1,6 0,0,0,0,0,5,0,5 0,0,0,0,0,0,1,1 End of list Also, duplicate totals are allowed and preferred showing different combinations that have the same total (sum) of the row. For Example #1: 30,0,0,0,0,0,0,30 0,25,0,0,0,5,0,30 For example this is one row of output based on the Input Example #1 above: 30,25,0,0,0,5,1,61 Last number in the row is the total. The total can also be the first number but the important thing is that the output list is sorted in descending order by the total. Input Example #2 with 5 numbers in the array: [20,15,10,5,1] Desired Output Example #2 format where the last number in each row is the total (sum) of the row: Beginning of list 20,15,10,5,1,51 20,15,10,5,0,50 20,15,10,0,1,46 20,15,10,0,0,45 ...(all number combinations in between) 20,0,10,0,0,30 0,15,10,5,0,30 ...(all number combinations in between) 0,15,0,0,1,16 0,15,0,0,0,15 0,0,10,5,0,15 0,0,10,0,1,11 0,0,10,0,0,10 0,0,0,5,1,6 0,0,0,5,0,5 0,0,0,0,1,1 End of list Input Example #1: [30,25,20,15,10,5,1] Every row of the output should show each number in the array used only once at most per row to get the total for the row. The rows must be sorted in decreasing order by the sums of the numbers used to get the total. The first output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 1 = 106 The second output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 5 + 0 = 105 The third output row in the list would show the result of 30 + 25 + 20 + 15 + 10 + 0 + 1 = 101 ...The rest of the rows would continue with the total (sum) of the row getting smaller until it reaches 1... The third to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 1 = 6 The second to last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 5 + 0 = 5 The last output row in the list would show the result of 0 + 0 + 0 + 0 + 0 + 0 + 1 = 1 I started with the code provided by user Divyanshu modified with different input numbers and the () added to the last line (but I need to use all the numbers in the array instead of size=4 as shown here): import itertools array = [30,25,20,15,10,5,1] size = 4 answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination So this is what I need as an Input (in this example): [30,25,20,15,10,5,1] size=4 in the above code limits the output to 4 of the numbers in the array. If I take out size=4 I get an error. I need to use the entire array of numbers. I can manually change size=4 to size=1 and run it then size=2 then run it and so on. Entering size=1 through size=7 in the code and running it (7 times in this example) to get a list of all possible combinations gives me 7 different outputs. I could then manually put the lists together but that won't work for larger sets (arrays) of numbers. Can I modify the code referenced above or do I need to use a different approach?
I think you could do it as follow: Import the following: import pandas as pd import numpy as np The beginning of the code as in the questions: import itertools array = [30,25,20,15,10,5,1] size = 4 answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination The code, I think could help to obtain the final option: array_len = array.__len__() # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe aux.columns=array + ['total'] aux = aux.astype(int) print(aux.head().astype(int)) 30 25 20 15 10 5 1 total 0 30 25 20 15 0 0 0 90 1 30 25 20 0 10 0 0 85 2 30 25 0 15 10 0 0 80 3 30 25 20 0 0 5 0 80 4 30 25 20 0 0 0 1 76 Now generalization for all sizes import itertools import pandas as pd import numpy as np array = [30,25,20,15,10,5,1] array_len = array.__len__() answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for size in range(1,array_len+1): print(size) for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order for key in order: print (key[0],answer[key[1]]) # key[0] is sum of combination # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe aux.columns=array + ['total'] aux = aux.astype(int) print(aux.head().astype(int)) 30 25 20 15 10 5 1 total 0 30 25 20 15 10 5 1 106 1 30 25 20 15 10 5 0 105 2 30 25 20 15 10 0 1 101 3 30 25 20 15 10 0 0 100 4 30 25 20 15 0 5 1 96
Thanks to #RafaelValero (Rafael Valero) I was able to learn about pandas, numpy, and dataframes. I looked up options for pandas to get the desired output. Here is the final code with some extra lines left in for reference but commented out: import itertools import pandas as pd import numpy as np array = [30,25,20,15,10,5,1] array_len = array.__len__() answer = [] # to store all combination order = [] # to store order according to sum number = 0 # index of combination for size in range(1,array_len+1): # Commented out line below as it was giving extra information # print(size) for comb in itertools.combinations(array,size): answer.append(comb) order.append([sum(comb),number]) # Storing sum and index number += 1 order.sort(reverse=True) # sorting in decreasing order # Commented out two lines below as it was from the original code and giving extra information #for key in order: # print (key[0],answer[key[1]]) # key[0] is sum of combination # Auxiliary to place in reference to the original array dict_array = {} for i in range(0,array_len): # Commented out line below as it was giving extra information # print(i) dict_array[array[i]]=i # Reorder the previous combinations aux = [] for key in order: array_zeros = np.zeros([1, array_len+1]) for i in answer[key[1]]: # Commented out line below as it was giving extra information # print(i,dict_array[i] ) array_zeros[0][dict_array[i]] = i # Let add the total array_zeros[0][array_len]=key[0] aux.append(array_zeros[0]) # Tranform into a dataframe aux = pd.DataFrame(aux) # This is to add the names to the columns # for the dataframe # Update: removed this line below as I didn't need a header # aux.columns=array + ['total'] aux = aux.astype(int) # Tried option below first but it was not necessary when using to_csv # pd.set_option('display.max_rows', None) print(aux.to_csv(index=False,header=None)) Searched references: Similar question: Find combinations of size r from a set with decreasing sum value Pandas references: https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/ https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_csv.html Online compiler used: https://www.programiz.com/python-programming/online-compiler/ Output using Input Example #1 with 7 numbers in the array: [30,25,20,15,10,5,1]: 30,25,20,15,10,5,1,106 30,25,20,15,10,5,0,105 30,25,20,15,10,0,1,101 30,25,20,15,10,0,0,100 30,25,20,15,0,5,1,96 30,25,20,15,0,5,0,95 30,25,20,0,10,5,1,91 30,25,20,15,0,0,1,91 30,25,20,0,10,5,0,90 30,25,20,15,0,0,0,90 30,25,0,15,10,5,1,86 30,25,20,0,10,0,1,86 30,25,0,15,10,5,0,85 30,25,20,0,10,0,0,85 30,0,20,15,10,5,1,81 30,25,0,15,10,0,1,81 30,25,20,0,0,5,1,81 30,0,20,15,10,5,0,80 30,25,0,15,10,0,0,80 30,25,20,0,0,5,0,80 0,25,20,15,10,5,1,76 30,0,20,15,10,0,1,76 30,25,0,15,0,5,1,76 30,25,20,0,0,0,1,76 0,25,20,15,10,5,0,75 30,0,20,15,10,0,0,75 30,25,0,15,0,5,0,75 30,25,20,0,0,0,0,75 0,25,20,15,10,0,1,71 30,0,20,15,0,5,1,71 30,25,0,0,10,5,1,71 30,25,0,15,0,0,1,71 0,25,20,15,10,0,0,70 30,0,20,15,0,5,0,70 30,25,0,0,10,5,0,70 30,25,0,15,0,0,0,70 0,25,20,15,0,5,1,66 30,0,20,0,10,5,1,66 30,0,20,15,0,0,1,66 30,25,0,0,10,0,1,66 0,25,20,15,0,5,0,65 30,0,20,0,10,5,0,65 30,0,20,15,0,0,0,65 30,25,0,0,10,0,0,65 0,25,20,0,10,5,1,61 30,0,0,15,10,5,1,61 0,25,20,15,0,0,1,61 30,0,20,0,10,0,1,61 30,25,0,0,0,5,1,61 0,25,20,0,10,5,0,60 30,0,0,15,10,5,0,60 0,25,20,15,0,0,0,60 30,0,20,0,10,0,0,60 30,25,0,0,0,5,0,60 0,25,0,15,10,5,1,56 0,25,20,0,10,0,1,56 30,0,0,15,10,0,1,56 30,0,20,0,0,5,1,56 30,25,0,0,0,0,1,56 0,25,0,15,10,5,0,55 0,25,20,0,10,0,0,55 30,0,0,15,10,0,0,55 30,0,20,0,0,5,0,55 30,25,0,0,0,0,0,55 0,0,20,15,10,5,1,51 0,25,0,15,10,0,1,51 0,25,20,0,0,5,1,51 30,0,0,15,0,5,1,51 30,0,20,0,0,0,1,51 0,0,20,15,10,5,0,50 0,25,0,15,10,0,0,50 0,25,20,0,0,5,0,50 30,0,0,15,0,5,0,50 30,0,20,0,0,0,0,50 0,0,20,15,10,0,1,46 0,25,0,15,0,5,1,46 30,0,0,0,10,5,1,46 0,25,20,0,0,0,1,46 30,0,0,15,0,0,1,46 0,0,20,15,10,0,0,45 0,25,0,15,0,5,0,45 30,0,0,0,10,5,0,45 0,25,20,0,0,0,0,45 30,0,0,15,0,0,0,45 0,0,20,15,0,5,1,41 0,25,0,0,10,5,1,41 0,25,0,15,0,0,1,41 30,0,0,0,10,0,1,41 0,0,20,15,0,5,0,40 0,25,0,0,10,5,0,40 0,25,0,15,0,0,0,40 30,0,0,0,10,0,0,40 0,0,20,0,10,5,1,36 0,0,20,15,0,0,1,36 0,25,0,0,10,0,1,36 30,0,0,0,0,5,1,36 0,0,20,0,10,5,0,35 0,0,20,15,0,0,0,35 0,25,0,0,10,0,0,35 30,0,0,0,0,5,0,35 0,0,0,15,10,5,1,31 0,0,20,0,10,0,1,31 0,25,0,0,0,5,1,31 30,0,0,0,0,0,1,31 0,0,0,15,10,5,0,30 0,0,20,0,10,0,0,30 0,25,0,0,0,5,0,30 30,0,0,0,0,0,0,30 0,0,0,15,10,0,1,26 0,0,20,0,0,5,1,26 0,25,0,0,0,0,1,26 0,0,0,15,10,0,0,25 0,0,20,0,0,5,0,25 0,25,0,0,0,0,0,25 0,0,0,15,0,5,1,21 0,0,20,0,0,0,1,21 0,0,0,15,0,5,0,20 0,0,20,0,0,0,0,20 0,0,0,0,10,5,1,16 0,0,0,15,0,0,1,16 0,0,0,0,10,5,0,15 0,0,0,15,0,0,0,15 0,0,0,0,10,0,1,11 0,0,0,0,10,0,0,10 0,0,0,0,0,5,1,6 0,0,0,0,0,5,0,5 0,0,0,0,0,0,1,1
Performance enhancement of ranking function by replacement of lambda x with vectorization
I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this: ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x)) I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to: (i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns. (iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call? (iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used. Sample input data is as follows: import pandas as pd import numpy as np import random to_rank = ['var_1', 'var_2', 'var_3'] df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)}) df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0]) df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',') The two ranking functions are: def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date #extra data tidying logic here beyond scope of question - can remove ranked = df[to_rank].apply(lambda x: f(x)) return ranked def f(x): nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50 sub_df = x.dropna() # nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50 if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50 return nans_ranked if len(sub_df.unique()) == 1: # if all data has same value, return rank 50 sub_df[:] = 50 return sub_df #Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank. max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max max_bins = len(sub_df) / max_cluster if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank max_bins = 100 if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data. sub_df[:] = 50 return sub_df bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins) sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking ranked_df = pd.concat([sub_df_ranked, nans_ranked]) return ranked_df And the code to call my ranking function and recombine with df is: # ensure don't get duplicate columns if ranking already executed ranked_cols = [col + '_ranked' for col in to_rank] ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank)) ranked.columns = ranked_cols ranked.reset_index(inplace = True) ranked.set_index('level_1', inplace = True) df = df.join(ranked[ranked_cols]) I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x. I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach: Line # Hits Time Per Hit % Time Line Contents 2 def tst_fun(df, field): 3 1 685 685.0 0.2 x = df[field] 4 1 20726 20726.0 5.8 nans = x[np.isnan(x)] 5 1 28448 28448.0 8.0 sub_df = x.dropna() 6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50) 7 1 5 5.0 0.0 if len(sub_df.index) == 0: 8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990 9 return nans_ranked 10 11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1: 12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990 13 return sub_df 14 15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank. 16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max 17 # print(counts) 18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster # 19 20 1 3 3.0 0.0 if max_bins > 100: 21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank 22 23 24 1 0 0.0 0.0 if max_bins < 5: 25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data. 26 27 # return sub_df 28 29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins) 30 31 #should track bin resolution for all data. To add. 32 33 #if get here, then neither nans_ranked, nor sub_df are empty 34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False) 35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x) 36 37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked]) 38 39 1 1 1.0 0.0 return ranked_df
I'd build a function using numpy I plan on using this within each group defined within a pandas groupby def rnk(df): a = df.values.argsort(0) n, m = a.shape r = np.arange(a.shape[1]) b = np.empty_like(a) b[a, np.arange(m)[None, :]] = np.arange(n)[:, None] return pd.DataFrame(b / n, df.index, df.columns) gcols = ['date_id', 'category'] rcols = ['var_1', 'var_2', 'var_3'] df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked') var_1_ranked var_2_ranked var_3_ranked 0 0.333333 0.809524 0.428571 1 0.160000 0.360000 0.240000 2 0.153846 0.384615 0.461538 3 0.000000 0.315789 0.105263 4 0.560000 0.200000 0.160000 ... How It Works Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker. numpy's argsort will produce a permutation that can be used to slice the array into a sorted array. a = np.array([25, 300, 7]) b = a.argsort() print(b) [2 0 1] print(a[b]) [ 7 25 300] So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are. # create an empty array that is the same size as b or a # but these will be ranks, so I want them to be integers # so I use empty_like(b) because b is the result of # argsort and is already integers. u = np.empty_like(b) # now just like when I sliced a above with a[b] # I slice u the same way but instead I assign to # those positions, the ranks I want. # In this case, I defined the ranks as np.arange(b.size) + 1 u[b] = np.arange(b.size) + 1 print(u) [2 3 1] And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank. Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles. What's left to do is apply this logic to each group. We can do this in pandas with groupby Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently. Can we avoid the groupby and have numpy do the whole thing? I'll also take advantage of numba's just in time compiling to speed up some things with njit from numba import njit #njit def count_factor(f): c = np.arange(f.max() + 2) * 0 for i in f: c[i + 1] += 1 return c #njit def factor_fun(f): c = count_factor(f) cc = c[:-1].cumsum() return c[1:][f], cc[f] def lexsort(a, f): n, m = a.shape f = f * (a.max() - a.min() + 1) return (f.reshape(-1, 1) + a).argsort(0) def rnk_numba(df, gcols, rcols): tups = list(zip(*[df[c].values.tolist() for c in gcols])) f = pd.Series(tups).factorize()[0] a = lexsort(np.column_stack([df[c].values for c in rcols]), f) c, cc = factor_fun(f) c = c[:, None] cc = cc[:, None] n, m = a.shape r = np.arange(a.shape[1]) b = np.empty_like(a) b[a, np.arange(m)[None, :]] = np.arange(n)[:, None] return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked') How it works Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above. I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort. A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component. So that's some of the high level philosophy. What about #njit? Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts. In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed. numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them. Timing %%timeit ranked_cols = [col + '_ranked' for col in to_rank] ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank)) ranked.columns = ranked_cols ranked.reset_index(inplace = True) ranked.set_index('level_1', inplace = True) 1 loop, best of 3: 481 ms per loop gcols = ['date_id', 'category'] rcols = ['var_1', 'var_2', 'var_3'] %timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked') 100 loops, best of 3: 16.4 ms per loop %timeit rnk_numba(df, gcols, rcols).head() 1000 loops, best of 3: 1.03 ms per loop
I suggest you try this code. It's 3 times faster than yours, and more clear. rank function: def rank(x): counts = x.value_counts() bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0]) bins = 100 if bins > 100 else bins if bins < 5: return x.apply(lambda x: 50) else: return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int) single thread apply: for col in to_rank: df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank) mulple thread apply: import sys from multiprocessing import Pool def tfunc(col): return df.groupby(['date_id', 'category'])[col].apply(rank) pool = Pool(len(to_rank)) result = pool.map_async(tfunc, to_rank).get(sys.maxint) for (col, val) in zip(to_rank, result): df[col + '_ranked'] = val
reordering cluster numbers for correct correspondence
I have a dataset that I clustered using two different clustering algorithms. The results are about the same, but the cluster numbers are permuted. Now for displaying the color coded labels, I want the label ids to be same for the same clusters. How can I get correct permutation between the two label ids? I can do this using brute force, but perhaps there is a better/faster method. I would greatly appreciate any help or pointers. If possible I am looking for a python function.
The most well-known algorithm for finding the optimum matching is the hungarian method. Because it cannot be explained in a few sentences, I have to refer you to a book of your choice, or Wikipedia article "Hungarian algorithm". You can probably get good results (even perfect if the difference is indeed tiny) by simply picking the maximum of the correspondence matrix and then removing that row and column.
I have a function that works for me. But it may fail when the two cluster results are very inconsistent, which leads to duplicated max values in the contingency matrix. If your cluster results are about the same, it should work. Here is my code: from sklearn.metrics.cluster import contingency_matrix def align_cluster_index(ref_cluster, map_cluster): """ remap cluster index according the the ref_cluster. both inputs must be nparray and have same number of unique cluster index values. Xin Niu Jan-15-2020 """ ref_values = np.unique(ref_cluster) map_values = np.unique(map_cluster) print(ref_values) print(map_values) num_values = ref_values.shape[0] if ref_values.shape[0]!=map_values.shape[0]: print('error: both inputs must have same number of unique cluster index values.') return() switched_col = set() while True: cont_mat = contingency_matrix(ref_cluster, map_cluster) print(cont_mat) # divide contingency_matrix by its row and col sums to avoid potential duplicated values: col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values)) row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values))) print(col_sum) print(row_sum) cont_mat = cont_mat/(col_sum+row_sum) print(cont_mat) # ignore columns that have been switched: cont_mat[:, list(switched_col)]=-1 print(cont_mat) sort_0 = np.argsort(cont_mat, axis = 0) sort_1 = np.argsort(cont_mat, axis = 1) print('argsort contmat:') print(sort_0) print(sort_1) if np.array_equal(sort_1[:,-1], np.array(range(num_values))): break # switch values according to the max value in the contingency matrix: # get the position of max value: idx_max = np.unravel_index(np.argmax(cont_mat, axis=None), cont_mat.shape) print(cont_mat) print(idx_max) if (cont_mat[idx_max]>0) and (idx_max[0] not in switched_col): cluster_tmp = map_cluster.copy() print('switch', map_values[idx_max[1]], 'and:', ref_values[idx_max[0]]) map_cluster[cluster_tmp==map_values[idx_max[1]]]=ref_values[idx_max[0]] map_cluster[cluster_tmp==map_values[idx_max[0]]]=ref_values[idx_max[1]] switched_col.add(idx_max[0]) print(switched_col) else: break print('final argsort contmat:') print(sort_0) print(sort_1) print('final cont_mat:') cont_mat = contingency_matrix(ref_cluster, map_cluster) col_sum = np.matmul(np.ones((num_values, 1)), np.sum(cont_mat, axis = 0).reshape(1, num_values)) row_sum = np.matmul(np.sum(cont_mat, axis = 1).reshape(num_values, 1), np.ones((1, num_values))) cont_mat = cont_mat/(col_sum+row_sum) print(cont_mat) return(map_cluster) And here is some test code: ref_cluster = np.array([2,2,3,1,0,0,0,1,2,1,2,2,0,3,3,3,3]) map_cluster = np.array([0,0,0,1,1,3,2,3,2,2,0,0,0,2,0,3,3]) c = align_cluster_index(ref_cluster, map_cluster) print(ref_cluster) print(c) >>>[2 2 3 1 0 0 0 1 2 1 2 2 0 3 3 3 3] >>>[2 2 2 1 1 3 0 3 0 0 2 2 2 0 2 3 3]
Delimiting contiguous regions with values above a certain threshold in Pandas DataFrame
I have a Pandas Dataframe of indices and values between 0 and 1, something like this: 6 0.047033 7 0.047650 8 0.054067 9 0.064767 10 0.073183 11 0.077950 I would like to retrieve tuples of the start and end points of regions of more than 5 consecutive values that are all over a certain threshold (e.g. 0.5). So that I would have something like this: [(150, 185), (632, 680), (1500,1870)] Where the first tuple is of a region that starts at index 150, has 35 values that are all above 0.5 in row, and ends on index 185 non-inclusive. I started by filtering for only values above 0.5 like so df = df[df['values'] >= 0.5] And now I have values like this: 632 0.545700 633 0.574983 634 0.572083 635 0.595500 636 0.632033 637 0.657617 638 0.643300 639 0.646283 I can't show my actual dataset, but the following one should be a good representation import numpy as np from pandas import * np.random.seed(seed=901212) df = DataFrame(range(1,501), columns=['indices']) df['values'] = np.random.rand(500)*.5 + .35 yielding: 1 0.491233 2 0.538596 3 0.516740 4 0.381134 5 0.670157 6 0.846366 7 0.495554 8 0.436044 9 0.695597 10 0.826591 ... Where the region (2,4) has two values above 0.5. However this would be too short. On the other hand, the region (25,44) with 19 values above 0.5 in a row would be added to list.
You can find the first and last element of each consecutive region by looking at the series and 1-row shifted values, and then filter the pairs which are adequately apart from each other: # tag rows based on the threshold df['tag'] = df['values'] > .5 # first row is a True preceded by a False fst = df.index[df['tag'] & ~ df['tag'].shift(1).fillna(False)] # last row is a True followed by a False lst = df.index[df['tag'] & ~ df['tag'].shift(-1).fillna(False)] # filter those which are adequately apart pr = [(i, j) for i, j in zip(fst, lst) if j > i + 4] so for example the first region would be: >>> i, j = pr[0] >>> df.loc[i:j] indices values tag 15 16 0.639992 True 16 17 0.593427 True 17 18 0.810888 True 18 19 0.596243 True 19 20 0.812684 True 20 21 0.617945 True
I think this prints what you want. It is based heavily on Joe Kington's answer here I guess it is appropriate to up-vote that. import numpy as np # from Joe Kington's answer here https://stackoverflow.com/a/4495197/3751373 # with minor edits def contiguous_regions(condition): """Finds contiguous True regions of the boolean array "condition". Returns a 2D array where the first column is the start index of the region and the second column is the end index.""" # Find the indicies of changes in "condition" d = np.diff(condition,n=1, axis=0) idx, _ = d.nonzero() # We need to start things after the change in "condition". Therefore, # we'll shift the index by 1 to the right. -JK # LB this copy to increment is horrible but I get # ValueError: output array is read-only without it mutable_idx = np.array(idx) mutable_idx += 1 idx = mutable_idx if condition[0]: # If the start of condition is True prepend a 0 idx = np.r_[0, idx] if condition[-1]: # If the end of condition is True, append the length of the array idx = np.r_[idx, condition.size] # Edit # Reshape the result into two columns idx.shape = (-1,2) return idx def main(): import pandas as pd RUN_LENGTH_THRESHOLD = 5 VALUE_THRESHOLD = 0.5 np.random.seed(seed=901212) data = np.random.rand(500)*.5 + .35 df = pd.DataFrame(data=data,columns=['values']) match_bools = df.values > VALUE_THRESHOLD print('with boolian array') for start, stop in contiguous_regions(match_bools): if (stop - start > RUN_LENGTH_THRESHOLD): print (start, stop) if __name__ == '__main__': main() I would be surprised if there were not more elegant ways