I have a function "calc" that is being called via an apply() function. Question is, how can I provide the pandas column name dynamically to the calc function as an argument on my apply (instead of explicitly mentioning 'AMOUNT' as in this case)? Thanks.
def calc(row):
factor = 3
h_value = int(row['AMOUNT']) // 100
output = h_value * factor
return output
df1['BILL_VALUE'] = df1.apply(calc, axis=1)
U can use kwargs parameter:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
def calc(row, param=''):
factor = 3
h_value = int(row[param]) // 100
output = h_value * factor
return output
df1['BILL_VALUE'] = df1.apply(calc, axis=1, param='AMOUNT')
You don't require df.apply here.
Perfomance precendence of operations as per this answer:
vectorization
using a custom cython routine
apply
a) reductions that can be performed in cython
b) iteration in python space
itertuples
iterrows
updating an empty frame (e.g. using loc one-row-at-a-time)
Change your function definition as below and pass appropriate arguments.
def calc(df, input_col, output_col, factor=3):
df[output_col] = (df[input_col].astype("int") // 100) * factor
Example:
>>> def calc(df, input_col, output_col, factor=3):
... df[output_col] = (df[input_col].astype("int") // 100) * factor
...
>>> df1 = pd.DataFrame([1200,100,2425], columns=["AMOUNT"])
>>> df1
AMOUNT
0 1200
1 100
2 2425
>>> calc(df1, "AMOUNT", "BILL_VALUE")
>>> df1
AMOUNT BILL_VALUE
0 1200 36
1 100 3
2 2425 72
>>> df2 = pd.DataFrame([3123,55,420], columns=["AMOUNT"])
>>> calc(df2, "AMOUNT", "BILL_VALUE")
>>> df2
AMOUNT BILL_VALUE
0 3123 93
1 55 0
2 420 12
Reference:
Does pandas iterrows have performance issues?
Related
I've got a DataFrame data containing columns C0, S0 and M0, say,
C0
S0
M0
1
1
5
2
1
5
3
0.2
2
. Now I want to judge whether C0 is between M0-2*S0 and M0+2*S0 for each row in data and write the result in a new column data['J0']. So I define such a 3-variable function J:
def J(mean,std,x):
try:
lowb=mean-2*std
highb=mean+2*std
if lowb<=x and highb>=x:
return '-'
if x>highb:
return '↑'
if x<lowb:
return '↓'
except:
return nan
I think it is proper to use .apply to do this operation on columns M0, S0, C0 and store the result J in J0. However, I have only done this with mono-variable lambda functions. How to write .apply code exactly with this 3-variable function(C0->x,S0->std,M0->mean)? Thank you in advance for your kind help!
You can define a function that uses J but takes a whole row at as argument:
def vJ(r):
return J(*r)
data['J0'] = data[['M0', 'S0', 'C0']].apply(vJ, axis=1)
>>> data
C0 S0 M0 J0
0 1 1.0 5 ↓
1 2 1.0 5 ↓
2 3 0.2 2 ↑
However, note that this may be slow for large DataFrames. A faster option is to implement the logic of J with vectorized operations:
# faster (suitable for large df)
lob = data['M0'] - 2 * data['S0']
hib = data['M0'] + 2 * data['S0']
data['J0'] = '-'
data.loc[data['C0'] > hib, 'J0'] = '↑'
data.loc[data['C0'] < lob, 'J0'] = '↓'
You can use a lamba function to pass your n arguments using apply.
import pandas as pd
import numpy as np
df_sample =\
pd.DataFrame([[1,1,5],
[2,1,5],
[3,0.2,2]], columns=["C0","S0","M0"] )
def J(mean,std,x):
try:
lowb=mean-2*std
highb=mean+2*std
if lowb<=x and highb>=x:
return '-'
if x>highb:
return '↑'
if x<lowb:
return '↓'
except:
return nan
df_sample['J0'] = df_sample.apply(lambda x: J(x['M0'], x['S0'], x['C0']), axis=1)
print(df_sample)
I have a dataframe of part numbers stored as object with a string containing 3 digits of values of following format:
Either 1R2, where the R is the decimal separator
Or only numbers where the first 2 are significant and the 3rd is the number of 0 following:
101 = 100
010 = 1
223 = 22000
476 = 47000000
My dataframe (important are positions 5~7):
MATNR
0 xx01B101KO3XYZC
1 xx03C010CA3GN5T
2 xx02L1R2CA3ANNR
Below code works fine for the 1R2 case and converts object to float64.
But I am stuck with getting the 2 significant numbers together with the number of 0s.
value_pos1 = 5
value_pos2 = 6
value_pos3 = 7
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + df['MATNR'].str.get(value_pos3)))
Result
MATNR object
Cap pF float64
dtype: object
Index(['MATNR', 'Value'], dtype='object')
MATNR Value
0 xx01B101KO3XYZC 101.0
1 xx03C010CA3GN5T 10.0
2 xx02L1R2CA3ANNR 1.2
It should be
MATNR Value
0 xx01B101KO3XYZC 100.0
1 xx03C010CA3GN5T 1.0
2 xx02L1R2CA3ANNR 1.2
Following I tried with errors and on top there is a wrong value for 0 # pos3 being 1 instead 0.
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(Value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + str(pow(10, pd.to_numeric(df['MATNR'].str.get(value_pos3))))))
Do you have an idea?
If I have understood your problem correctly, defining a method and applying it to all the values of the column seems most intuitive. The method takes a str input and returns a float number.
Here is a snippet of what the simple method will entaik.
def get_number(strrep):
if not strrep or len(strrep) < 8:
return 0.0
useful_str = strrep[5:8]
if useful_str[1] == 'R':
return float(useful_str[0] + '.' + useful_str[2])
else:
zeros = '0' * int(useful_str[2])
return float(useful_str[0:2] + zeros)
Then you could simply create a new column with the numeric conversion of the strings. The easiest way possible is using list comprehension:
df['Value'] = [get_number(x) for x in df['MATNR']]
Not sure where the bug in your code is, but another option that I tend to use when creating a new column based on other columns is pandas' apply function:
def create_value_col(row):
if row['MATNR'][value_pos2] == 'R':
val = row['MATNR'][value_pos1] + '.' + row['MATNR'][value_pos3]
else:
val = (int(row['MATNR'][value_pos1]) * 10 +
int(row['MATNR'][value_pos2])) * 10 ** int(row['MATNR'][value_pos3])
return val
df['Value'] = df.apply(lambda row: create_value_col(row), axis='columns')
This way, you can create a function that processes the data however you want and then apply it to every row and add the resulting series to your dataframe.
Imagine we have different structures of dataframes in Pandas
# creating the first dataframe
df1 = pd.DataFrame({
"width": [1, 5],
"height": [5, 8]})
# creating second dataframe
df2 = pd.DataFrame({
"a": [7, 8],
"b": [11, 23],
"c": [1, 3]})
# creating second dataframe
df3 = pd.DataFrame({
"radius": [7, 8],
"height": [11, 23]})
In general there might be more than 2 dataframes. Now, I want to create a logic that is mapping columns names to specific functions to create a new column "metric" (think of it as area for two columns and volume for 3 columns). I want to specify column names ensembles
column_name_ensembles = {
"1": {
"ensemble": ['height', 'width'],
"method": area},
"2": {
"ensemble": ['a', 'b', 'c'],
"method": volume_cube},
"3": {
"ensemble": ['radius', 'height'],
"method": volume_cylinder}}
def area(width, height):
return width * height
def volume_cube(a, b, c):
return a * b * c
def volume_cylinder(radius, height):
return (3.14159 * radius ** 2) * height
Now, the area function create a new column for the dataframe df1['metric'] = df1['height'] * df2['widht'] and the volumen function will create a new column for the dataframe df2['metic'] = df2['a'] * df2['b'] * df2['c']. Note, that the functions can have arbitrary form but it takes the ensemble as parameters. The desired function metric(df, column_name_ensembles) should take an arbitrary dataframe as input and decide by inspecting the column names which function should be applied.
Example input output behaviour
df1_with_metric = metric(df1, column_name_ensembles)
print(df1_with_metric)
# output
# width height metric
# 0 1 5 5
# 1 5 8 40
df2_with_metric = metric(df2, column_name_ensembles)
print(df2_with_metric)
# output
# a b c metric
# 0 7 11 1 77
# 1 8 23 3 552
df3_with_metric = metric(df3, column_name_ensembles)
print(df3_with_metric)
# output
# radius height metric
# 0 7 11 1693.31701
# 1 8 23 4624.42048
The perfect solution would be a function that takes the dataframe and the column_name_ensembles as parameters and returns the dataframe with the appropriate 'metric' added to it.
I know this can be achieved by multiple if and else statements, but this does not seem to be the most intelligent solution. Maybe there is a design pattern that can solve this problem, but I am not an expert at design patterns.
Thank you for reading my question! I am looking forward for your great answers.
You can use the inspect module to extract parameter names automatically and then map frozenset of parameter names to metric functions directly:
import inspect
metrics = {
frozenset(inspect.signature(f).parameters): f
for f in (area, volume_cube, volume_cylinder)
}
Then for a given data frame, if all columns are guaranteed to be arguments to the relevant metric, you can simply query that dictionary:
def apply_metric(df, metrics):
metric = metrics[frozenset(df.columns)]
args = tuple(df[p] for p in inspect.signature(metric).parameters)
df['metric'] = metric(*args)
return df
In case the input data frame has more columns than are required by the metric function you can use set intersection for finding the relevant metric:
def apply_metric(df, metrics):
for parameters, metric in metrics.items():
if parameters & set(df.columns) == parameters:
args = tuple(df[p] for p in inspect.signature(metric).parameters)
df['metric'] = metric(*args)
break
else:
raise ValueError(f'No metric found for columns {df.columns}')
return df
The function that runs the model should be a fairly flexible apply. Assuming the calculations will always be limited to the data in a single row, this would probably work.
First, I modified the functions to use a common input. I added a triangle area calc to be sure this was extensible.
#def area(width, height):
# return width * height
def area(row):
return row['width'] * row['height']
#def volume_cube(a, b, c):
# return a * b * c
def volume_cube(row):
return row['a'] * row['b'] * row['c']
#def volume_cylinder(radius, height):
# return (3.14159 * radius ** 2) * height
def volume_cylinder(row):
return (3.14159 * row['radius'] ** 2) * row['height']
def area_triangle(row):
return 0.5 * row['width'] * row['height']
This allows us to use the same apply for all of the functions. Because I'm a bit ocd, I changed the names of keys in the reference dictionary.
column_name_ensembles = {
"area": {
"ensemble": ['width', 'height'],
"method": area},
"volume_cube": {
"ensemble": ['a', 'b', 'c'],
"method": volume_cube},
"volume_cylinder": {
"ensemble": ['radius', 'height'],
"method": volume_cylinder},
"area_triangle": {
"ensemble": ['width', 'height'],
"method": area_triangle},
}
The metric function then is an apply to the df. You have to specify the function you are targeting in this version, but you could infer the ensemble method based on the columns. This version makes sure the required columns are available.
def metric(df,method_id):
source_columns = list(df.columns)
calc_columns = column_name_ensembles[method_id]['ensemble']
if all(factor in source_columns for factor in calc_columns):
df['metric'] = df.apply(lambda row: column_name_ensembles[method_id]['method'](row),axis=1)
return df
else:
print('Column Mismatch')
You can then specify the dataframe and the ensemble method.
df1_with_metric = metric(df1,'area')
df2_with_metric = metric(df2,'volume_cube')
df3_with_metric = metric(df3,'volume_cylinder')
df1_with_triangle_metric = metric(df1,'area_triangle')
def metric(df, column_name_ensembles):
df_cols_set = set(df.columns)
# if there is a need to overwrite the previously calculated 'metric' column
df_cols_set.discard('metric')
for column_name_ensemble in column_name_ensembles.items():
# pick up the first `column_name_ensemble` dictionary
# with 'ensemble' matching the df columns
# (excluding 'metric' column, if present)
# comparing `set` if order of column names
# in ensemble does not matter (as per your df1 example),
# else can compare `list`
if df_cols_set == set(column_name_ensemble[1]['ensemble']):
df['metric'] = column_name_ensemble[1]['method'](**{col: df[col] for col in df_cols_set})
break
# if there is a match, return df with 'metric' calculated
# else, return original df untouched
return df
Solution
The idea is to make a function as generic as possible. To do that you should rely on df.apply using axis=1 to apply the function row wise.
The function would be:
def method(df, ensembles):
# To avoid modifying the original dataframe
df = df_in.copy()
for data in ensembles.values():
if set(df.columns) == set(data["ensemble"]):
df["method"] = df.apply(lambda row: data["method"](**row), axis=1)
return df
Why it always works?
This would be posible to apply even for functions that won't work with the whole column.
For example:
df = pd.DataFrame({
"a": [1, 2],
"b": [[1, 2], [3, 4]],
})
def a_in_b(a, b):
return a in b
# This will work
df.apply(lambda row: a_in_b(**row), axis=1)
# This won't
a_in_b(df["a"], df["b"])
Here is an interesting way of doing this using pandas methods (Details below)
def metric(dataframe,column_name_ensembles):
func_df = pd.DataFrame(column_name_ensembles).T
func_to_apply = func_df.loc[func_df['ensemble'].map(dataframe.columns.difference)
.str.len().eq(0),'method'].iat[0]
return dataframe.assign(metric=dataframe.apply(lambda x: func_to_apply(**x),axis=1))
print(metric(df1,column_name_ensembles),'\n')
print(metric(df2,column_name_ensembles),'\n')
print(metric(df3,column_name_ensembles))
width height metric
0 1 5 5
1 5 8 40
a b c metric
0 7 11 1 77
1 8 23 3 552
radius height metric
0 7 11 1693.31701
1 8 23 4624.42048
More details:
func_df = pd.DataFrame(column_name_ensembles).T
This creates a dataframe of column names and associated methods like below:
ensemble method
1 [height, width] <function area at 0x000002809540F9D8>
2 [a, b, c] <function volume_cube at 0x000002809540F950>
3 [radius, height] <function volume_cylinder at 0x000002809540FF28>
Using this dataframe , we find the row where difference of column names of the passed dataframe and the list of columns in ensamble is 0 using pd.Index.difference , series.map , series.str.len and series.eq()
func_df['ensemble'].map(df1.columns.difference)
1 Index([], dtype='object') <- Row matches the df columns completely
2 Index(['height', 'width'], dtype='object')
3 Index(['width'], dtype='object')
Name: ensemble, dtype: object
func_df['ensemble'].map(df1.columns.difference).str.len().eq(0)
1 True
2 False
3 False
Next , where True , we pick the function in the method column
func_df.loc[func_df['ensemble'].map(df1.columns.difference)
.str.len().eq(0),'method'].iat[0]
#<function __main__.area(width, height)>
and using apply and df.assign we create a new row with a copy of the passed dataframe returned.
I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:
ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))
I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:
(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.
Sample input data is as follows:
import pandas as pd
import numpy as np
import random
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
The two ranking functions are:
def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
#extra data tidying logic here beyond scope of question - can remove
ranked = df[to_rank].apply(lambda x: f(x))
return ranked
def f(x):
nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
sub_df = x.dropna() #
nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50
if len(sub_df.index) == 0: #check not all nan. If no non-nan data, then return with rank 50
return nans_ranked
if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
sub_df[:] = 50
return sub_df
#Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
max_bins = len(sub_df) / max_cluster
if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
max_bins = 100
if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
sub_df[:] = 50
return sub_df
bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut. pd.rank( seems to have extra functionality, but overheads similar in practice
sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins. E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
ranked_df = pd.concat([sub_df_ranked, nans_ranked])
return ranked_df
And the code to call my ranking function and recombine with df is:
# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
df = df.join(ranked[ranked_cols])
I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.
I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:
Line # Hits Time Per Hit % Time Line Contents
2 def tst_fun(df, field):
3 1 685 685.0 0.2 x = df[field]
4 1 20726 20726.0 5.8 nans = x[np.isnan(x)]
5 1 28448 28448.0 8.0 sub_df = x.dropna()
6 1 387 387.0 0.1 nans_ranked = nans.replace(np.nan, 50)
7 1 5 5.0 0.0 if len(sub_df.index) == 0:
8 pass #check not empty. May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
9 return nans_ranked
10
11 1 65559 65559.0 18.4 if len(sub_df.unique()) == 1:
12 sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13 return sub_df
14
15 #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16 1 74610 74610.0 20.9 max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17 # print(counts)
18 1 9 9.0 0.0 max_bins = len(sub_df) / max_cluster #
19
20 1 3 3.0 0.0 if max_bins > 100:
21 1 0 0.0 0.0 max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22
23
24 1 0 0.0 0.0 if max_bins < 5:
25 sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26
27 # return sub_df
28
29 1 1 1.0 0.0 bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30
31 #should track bin resolution for all data. To add.
32
33 #if get here, then neither nans_ranked, nor sub_df are empty
34 # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35 1 160530 160530.0 45.0 sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36
37 1 5777 5777.0 1.6 ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38
39 1 1 1.0 0.0 return ranked_df
I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby
def rnk(df):
a = df.values.argsort(0)
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame(b / n, df.index, df.columns)
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')
var_1_ranked var_2_ranked var_3_ranked
0 0.333333 0.809524 0.428571
1 0.160000 0.360000 0.240000
2 0.153846 0.384615 0.461538
3 0.000000 0.315789 0.105263
4 0.560000 0.200000 0.160000
...
How It Works
Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.
a = np.array([25, 300, 7])
b = a.argsort()
print(b)
[2 0 1]
print(a[b])
[ 7 25 300]
So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.
# create an empty array that is the same size as b or a
# but these will be ranks, so I want them to be integers
# so I use empty_like(b) because b is the result of
# argsort and is already integers.
u = np.empty_like(b)
# now just like when I sliced a above with a[b]
# I slice u the same way but instead I assign to
# those positions, the ranks I want.
# In this case, I defined the ranks as np.arange(b.size) + 1
u[b] = np.arange(b.size) + 1
print(u)
[2 3 1]
And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.
Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
What's left to do is apply this logic to each group. We can do this in pandas with groupby
Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.
Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit
from numba import njit
#njit
def count_factor(f):
c = np.arange(f.max() + 2) * 0
for i in f:
c[i + 1] += 1
return c
#njit
def factor_fun(f):
c = count_factor(f)
cc = c[:-1].cumsum()
return c[1:][f], cc[f]
def lexsort(a, f):
n, m = a.shape
f = f * (a.max() - a.min() + 1)
return (f.reshape(-1, 1) + a).argsort(0)
def rnk_numba(df, gcols, rcols):
tups = list(zip(*[df[c].values.tolist() for c in gcols]))
f = pd.Series(tups).factorize()[0]
a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
c, cc = factor_fun(f)
c = c[:, None]
cc = cc[:, None]
n, m = a.shape
r = np.arange(a.shape[1])
b = np.empty_like(a)
b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')
How it works
Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.
So that's some of the high level philosophy. What about #njit?
Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator #njit. This coded works just as well without those decorators, but is sped up with them.
Timing
%%timeit
ranked_cols = [col + '_ranked' for col in to_rank]
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank))
ranked.columns = ranked_cols
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)
1 loop, best of 3: 481 ms per loop
gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop
%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop
I suggest you try this code. It's 3 times faster than yours, and more clear.
rank function:
def rank(x):
counts = x.value_counts()
bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
bins = 100 if bins > 100 else bins
if bins < 5:
return x.apply(lambda x: 50)
else:
return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)
single thread apply:
for col in to_rank:
df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)
mulple thread apply:
import sys
from multiprocessing import Pool
def tfunc(col):
return df.groupby(['date_id', 'category'])[col].apply(rank)
pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
df[col + '_ranked'] = val
Trying to compute a range (confidence interval) to return two values in lambda mapped over a column.
M=12.4; n=10; T=1.3
dt = pd.DataFrame( { 'vc' : np.random.randn(10) } )
ci = lambda c : M + np.asarray( -c*T/np.sqrt(n) , c*T/np.sqrt(n) )
dt['ci'] = dt['vc'].map( ci )
print '\n confidence interval ', dt['ci'][:,1]
..er , so how does this get done?
then, how to unpack the tuple in a lambda?
(I want to check if the range >0, ie contains the mean)
neither of the following work:
appnd = lambda c2: c2[0]*c2[1] > 0 and 1 or 0
app2 = lambda x,y: x*y >0 and 1 or 0
dt[cnt] = dt['ci'].map(app2)
It's probably easier to see by defining a proper function for the CI, rather than a lambda.
As far as the unpacking goes, maybe you could let the function take an argument for whether to add or subtract, and then apply it twice.
You should also calculate the mean and size in the function, instead of assigning them ahead of time.
In [40]: def ci(arr, op, t=2.0):
M = arr.mean()
n = len(arr)
rhs = arr * t / np.sqrt(n)
return np.array(op(M, rhs))
You can import the add and sub functions from operator
From there it's just a one liner:
In [47]: pd.concat([dt.apply(ci, axis=1, op=x) for x in [sub, add]], axis=1)
Out[47]:
vc vc
0 -0.374189 1.122568
1 0.217528 -0.652584
2 -0.636278 1.908835
3 -1.132730 3.398191
4 0.945839 -2.837518
5 -0.053275 0.159826
6 -0.031626 0.094879
7 0.931007 -2.793022
8 -1.016031 3.048093
9 0.051007 -0.153022
[10 rows x 2 columns]
I'd recommend breaking that into a few steps for clarity.
Get the minus one with r1 = dt.apply(ci, axis=1, op=sub), and the plus with r2 = dt.apply(ci, axis=1, op=add). Combine with pd.concat([r1, r2], axis=1)
Basically, it's hard to tell from dt.apply what the output should look like, just seeing some tuples. By applying separately, we get two 10 x 1 arrays.