This is an example dataframe, my actual dataframe has 100s more rows.
nums_1 nums_2 nums_3
1 1 8
2 1 7
3 5 9
Is there a method that will calculate the 95% confidence interval across each row? A method that would work for large dataframe?
df = pd.DataFrame({'nums_1': [1, 2, 3], 'nums_2': [1, 1, 5], 'nums_3' : [8,7,9]})
You can use:
from scipy import stats
df.apply(lambda x: stats.t.interval(0.95, len(x)-1, loc=np.mean(x), scale=stats.sem(x)), axis=1)
You will obtain essentially the same results by using the following:
import statsmodels.stats.api as sms
df.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)
Both answers return the same result - tuples.
The answer is described here: Compute a confidence interval from sample data
What is important to understand is that it works correctly if each row (each sample) is drawn independently from a normal distribution with an unknown standard deviation.
When it comes to large dataframes, the easy solution is to use swifter. However, it only speeds up your calculations twice. Nevertheless, it is worth trying: https://towardsdatascience.com/do-you-use-apply-in-pandas-there-is-a-600x-faster-way-d2497facfa66
import statsmodels.stats.api as SMS
import swifter
df.swifter.apply(lambda x: sms.DescrStatsW(x).tconfint_mean(), axis=1)
Edit: if you want to round your results and maybe get two columns instead of one with tuples, you can use:
def get_conf_interv(x):
res1, res2 = sms.DescrStatsW(x).tconfint_mean()
return round(res1, 2), round(res2, 2)
df[['res1', 'res2']] = df.swifter.apply(get_conf_interv, axis=1, result_type='expand')
Related
I have an application that uses a Pandas dataframe to calculate each min/max row value for each column. For example:
col_a col_b col_c
2 8 7
10 4 3
6 5 1
calling df.max() produces
col_a 10
col_b 8
col_c 7
Just as a reference the I'm trying to conver the following code:
bin_stats = {'min': df.min(),
'max': df.max(),
'binwidth': (df.max()-df.min()+10**-6)/bincount}
# Transform data into bin positions for fast binning
data = ((df - in_stats['min'])/bin_stats['binwidth']).apply(np.floor)
I'm converting my functionality to Vaex and I need to print out the max row value for every column in my dataframe like above.I have tried df.max(column_names) but I get the error:
ValueError: Could not find a class (AggMax_object), seems object is not supported. How do I get an array of max values?
In vaex you can do df.max(). You need to pass an expression or a list of expressions for which you want to get the maximum value.
Consider this example:
import vaex
df = vaex.example()
columns = df.get_column_names(dtype='numeric')
df.max(columns)
# returns array([ 3.2000000e+01, 1.3049751e+02, 6.0022778e+01, 5.4506802e+01,
6.3641956e+02, 5.7964453e+02, 5.3974872e+02, 3.5941863e+04,
3.7393040e+03, 1.7840929e+03, -3.0200911e-01], dtype=float32)
Note that vaex has a df.minmax() method that can get you the min and max values in a single pass over the data (i.e. faster if you data is larger).
float_columns = df.get_column_names(dtype='float')
df.minmax(float_columns)
Having said all of this, vaex excels at binning stuff, so it might be worth looking into how to achieve what you want in a "vaex-native" way, instead of straight up translating pandas code into vaex. It should work, but you might not get optimal performance.
I’ve been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby, and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers. So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well. Since I’m not the only one struggling with these questions I thought I’d provide my solutions here with working examples, hoping it helps any existing/future pandas-users.
Important notes
The combination of apply & rolling in pandas has a very strong output requirement. You have to return one single value. You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, e.g. one integer. This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns. I don’t understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn’t have this requirement. Must be due to some internal pandas functions.
The combination of 'apply & rolling' combined with multiple input columns simply does not work! Imagine a dataframe with 2 columns, 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. But it seems pandas can’t handle rolling and multiple input columns at the same time. I tried to use the axis parameter to get it working but:
Axis = 0, will call your function per column. In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it’s per column, it only provides the 2 rolling values of that column…
Axis = 1, will call your function per row. This is what you probably want, but pandas will not provide a 2x2 input. It actually completely ignores the rolling and only provides one row with values of 2 columns...
When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean). It’s False by default, which means the input will be a pd.Series and thus includes indexes next to the values. If you don’t need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.
When combining 'rolling & groupby', it returns a multi-indexes series which can’t easily serve as an input for a new column. The easiest solution is to append a reset_index(drop=True) as answered & commented here (Python - rolling functions for GroupBy object).
You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs!? Answer: I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby). And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs). Most people probably only need some of the basic examples below, but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.
Please let me know if you have even better, clearer or faster ways to perform any of the solutions below. I'll update my answer and we can all benefit!
Code examples
Let’s create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.
For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Input 1 column, output 1 column
Basic
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Input 2 columns, output 1 column
Basic
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs. The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values. This enables us to get values from other columns at the correct indexes to be used.
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Input 1 column, output 2 columns
Basic
You could use a 'normal' solution by returning pd.Series:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined. I found 2 working solutions.
1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros: Everything is done within 1 function.
Cons: You have to create the columns first and it is slower since it doesn't use the raw input.
2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros: It uses the raw input which makes it about twice as fast. And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).
Cons: You have to create the nan-prefix yourself and it takes a bit more lines of code.
Rolling & Groupby
Normally, I would use the faster 2nd solution above. However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset. To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically. Once again, I use the reset_index solution at the end.
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Input 2 columns, output 2 columns
Basic
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs, you can guess I need to combine them for this one.
1. Get values from other columns using indexes (see func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby
Add the reset_index solution (see notes above) to the rolling function.
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
I am currently using pandas groupby and transform to calculate smth for each group (once) and then assign the result to each row of the group.
If the result of calculations is scalar it can be obtained like:
df['some_col'] = df.groupby('id')['some_col'].transform(lambda x:process(x))
The problem is that the result of my calculations is vector, and pd tries to make element-wise assignment of result vector to the group (quote from pandas docs):
The transform function must:
Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
I could hardcode external function, creating a group-sized list, that will contain copies of result (currently on python 3.6, so it's not possible to use assignment inside lambda):
def return_group(x):
result = process(x)
return [result for item in x]
But I think that it's possible to solve this somehow "smarter". Remember that it's necessary to make calculations only once for each group.
Is it possible to force pd.transform work with array-like result of lambda function like with scalars (just copy it n-times)?
Would be grateful for any advices.
P. S. I understand, that it's possible to use combination of apply and join to solve the original requirement, but the solution with transform has more priority in my case.
Sometime transform is a pain to work with If that's not a problem for you I'd suggest you to use groupby + a left pd.merge as in this example:
import pandas as pd
df = pd.DataFrame({"id":[1,1,2,2,2],
"col":[1,2,3,4,5]})
# this return a list for every group
grp = df.groupby("id")["col"]\
.apply(lambda x: list(x))\
.reset_index(name="out")
# Then you merge it to the original df
df = pd.merge(df, grp, how="left")
And print(df) returns
id col out
0 1 1 [1, 2]
1 1 2 [1, 2]
2 2 3 [3, 4, 5]
3 2 4 [3, 4, 5]
4 2 5 [3, 4, 5]
I have a dataframe containing dates as rows and columns as $investment in each stock on a particular day ("ndate"). Also, I have a Series ("portT") containing the sum of the total investments in all stocks each date (series size: len(ndate)*1). Here is the code that calculates the weight of each stock/each date by dividing each element of each row of ndate by sum of that day:
(l,w)=port1.shape
for i in range(0,l):
port1.iloc[i]=np.divide(ndate.iloc[i],portT.iloc[i])
The code works very slowly, could you please let me know how I can modify and speed it up? I tried to do this by vectorising, but did not succeed.
as this is justa simple divison of two dataframes of the same shape (or you can formulate it as such) you can use the simple /-operator, pandas will execute it element-wise (possibly with replication if shapes don't match, so be sure about that):
import pandas as pd
df1 = pd.DataFrame([[1,2], [3,4]])
df2 = pd.DataFrame([[2,2], [3,3]])
df_new = df1 / df2
#>>> pd.DataFrame([[0.5, 1.],[1., 1.3]])
this is most likely internally doing the same operations that you have specified in your example, however, internal assignments and checks are by-passed, which should give you some speed
EDIT:
I was mistaken on the outline of your problem; maybe include a minimal self-contained code example next time. Still the /-operator also works for Dataframes and Series in combination:
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]])
s = pd.Series([1,2])
new_df = df / s
#>>> pd.DataFrame([[1., 3.],[1., 2]])
I have the following code and would like to create a new column per Transaction Number and Description that represents the 99th percentile of each row.
I am really struggling to achieve this - it seems that most posts cover calculating the percentile on the column.
Is there a way to achieve this? I would expect a new column to be create with two rows.
df_baseScenario = pd.DataFrame({'Transaction Number' : [1,10],
'Description' :['asf','def'],
'Calc_PV_CF_2479.0':[4418494.085,-3706270.679],
'Calc_PV_CF_2480.0':[4415476.321,-3688327.494],
'Calc_PV_CF_2481.0':[4421698.198,-3712887.034],
'Calc_PV_CF_2482.0':[4420541.944,-3706402.147],
'Calc_PV_CF_2483.0':[4396063.863,-3717554.946],
'Calc_PV_CF_2484.0':[4397897.082,-3695272.043],
'Calc_PV_CF_2485.0':[4394773.762,-3724893.702],
'Calc_PV_CF_2486.0':[4384868.476,-3741759.048],
'Calc_PV_CF_2487.0':[4379614.337,-3717010.873],
'Calc_PV_CF_2488.0':[4389307.584,-3754514.639],
'Calc_PV_CF_2489.0':[4400699.929,-3741759.048],
'Calc_PV_CF_2490.0':[4379651.262,-3714723.435]})
The following should work:
df['99th_percentile'] = df[cols].apply(lambda x: numpy.percentile(x, 99), axis=1)
I'm assuming here that the variable 'cols' contains a list of the columns you want to include in the percentile (You obviously can't use the Description in your calculation, for example).
What this code does is loops over rows in the dataframe, and for each row, computes the numpy.percentile to get the 99th percentile. You'll need to import numpy.
If you need maximum speed, then you can use numpy.vectorize to remove all loops at the expense of readability (untested):
perc99 = np.vectorize(lambda x: numpy.percentile(x, 99))
df['99th_percentile'] = perc99(df[cols].values)
Slightly modified from #mxbi.
import numpy as np
df = df_baseScenario.drop(['Transaction Number','Description'], axis=1)
df_baseScenario['99th_percentile'] = df.apply(lambda x: np.percentile(x, 99), axis=1)