Python Pandas Calculating Percentile per row - python

I have the following code and would like to create a new column per Transaction Number and Description that represents the 99th percentile of each row.
I am really struggling to achieve this - it seems that most posts cover calculating the percentile on the column.
Is there a way to achieve this? I would expect a new column to be create with two rows.
df_baseScenario = pd.DataFrame({'Transaction Number' : [1,10],
'Description' :['asf','def'],
'Calc_PV_CF_2479.0':[4418494.085,-3706270.679],
'Calc_PV_CF_2480.0':[4415476.321,-3688327.494],
'Calc_PV_CF_2481.0':[4421698.198,-3712887.034],
'Calc_PV_CF_2482.0':[4420541.944,-3706402.147],
'Calc_PV_CF_2483.0':[4396063.863,-3717554.946],
'Calc_PV_CF_2484.0':[4397897.082,-3695272.043],
'Calc_PV_CF_2485.0':[4394773.762,-3724893.702],
'Calc_PV_CF_2486.0':[4384868.476,-3741759.048],
'Calc_PV_CF_2487.0':[4379614.337,-3717010.873],
'Calc_PV_CF_2488.0':[4389307.584,-3754514.639],
'Calc_PV_CF_2489.0':[4400699.929,-3741759.048],
'Calc_PV_CF_2490.0':[4379651.262,-3714723.435]})

The following should work:
df['99th_percentile'] = df[cols].apply(lambda x: numpy.percentile(x, 99), axis=1)
I'm assuming here that the variable 'cols' contains a list of the columns you want to include in the percentile (You obviously can't use the Description in your calculation, for example).
What this code does is loops over rows in the dataframe, and for each row, computes the numpy.percentile to get the 99th percentile. You'll need to import numpy.
If you need maximum speed, then you can use numpy.vectorize to remove all loops at the expense of readability (untested):
perc99 = np.vectorize(lambda x: numpy.percentile(x, 99))
df['99th_percentile'] = perc99(df[cols].values)

Slightly modified from #mxbi.
import numpy as np
df = df_baseScenario.drop(['Transaction Number','Description'], axis=1)
df_baseScenario['99th_percentile'] = df.apply(lambda x: np.percentile(x, 99), axis=1)

Related

Numpy Lambda Function Not Working as Expected

market['AAPL'] is a dataframe with Apple's daily stock return
I noticed that:
market['AAPL'].apply(lambda x: np.exp(x))
market['AAPL'].apply(lambda x: np.cumprod(np.exp(x)))
Both give the same results
Why is the np.cumprod not working?
You probably mean to apply the cumulative product across the AAPL column. Your current attempt doesn't work, because .apply works per row. As a result, np.cumprod is called each time for a single number, not for an array of numbers.
Instead, try something like this:
import pandas as pd
import numpy as np
aapl = {"AAPL": np.linspace(1, 2, 10)}
df = pd.DataFrame(appl)
# Calculate exp for the column, then calculate
# the cumulative product over the column
df['cum-AAPL'] = np.exp(df['AAPL']).cumprod())
Because x is a number, it's np.exp is a number, and a product of one number is itself.

Calculate 'NEW COLUMN' from column 'A' based on condition from column 'B"

I have a DF roughly with columns: date, amount, currency.
There are several CURRENCY types.
I need to create a new column (USD) which will be a calculation of
( AMOUNT*EXCHANGE RATE ) based on CURRENCY type.
There are multiple EXCHANGE RATES to be applied.
I cant figure out the code/approach to do so.
Maybe df.where() should help but i keep getting errors.
Thank you
df['RUR'] = df.where(df['CUR']=='KES', df['AMOUNT']*3, axis=1)
or
df['RUR'] = df['AMOUNT'].apply(lambda x: x*2 if df['CUR']=='KES' else None)
use np.where
import numpy as np
df['RUR'] = np.where(df['CUR']=='KES',df['AMOUNT']*3,np.nan)
second sol
you can use.loc and apply condition in it.
df.loc[df['CUR']=='KES','RUR']=df['AMOUNT']*3

Speed-up Python loop running by rows and elements in each row

I have a dataframe containing dates as rows and columns as $investment in each stock on a particular day ("ndate"). Also, I have a Series ("portT") containing the sum of the total investments in all stocks each date (series size: len(ndate)*1). Here is the code that calculates the weight of each stock/each date by dividing each element of each row of ndate by sum of that day:
(l,w)=port1.shape
for i in range(0,l):
port1.iloc[i]=np.divide(ndate.iloc[i],portT.iloc[i])
The code works very slowly, could you please let me know how I can modify and speed it up? I tried to do this by vectorising, but did not succeed.
as this is justa simple divison of two dataframes of the same shape (or you can formulate it as such) you can use the simple /-operator, pandas will execute it element-wise (possibly with replication if shapes don't match, so be sure about that):
import pandas as pd
df1 = pd.DataFrame([[1,2], [3,4]])
df2 = pd.DataFrame([[2,2], [3,3]])
df_new = df1 / df2
#>>> pd.DataFrame([[0.5, 1.],[1., 1.3]])
this is most likely internally doing the same operations that you have specified in your example, however, internal assignments and checks are by-passed, which should give you some speed
EDIT:
I was mistaken on the outline of your problem; maybe include a minimal self-contained code example next time. Still the /-operator also works for Dataframes and Series in combination:
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]])
s = pd.Series([1,2])
new_df = df / s
#>>> pd.DataFrame([[1., 3.],[1., 2]])

fastest way to generate column with random elements based on another column

I have a dataframe of ~20M lines
I have a column called A that gives me an id (there are ~10K ids in total).
The value of this id defines a random distribution's parameters.
Now I want to generate a column B, that is randomly drawn from the distribution that is defined by the value in the column A
What is the fastest way to do this? Doing something with iterrows or apply is extremely slow. Another possiblity is to group by A, and generate all my data for each value of A (so I only draw from one distribution). But then I don't end up with a Dataframe but with a "groupBy" object, and I don't know how to go back to having the initial dataframe, plus my new column.
I think this approach is similar to what you were describing, where you generate the samples for each id. On my machine, it appears this would take around 5 minutes to run. I assume you can trivially get the ids.
import numpy as np
num_ids = 10000
num_rows = 20000000
ids = np.arange(num_ids)
loc_params = np.random.random(num_ids)
A = np.random.randint(0, num_ids, num_rows)
B = np.zeros(A.shape)
for idx in ids:
A_idxs = A == idx
B[A_idxs] = np.random.normal(np.sum(A_idxs), loc_params[idx])
This question is pretty vague, but how would this work for you?
df['B'] = df.apply(lambda row: distribution(row.A), axis=1)
Editing from question edits (apply is too slow):
You could create a mapping dictionary for the 10k ids to their generated value, then do something like
df['B'] = df['A'].map(dictionary)
I'm unsure if this will be faster than apply, but it will require fewer calls to your random distribution generator

Pandas how to apply multiple functions to dataframe

Is there a way to apply a list of functions to each column in a DataFrame like the DataFrameGroupBy.agg function does? I found an ugly way to do it like this:
df=pd.DataFrame(dict(one=np.random.uniform(0,10,100), two=np.random.uniform(0,10,100)))
df.groupby(np.ones(len(df))).agg(['mean','std'])
one two
mean std mean std
1 4.802849 2.729528 5.487576 2.890371
For Pandas 0.20.0 or newer, use df.agg (thanks to ayhan for pointing this out):
In [11]: df.agg(['mean', 'std'])
Out[11]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
For older versions, you could use
In [61]: df.groupby(lambda idx: 0).agg(['mean','std'])
Out[61]:
one two
mean std mean std
0 5.147471 2.971106 4.9641 2.753578
Another way would be:
In [68]: pd.DataFrame({col: [getattr(df[col], func)() for func in ('mean', 'std')] for col in df}, index=('mean', 'std'))
Out[68]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
In the general case where you have arbitrary functions and column names, you could do this:
df.apply(lambda r: pd.Series({'mean': r.mean(), 'std': r.std()})).transpose()
mean std
one 5.366303 2.612738
two 4.858691 2.986567
I tried to apply three functions into a column and it works
#removing new line character
rem_newline = lambda x : re.sub('\n',' ',x).strip()
#character lower and removing spaces
lower_strip = lambda x : x.lower().strip()
df = df['users_name'].apply(lower_strip).apply(rem_newline).str.split('(',n=1,expand=True)
I am using pandas to analyze Chilean legislation drafts. In my dataframe, the list of authors are stored as a string. The answer above did not work for me (using pandas 0.20.3). So I used my own logic and came up with this:
df.authors.apply(eval).apply(len).sum()
Concatenated applies! A pipeline!! The first apply transforms
"['Barros Montero: Ramón', 'Bellolio Avaria: Jaime', 'Gahona Salazar: Sergio']"
into the obvious list, the second apply counts the number of lawmakers involved in the project. I want the size of every pair (lawmaker, project number) (so I can presize an array where I will study which parties work on what).
Interestingly, this works! Even more interestingly, that last call fails if one gets too ambitious and does this instead:
df.autores.apply(eval).apply(len).apply(sum)
with an error:
TypeError: 'int' object is not iterable
coming from deep within /site-packages/pandas/core/series.py in apply

Categories