Numpy Lambda Function Not Working as Expected - python

market['AAPL'] is a dataframe with Apple's daily stock return
I noticed that:
market['AAPL'].apply(lambda x: np.exp(x))
market['AAPL'].apply(lambda x: np.cumprod(np.exp(x)))
Both give the same results
Why is the np.cumprod not working?

You probably mean to apply the cumulative product across the AAPL column. Your current attempt doesn't work, because .apply works per row. As a result, np.cumprod is called each time for a single number, not for an array of numbers.
Instead, try something like this:
import pandas as pd
import numpy as np
aapl = {"AAPL": np.linspace(1, 2, 10)}
df = pd.DataFrame(appl)
# Calculate exp for the column, then calculate
# the cumulative product over the column
df['cum-AAPL'] = np.exp(df['AAPL']).cumprod())

Because x is a number, it's np.exp is a number, and a product of one number is itself.

Related

DataFrame Pandas for range

I have problem with DataFrame for range.
In the first line, I would like to calculate and add the data,
subsequent lines depend on each previous one.
So the first formula is "different", the rest are repeated.
I did this in a DataFrame and it works, but very slowly.
All other data so far is in the DataFrame.
import pandas as pd
import numpy as np
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = calc[0]
calc['op_ol'][0] = calc[0][0]
for ee in range(1,5):
calc['op_ol'][ee] = 0 if calc['op_ol'][ee-1] == 0 else calc[0][ee-1] * calc['op_ol'][ee-1]
How could I speed this up?
It's generally slow when you use loops with pandas. I suggest you these lines:
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = (calc[0].cumprod() * calc[0][0]).shift(fill_value=calc[0][0])
Where cumprod is the cumulative product and we shift it with the first value.

Taking first value in a rolling window that is not numeric

This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')

Multiply two columns in a groupby statement in pandas

I have a simplified data frame called df
import pandas as pd
df = pd.DataFrame({'num': [1,1,2,2],
'price': [12,11,15,13],
'y': [7,7,9,9]})
I want to group by num an then multiply price and y and take the sum divided by sum of y
I've been trying to get started with this and have been having trouble
df.groupby('letter').agg(['price']*['quantity'])
Prior to the groupby operation, you can add a temporary column to the dataframe that calcs your intermediate result (price * y) and then use this column in your groupby operation (summing the values, and then using eval to calculate the sum of temp divided by the sum of y). Cast the result back to a dataframe and call the new column whatever you'd like.
>>> (df
.assign(temp=df.eval('price * y'))
.groupby('num')
.sum()
.eval('temp / y')
.to_frame('result')
)
result
num
1 11.5
2 14.0
Basically you want to compute a weighted mean
A way to do this is:
import numpy as np
# define custom function with 'y'column as weights
weights = lambda x: np.average(x,weights=df.loc[x.index,'y'])
# aggregate using this new function
df.groupby('num').agg({'price': weights})

Python Pandas Calculating Percentile per row

I have the following code and would like to create a new column per Transaction Number and Description that represents the 99th percentile of each row.
I am really struggling to achieve this - it seems that most posts cover calculating the percentile on the column.
Is there a way to achieve this? I would expect a new column to be create with two rows.
df_baseScenario = pd.DataFrame({'Transaction Number' : [1,10],
'Description' :['asf','def'],
'Calc_PV_CF_2479.0':[4418494.085,-3706270.679],
'Calc_PV_CF_2480.0':[4415476.321,-3688327.494],
'Calc_PV_CF_2481.0':[4421698.198,-3712887.034],
'Calc_PV_CF_2482.0':[4420541.944,-3706402.147],
'Calc_PV_CF_2483.0':[4396063.863,-3717554.946],
'Calc_PV_CF_2484.0':[4397897.082,-3695272.043],
'Calc_PV_CF_2485.0':[4394773.762,-3724893.702],
'Calc_PV_CF_2486.0':[4384868.476,-3741759.048],
'Calc_PV_CF_2487.0':[4379614.337,-3717010.873],
'Calc_PV_CF_2488.0':[4389307.584,-3754514.639],
'Calc_PV_CF_2489.0':[4400699.929,-3741759.048],
'Calc_PV_CF_2490.0':[4379651.262,-3714723.435]})
The following should work:
df['99th_percentile'] = df[cols].apply(lambda x: numpy.percentile(x, 99), axis=1)
I'm assuming here that the variable 'cols' contains a list of the columns you want to include in the percentile (You obviously can't use the Description in your calculation, for example).
What this code does is loops over rows in the dataframe, and for each row, computes the numpy.percentile to get the 99th percentile. You'll need to import numpy.
If you need maximum speed, then you can use numpy.vectorize to remove all loops at the expense of readability (untested):
perc99 = np.vectorize(lambda x: numpy.percentile(x, 99))
df['99th_percentile'] = perc99(df[cols].values)
Slightly modified from #mxbi.
import numpy as np
df = df_baseScenario.drop(['Transaction Number','Description'], axis=1)
df_baseScenario['99th_percentile'] = df.apply(lambda x: np.percentile(x, 99), axis=1)

Separating pandas dataframe by offset string

Lets say I have a pandas.DataFrame that has hourly data for 3 days:
import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))
I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).
So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as
l=splitDF(df, '6H')
and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
cont=0
for date in data.index:
... check for date in res_index ...
... and start cutting at those points ...
But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?
Thank you!
EDIT
A better method (that needs some improvement but it's faster) would be the following:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
pdate=res_index[0]
for date in res_index:
out.append(data[pdate:date][:-1])
pdate=date
out.append(data[pdate:])
return out
But still seems to me that there should be a better method.
Ok, so this sounds like a textbook case for using groupby. Here's my thinking:
import pandas as pd
#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
new_df = df.copy()
hr_int = int(hr)
new_df['hr_group'] = new_df.index.hour/hr_int
new_df['dt_group'] = new_df.index.date
return new_df
#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
df_new = df.copy()
coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
poly_func = pd.np.poly1d(coef_array)
df_new['poly_fit'] = poly_func(df[x_col])
return df_new
#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
as_index=False).apply(polyfit_x_y)
How about?
np.array_split(df,len(df)/6)

Categories