So I was using pandas for some analysis and ran into a scenario, in which I had to run 2 functions for different groups in my data. And I decided to use pandas' .agg function.
Below are my 2 functions:
def pick_pacf(df,alpha=0.05,nlags=192):
'''
This function returns the lags in the timeseries which are highly correlated with the original timeseries
Input
1. df: pandas series, this is the column for which we are trying to find AR lag
2. metric: str, what metric to be calculated - acf/pacf
3. alpha: float, confidence interval
4. nlags: int, the no. of lags to be tested
Return
1. lags: list, this contain the list of all the lags (# of timestamps) that are highly correlated
'''
values,conf_int = pacf(df.values,alpha=alpha,nlags=nlags)
lags = []
#in the pacf function, confidence interval is centered around pacf values
#we need them to be centered around 0, this will produce the intervals we see in the graph
conf_int_cntrd = [value[0] - value[1] for value in zip(conf_int,values)]
for obs_index, obs in enumerate(zip(conf_int_cntrd,values)):
if obs[1] >= obs[0][1]: #obs[0][1] contains the high value of the conf int
lags.append(obs_index)
elif obs[1] <= obs[0][0]: #obs[0][0] contains the low value of the conf_int
lags.append(obs_index)
lags.remove(0) #removing the 0 lag for auto-corr with itself
return lags
def pick_acf(df,nlags=192):
'''
This funciton takes returns the ACF value for a MA model for a time series
Input
1. df: pandas series, this is the series for which we want to find ACF value
2. nlags: the number of lags to be taken into consideration for ACF
Returns
1. q: numpy array, The lags value at which ACF cuts off
'''
acf_values = acf(df.values)
acf_values = np.round(acf_values,1)
q = np.where(acf_values == 0)[0]
return q
No need to go through the functions line by line (you can if you want to) but the main thing to focus here is what the two functions return. pick_pacf returns a list, whereas pick_acf returns a numpy array.
The calls to these functions are like this:
pacf_values = train_ads[['INVERTER_ID','PER_TS_YIELD']].groupby('INVERTER_ID').agg(lambda x: pick_pacf(x))
acf_values = train_ads[['INVERTER_ID','PER_TS_YIELD']].groupby('INVERTER_ID').agg(lambda x: pick_acf(x))
PER_TS_YIELD is a numeric column and INVERTER_ID is an alphanumeric column.
The ambiguous behaviour here is that when I call the pick_pacf function then only the pandas series of PER_TS_YIELD column is sent as input to the function for each INVERTER_ID.
Whereas, when I call the pick_acf function then first the pandas series of PER_TS_YIELD column is sent and then the whole data frame made up of INVERTER_ID and PER_TS_YIELD columns is sent to the function. This leads to an error as I am doing some calculation which error out when an alphanumeric column is received.
Why is this happening? Does the behaviour of the .agg function depends on what is being returned from the user defined function??
Can someone please explain this to me. Thanks in advance.
Related
(new to python so I apologize if this question is basic)
Say I create a function that will calculate some equation
def plot_ev(accuracy,tranChance,numChoices,reward):
ev=(reward-numChoices)*1-np.power((1-accuracy),numChoices)*tranChance)
return ev
accuracy, tranChance, and numChoices are each float arrays
e.g.
accuracy=np.array([.6,.7,.8])
tranChance=np.array([.6,.7,8])
numChoices=np.array([2,.3,4])
how would I run and plot plot_ev over my 3 arrays so that I end up with an output that has all combinations of elements (ideally not running 3 forloops)
ideally i would have a single plot showing the output of all combinations (1st element from accuracy with all elements from transChance and numChoices, 2nd element from accuracy with all elements from transChance and numChoices and so on )
thanks in advance!
Use numpy.meshgrid to make an array of all the combinations of values of the three variables.
products = np.array(np.meshgrid(accuracy, tranChance, numChoices)).T.reshape(-1, 3)
Then transpose this again and extract three longer arrays with the values of the three variables in every combination:
accuracy_, tranChance_, numChoices_ = products.T
Your function contains only operations that can be carried out on numpy arrays, so you can then simply feed these arrays as parameters into the function:
reward = ?? # you need to set the reward value
results = plot_ev(accuracy_, tranChance_, numChoices_, reward)
Alternatively consider using a pandas dataframe which will provide clearer labeling of the columns.
import pandas as pd
df = pd.DataFrame(products, columns=["accuracy", "tranChance", "numChoices"])
df["ev"] = plot_ev(df["accuracy"], df["tranChance"], df["numChoices"], reward)
I have a huge dataframe with a lot of zero values. And, I want to calculate the average of the numbers between the zero values. To make it simple, the data shows for example 10 consecutive values then it renders zeros then values again. I just want to tell python to calculate the average of each patch of the data.
The pic shows an example
first of all I'm a little bit confused why you are using a DataFrame. This is more likely being stored in a pd.Series while I would suggest storing numeric data in an numpy array. Assuming that you are having a pd.Series in front of you and you are trying to calculate the moving average between two consecutive points, there are two approaches you can follow.
zero-paddding for the last integer:
assuming circularity and taking the average between the first and the last value
Here is the expected code:
import numpy as np
import pandas as pd
data_series = pd.Series([0,0,0.76231, 0.77669,0,0,0,0,0,0,0,0,0.66772, 1.37964, 2.11833, 2.29178, 0,0,0,0,0])
np_array = np.array(data_series)
#assuming zero_padding
np_array_zero_pad = np.hstack((np_array, 0))
mvavrg_zeropad = [np.mean([np_array_zero_pad[i], np_array_zero_pad[i+1]]) for i in range(len(np_array_zero_pad)-1)]
#asssuming circularity
np_array_circ_arr = np.hstack((np_array, np_array[-1]))
np_array_circ_arr = [np.mean([np_array_circ_arr[i], np_array_circ_arr[i+1]]) for i in range(len(np_array_circ_arr)-1)]
I have a dataframe with total sales of around 500 product categories in each row. So there are 500 columns in my dataframe. I am trying to find the highest correlated category with my another dataframe columns.
So I will use Pearson correlation method for this.
But the Total sales for all the categories are highly skewed data, with the skewness level ranging from 10 to 40 for all the category columns. So I want to log transform this sales data using boxcox transformation.
Since, my sales data has 0 values as well, I want to use boxcox1p function.
Can somebody help me, how do I calculate lambda for boxcox1p function, since it is a mandatory parameter for this function?
Also, Is this the correct approach for my problem statement to find highly correlated categories?
Assume df is Your dataframe with many columns containing numeric values, and lambda parameter of box-cox transformation equals 0.25, then:
from scipy.special import boxcox1p
df_boxcox = df.apply(lambda x: boxcox1p(x,0.25))
Now transformed values are in df_boxcox.
Unfortunately there is no built-in method to find lambda of boxcox1p but we can use PowerTransformer from sklearn.preprocessing instead:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
Note method 'yeo-johnson' is used because it works with both positive and negative values. Method 'box-cox' will raise error: ValueError: The Box-Cox transformation can only be applied to strictly positive data.
data = pd.DataFrame({'x':[-2,-1,0,1,2,3,4,5]}) #just sample data to explain
pt.fit(data)
print(pt.lambdas_)
[0.89691707]
then apply calculated lambda:
print(pt.transform(data))
result:
[[-1.60758267]
[-1.09524803]
[-0.60974999]
[-0.16141745]
[ 0.26331586]
[ 0.67341476]
[ 1.07296428]
[ 1.46430326]]
I have a dataframe with time series.
I'd like to compute the rolling correlation (periods=20) between columns.
store_corr=[] #empty list to store the rolling correlation of each pairs
names=[] #empty list to store the column name
df=df.pct_change(periods=1).dropna(axis=0) #Prepate dataframe of time series
for i in range(0,len(df.columns)):
for j in range(i,len(df.columns)):
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
names.append('col '+str(i)+' -col '+str(j))
store_corr.append(corr)
df_corr=pd.DataFrame(np.transpose(np.array(store_corr)),columns=names)
This solution is working and gives me the rolling correlation.This solution is with the help of Austin Mackillop (comments).
Is there another faster way? (I.e. I want to avoid this double for loop.)
This line:
corr=df.rolling(20).corr(df[df.columns[i]],df[df.columns[j]])
will produce an error because the second argument of corr expects a Bool but you passed a DataFrame which has an ambiguous truth value. You can view the docs here.
Does applying the rolling method to the first DataFrame in the second line of code that you provided achieve what you are trying to do?
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64