Conditional rolling computation in pandas - python

I would like to compute a quantity called "downside beta".
Let's suppose I have a dataframe df:
df = pd.DataFrame({'A': [-0.1,0.3,-0.4, 0.8,-0.5],'B': [-0.2,0.5,0.3,-0.5,0.1]},index=[0, 1, 2, 3,4])
I would like to add a column, 'C' that computes this downside beta defined as the covariance between the columns A and B considering only the negative values of column A with the corresponding values of B. This covariance should be then divided by the variance of column A considering only the negative values.
In the above example, it should be equivalent of computing the covariance between the two series: [-0.1,-0.4,-0.5] and [-0.2,0.3,0.1]. Divided by the variance of the series [-0.1,-0.4,-0.5].
Next step would be to roll this metric over the index of an initial large dataframe df.
Is there an efficient way to do that? In a vectorized manner. I guess combining pd.rolling_cov and np.where?
Thank you!

Is this what you're looking for? You can filter out positive values and then call pandas cov and var functions accordingly:
v = df[df.A.lt(0)]
v.cov() / v.A.var()
A B
A 1.000000 -0.961538
B -0.961538 1.461538
If you just want the value at the diagonal,
np.diag(v.cov() / v.A.var(), k=-1)
array([-0.96153846])
For a rolling window, you may need to jump through a few hoops, but this should be doable;
v = df[df.A.lt(0)]
i = v.rolling(3).cov().A.groupby(level=0).last()
j = v.rolling(3).A.var()
i / j
0 NaN
2 NaN
4 -0.961538
Name: A, dtype: float64

Related

Trying to use a lambda function to add a new column with cumulative increase of the values in a DataFrame

Say I have a simple one column DataFrame:
df = pd.DataFrame([0.01,0.02,0.03,0.04,0.05,0.06,0.07])
I am trying to multiply the first two numbers and then multiply the result of that by the third number and then multiply the result of that by the fourth number, so on down the column.
For example:
df['chainlink'] = df.apply(lambda x: (1+x[0])*(1+x[1]))
This Obviously creates a new column with the value 1.0302 in the first row and then NaNs after. What I am then trying to do is (1.0302)(1+0.03) = 1.0611 then (1.0611)(1+0.04) = 1.1036 etc.
The new column should be a sort of cumulative increase of the values.
Check with cumprod
df['new'] = df[0].add(1).cumprod()
0 1.010000
1 1.030200
2 1.061106
3 1.103550
4 1.158728
5 1.228251
6 1.314229
Name: 0, dtype: float64

Pandas: Calculating a Z-score to avoid "look ahead" bias

I have time series data in dataframe named "df", and, my code for calculating the z-score is given below:
mean = df.mean()
standard_dev = df.std()
z_score = (df - mean) / standard_dev
I would like to calculate the z-score for each observation using the respective observation and data that was known at the point of recording the observation. i.e. I do not want to use a standard deviation and mean that incorporates data that occurs after a specific point in time. I just want to use data from time t, t-1, t-2....
How do I do this?
Use .expanding() - col being the column you want to compute your statistics for (drop [col] in case, if you wish to compute it for the whole dataframe):
You might need to sort values by time column first - denoted as time_col (in case if it's not sorted already):
df=df.sort_values("time_col", axis=0)
Then:
df[col].sub(df[col].expanding().mean()).div(df[col].expanding().std())
Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html
For the sample data:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqrstuv"), "b": [6,5,7,1,-9,0,3,5,2,8]})
df["c"]=df["b"].sub(df["b"].expanding().mean()).div(df["b"].expanding().std())
Outputs:
a b c
0 x 6 NaN
1 y 5 -0.707107
2 z 7 1.000000
3 p 1 -1.425880
4 q -9 -1.677484
5 r 0 -0.281450
6 s 3 0.210502
7 t 5 0.534207
8 u 2 -0.046142
9 v 8 1.062430
You could assign two new columns, containing the mean and std of previous items. I here assume, that your time series data is in the column 'time_series_data':
len_ = len(df)
df['mean_past'] = [np.mean(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['std_past'] = [np.std(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['z_score'] = (df['time_series_data'] - df['mean_past']) / df['std_past']
Edit: if you want to z-score all columns, you could define a function, that computes the z-score and apply it on all columns of your dataframe:
def z_score_column(column):
len_ = len(column)
mean = [np.mean(column[0:lv+1]) for lv in range(0,len_)]
std = [np.std(column[0:lv+1]) for lv in range(0,len_)]
return [(c-m)/s for c,m,s in zip(column, mean, std)]
df = pd.DataFrame(np.random.rand(10,5))
df.apply(z_score_column)

Pandas: how to find row and column for values in a range?

I have a Pandas DataFrame that is generated from performing multiple correlations across variables.
corr = df.apply(lambda s: df.corrwith(s))
print('\n', 'Correlations')
print(corr.to_string())
The output looks like this:
Correlations
A B C D E
A 1.000000 -0.901104 0.662530 -0.772657 0.532606
B -0.901104 1.000000 -0.380257 0.946223 -0.830466
C 0.662530 -0.380257 1.000000 -0.227531 -0.102506
D -0.772657 0.946223 -0.227531 1.000000 -0.888768
E 0.532606 -0.830466 -0.102506 -0.888768 1.000000
However, this is a small sample of the correlation table, which can be over 300 rows x 300 cols. I'm trying to find a way to identify the coordinates for correlations within a specific value range.
For example, correlations between +0.25 and -0.25. My desired output would be:
E x C = -0.102506
D x C = -0.227531
In searching, I've found a few pandas functions that I'm unable to put together in a coherent way:
pandas iloc, loc, pandas between
How would you suggest I go about accomplishing this filtering?
Use masks + DataFrame.where. We'll use np.triu to get rid of duplicates since the correlation matrix is symmetric.
import numpy as np
corr.where(np.triu((corr.values <= 0.25) & (corr.values >= -0.25))).stack()
C D -0.227531
E -0.102506
dtype: float64

Pearson multiple correlation with Scipy

I am trying to do something quite simple compute a Pearson correlation matrix of several variables that are given as columns of a DataFrame. I want it to ignore nans and provide also the p-values. scipy.stats.pearsonr is insufficient because it works only for two variables and cannot account for nans. There should be something better than that...
For example,
df = pd.DataFrame([[1,2,3],[6,5,4],[1,None,9]])
0 1 2
0 1 2.0 3
1 6 5.0 4
2 1 NaN 9
The columns of df are the variables and the rows are observations. I would like a command that returns a 3x3 correlation matrix, along with a 3x3 matrix of corresponding p-values. I want it to omit the None. That is, the correlation between [1,6,1],[2,5,NaN] should be the correlation between [1,6] and [2,5].
There must be a nice Pythonic way to do that, can anyone please suggest?
If you have your data in a pandas DataFrame, you can simply use df.corr().
From the docs:
DataFrame.corr(method='pearson', min_periods=1)
Compute pairwise correlation of columns, excluding NA/null values

apply pandas qcut function to subgroups

Let us assume we created a dataframe df using the code below. I have created a bin frequency count based on the 'value' column in df. Now how do I get the frequency count of these label=1 samples frequency count based on previous created bin? Obviously, I should not use qcut for those label = 1 samples to get the count, since the bin positions are not same as before.
import numpy as np
import pandas as pd
mu, sigma = 0, 0.1
theta = 0.3
s = np.random.normal(mu, sigma, 100)
group = np.random.binomial(1, theta, 100)
df = pd.DataFrame(np.vstack([s,group]).transpose())
df.columns = ['value','label']
factor = pd.qcut(df['value'], 5)
factor_bin_count = pd.value_counts(factor)
Update: I took the solution from jeff
df.groupby(['label',factor]).value.count()
If I understand your question. You want to take a grouping factor (e.g. you created using qcut to bin the continuous values), and another grouper (e.g. 'label'), then perform an operation. count in this case.
In [36]: df.groupby(['label',factor]).value.count()
Out[36]:
label value
0 [-0.248, -0.0864] 14
(-0.0864, -0.0227] 15
(-0.0227, 0.0208] 15
(0.0208, 0.0718] 17
(0.0718, 0.24] 13
1 [-0.248, -0.0864] 6
(-0.0864, -0.0227] 5
(-0.0227, 0.0208] 5
(0.0208, 0.0718] 3
(0.0718, 0.24] 7
Name: value, dtype: int64

Categories