Weird correlation results - python

I'm trying to calculate the correlation between 2 multi-index dataframes(a and b) in two ways:
1)calculate the date-to-date correlation directly with a.corr(b) which returns a result X
2)take the mean values for all dates and calculate the correlation
a.mean().corr(b.mean()) and I got a result Y.
I made a scatter plot and in this way I needed both dataframes with the same index.
I decided to calculate:
a.mean().corr(b.reindex_like(a).mean()) and I again achieved the value X.
It's strange for me because I expected to get 'Y'. I thought that the corr function reindex the dataframes one to another. If not, what is this value Y I am getting?
Thanks in advance!

I have found the answer - when I do the reindex, I cut most of the values. One of the dataframes consists of only one value per date, so the mean is equal to this value.

Related

Multiply by unique number based on which pandas interval a number falls within

I am trying to take a number multiply it by a unique number given which interval it falls within.
I did a groupby on my pandas dataframe according to which bins a value fell into
bins = pd.cut(df['A'], 50)
grouped = df['B'].groupby(bins)
interval_averages = grouped.mean()
A
(0.00548, 0.0209] 0.010970
(0.0209, 0.0357] 0.019546
(0.0357, 0.0504] 0.036205
(0.0504, 0.0651] 0.053656
(0.0651, 0.0798] 0.068580
(0.0798, 0.0946] 0.086754
(0.0946, 0.109] 0.094038
(0.109, 0.124] 0.114710
(0.124, 0.139] 0.136236
(0.139, 0.153] 0.142115
(0.153, 0.168] 0.161752
(0.168, 0.183] 0.185066
(0.183, 0.198] 0.205451
I need to be able to check which interval a number falls into, and then multiply it by the average value of the B column for that interval range.
From the docs I know I can use the in keyword to check if a number is in an interval, but I cannot find how to access the value for a given interval. In addition, I don't want to have to loop through the Series checking if the number is in each interval, that seems quite slow.
Does anybody know how to do this efficiently?
Thanks a lot.
You can store the numbers being tested in an array, and use the cut() method with your bins to sort the values into their respective intervals. This will return an array with the bins that each number has fallen into. You can use this array to determine where the value in the dataframe (the mean) that you need to access is located (you will know the correct row) and access the value via iloc.
Hopefully this helps a bit

Python - Assign a variable in lambda apply function to calculate correlation

I have a dataframe that has the potential to grow in column size exponentially. I'm trying to calculate the correlation between two columns, multiple times. Part of the correlation calculation is with the growing number of columns. I'm creating the columns needed for the correlation calculation in a FOR loop and when i try and calculate the correlation, I get an error saying:
'DataFrame' object has no attribute 'col'
I've tried assigning the new column name to a variable and putting that variable in the lambda function, but that also doesn't work.
How to I update the correlation piece of the code to use the new columns in the FOR loop?
Here is the for loop that creates the new columns. colname is a list of all column names:
for col in colname:
df[col+'_RR'] = df['p_'+col] - df['r2500_ret']
df[col+'_sec_rr'] = df['ret'] - df[col+'_RR']
# Calculate Correlation
dfcorr = df.groupby('symbol').apply(lambda v: v.col+'_sec_rr'.corr(v.col+'_RR')).to_frame().rename(columns={0:'jets_correlation'})
Tim Roberts answered the question in the first comment. It was a simple change in notation for . to []. Thanks Tim!

Pandas - Use values from rows with equal values in iteration

In case this has been answered in the past I want to apologize, I was not sure how to phrase the question.
I have a dataframe with 3d coordinates and rows with a scalar value (magnetic field in this case) for each point in space. I calculated the radius as the distance from the line at (x,y)=(0,0) for each point. The unique radius and z values are transferred into a new dataframe. Now I want to calculate the scalar values for every point (Z,R) in the volume by averaging over all points in the 3d system with equal radius.
Currently I am iterating over all unique Z and R values. It works but is awfully slow.
df is the original dataframe, dfn is the new one which - in the beginning - only contains the unique combinations of R and Z values.
for r in dfn.R.unique():
for z in df.Z.unique():
dfn.loc[(df["R"]==r)&(df["Z"]==z), "B"] = df["B"][(df["R"]==r)&(df["Z"]==z)].mean()
Is there any way to speed this up by writing a single line of code, in which pandas is given the command to grab all rows from the original dataframe, where Z and R have the values according to each row in the new dataframe?
Thank you in advance for your help.
Try groupby!!!
It looks like you can achieve with something like:
df[['R', 'Z', 'B']].groupby(['R', 'Z']).mean()

Understanding percentile= calculation in describes () of python

I am trying to understand the following:
1)how the percentiles are calculated.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
3) My requirement is to know actual value below which x% of population lies. How to do that?
Thanks
Python-2
new=pd.DataFrame({'a':range(10),'b':[60510,60053,54968,62269,91107,29812,45503,6460,62521,37128]})
print new.describe(percentiles=[ 0,0.1 ,0.2,0.3,0.4, 0.50, 0.6,0.7,0.8 ,0.90,1 ])
1)how the percentiles are calculated
90% percentile/quantile means 10% of the data is greater than that value, 90% of the data falls below that value. By default, it's based on a linear interpolation. This is why in your a column, values increment by 0.9instead of original data values of [0, 1, 2 ...]. If you want to use nearest values instead of interpolation, you can use the quantile method instead of describe and change the interpolation parameter.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
Your question is unclear here. It does return values in a sorted order, indexed based on the output of the .describe method output: count, mean, std, min, quantiles from low to high, max. If you only want quantiles and not the other statistics, you can use the quantile method instead.
3) My requirement is to know actual value below which x% of population lies. How to do that?
Nothing is wrong with the output. Those quantiles are accurate, although they aren't very meaningful when your data only has 10 observations.
Edit: It wasn't originally clear to me that you were attempting to do stats on a frequency table. I don't know of a direct solution in pandas that don't involve moving your data over to a numpy array. You could use numpy.repeat like to get a raw list of observations to put back into pandas and do descriptive stats on.
vals = np.array(new.a)
freqs = np.array(new.b)
observations = np.repeat(vals, freqs)

Comparing two dataframes in pandas for all values greater than the other

I have two data frames with numeric values.
I want to compare both of them and check if one has all values greater than the other.
I have a formula say where mean is mr and variance is vr and alpha is a scalar value, then I want to check if the dataframe r > (mr + alpha * vr) where mr is a dataframe with mean values and vr is variance dataframe. R is an individual dataframe for comparison.
if(r>(mr+alpha*vr)) :
do something
For example my r DataFrame is r=pd.DataFrame({"a":[5,1,8,9,10],"b":[4,5,6,7,8],"c":[11,12,12,14,15]}) and the other part entirely on the right is say toCompare=pd.DataFrame({"a":[6,7,8,9,10],"b":[2,3,5,6,6],"c":[4,5,17,8,9]})
So r>toCompare should result True,since elements in "b" are greater.
I needed to just check if all values are True in the DataFrame. I got this to work finally. It was a bit difficult to figure in the large piece of code.
any((r>(mr+alpha*vr)).any())

Categories