Pandas correlation on just one column containing np arrays - python

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object

The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

Related

Normalizing all numeric columns in my dataset and compare before and after

I want to normalize all the numeric values in my dataset.
I have taken my whole dataset into a pandas dataframe.
My code to do this so far:
for column in numeric: #numeric=df._get_numeric_data()
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
But how do i verify this is correct though?
I tried plotting a histogram for one of the columns before normalizing and after adding this piece of code before and after my for loop:
x=df['Below.Primary'] #Below.Primary is one of my column names
plt.hist(x, bins=45)
The blue histogram was before the for loop and the orange, after.
My total code looked like this:
ln[21] plt.hist(df['Below.Primary'], bins=45)
ln[22] for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
x=df['Below.Primary']
plt.hist(x, bins=45)
I don't see any reduction in scale. What have i done wrong? If not correct, can someone point out the correct way to do what i wanted to do?
Try use this:
scaler = preprocessing.StandardScaler()
df[col] = scaler.fit_transform(df[col])
A couple general things first.
If numeric is a list of column names (looks like this is the case), the for loop is not necessary.
A Pandas series using an ndarray under the hood so you can just request the ndarray with Series.values instead of calling np.array(). See this page on the Pandas Series.
I am assuming you are using preprocessing from sklearn.
I recommend using sklearn.preprocessing.Normalizer for this.
import pandas as pd
from sklearn.preprocessing import Normalizer
### Without the for loop (recommended)
# this version returns array
normalizer = Normalizer()
normalized_values = normalizer.fit_transform(df[numeric])
# normalized_values is a 2D array which is useful
# for many applications
# to convert back to DataFrame
df = pd.DataFrame(normalized_values, columns = numeric)
### with the for-loop (not recommended)
for column in numeric:
x_array = df[column].values.reshape(-1,1)
df[column] = normalizer.fit_transform(x_array)
You have to set normalized_X to the respective column while iterating.
for column in numeric:
x_array=np.array(df[column])
normalized_X=preprocessing.normalize([x_array])
df[column]= normalized_X #Setting normalized value in the column
x=df['Below.Primary']
plt.hist(x, bins=45)

Filling results of linear regression into a dataframe

I'm running a regression between two stocks:
(y=bank_matrix['EXO.MI']
and
x=bank_matrix['LDO.MI']).
My task is to update the slope coefficient every 20 days (lookback). In short, I want to have a list of slope coefficients starting from day 20 (my lookback). So I run this regression model called reg.
In the meantime, I create:
A)3 empty lists: Intercetta=[], Hedge=[], Residuals=[]
B)1 Dataframe called Regressione where I want to copy the results of my regression (Intercept,Slope and residuals) inside this dataframe columns (['Intercept','Hedge','Residuals']).
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
NOW THE FINAL QUESTION: Why in my final dataframe 'Regressione', the third column('Residuals') is an horizontal array???
so, firstly I think these 2 lines you are doing completely wrong:
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
You basically try to run linear regression for all the points starting 1 to 20, then 2 to 21, 3 to 22 etc. Then you try to fit that regression to data from observation 20 onward. So you get the model for e.g. 5 to 24 and based on it you predict observations 20 till the end, and take the difference between that prediction and actuals (mind that bank_matrix[['EXO.MI']][lookback:].to_numpy() doesn't change during your for loop).
I suppose what would make more sense here would be:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i-lookback+1:i])
Residuals.append(bank_matrix[['EXO.MI']][i-lookback+1:i].to_numpy()-y_pred)
So you would take error of the model, or:
y_pred=reg.predict(bank_matrix[['LDO.MI']][i:])
Residuals.append(bank_matrix[['EXO.MI']][i:].to_numpy()-y_pred)
So you would try to fit prediction based on the current time span to the data going forward.
Now first option will produce lists of 19 elements per row, while the other one will produce list of 430, decreasing by 1 per row, until 1 in the last row. Because these are residuals - so you have a line, with a slope, and hedge 1 per given time span, but then you have number of observation within this range producing each different result. So depending on how do you want to express it - you can make it sum of square residuals, or maybe take mean residual - you can make it one number only by applying some further transformation to it.
Hope this helps...
From the doc:
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
You need to use df.loc for example to modify the data in your dataframe...

How to split random generated array values in two separate array using numpy and random?

how to create a vector called row_min that contains the minimum value for each of the 25 rows (this implies the shape of this vector will be (25,)) Create a vector called col_max that contains the maximum value for each of the 8 columns (col_max will be a vector of shape (8,))
I have developed the code and I'm new to vector concept, need some suggestions.
import random
import numpy
c = numpy.random.rand(25,8)
print("Random float array 25X8 between range of 0.0 to 1.0 \n")
print(c,"\n")
I didn't find the source to understand the concept.
You have to specify the axis np.max( .., axis=...) should work on:
import random
import numpy as np
c = np.random.rand(5,3) # smaller for less output
print(c,"\n")
print( np.max(c, axis=0)) # column
print( np.max(c, axis=1)) # row
Output:
[[0.47894278 0.80356294 0.34453725]
[0.33802491 0.82795648 0.28438504]
[0.46838701 0.73664987 0.82215448]
[0.66245476 0.59981989 0.43837083]
[0.28515865 0.86093323 0.92248524]]
# axis 0 (columns)
[0.66245476 0.86093323 0.92248524]
# axis 1 (rows)
[0.80356294 0.82795648 0.82215448 0.66245476 0.92248524]
See matrix.max() ... min() works the same.

Pearson correlation of two 3D arrays with NaN in python

I have two three-dimensional arrays a and b with [time,lat,lon]. I want to correlate the time series of each grid cell like correlate(a[:,0,0],b[:,0,0]), correlate(a[:,0,1],b[:,0,1]), ... . I'm aiming for two correlations. One with the entire time series and one only where array a surpasses a certain threshold.
The datasets also include some missing values in the time series and I read in both datasets with Xarray. Correlations and masking are done using numpy.
At the moment I walk through each latitude and longitude, grabbing the time series, mask it to account for nan and the threshold and correlate them. My code looks like this:
def correlate(A, B, var1, var2, TH):
name = "corr_"+var1+"_"+var2+"_TH_"+str(TH)+".nc"
a = xr.open_dataset(A).sel(time=slice('1950-03','2013-12'))
b = xr.open_dataset(B).sel(time=slice('1950-03','2013-12'))
corr = np.empty([a[var1].shape[1],a[var1].shape[2]],dtype=float)
corr_TH = corr
varname_TH = "r_TH_"+str(TH)
for lt in range(corr.shape[0]):
for ln in range(corr.shape[1]):
corr[lt,ln] = np.ma.corrcoef(a[var1][:,lt,ln],b[var2][:,lt,ln], rowvar=True)[0,1]
corr_TH[lt,ln] = np.ma.corrcoef(np.ma.masked_greater(a[var1][:,lt,ln],TH),b[var2][:,lt,ln], rowvar=True)[0,1]
# save whole correlations
ds = xr.Dataset({'r': (['lat', 'lon'], corr),varname_TH: (['lat', 'lon'], corr_TH)},coords={'lon': a['lon'],'lat': a['lat']})
return ds
This works in general but is super slow. I found the Xarray function array.stack() to flatten the arrays and tried something like:
A_stack = A.var1.stack(z=('lat','lon'))
B_stack = B.var2.stack(z=('lat','lon'))
cov = ((A_stack - A_stack.mean(axis=0))* (B_stack - B_stack.mean(axis=0))).mean(axis=0)
corr = cov / (A_stack.std(axis=0) * B_stack.std(axis=0))
The multi index 'z' over which the array is stacked is retained through the process, however, the correlation array in the end is empty. I suppose that's because of the Nans.
Does anyone have an idea of the do this?
thanks

Compact way of visualizing heat maps of correlated data

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?
You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

Categories