Compact way of visualizing heat maps of correlated data - python

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?

You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

Related

Pandas + Seaborn : compute number of 0 regarding categorical datas

I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.
Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')

Pandas correlation on just one column containing np arrays

I'm working with a dataframe with a column containing a np.array per row (in this case representing the mean waveform of brain recordings trought the time). I want to calculate the pearson correlation of this column (array by array).
This is my code
lenght = len(df.Mean)
Mean = []
for i in range(len(df.Mean)):
Mean.append(df.Mean[i])
Correlation_p = np.zeros((lenght,lenght))
P_Value_p = np.zeros((lenght,lenght))
for i in range(lenght):
for j in range(lenght):
Correlation_p[i][j],P_Value_p[i][j] = stats.pearsonr(df.Mean[i],df.Mean[j])
This works, but I want to know if there is a more pythonic way to do it, maybe using df.corr(). I tried but I failed in how to do it.
EDIT: the output of df.Mean.head()
0 [-0.2559348091247745, 0.02743063113723536, 0.3...
1 [-0.37025615099744325, -0.11299328141596175, 0...
2 [-1.0543681894876467, -0.8452798699354909, -0....
3 [-0.23527437766943646, -0.28657810260136585, -...
4 [0.45557980303095674, 0.6055674269814991, 0.74...
Name: Mean, dtype: object
The arrays that you would like to correlate seem in single cells of the DataFrame, if I am not mistaken. The following brings it in a format where each single array occupies a single column.
I made an data example that resembles the format of df.Mean.head():
df = pd.DataFrame({'x':[np.random.randint(0,5,10), np.random.randint(0,5,10), np.random.randint(0,5,10)]})
You can turn these arrays into columns using this:
df = pd.DataFrame(np.array(df['x'].tolist()).transpose())
Adapt the reshape parameters according to your own dimensions.
From there, it would be fairly straightforward.
A correlation matrix can be created by:
df.corr()
A visualization of the correlation matrix:
import matplotlib.pyplot as plt
plt.matshow(df.corr())
plt.show()

Multiple columns visualization with plotly or seaborn

I have data of factories and their error codes during production
such as below;
PlantID A B C D
1 0 1 2 4
1 3 0 2 0
3 0 0 0 1
4 0 1 1 5
Each row represent production order.
I want to create a graph with x-axis=PlantID's and y-axis are A,B,C,D with different bars.
In this way I can see that which factory has the most D error, which has A in one graph
I usually use plotly and seaborn but I couldn't find any solution for that, y-axis is single column in every example
Thanks in advance,
Seaborn likes its data in long or wide-form.
As mentioned above, seaborn will be most powerful when your datasets have a particular organization. This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham in this academic paper. The rules can be simply stated:
Each variable is a column
Each observation is a row
The following code converts the original dataframe to a long form dataframe.
By stacking the columns on top of each other such that every row corresponds to a single record that specifies the column name and the value (the count).
import numpy as np
import pandas as pd
import seaborn as sns
# Generating some data
N = 20
PlantID = np.random.choice(np.arange(1, 4), size=N, replace=True)
data = dict((k, np.random.randint(0, 50, size=N)) for k in ['A', 'B', 'C', 'D'])
df = pd.DataFrame(data, index=PlantID)
df.index = df.index.set_names('PlantID')
# Stacking the columns and resetting the index to create a longformat. (And some renaming)
df = df.stack().reset_index().rename({'level_1' : 'column', 0: 'count'},axis=1)
sns.barplot(x='PlantID', y='count', hue='column', data=df)
Pandas has really clever built-in plotting functionality:
df.plot(kind='bar')
plt.show()

Calculating angle between two points in time-series

I have a time-series data and i am trying to calculate angle (degree) between two points. Here is what i did so far but it doesn't seem to give the correct solution:
bars = 2
df = pd.read_csv("EURUSD.csv")
df = df.reset_index()
df['A'] = np.rad2deg(np.arctan2(df['Low']-df['Low'].shift(pts), df['index']-df['index'].shift(pts)))
df.dropna(inplace=True)
However, sometimes this gives me weird outputs like:
2693 3.141258
2702 -3.141383
2708 -3.141451
2719 -3.141033
2724 -3.140893
2734 3.141550
I have also tried the following code:
df['A'] = ((df['Low']-df['Low'].shift(pts))/(df['index']-df['index'].shift(pts)))
2693 -0.000334
2702 0.000210
2708 0.000142
2719 0.000560
2724 0.000700
2734 -0.000043
what am i doing wrong here?
EDIT:
Here is the screenshot i'm trying to do. I'm simply trying to find that -48 degree in Python. I am not trying to get these points automatically. I have spotted them manually and just need to do calculation.
I guess that your question is how do I calculated the angle between two lines? Where those lines are each of them defined by a single point and a common origin. Then you want to perform this operation for a series of x1, x2 points recorded over time.
Here you can find the arithmetics and here an example.
To get your line angle between the two points, you'll need the following:
price difference (looks like 1.29250 - 1.29650 = -0.004)
number of bar between the two points (That appears to be 10 bars)
Price to Bar ratio (you'll have to look at the settings for that particular graph)
price_diff = -0.004
bars = 10
price_to_bar = unknown
X = bars * price_to_bar
Final output:
import numpy as np
round(np.angle(complex(x, price_diff), deg=True), 0)

K-means clustering on 3 dimensions with sklearn

I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')
It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.

Categories