How to make clustered heatmap of a large dataset look nicer?

How to make clustered heatmap of a large dataset look nicer? - python

I have a distance matrix which I normalized, trimmed the row and column headers with python regular expressions and tried to make a clustered heatmap from it with the following code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv('distance_matrix_Mult_Align(distance).csv', index_col=0)
row_sums = df.sum(axis=1)
new_matrix = df / row_sums[:, np.newaxis]
def acc_id(s):
import re
match = re.search('\|(.*)\|', s)
if match:
return match.group(1)
sns.clustermap(new_matrix.rename(columns=acc_id, index=acc_id),
row_cluster=False,
xticklabels=True,
yticklabels=True,
cmap='RdBu',
center=0,
vmin=0,
vmax=1)
plt.figure()
plt.show
My clustered map look like this:
I have tried to read the documentations of clustermap and pyplot: https://seaborn.pydata.org/generated/seaborn.clustermap.html
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure
But I can not seem to understand how to make the plot look something useful. I would really appreciate any help. Thanks!

The problem is in your vmax = 1 argument. If you look at the maximum value in the whole dataset using new_matrix.max().max() , it is about 0.17.
So, just removing vmax as: or just set a lower value for vmax

Related

How to plot Multiline Graphs Via Seaborn library in Python?

I have written a code that looks like this:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
exp1= sns.lineplot(data=df1)
plt.savefig('exp1.png')
exp1_smooth= sns.lmplot(x='Size', y='Time', data=df, ci=None, order=4, truncate=False)
plt.savefig('exp1_smooth.png')
That gives me Graph_1:
The Size = x- axis is a constant line but as you can see in my code it varies from (10,100,1000).
How does this produces a constant line? I want to produce a multiline graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2).
Also I wanted to plot a smooth graph of the same graph I am getting right now but it gives me error. What needs to be done to achieve a smooth multi-line graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2)?

I think it not the issue, the line represents for size looks like constant but it NOT.
Can see that values of size in range 10-1000 while the minimum division of y-axis is 20,000 (20 times bigger), make it look like a horizontal line on your graph.
You can try with a bigger values to see the slope clearly.
If you want 'size` as x-axis, you can try below example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
fig = plt.figure()
fig = sns.lineplot(data=df1, x='Size',y='Encrypt_Time' )
fig = sns.lineplot(data=df1, x='Size',y='Decrypt_Time' )

Add a mean and 3*std to a scatter plot using matplotlib

I've got a scatter plot and want add a straight line for mean, 3*std+mean and 3*std-mean. I seem to have the mean plotting but can't work out the std! Thanks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
for element in df_na.loc[:, 'Ag_ppb':'Zr_ppb']:
temp_df = df_na.loc[:, ['Date', element]].dropna()
fig =plt.figure()
plt.scatter(temp_df['Date'], temp_df[element],c='black',s=10)
plt.plot(temp_df['Date'],[df_na[element].mean()]*len(x))
plt.xlabel('Date')
plt.xticks(rotation =90, fontsize=5)
plt.ylabel(element)
plt.show()

You want to use dataframe.std():
df_na.std(axis=0,skipna=True)[element]

So I incorporated the above which works, see below:
plt.plot(temp_df['Date'],[temp_df[element].mean(axis=0,skipna=True)]*len(x), c='red',label='Mean')
but the following won't plot the 3* std + mean .
plt.plot(temp_df['Date'],[temp_df[element].mean()]+[temp_df[element].std(axis=0,skipna=True)*3]*len(x),label='3xstd')

The above worked but don't adding the mean to 3*std doesn't plot as a line.

Ticklabels in matplotlib don't match the plot values [duplicate]

I have an existing plot that was created with pandas like this:
df['myvar'].plot(kind='bar')
The y axis is format as float and I want to change the y axis to percentages. All of the solutions I found use ax.xyz syntax and I can only place code below the line above that creates the plot (I cannot add ax=ax to the line above.)
How can I format the y axis as percentages without changing the line above?
Here is the solution I found but requires that I redefine the plot:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as mtick
data = [8,12,15,17,18,18.5]
perc = np.linspace(0,100,len(data))
fig = plt.figure(1, (7,4))
ax = fig.add_subplot(1,1,1)
ax.plot(perc, data)
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
xticks = mtick.FormatStrFormatter(fmt)
ax.xaxis.set_major_formatter(xticks)
plt.show()
Link to the above solution: Pyplot: using percentage on x axis

This is a few months late, but I have created PR#6251 with matplotlib to add a new PercentFormatter class. With this class you just need one line to reformat your axis (two if you count the import of matplotlib.ticker):
import ...
import matplotlib.ticker as mtick
ax = df['myvar'].plot(kind='bar')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
PercentFormatter() accepts three arguments, xmax, decimals, symbol. xmax allows you to set the value that corresponds to 100% on the axis. This is nice if you have data from 0.0 to 1.0 and you want to display it from 0% to 100%. Just do PercentFormatter(1.0).
The other two parameters allow you to set the number of digits after the decimal point and the symbol. They default to None and '%', respectively. decimals=None will automatically set the number of decimal points based on how much of the axes you are showing.
Update
PercentFormatter was introduced into Matplotlib proper in version 2.1.0.

pandas dataframe plot will return the ax for you, And then you can start to manipulate the axes whatever you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,5))
# you get ax from here
ax = df.plot()
type(ax) # matplotlib.axes._subplots.AxesSubplot
# manipulate
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])

Jianxun's solution did the job for me but broke the y value indicator at the bottom left of the window.
I ended up using FuncFormatterinstead (and also stripped the uneccessary trailing zeroes as suggested here):
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
df = pd.DataFrame(np.random.randn(100,5))
ax = df.plot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
Generally speaking I'd recommend using FuncFormatter for label formatting: it's reliable, and versatile.

For those who are looking for the quick one-liner:
plt.gca().set_yticklabels([f'{x:.0%}' for x in plt.gca().get_yticks()])
this assumes
import: from matplotlib import pyplot as plt
Python >=3.6 for f-String formatting. For older versions, replace f'{x:.0%}' with '{:.0%}'.format(x)

I'm late to the game but I just realize this: ax can be replaced with plt.gca() for those who are not using axes and just subplots.
Echoing #Mad Physicist answer, using the package PercentFormatter it would be:
import matplotlib.ticker as mtick
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
#if you already have ticks in the 0 to 1 range. Otherwise see their answer

I propose an alternative method using seaborn
Working code:
import pandas as pd
import seaborn as sns
data=np.random.rand(10,2)*100
df = pd.DataFrame(data, columns=['A', 'B'])
ax= sns.lineplot(data=df, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='title')
#changing ylables ticks
y_value=['{:,.2f}'.format(x) + '%' for x in ax.get_yticks()]
ax.set_yticklabels(y_value)

You can do this in one line without importing anything:
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter('{}%'.format))
If you want integer percentages, you can do:
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))
You can use either ax.yaxis or plt.gca().yaxis. FuncFormatter is still part of matplotlib.ticker, but you can also do plt.FuncFormatter as a shortcut.

Based on the answer of #erwanp, you can use the formatted string literals of Python 3,
x = '2'
percentage = f'{x}%' # 2%
inside the FuncFormatter() and combined with a lambda expression.
All wrapped:
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: f'{y}%'))

Another one line solution if the yticks are between 0 and 1:
plt.yticks(plt.yticks()[0], ['{:,.0%}'.format(x) for x in plt.yticks()[0]])

add a line of code
ax.yaxis.set_major_formatter(ticker.PercentFormatter())

Clustermapping in Python using Seaborn

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.

An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()

As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Multiple single plots in seaborn with pandas groupby data

My issue is very specific, i guess, but i can't seem to find a proper solution, and im clueless with the error output that i get.
Anyway, i have a pandas dataframe loaded from an sqlite database.
data_frame = pd.read_sql_query(
"SELECT (total_comb + total_comb_rc) as total_comb, p_val, w_length from {tn}".format(
tn=table_name), conn)
With that loaded, i group the data by the 'w_length' value.
for i, group in data_frame.groupby('w_length'):
Now, i want to plot a scatter plot for each group created with seaborn lmplot.
for i, group in data_frame.groupby('w_length'):
sns.lmplot(x=group['total_comb'], y=group['p_val'],
data=group,
fit_reg=False)
sns.despine()
plt.savefig('test_scatter'+i+'.png', dpi=400)
But for some reason im getting, this output.
'[ 6.95485628e-02 3.53641178e-01 3.46862200e+06 4.11684800e+06] not in index'
and no plot file.
I know im doing something wrong, but i cant seem to figure it out.
pd: i know i can do something like this.
sns.lmplot(x='total_comb', y='p_val',
data=data_frame,
fit_reg=False,
hue="w_length", x_jitter=.1, col="w_length", col_wrap=3, size=4)
but i also need the separeted plots for each 'w_length'.
Thanks!!

Supposing the problem is not due to the data collection from the sql database, it's probably due to the fact that you call
sns.lmplot(x=group['total_comb'], y=group['p_val'], data=group)
instead of
sns.lmplot(x='total_comb', y='p_val', data=group)
Here is a working example, which produces two separate plots:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np; np.random.seed(42)
x = np.arange(24)
y = np.random.randint(1,10, len(x))
cat = np.random.choice(["A", "B"], size=len(x))
df = pd.DataFrame({"x": x, "y": y, "cat": cat})
for i, group in df.groupby('cat'):
sns.lmplot(x="x", y="y", data=group, fit_reg=False)
plt.savefig(__file__+str(i)+".png")
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make clustered heatmap of a large dataset look nicer? - python

The problem is in your vmax = 1 argument. If you look at the maximum value in the whole dataset using new_matrix.max().max() , it is about 0.17. So, just removing vmax as: or just set a lower value for vmax

Related

How to plot Multiline Graphs Via Seaborn library in Python?

Add a mean and 3*std to a scatter plot using matplotlib

Ticklabels in matplotlib don't match the plot values [duplicate]

Clustermapping in Python using Seaborn

Multiple single plots in seaborn with pandas groupby data

Categories

Resources