I'm using seaborn to make a violinplot, which uses hues to identify who survived and who didn't. This is given by the column 'DEATH_EVENT', where 0 means the person survived and 1 means they didn't. The only issue I'm having is that I can't figure out how to set labels for this hue legend. As seen below, 'DEATH_EVENT' presents 0 and 1, but I want to change this into 'Survived' and 'Not survived'.
Current code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
sns.set()
plt.style.use('seaborn')
data = pd.read_csv('heart_failure_clinical_records_dataset.csv')
g = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
g.set_xticklabels(['No smoking', 'Smoking'])
I tried to use: g.legend(labels=['Survived', 'Not survived']), but it returns it without the colors, instead a thin and thick line for some reason.
I'm aware I could just use:
data['DEATH_EVENT'].replace({0:'Survived', 1:'Not survived'}, inplace=True)
but I wanted to see if there was another way. I'm still a rookie, so I'm guessing that there's a reason why the CSV's author made it so that it uses integers to describe plenty of things. Ex: if someone smokes or not, sex, diabetic or not, etc. Maybe it runs faster?
Controlling Seaborn legends is still somewhat tricky (some extensions to matplotlib's API would be helpful). In this case, you could grab the handles from the just-created legend and reuse them for a new legend:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.DataFrame({"smoking": np.random.randint(0, 2, 200),
"survived": np.random.randint(0, 2, 200),
"age": np.random.normal(60, 10, 200),
"DEATH_EVENT": np.random.randint(0, 2, 200)})
ax = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
ax.set_xticklabels(['No smoking', 'Smoking'])
ax.legend(handles=ax.legend_.legendHandles, labels=['Survived', 'Not survived'])
Here is an approach to make the change via the dataframe without changing the original dataframe. To avoid accessing ax.legend_ alltogether (to remove the legend title), a trick is to rename the column to a blank string (and use that blank string for hue). If the dataframe isn't super long (i.e. not having millions of rows), the speed and memory overhead are quite modest.
names = {0: 'Survived', 1: 'Not survived'}
ax = sns.violinplot(data=data.replace({'DEATH_EVENT': names}).rename(columns={'DEATH_EVENT': ''}),
x='smoking', y='age', hue='')
Related
I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution
From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')
'''
I did a clustermap with thousands of genes, using seaborn. Because, I'm interested in only few genes, I'd like to display those genes of interest on the ytick. I'm trying to figure it out using the iris dataset. Please find below my code. I'm not sure how to get the samples of interest at their right indexes. Thank you in advance for helpful assistance.
'''
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
samples = ['sample_'+str(x) for x in list(iris.index)] #creating sample ID lining up with the internal index.[![enter image description here][1]][1]
iris.insert(0,'Sample_ID',samples)
samples_of_interest = ['sample_41','sample_34','sample_114','sample_55'] #samples to be visible on ytick
sns.clustermap(iris.iloc[:,1:-1],yticklabels=samples_of_interest) #Not giving the expected result as all of thesmples of interest are not at their right index
plt.show()
plt.close()
Here's why your answer wasn't working:
See this about the yticklabels argument in the documentation:
If list-like, plot these alternate labels as the xticklabels.
So basically when you only pass a few tick labels, it is just setting those names as the tick labels, without knowledge of the tick positions. One way to get around this is to do the following, adding sample_labels which makes a label for all ticks, but sets non-interesting ones to None. You then follow this answer to rotate the ticks):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
samples = ['sample_'+str(x) for x in list(iris.index)]
iris.insert(0,'Sample_ID',samples)
samples_of_interest = ['sample_41','sample_34','sample_114','sample_55']
sample_labels = [i if i in samples_of_interest else None
for i in iris['Sample_ID'] ]
cm=sns.clustermap(iris.iloc[:,1:-1], yticklabels=sample_labels)
plt.setp(cm.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
But this is still not ideal b/c there are ticks for all the positions I'm sure there is a way to edit this but instead..
Here's a method I like more:
Get the new order of the samples from the clustergrid (object returned by clustermap, then manually set the y-tick labels and positions (with credit to this post):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
samples_of_interest = [41, 34, 114, 55]
sample_names = ['Sample ' + str(i) for i in samples_of_interest]
cm=sns.clustermap(iris.iloc[:,:-1]) #note the loc has changed!
reorder = cm.dendrogram_row.reordered_ind
new_positions = [reorder.index(i) for i in samples_of_interest]
plt.setp(cm.ax_heatmap.yaxis.set_ticks(new_positions))
plt.setp(cm.ax_heatmap.yaxis.set_ticklabels(sample_names))
Oddly the cm.ax_heatmap.yaxis.set... commands print out the get versions (it seems), but this doesn't affect outcome
The hue parameter skips one integer.
d = {'column1':[1,2,3,4,5], 'column2':[2,4,5,2,3], 'cluster':[0,1,2,3,4]}
df = pd.DataFrame(data=d)
sns.relplot(x='column2', y='column1', hue='cluster', data=df)
Python 2.7
Seaborn 0.9.0
Ubuntu 16.04 LTS
"Full" legend
If the hue is in numeric format, seaborn will assume that it represents some continuous quantity and will decide to display what it thinks is a representative sample along the color dimension.
You can circumvent this by using legend="full".
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'column1':[1,2,3,4,5], 'column2':[2,4,5,2,3], 'cluster':[0,1,2,3,4]})
sns.relplot(x='column2', y='column1', hue='cluster', data=df, legend="full")
plt.show()
Categoricals
An alternative is to make sure the values are treated categorical
Unfortunately, even if you plug in the numbers as strings, they will be converted to numbers falling back to the same mechanism described above. This may be seen as a bug.
However, one choice you have is to use real categories, like e.g. single letters.
'cluster':list("ABCDE")
works fine,
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
d = {'column1':[1,2,3,4,5], 'column2':[2,4,5,2,3], 'cluster':list("ABCDE")}
df = pd.DataFrame(data=d)
sns.relplot(x='column2', y='column1', hue='cluster', data=df)
plt.show()
Strings with customized palette
An alternative to the above is to use numbers converted to strings, and then make sure to use a custom palette with as many colors as there are unique hues.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
d = {'column1':[1,2,3,4,5], 'column2':[2,4,5,2,3], 'cluster':[1,2,3,4,5]}
df = pd.DataFrame(data=d)
df["cluster"] = df["cluster"].astype(str)
sns.relplot(x='column2', y='column1', hue='cluster', data=df,
palette=["b", "g", "r", "indigo", "k"])
plt.show()
I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html
I know that seaborn.countplot has the attribute order which can be set to determine the order of the categories. But what I would like to do is have the categories be in order of descending count. I know that I can accomplish this by computing the count manually (using a groupby operation on the original dataframe, etc.) but I am wondering if this functionality exists with seaborn.countplot. Surprisingly, I cannot find an answer to this question anywhere.
This functionality is not built into seaborn.countplot as far as I know - the order parameter only accepts a list of strings for the categories, and leaves the ordering logic to the user.
This is not hard to do with value_counts() provided you have a DataFrame though. For example,
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='darkgrid')
titanic = sns.load_dataset('titanic')
sns.countplot(x = 'class',
data = titanic,
order = titanic['class'].value_counts().index)
plt.show()
Most often, a seaborn countplot is not really necessary. Just plot with pandas bar plot:
import seaborn as sns; sns.set(style='darkgrid')
import matplotlib.pyplot as plt
df = sns.load_dataset('titanic')
df['class'].value_counts().plot(kind="bar")
plt.show()