Clustermapping in Python using Seaborn - python

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.

An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()

As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Related

How can one create histograms with subplots according to grouped variables in seaborn?

I am attempting to create a histogram using seaborn and census data that displays 3 subplots for age composition, and I have the data grouped the way that I would like it, but I am struggling to turn that into a histogram.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
filename = "/scratch/%s_class_root/%s_class/materials/data/pums_short.csv.gz"
acs = pd.read_csv(filename)
R65_agg = acs.groupby(["R65", "PUMA"])["HINCP"]
R65_meds = R65_agg.agg(np.median).unstack()
R65_f = R65_meds.dropna()
R65_f = R65_meds.reset_index(drop = True)
I was expecting this code to give me data that I could plug into a histogram but instead of being distinct subplots, the "0.0, 1.0, 2,0" in the final variable just get added together when I apply the .describe() function. Any advice for how I can convert this into a form that's readable with the sns.histplot() function?

How to make a bar chart with multiple series and count

I want to have x-axis = 'brand', y-axis = 'count', and 2 series for 'online_order' (True & False)
How can I do this on Python (using Jupyter?)
Right now, my Y axis comes on a scale of 0-1. I want to ensure that the Y axis is automated based on the values
This is the result I am getting :
I'm guessing the plot was made with something like the following:
Since the plot code is not included, it's just a guess.
df.groupby(['brand', 'online_order'])['count'].size().unstack().plot.bar(legend=True)
The issue is, size is not the value in 'count', it's .Groupby.size which computes group sizes, of which there is 1 of each.
Using seaborn
The easiest way to get the desired plot is using seaborn, which is a high-level API for matplolib.
Use seaborn.barplot with hue='online_order'.
The dataframe does not need to be reshaped.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# test data
df = pd.DataFrame({'brand': ['Solex', 'Solex', 'Giant Bicycles', 'Giant Bicycles'], 'online_order': [False, True, True, False], 'count': [2122, 2047, 1640, 1604]})
# plot
plt.figure(figsize=(7, 5))
sns.barplot(x='brand', y='count', hue='online_order', data=df)
Using pandas.DataFrame.pivot
.pivot changes the shape of the dataframe to accommodate the plot API
This option also uses pandas.DataFrame.plot.bar
df.pivot('brand', 'online_order', 'count').plot.bar()
If the data is a csv file, you can import matplotlib and pandas to create a graph and view the data.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("file name here")
plt.bar(data.brand,data.count)
plt.xlabel("brand")
plt.ylabel("count")

Seaborn how to show specific samples of interest on ytick

'''
I did a clustermap with thousands of genes, using seaborn. Because, I'm interested in only few genes, I'd like to display those genes of interest on the ytick. I'm trying to figure it out using the iris dataset. Please find below my code. I'm not sure how to get the samples of interest at their right indexes. Thank you in advance for helpful assistance.
'''
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
samples = ['sample_'+str(x) for x in list(iris.index)] #creating sample ID lining up with the internal index.[![enter image description here][1]][1]
iris.insert(0,'Sample_ID',samples)
samples_of_interest = ['sample_41','sample_34','sample_114','sample_55'] #samples to be visible on ytick
sns.clustermap(iris.iloc[:,1:-1],yticklabels=samples_of_interest) #Not giving the expected result as all of thesmples of interest are not at their right index
plt.show()
plt.close()
Here's why your answer wasn't working:
See this about the yticklabels argument in the documentation:
If list-like, plot these alternate labels as the xticklabels.
So basically when you only pass a few tick labels, it is just setting those names as the tick labels, without knowledge of the tick positions. One way to get around this is to do the following, adding sample_labels which makes a label for all ticks, but sets non-interesting ones to None. You then follow this answer to rotate the ticks):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
samples = ['sample_'+str(x) for x in list(iris.index)]
iris.insert(0,'Sample_ID',samples)
samples_of_interest = ['sample_41','sample_34','sample_114','sample_55']
sample_labels = [i if i in samples_of_interest else None
for i in iris['Sample_ID'] ]
cm=sns.clustermap(iris.iloc[:,1:-1], yticklabels=sample_labels)
plt.setp(cm.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
But this is still not ideal b/c there are ticks for all the positions I'm sure there is a way to edit this but instead..
Here's a method I like more:
Get the new order of the samples from the clustergrid (object returned by clustermap, then manually set the y-tick labels and positions (with credit to this post):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset('iris')
samples_of_interest = [41, 34, 114, 55]
sample_names = ['Sample ' + str(i) for i in samples_of_interest]
cm=sns.clustermap(iris.iloc[:,:-1]) #note the loc has changed!
reorder = cm.dendrogram_row.reordered_ind
new_positions = [reorder.index(i) for i in samples_of_interest]
plt.setp(cm.ax_heatmap.yaxis.set_ticks(new_positions))
plt.setp(cm.ax_heatmap.yaxis.set_ticklabels(sample_names))
Oddly the cm.ax_heatmap.yaxis.set... commands print out the get versions (it seems), but this doesn't affect outcome

Box and whisker plot on multiple columns

I am trying to make a Box and Whisker plot on my dataset that looks something like this -
& the chart I'm trying to make
My current lines of code are below -
import seaborn as sns
import matplotlib.pyplot as plt
d = df3.boxplot(column = ['Northern California','New York','Kansas','Texas'], by = 'Banner')
d
Thank you
I've recreated a dummy version of your dataset:
import numpy as np
import pandas as pd
dictionary = {'Banner':['Type1']*10+['Type2']*10,
'Northen_californina':np.random.rand(20),
'Texas':np.random.rand(20)}
df = pd.DataFrame(dictionary)
What you need is to melt your dataframe (unpivot) in orther to have the information of geographical zone stored in a column and not as column name. You can use pandas.melt method and specify all the columns you want to put in your boxplot in the value_vars argument.
With my dummy dataset you can do this:
df = pd.melt(df,id_vars=['Banner'],value_vars=['Northen_californina','Texas'],
var_name='zone', value_name='amount')
Now you can apply a boxplot using the hue argument:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(9,9)) #for a bigger image
sns.boxplot(x="Banner", y="amount", hue="zone", data=df, palette="Set1")

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Categories