Seaborn countplot not displaying correct frequncies - python

I am trying to create a simple count plot using Seaborn of the frequencies of 3 different categories. The plot I am being given has completely incorrect count.
I am using the correct count plot method
sns.countplot(data = r.reset_index(), x = 'cat')
gives me
This is the DataFrame I'm trying to plot :
count
high 38
low 64
medium 30
I want the graph to display the correct counts for each category, high medium and low

A countplot is going to count each occurrence of your x variable -- in this case, one observation per level.
From the API page for countplot:
Show the counts of observations in each categorical bin using bars.
A count plot can be thought of as a histogram across a categorical,
instead of quantitative, variable. The basic API and options are
identical to those for barplot(), so you can compare counts across
nested variables.
You want a simple barplot:
sns.barplot(data=df.reset_index(), x='index', y='count')

Related

How can I plot the sum of values (rather than the count) using seaborn violinplot?

I have a data set with various quantities which I would like to make a violin plot of using seaborn. However, instead of visualizing the COUNT of occurrences of all of the quantities I'd like to display the SUM of the quantities.
For example, if I have a data set like this...
df = pd.DataFrame({'quantity':[1,1,1,4]})
sns.violinplot(data=df)
Actual Result
I really want it to display something like this...
df = pd.DataFrame({'quantity':[1,1,1,4,4,4,4]})
sns.violinplot(data=df)
Expected Result
My data set is around 500k values ranging between 1 and 10k so its not possible to transform the data as above (ie [1,1,2,3] -> [1,1,2,2,3,3,3]).
I know I could do something like this below to get the values I want and then plot with a bar plot or something...
df.quantity.value_counts() * df.quantity.value_counts().index
however, I really like seaborns violin plot and the ability to pair it with catplot so if anyone knows a way to do this in seaborn I'd be very grateful.

Using Seaborn Catplot scatterplot creates a numerically unordered y-axis

Using this dataset, I tried to make a categorical scatterplot with two mutation types (WES + RNA-Seq & WES) being shown on the x-axis and a small set of numerically ordered numbers spaced apart by a scale on the y-axis. Although I was able to get the x-axis the way I intended it to be, the y-axis instead used every single value in the Mutation Count column as a number on the y-axis. In addition to that, the axis is ordered from the descending order on the dataset, meaning it isn't numerically ordered either. How can I go about fixing these aspects of the graph?
The code and its output are shown below:
import seaborn as sns
g = sns.catplot(x="Sequencing Type", y="Mutation Count", hue="Sequencing Type", data=tets, height=16, aspect=0.8)

Center nested boxplots in Python/Seaborn with unequal classes

I have grouped data from which I want to generated boxplots using seaborn. However, not every group has all classes. As a result, the boxplots are not centered if classes are missing within one group:
Figure
The graph is generated using the following code:
sns.boxplot(x="label2", y="value", hue="variable",palette="Blues")
Is there any way to force seaborn to center theses boxes? I didn't find any approbiate way.
Thank you in advance.
Yes there is but you are not going to like it.
Centering these will mean that you will have the same y value for median values, so normalize your data so that the median is 0.5 for each y value for each value of x. That will give you the plot you want, but you should note that somewhere in the plot so people will not be confused.

How do I plot my histogram for density rather than count? (Matplotlib)

I have a data frame called 'train' with a column 'string' and a column 'string length' and a column 'rank' which has ranking ranging from 0-4.
I want to create a histogram of the string length for each ranking and plot all of the histograms on one graph to compare. I am experiencing two issues with this:
The only way I can manage to do this is by creating separate datasets e.g. with the following type of code:
S0 = train.loc[train['rank'] == 0]
S1 = train.loc[train['rank'] == 1]
Then I create individual histograms for each dataset using:
plt.hist(train['string length'], bins = 100)
plt.show()
This code doesn't plot the density but instead plots the counts. How do I alter my code such that it plots density instead?
Is there also a way to do this without having to create separate datasets? I was told that my method is 'unpythonic'
You could do something like:
df.loc[:, df.columns != 'string'].groupby('rank').hist(density=True, bins =10, figsize=(5,5))
Basically, what it does is select all columns except string, group them by rank and make an histogram of all them following the arguments.
The density argument set to density=True draws it in a normalized manner, as
Hope this has helped.
EDIT:
f there are more variables and you want the histograms overlapped, try:
df.groupby('rank')['string length'].hist(density=True, histtype='step', bins =10,figsize=(5,5))

I want to create a pie chart using a dataframe column in python

I want to create a Pie chart using single column of my dataframe, say my column name is 'Score'. I have stored scores in this column as below :
Score
.92
.81
.21
.46
.72
.11
.89
Now I want to create a pie chart with the range in percentage.
Say 0-0.4 is 30% , 0.4-0.7 is 35 % , 0.7+ is 35% .
I am using the below code using
df1['bins'] = pd.cut(df1['Score'],bins=[0,0.5,1], labels=["0-50%","50-100%"])
df1 = df.groupby(['Score', 'bins']).size().unstack(fill_value=0)
df1.plot.pie(subplots=True,figsize=(8, 3))
With the above code I am getting the Pie chart, but i don’t know how i can do this using percentage.
my pie chart look like this for now
Cutting the dataframe up into bins is the right first step. After which, you can use value_counts with normalize=True in order to get relative frequencies of values in the bins column. This will let you see percentage of data across ranges that are defined in the bins.
In terms of plotting the pie chart, I'm not sure if I understood correctly, but it seemed like you would like to display the correct legend values and the percentage values in each slice of the pie.
pandas.DataFrame.plot is a good place to see all parameters that can be passed into the plot method. You can specify what are your x and y columns to use, and by default, the dataframe index is used as the legend in the pie plot.
To show the percentage values per slice, you can use the autopct parameter as well. As mentioned in this answer, you can use all the normal matplotlib plt.pie() flags in the plot method as well.
Bringing everything together, this is the resultant code and the resultant chart:
df = pd.DataFrame({'Score': [0.92,0.81,0.21,0.46,0.72,0.11,0.89]})
df['bins'] = pd.cut(df['Score'], bins=[0,0.4,0.7,1], labels=['0-0.4','0.4-0.7','0.7-1'], right=True)
bin_percent = pd.DataFrame(df['bins'].value_counts(normalize=True) * 100)
plot = bin_percent.plot.pie(y='bins', figsize=(5, 5), autopct='%1.1f%%')
Plot of Pie Chart

Categories