Add values to bars in barplot with different values - python

I am creating a barplot by using groupby showing the success rate of an individual for Calendar year 2012. This works well. X axis= S_L's and Y axis is the success rate%. I have a column in my dataset for the success (1 or 0).
ax=df[df['CY']==2012].groupby('S_L').success.mean().sort_values(ascending=False).plot(kind='bar',stacked=False)
Instead of showing the values for each of the barplots, I want to show the calculations behind the mean, i.e the total for each group and the count where success (which is a flag) =1 i.e. the numerator. For example: If the bar shows 90%, which is calculated by 9 (numerator) being successful/ 10 (overall count for the given S_L group), I want to show n=9 and n=10 for that bar.
I looked at these posts Add labels to barplots , and it works when I display the values for the bars.
However, I don't know how to add the values for the calculation. As I am also sorting the values in descending order, I don't know how to do this. Please help.
My code:
import pandas as pd
from os import path
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
fname=path.expanduser(r'Test file.xlsx')
df=pd.read_excel(io=fname,sheet_name='Sheet1')
ax=df.groupby('S_L').success.mean().sort_values(ascending=False).plot(kind='bar',stacked=False)
vals = ax.get_yticks()
ax.set_ylabel('Success Rate')
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
Below is the dataset image

Related

No Output: Bar Graph Using Matplotlib

I have a df of Airbnb where each row represents a airbnb listing. I am trying to plot two columns as bar plot using Matplotlib.
fig,ax= plt.subplots()
ax.bar(airbnb['neighbourhood_group'],airbnb['revenue'])
plt.show()
What I think is, this graph should plot every neighbourhood on x axis and avg revenue per neighbourhood group on y axis(by default bar graph takes mean value per category)
This code of line keeps on running without giving me any error as if it has entered an indefinite while loop.
Can someone please suggest what could be wrong?
following I have used a dataframe, since none is available.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create sample DataFrame
y = np.random.rand(10,2)
y[:,0]= np.arange(10)
df = pd.DataFrame(y, columns=["neighbourhood_group", "revenue"])
Make sure that the "np.random" always gives different values for the revenue column when you start the program.
df:
# bar plot
ax = df.plot(x="neighbourhood_group", y="revenue", kind="bar")
regarding your statement that your code runs like in a loop. Could it be that the amount of data to be processed from the DataFrame to display the bar chart is too much effort. However, to say that for sure you would have to provide us with a dataset.

Question about Dataframe ploting with Pandas on Jupyter with multiple colors

1 - My goal is to create a bar plot of grades (y axis) and students id (x axis).
2 - Add an extra column with the mean() of the grades in a different color.
What's the best way of doing it?
I could create the first part but when it comes to change the color of the following column (mean), I couldn't finish it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
a = pd.read_excel('x.xlsx')
Felipe_stu = a['Teacher'] == 'Felipe'
Felipe_stu.plot(kind = 'bar', figsize = (20,5), color = 'gold')
Example of data (the first 10):
data example
Example of plot:
I've already tried to create a list with all the colors of the respective items on the plot.
Such as:
my_color = []
for c in range(0, len(Jorge_stu))
my_color.append('gold')
my_color.append('blue')
So, I would make the last column (the mean) in the color that I chose (blue in this case). This didn't work.
Any ideas how can I put the mean column on my plot?
Is it a better option to add an extra column to the plot or to add it in the proper dataframe and afterwards plot it?
U may need to do something like this:
How to create a matplotlib bar chart with a threshold line?
the threshold value in the above example, will be ur mean line, and that can be simply calculated with the df[score_column_name].mean()

No outlier detection in boxplot

I would like to plot boxplots of dataframes (see sample code below). What I'm wondering is: How can I disable the detection of outlier? I don't want to remove them, I just want a plot which visualizes the data by marking 0%, 25%, 50% and 75% of the datapoints without considering any criteria for outliers etc.
How do I have to modify my code to achieve this? Can I change the outlier detection criteria in a way that it behaves like disabled?
I would be very grateful for any help and if there is already another threat about this (which I didn't find), I would be happy to get a link to it.
Many thanks!
Jordin
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
plt.figure()
plt.boxplot(df.values)
plt.show()
EDIT:
I would like to include this outlier when drawing the whiskers and not just not show it.
You're looking for the whis parameter.
For the documentation:
whis : float, sequence, or string (default = 1.5)
As a float, determines the reach of the
whiskers to the beyond the first and third quartiles. In other words,
where IQR is the interquartile range (Q3-Q1), the upper whisker will
extend to last datum less than Q3 + whisIQR). Similarly, the lower
whisker will extend to the first datum greater than Q1 - whisIQR.
Beyond the whiskers, data are considered outliers and are plotted as
individual points. Set this to an unreasonably high value to force the
whiskers to show the min and max values. Alternatively, set this to an
ascending sequence of percentile (e.g., [5, 95]) to set the whiskers
at specific percentiles of the data. Finally, whis can be the string
'range' to force the whiskers to the min and max of the data.
Add it like so:
df.boxplot(whis=99)
If you add sym='' inside your plot function I think you will get what you ask for:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1234)
df = pd.DataFrame(np.random.randn(10, 4),
columns=['Col1', 'Col2', 'Col3', 'Col4'])
df.boxplot(sym='')

Seaborn pairplot with different data on each triangle [duplicate]

This question already has an answer here:
seaborn two corner pairplot
(1 answer)
Closed 1 year ago.
I wanted to do a pairplot with two different dataframes data_up and data_low on the lower part and the upper part of the pairgrid. The two dataframes have both 4 columns, wich correspond to the variables.
Looking at Pairgrid, i did not found a way to give different data to each triangle.
e.g :
import numpy as np
import seaborn as sns
import pandas as pd
# Dummy data :
data_up = np.random.uniform(size=(100,4))
data_low = np.random.uniform(size=(100,4))
# The two pairplots i currently uses and want to mix :
sns.pairplot(pd.DataFrame(data_up))
sns.pairplot(pd.DataFrame(data_low))
How can i have only the upper triangle of the first one plotted witht he lower traingle of the second one ? On the diagonal i dont really care what's plotted. Maybe a qqplot between the two corresponding marginals could be nice, but i'll see later.
You could try to put all columns together in the dataframe, and then use x_vars=... to tell which columns to use for the x-direction. Similar for y.
import numpy as np
import seaborn as sns
import pandas as pd
# Dummy data :
data_up_down = np.random.uniform(size=(100,8))
df = pd.DataFrame(data_up_down)
# use columns 0..3 for the x and 4..7 for the y
sns.pairplot(df, x_vars=(0,1,2,3), y_vars=(4,5,6,7))
import matplotlib.pyplot as plt
plt.show()

pandas groupby sum area plot

I'm looking to make a stacked area plot over time, based on summary data created by groupby and sum.
The groupby and sum part correctly groups and sums the data I want, but it seems the resultant format is nonsense in terms of plotting it.
I'm not sure where to go from here:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'invoice':[1,2,3,4,5,6],'year':[2016,2016,2017,2017,2017,2017],'part':['widget','wonka','widget','wonka','wonka','wonka'],'dollars':[10,20,30,10,10,10]})
#drop the invoice number from the data since we don't need it
df=df[['dollars','part','year']]
#group by year and part, and add them up
df=df.groupby(['year','part']).sum()
#plotting this is nonsense:
df.plot.area()
plt.show()
to chart multiple series, its easiest to have each series organized as a separate column, i.e. replace
df=df.groupby(['year','part']).sum()
with
df=df.groupby(['year', 'part']).sum().unstack(-1)
Then the rest of the code should work. But, I'm not sure if this is what you need because the desired output is not shown.
df.plot.area() then produces the chart like

Categories