seaborn - how to add mean bar to barplot grouped with hue - python

I've got code as follows:
import pandas as pd
import numpy as np
import seaborn as sns
data=[np.random.randint(2018,2020,size=(30)),
np.random.randint(1,13,size=(30)),
np.random.randint(1,101,size=(30)),
np.random.randint(1,101,size=(30))]
cols=['year','month','val','val1']
data=pd.DataFrame(data).T
data.columns=cols
data1=[np.random.randint(1,13,size=(30)),
np.random.randint(1,101,size=(30)),
np.random.randint(1,101,size=(30))]
cols1=['month','val','val1']
data1=pd.DataFrame(data1).T
data1.columns=cols1
sns.barplot(data=data,x='month',y='val',hue='year',ci=False)
sns.barplot(data=data,x='month',y='val',estimator=np.mean,ci=False)
to produce barplots
and in fact I get two bar plots
and the second with mean for each month
but I would like to have one plot with three columns for each month including mean bar. Could you help me with this?

You can use pandas' plot function:
(data.pivot_table(index='month',columns='year',
values='val', margins=True,
margins_name='Mean')
.drop('Mean')
.plot.bar()
)
Output:

Related

Plotting complex graph in pandas

I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution
From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')

Box and whisker plot on multiple columns

I am trying to make a Box and Whisker plot on my dataset that looks something like this -
& the chart I'm trying to make
My current lines of code are below -
import seaborn as sns
import matplotlib.pyplot as plt
d = df3.boxplot(column = ['Northern California','New York','Kansas','Texas'], by = 'Banner')
d
Thank you
I've recreated a dummy version of your dataset:
import numpy as np
import pandas as pd
dictionary = {'Banner':['Type1']*10+['Type2']*10,
'Northen_californina':np.random.rand(20),
'Texas':np.random.rand(20)}
df = pd.DataFrame(dictionary)
What you need is to melt your dataframe (unpivot) in orther to have the information of geographical zone stored in a column and not as column name. You can use pandas.melt method and specify all the columns you want to put in your boxplot in the value_vars argument.
With my dummy dataset you can do this:
df = pd.melt(df,id_vars=['Banner'],value_vars=['Northen_californina','Texas'],
var_name='zone', value_name='amount')
Now you can apply a boxplot using the hue argument:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(9,9)) #for a bigger image
sns.boxplot(x="Banner", y="amount", hue="zone", data=df, palette="Set1")

Plot aggregate groupby Count data in SeaBorn Python?

how to use groupby function in the y-axis? the below code doesn't display what i expect, due to y = df.groupby('column1')['column2'].count()
import seaborn as sns
import pandas as pd
sns.set(style="whitegrid", color_codes=True)
sns.stripplot(x="column1", y = df.groupby('column1')['column2'].count(), data=df)
Seaborn just doesn't work that way. In seaborn, you specify the x and y columns as well as the data frame. Seaborn will do the aggregation itself.
import seaborn as sns
sns.striplot('column1', 'column2', data=df)
For the count, maybe what you need is countplot
sns.countplot('column1', data=df)
The equivalent pandas code is:
df.groupby('column1').size().plot(kind='bar')
this code will create a count plot with horizontal bar equivalent and descending sorted values
fig,ax = plt.subplots(figsize=(10,16))
grouped=df.groupby('Age').size(). \
sort_values(ascending=False).plot(kind='barh',ax=ax)

Plot stacked bar chart from pandas data frame

I have dataframe:
payout_df.head(10)
What would be the easiest, smartest and fastest way to replicate the following excel plot?
I've tried different approaches, but couldn't get everything into place.
Thanks
If you just want a stacked bar chart, then one way is to use a loop to plot each column in the dataframe and just keep track of the cumulative sum, which you then pass as the bottom argument of pyplot.bar
import pandas as pd
import matplotlib.pyplot as plt
# If it's not already a datetime
payout_df['payout'] = pd.to_datetime(payout_df.payout)
cumval=0
fig = plt.figure(figsize=(12,8))
for col in payout_df.columns[~payout_df.columns.isin(['payout'])]:
plt.bar(payout_df.payout, payout_df[col], bottom=cumval, label=col)
cumval = cumval+payout_df[col]
_ = plt.xticks(rotation=30)
_ = plt.legend(fontsize=18)
Besides the lack of data, I think the following code will produce the desired graph
import pandas as pd
import matplotlib.pyplot as plt
df.payout = pd.to_datetime(df.payout)
grouped = df.groupby(pd.Grouper(key='payout', freq='M')).sum()
grouped.plot(x=grouped.index.year, kind='bar', stacked=True)
plt.show()
I don't know how to reproduce this fancy x-axis style. Also, your payout column must be a datetime, otherwise pd.Grouper won't work (available frequencies).

Plot Overlapping Histograms Using Python

I have a .csv file (csv_test_1.csv) that is in this format:
durum_before_length,durum_before_reads,durum_after_length,durum_after_reads
0,0,0,0
10,0,10,0
20,0,20,0
30,0,30,1
40,0,40,4
50,0,50,5
60,0,60,0
70,0,70,1
80,0,80,4
90,0,90,1
100,4840,100,4704
110,4817,110,4706
120,4983,120,4860
130,4997,130,4851
140,5142,140,4980
150,5363,150,5192
160,5756,160,5530
170,6054,170,5725
180,6335,180,5989
190,7051,190,6651
200,9003,200,7157
210,8446,210,7812
220,9088,220,8314
230,9761,230,8955
240,10637,240,9660
250,11659,250,10408
260,12572,260,11178
270,13139,270,11538
280,13985,280,11950
290,113552,290,14304
300,954175,300,16383
,,310,17230
,,320,18368
,,330,19158
,,340,19733
,,350,20754
,,360,21698
,,370,21991
,,380,21937
,,390,22473
,,400,22655
,,410,22497
,,420,22460
,,430,22488
,,440,21941
,,450,21884
,,460,21350
,,470,21066
,,480,20812
,,490,19901
,,500,19716
,,510,19374
,,520,19000
,,530,18245
,,540,17220
,,550,15713
,,560,14042
,,570,11932
,,580,7204
,,590,29
You can see that the second two columns are longer than the first two columns. I would like to plot two overlapping histograms: the first histogram will be the first column as the x values plotted against the second column as the y-values, and the second histogram will be the third column as the x values plotted against the fourth column as the y-values.
I am thinking of using seaborn because it makes nice looking plots. The code I have thus far is as shown below. From here, I have no idea how to specify the x and y values and how to generate two overlapping histograms on the same plot. Any advice would be greatly appreciated.
import numpy as np
import pandas as pd
from pandas import read_csv
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
read_data = read_csv("csv_test_1.csv")
sns.set(style="white", palette="muted")
sns.despine()
plt.hist(read_data, normed=False)
plt.xlabel("Read Length")
plt.ylabel("Number of Reads")

Categories