Related
I have dataframe:
d = {'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'D', 'D', 'D', 'D', 'D'],
'value': [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 1.0],
'count': [4, 3, 7, 3, 12, 14, 5, 10, 3, 8, 7, 15, 4]}
df = pd.DataFrame(data=d)
df
I want to plot multiple histograms in one figure. That is, a histogram for group A, a histogram for group B and group D in one figure. So, labels is a group column, y-axis is count, x-axis is value.
I do this, but there are incorrect values on the y-axis and it builds several figures.
axarr = df.hist(column='value', by = 'group', bins = 20)
for ax in axarr.flatten():
ax.set_xlabel("value")
ax.set_ylabel("count")
Assuming that you are looking for a grouped bar plot (pending the clarification in the comments):
Plot:
Code:
import pandas as pd
d = {'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'D', 'D', 'D', 'D', 'D'],
'value': [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 1.0],
'count': [4, 3, 7, 3, 12, 14, 5, 10, 3, 8, 7, 15, 4]}
df = pd.DataFrame(data=d)
df_pivot = pd.pivot_table(
df,
values="count",
index="value",
columns="group",
)
ax = df_pivot.plot(kind="bar")
fig = ax.get_figure()
fig.set_size_inches(7, 6)
ax.set_xlabel("value")
ax.set_ylabel("count")
Brief plot option:
pd.pivot(df, index="value", columns="group").plot(kind="bar", y='count')
Brief explanation:
Pivot table for preparation:
x-axis 'value' as index
different bars 'group' as columns
group A B D
value
0.2 4.0 12.0 3.0
0.4 3.0 14.0 8.0
0.6 7.0 5.0 7.0
0.8 3.0 10.0 15.0
1.0 NaN NaN 4.0
Pandas .plot() can handle that groupded bar plot directly after the df_pivot preparation.
Its default backend is matplotlib, so usual commands apply (like fig.savefig).
Add-on apperance:
You can make it look more like a hist plot concerning the x-axis by aligning the spacing in between the groups:
Just add a , width=0.95 (or =1.0) within the .plot( ) statements.
Using plotnine in python, I'd like to add dashed horizontal lines to my plot (a scatterplot, but preferably an answer compatible with other plot types) representing the mean for every color separately. I'd like to do so without manually computing the mean values myself or adapting other parts of the data (e.g. adding columns for color values etc).
Additionally, the original plot is generated via a function (make_plot below) and the mean lines are to be added afterwards, yet need to have the same color as the points from which they are derived.
Consider the following as a minimal example;
import pandas as pd
import numpy as np
from plotnine import *
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
def make_plot(df, var_x, var_y, var_fill) :
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
I'd like to add 4 lines, one for each Size. The exact same can be done in R using ggplot, as shown by this question. Adding geom_line(stat="hline", yintercept="mean", linetype="dashed") to plot however results in an error PlotnineError: "'stat_hline' Not in Registry. Make sure the module in which it is defined has been imported." that I am unable to resolve.
Answers that can resolve the aforementioned issue, or propose another working solution entirely, are greatly appreciated.
You can do it by first defining the means as a vector and then pass it to your function:
import pandas as pd
import numpy as np
from plotnine import *
from random import randint
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
a = df.groupby(['Size'])['MSE'].mean() ### Defining yuor means
a = list(a)
def make_plot(df, var_x, var_y, var_fill):
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()+ geom_hline(yintercept =a,linetype="dashed")
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
which gives:
Note that two of the lines coincide:
a = [0.6666666666666666, 0.5, 0.4666666666666666, 0.6666666666666666]
To add different colors to each dashed line, you can do this:
import pandas as pd
import numpy as np
from plotnine import *
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
### Generate a list of colors of the same length as your categories (Sizes)
color = []
n = len(list(set(df.Size)))
for i in range(n):
color.append('#%06X' % randint(0, 0xFFFFFF))
######################################################
def make_plot(df, var_x, var_y, var_fill):
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()+ geom_hline(yintercept =list(df.groupby(['Size'])['MSE'].mean()),linetype="dashed", color =b)
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
which returns:
I am trying to plot a bar chart where I would like to have two bars, one stacked and another one not stacked by the side of the stacked one.
I have the first plot which is a stacked plot:
And another plot, with the same lines and columns:
I want to plot it side by side to the columns of the last plot, and not stack it:
This is a code snippet to replicate my problem:
d = pd.DataFrame({'DC': {'col0': 257334.0,
'col1': 0.0,
'col2': 0.0,
'col3': 186146.0,
'col4': 0.0,
'col5': 366431.0,
'col6': 461.0,
'col7': 0.0,
'col8': 0.0},
'DC - IDC': {'col0': 32665.0,
'col1': 0.0,
'col2': 156598.0,
'col3': 0.0,
'col4': 176170.0,
'col5': 0.0,
'col6': 0.0,
'col7': 0.0,
'col8': 0.0},
'No Address': {'col0': 292442.0,
'col1': 227.0,
'col2': 298513.0,
'col3': 117167.0,
'col4': 249.0,
'col5': 747753.0,
'col6': 271976.0,
'col7': 9640.0,
'col8': 211410.0}})
d[['DC', 'DC - IDC']].plot.barh(stacked=True)
d[['No Address']].plot.barh( stacked=False, color='red')
Use position parameter to draw 2 columns on the same index:
fig, ax = plt.subplots()
d[['DC', 'DC - IDC']].plot.barh(width=0.4, position=0, stacked=True, ax=ax)
d[['No Address']].plot.barh(width=0.4, position=1, stacked=True, ax=ax, color='red')
plt.show()
You can achieve this only by using matplotlib.pyplot library. First, you need to import NumPy and matplotlib libraries.
import matplotlib.pyplot as plt
import numpy as np
Then,
plt.figure(figsize=(15,8))
plt.barh(d.index, d['DC'], 0.4, label='DC', align='edge')
plt.barh(d.index, d['DC - IDC'], 0.4, label='DC - IDC', align='edge')
plt.barh(np.arange(len(d.index))-0.4, d['No Address'], 0.4, color='red', label='No Address', align='edge')
plt.legend();
Here is what I did:
Increase the figure size (optional)
Create a BarContainer for each column
Decrease the width of each bar to 0.4 to make them fit
Align the left edges of the bars with the y positions
Normally all bars now are stacked. To put the red bars to the side you need to subtract each y coordinate by the width of the bars (0.4) np.arange(len(d.index))-0.4
Finally, add a legend
It should look like that:
I am trying to fill missing values of series 'X', by growing it backwards using series 'Y' (assuming it is a percentage growth rate). I am trying do this by group 'G'. When I debug, I can see that my function "Fillbackwards" is doing exactly what i want it to do for each group. However, when i use apply to use this function on each group, it returns an empty dataframe. Does anyone know what I am missing?
Thanks
Edited to clarify I want to fill na by growing the series backwards using another series.
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[np.nan, np.nan, 6, 6.7, np.nan, 5, 9, 10],
'Y':[5.4, 5.7, 5.5, 6.1, 2.1, 1.5, 5.1, 2.1,],
'G': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']})
def Fillbackwards(DB, Sname, Growthrate):
first_non_nan = DB[Sname].isnull().idxmin()
while first_non_nan-DB.index[0] > 0:
# Note the index of the group within the dataframe does not start at 0 as it's part of a larger frame - DB.index[0] restarts from zero
DB.loc[first_non_nan-1-DB.index[0], Sname] = DB.loc[first_non_nan-DB.index[0], Sname]/(DB.loc[first_non_nan-DB.index[0], Growthrate]/100+1)
first_non_nan -= 1
df = df.groupby('G').apply(lambda x: Fillbackwards(x, 'X', 'Y'))
Are you just looking to fill the NaN values in X with the values in Y?
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[np.nan, np.nan, 6, 6.7, np.nan, 5, 9, 10],
'Y':[5.4, 5.7, 5.5, 6.1, 2.1, 1.5, 5.1, 2.1,],
'G': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']})
df['X'] = df['X'].fillna(df['Y'])
pandas has a built in way of handling this
I have realized my mistake: i did not have a "return" at the end of my function, so it returned null for each group.
Also, I have tweaked some of the above code in terms of indexing so it works correctly now.
Anyway: here is some code that lets you fill in missing values of series "X" by "growing it backwards" using growth rates in series "Y" separately for each group in "G":
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[np.nan, np.nan, 6, 6.7, np.nan, 5, 9, 10],
'Y':[5.4, 5.7, 5.5, 6.1, 2.1, 1.5, 5.1, 2.1,],
'G': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']})
def Fillbackwards(DB, Sname, Growthrate):
first_non_nan = DB[Sname].isnull().idxmin()
while first_non_nan > DB.index[0]:
# Note the index of the group within the dataframe does not start at 0 as it's part of a larger frame - DB.index[0] restarts from zero
DB.loc[first_non_nan-1, Sname] = DB.loc[first_non_nan, Sname]/(DB.loc[first_non_nan, Growthrate]/100+1)
first_non_nan -= 1
return DB
df = df.groupby('G').apply(lambda x: Fillbackwards(x, 'X', 'Y'))
check = df
I am using the following code to produce a pie plot.
My question is, how do I mask/hide the numbers inside the pie chart?
I do not want the numbers 0.62, 0.31 and 0.02 inside the pie chart to be visible.
Thanks in advance.
import pandas as pd
import matplotlib.pyplot as plt
df99 = pd.DataFrame({
'Data': ['A', 'B', 'C'],
'Perc': [0.62, 0.31, 0.02]})
plt.pie(df99['Perc']*100, colors=['#002c4b','#392e2c','#92847a','#ccc2bb','#6b879d','#7FBAA4','#8E654C','#006CB8','#CBBBE9','#9778D3'],counterclock=False,startangle=-270,pctdistance=1.2,labeldistance=1.2,labels=df99['Data'],
autopct=lambda p: f"{p*df99['Perc'].sum()/100:.2f}")
plt.show()
IIUC,
import pandas as pd
import matplotlib.pyplot as plt
df99 = pd.DataFrame({
'Data': ['A', 'B', 'C'],
'Perc': [0.62, 0.31, 0.02]})
plt.pie(df99['Perc']*100,
colors=['#002c4b','#392e2c','#92847a','#ccc2bb','#6b879d','#7FBAA4','#8E654C','#006CB8','#CBBBE9','#9778D3'],counterclock=False,startangle=-270,pctdistance=1.2,labeldistance=1.2,
labels=df99['Data'],
autopct=None)
plt.show()
Output:
Let's use pandas plot also,
df99.set_index('Data').mul(100).plot.pie(y='Perc',colors=['#002c4b','#392e2c','#92847a','#ccc2bb','#6b879d','#7FBAA4','#8E654C','#006CB8','#CBBBE9','#9778D3'],counterclock=False,startangle=-270)
Output: