I have a slightly odd csv file where the month column is repeated as such. My goal is to create a bar graph where each month has two columns of y (from both a and b). I have tried to approach this by separating the data frame into two - a only and b only - but the repetition of the month column gets in the way. Fairly new to Python and Pandas so perhaps there is a function I'm not aware of? Any help is appreciated.
month cond. y
Jan a 4
Jan b 8
Feb a 2
Feb b 9
March a 3
March b 7
Perhaps the most common way to approach this problem is to reshape the long-form data to wide-form via pivot and then DataFrame.plot:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'March', 'March'],
'cond.': ['a', 'b', 'a', 'b', 'a', 'b'],
'y': [4, 8, 2, 9, 3, 7]
})
df.pivot(index='month', columns='cond.', values='y').plot(kind='bar', rot=0)
plt.tight_layout()
plt.show()
There is a noticeable issue in that the x-axis columns appear out of order as they are alphabetically ordered and not ordered by Date. One option would be to reindex before plotting. There would be more options if the month column was regular, but since it contains both full month names and abbreviations manually reindexing is likely the best option.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'March', 'March'],
'cond.': ['a', 'b', 'a', 'b', 'a', 'b'],
'y': [4, 8, 2, 9, 3, 7]
})
(
df.pivot(index='month', columns='cond.', values='y')
.reindex(['Jan', 'Feb', 'March']) # Re-order so they appear correctly on x-axis
.plot(kind='bar', rot=0)
)
plt.tight_layout()
plt.show()
Seaborn is highly popular in solving these types of questions as the hue argument allows the reshaping step to be avoided. Additionally x will be in order of appearance in the frame so reindex is also unnecessary (assuming the data appears in the correct order in the source DataFrame)
sns.barplot:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.set_theme() # (optional) Use seaborn theme
df = pd.DataFrame({
'month': ['Jan', 'Jan', 'Feb', 'Feb', 'March', 'March'],
'cond.': ['a', 'b', 'a', 'b', 'a', 'b'],
'y': [4, 8, 2, 9, 3, 7]
})
sns.barplot(data=df, x='month', y='y', hue='cond.')
plt.tight_layout()
plt.show()
Using the hue attribute to categorize also works
import seaborn as sns
sns.barplot(data=df,x='Month',y='y',hue='Cond')
result_plot
Related
I am trying to apply the following calculation in all the columns of a dataframe EXCEPT a list of 3 string columns. The issue is that although the code bellow works fine based on the sample data, in reality the Month columns are upwards of 100+ and they are getting increased every month while the 3 string columns are fix. The months list should contain 100+ columns which they will be +1 every month so I want to just apply the /100 on all the columns that the View description == 'Percent change' except the Series ID, View Description and Country columns. How can I modify the list so that it includes just the 3 string columns and the .loc is applied to everything else.
import pandas as pd
df = pd.DataFrame({
'Series ID': ['Food', 'Drinks', 'Food at Home'],
'View Description': ['Percent change', 'Original Data Value', 'Original Data Value'],
'Jan': [219.98, 'B', 'A'],
'Feb': [210.98, 'B', 'A'],
'Mar': [205, 'B', 'A'],
'Apr': [202, 'B', 'A'],
'Country': ['Italy', 'B', 'A']
})
months = ['Jan', 'Feb', 'Mar', 'Apr']
df.loc[df['View Description'] == 'Percent change', months] /= 100
print(df)
Thanks!
You can change months to be a boolean array which omits the string columns:
months = ~df.columns.isin(['Series ID', 'View Description', 'Country'])
The command for applying the division will be the same as you have above. This change just programmatically selects the month columns by excluding the non-month columns.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 100
data = {'Month': np.random.choice(['2014-01', '2014-02', '2014-03', '2014-04'], size=rows),
'Code': np.random.choice(['A', 'B', 'C'], size=rows),
'ColA': np.random.randint(5, 125, size=rows),
'ColB': np.random.randint(0, 51, size=rows),}
df = pd.DataFrame(data)
df = df[((~((df.Code=='A')&(df.Month=='2014-04')))&(~((df.Code=='C')&(df.Month=='2014-03'))))]
dfg = df.groupby(['Code', 'Month']).sum()
For above. I wish to plot a stacked plot..
dfg.unstack(level=0).plot(kind='bar', stacked =True)
I wish to stack over 'Code' column. But, above is stacking over 'Month' Why?. How to better plot stacked plot with this?
The index of the input dataframe is used by default as x-value in plot.bar
IIUC, you need:
dfg.unstack(level=1).plot(kind='bar', stacked=True)
legend position:
ax = dfg.unstack(level=1).plot(kind='bar', stacked=True, legend=False)
ax.figure.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Given the following code:
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1')
fig.show()
that generates the following bar plot:
how do I remove count from hover_data?
plotly==5.1.0
You can remove it from hovertemplate
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1').update_traces(hovertemplate='col1=%{y}<br><extra></extra>')
fig.show()
I am using the following code to produce a pie plot.
My question is, how do I mask/hide the numbers inside the pie chart?
I do not want the numbers 0.62, 0.31 and 0.02 inside the pie chart to be visible.
Thanks in advance.
import pandas as pd
import matplotlib.pyplot as plt
df99 = pd.DataFrame({
'Data': ['A', 'B', 'C'],
'Perc': [0.62, 0.31, 0.02]})
plt.pie(df99['Perc']*100, colors=['#002c4b','#392e2c','#92847a','#ccc2bb','#6b879d','#7FBAA4','#8E654C','#006CB8','#CBBBE9','#9778D3'],counterclock=False,startangle=-270,pctdistance=1.2,labeldistance=1.2,labels=df99['Data'],
autopct=lambda p: f"{p*df99['Perc'].sum()/100:.2f}")
plt.show()
IIUC,
import pandas as pd
import matplotlib.pyplot as plt
df99 = pd.DataFrame({
'Data': ['A', 'B', 'C'],
'Perc': [0.62, 0.31, 0.02]})
plt.pie(df99['Perc']*100,
colors=['#002c4b','#392e2c','#92847a','#ccc2bb','#6b879d','#7FBAA4','#8E654C','#006CB8','#CBBBE9','#9778D3'],counterclock=False,startangle=-270,pctdistance=1.2,labeldistance=1.2,
labels=df99['Data'],
autopct=None)
plt.show()
Output:
Let's use pandas plot also,
df99.set_index('Data').mul(100).plot.pie(y='Perc',colors=['#002c4b','#392e2c','#92847a','#ccc2bb','#6b879d','#7FBAA4','#8E654C','#006CB8','#CBBBE9','#9778D3'],counterclock=False,startangle=-270)
Output:
%matplotlib inline
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08], 'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 'NaN']}).set_index('index')
I want to plot a horizontal bar chart with first and second as bars.
I want to use the max column for displaying a vertical line at the corresponding values if the other columns.
I only managed the bar plot as for now.
Like this:
Any hints on how to achieve this?
thx
I have replaced the NaN with some finite value and then you can use the following code
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08],
'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 2.5]}).set_index('index')
plt.barh(range(4), df['first'], height=-0.25, align='edge')
plt.barh(range(4), df['second'], height=0.25, align='edge', color='red')
plt.yticks(range(4), df.index);
for i, val in enumerate(df['max']):
plt.vlines(val, i-0.25, i+0.25, color='limegreen')