Simple barplot of column means using seaborn - python

I have a pandas dataframe with 26 columns of numerical data. I want to represent the mean of each column in a barplot with 26 bars. This is easy to do with pandas plotting function: df.plot(kind = 'bar'). However, the results are ugly and the column labels are often truncated, i.e.:
I'd like to do this with seaborn instead, but can't seem to find a way no matter how hard I look. Surely there's an easy way to do a simple barplot of column averages? Thanks.

You can try something like this:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
fig = df.mean().plot(kind='bar')
plt.margins(0.02)
plt.ylabel('Your y-label')
plt.xlabel('Your x-label')
fig.set_xticklabels(df.columns, rotation = 45, ha="right")
plt.show()

If anyone finds this by a search, the easiest solution I've found (I'm OP) is to use use the pandas.melt() function. This concatenates all the columns into a single column, but adds a second column that preserves the column title adjacent to each value. This dataframe can be passed directly to seaborn.

You can use sns.barplot - especially for horizontal barplots more suitable for so many categories - like this:
import seaborn as sns
df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})
unstacked = df.unstack().to_frame()
sns.barplot(
y=unstacked.index.get_level_values(0),
x=unstacked[0]);

df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})
sns.barplot(x = df.mean().index, y = df.mean())
plt.show()

Related

Python Plotly: Percentage Axis Formatter

I want to create a diagram from a pandas dataframe where the axes ticks should be percentages.
With matplotlib there is a nice axes formatter which automatically calculates the percentage ticks based on the given maximum value:
Example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame( { 'images': np.arange(0, 355, 5) } ) # 70 items in total, max is 350
ax = df.plot()
ax.yaxis.set_major_formatter(pltticker.PercentFormatter(xmax=350))
loc = pltticker.MultipleLocator(base=50) # locator puts ticks at regular intervals
ax.yaxis.set_major_locator(loc)
Since the usage of matplotlib is rather tedious, I want to do the same with Plotly. I only found the option to format the tick labels as percentages - but no 'auto formatter' who calculates the ticks and percentages for me. Is there a way to use automatic percentage ticks or do I have to calculate them everytime by hand (urgh)?
import plotly.express as px
import pandas as pd
fig = px.line(df, x=df.index, y=df.images, labels={'index':'num of users', '0':'num of img'})
fig.layout.yaxis.tickformat = ',.0%' # does not help
fig.show()
Thank you for any hints.
I'm not sure there's an axes option for percent, BUT it's relatively easy to get there by dividing y by it's max, y = df.y/df.y.max(). These types calculations, performed right inside the plot call, are really handy and I use them all of the time.
NOTE: if you have the possibility of negative values it does get more complicated (and ugly). Something like y=(df.y-df.y.min())/(df.y.max()-df.y.min()) may be necessary and a more general solution.
Full example:
import plotly.express as px
import pandas as pd
data = {'x': [0, 1, 2, 3, 4], 'y': [0, 1, 4, 9, 16]}
df = pd.DataFrame.from_dict(data)
fig = px.line(df, x=df.x, y=df.y/df.y.max())
#or# fig = px.line(df, x=df.x, y=(df.y-df.y.min())/(df.y.max()-df.y.min()))
fig.layout.yaxis.tickformat = ',.0%'
fig.show()

Python Pandas Seaborn FacetGrid: use dataframe series' names to set up columns

I am using pandas dataframes to hold some volume calculation results, and trying to configure a seaborn FacetGrid setup to visualize results of 4 different types of volume calculations for a reservoir zone.
I believe I can handle the dataframe part, my problems is with the visualization part:
Each different type of volume calculations is loaded in the dataframe as a series. The series name corresponds to the type of volume calculation. I want to create a number of plots then, aligned so that each column of plot corresponds to one series in my dataframe.
Theory (documentation) says this should do it (example from tutorial at https://seaborn.pydata.org/tutorial/axis_grids.html):
import seaborn as sns
import matpltlib.pyplot as plt
tips = sns.load_dataset("tips")
g=sns.FacetGrid(tips, col = "time")
I cannot find the referenced dataset "tips" for download, but I think that is a minor problem. From the code snippet above and after some testing on my own data, I infer that "time" in that dataset refers to the name of one series in the dataframe and that different times would then be different categories or other types of values in that series.
This is not how my dataset is ordered. I have the different types of volume calculations that I would see as individual plots (in columns) represented as series in my dataframe. How do I provide the series name as input to seaborn FacetGrid col= argument?
g = seaborn.FacetGrid(data=volumes_table, col=?????)
I cannot figure out how I can get col=dataframe.series and I cannot find any documented example of that.
here's a setup with some exciting dummy names and dummy values
import os
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
#provide some input data, using a small dictionary
volumes_categories = {'zone_numbers': [1, 2, 3, 4],
'zone_names': ['corona', 'hiv', 'h5n1', 'measles'],
'grv': [30, 90, 80, 100],
'nv': [20, 60, 20, 50],
'pv': [5, 12, 4, 25],
'hcpv': [4, 6, 1, 20]}
# create the dataframe
volumes_table = pandas.DataFrame(volumes_categories)
# set up for plotting
seaborn.set(style='ticks')
g= seaborn.FacetGrid(data=volumes_table, col='zone_names')
The above setup generates columns ok, but I cannot get the colums to represent series in my dataframe (the columns when visualizing the dataframe as a table....)
What do I need to do?
The main part of the solution is described in BBQuercus's answer: reshaping the nice, human-readable wide-format dataframe/table into a long-format table which is simpler to digest for seaborn, using seaborn.melt()
I implemented this by creating a copy of the original dataframe and melting the copy:
# first copy dataframe
vol_table2 = volumes_table.copy()
#melt it into long format
vol_table2 = pandas.melt(vol_table2, id_vars = ['zone_numbers','zone_names'], value_vars=['grv','nv','pv','hcpv'], var_name = "volume_type", value_name = "volume")
In the end I also decided to scrap the explicit FacetGrid and map setup and use seaborn.catplot (with FacetGrid functionality included).
Thanks for assistance
(PS: it must be a good idea for seaborn to accept series names for Facetgrid setup)
Once we imported all requirements:
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
The FacetGrid essentially just provides a canvas to draw on. You can then use the map function to "project" plotting functions onto the canvas:
# Blueprint
g = sns.FacetGrid(dataframe, col="dataframe.column", row="dataframe.column")
g = g.map(plotting.function, "dataframe.column")
# Example with the tips dataset
g = sns.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
plt.show()
In your case as mentioned above I would also melt the columns first to get a tidy data format and then plot as usual. Changing what to plot however necessary:
volumes_table = volumes_table.melt(id_vars=['zone_numbers', 'zone_names'])
g = sns.FacetGrid(data=volumes_table, col='variable')
g = g.map(plt.scatter, 'zone_numbers', 'value')
plt.show()

Plotting categorical variable against numeric variable in matplotlib

My DataFrame's structure
trx.columns
Index(['dest', 'orig', 'timestamp', 'transcode', 'amount'], dtype='object')
I'm trying to plot transcode (transaction code) against amount to see the how much money is spent per transaction. I made sure to convert transcode to a categorical type as seen below.
trx['transcode']
...
Name: transcode, Length: 21893, dtype: category
Categories (3, int64): [1, 17, 99]
The result I get from doing plt.scatter(trx['transcode'], trx['amount']) is
Scatter plot
While the above plot is not entirely wrong, I would like the X axis to contain just the three possible values of transcode [1, 17, 99] instead of the entire [1, 100] range.
Thanks!
In matplotlib 2.1 you can plot categorical variables by using strings. I.e. if you provide the column for the x values as string, it will recognize them as categories.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
plt.scatter(df["x"].astype(str), df["y"])
plt.margins(x=0.5)
plt.show()
In order to optain the same in matplotlib <=2.0 one would plot against some index instead.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
u, inv = np.unique(df["x"], return_inverse=True)
plt.scatter(inv, df["y"])
plt.xticks(range(len(u)),u)
plt.margins(x=0.5)
plt.show()
The same plot can be obtained using seaborn's stripplot:
sns.stripplot(x="x", y="y", data=df)
And a potentially nicer representation can be done via seaborn's swarmplot:
sns.swarmplot(x="x", y="y", data=df)

Side-by-side boxplot of multiple columns of a pandas DataFrame

One year of sample data:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A":rnd.randn(n), "B":rnd.randn(n)+1},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
I want to boxplot these data side-by-side grouped by the month (i.e., two boxes per month, one for A and one for B).
For a single column sns.boxplot(df.index.month, df["A"]) works fine. However, sns.boxplot(df.index.month, df[["A", "B"]]) throws an error (ValueError: cannot copy sequence with size 2 to array axis with dimension 365). Melting the data by the index (pd.melt(df, id_vars=df.index, value_vars=["A", "B"], var_name="column")) in order to use seaborn's hue property as a workaround doesn't work either (TypeError: unhashable type: 'DatetimeIndex').
(A solution doesn't necessarily need to use seaborn, if it is easier using plain matplotlib.)
Edit
I found a workaround that basically produces what I want. However, it becomes somewhat awkward to work with once the DataFrame includes more variables than I want to plot. So if there is a more elegant/direct way to do it, please share!
df_stacked = df.stack().reset_index()
df_stacked.columns = ["date", "vars", "vals"]
df_stacked.index = df_stacked["date"]
sns.boxplot(x=df_stacked.index.month, y="vals", hue="vars", data=df_stacked)
Produces:
here's a solution using pandas melting and seaborn:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A": rnd.randn(n),
"B": rnd.randn(n)+1,
"C": rnd.randn(n) + 10, # will not be plotted
},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
df['month'] = df.index.month
df_plot = df.melt(id_vars='month', value_vars=["A", "B"])
sns.boxplot(x='month', y='value', hue='variable', data=df_plot)
month_dfs = []
for group in df.groupby(df.index.month):
month_dfs.append(group[1])
plt.figure(figsize=(30,5))
for i,month_df in enumerate(month_dfs):
axi = plt.subplot(1, len(month_dfs), i + 1)
month_df.plot(kind='box', subplots=False, ax = axi)
plt.title(i+1)
plt.ylim([-4, 4])
plt.show()
Will give this
Not exactly what you're looking for but you get to keep a readable DataFrame if you add more variables.
You can also easily remove the axis by using
if i > 0:
y_axis = axi.axes.get_yaxis()
y_axis.set_visible(False)
in the loop before plt.show()
This is quite straight-forward using Altair:
alt.Chart(
df.reset_index().melt(id_vars = ["index"], value_vars=["A", "B"]).assign(month = lambda x: x["index"].dt.month)
).mark_boxplot(
extent='min-max'
).encode(
alt.X('variable:N', title=''),
alt.Y('value:Q'),
column='month:N',
color='variable:N'
)
The code above melts the DataFrame and adds a month column. Then Altair creates box-plots for each variable broken down by months as the plot columns.

Can I add y-axis labels on a horizontal barchart using pandas?

I'm using the pandas wrapper around matplotlib to create a horizontal barchart and would like to add labels to the y-axis.
Sadly it doesn't seem to be as simple as just adding a labels=df['Labels'] parameter as we can with pie charts.
import pandas
import matplotlib.pyplot as plt
data = [['A', 1, 2], ['B', 2, 3], ['C', 3, 4]]
df = pandas.DataFrame(data, columns=['Label', 'Col1', 'Col2'])
df.plot(kind='barh')
plt.show()
Is this possible in pandas alone or am I going to have to move into matplotlib?
I've figured out what the problem is. If we set the 'Label' column as the index then the y-axis is labelled automatically.
df = pandas.DataFrame(data, columns=['Label', 'Col1', 'Col2'])
df.index = df['Label']
df.plot(kind='barh')
plt.show()
Modifying the DataFrame is not required. You can set the labels with plt.yticks after you have created the plot:
plt.yticks(range(3),df['Label'])

Categories