Where does this bottom margin come from? - python

I'm cramming lots of small line charts onto one single figure. Sometimes I am left with a relatively large bottom margin, depending on my data. This is not specific to subplots but can also happen for only one axes. An example:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.Series([1, 2, 2, 4, 5], index=pd.date_range('2023', periods=5))
df = df.drop_duplicates() # Without gaps as is well
fig = plt.figure()
plt.subplots_adjust(0, 0, 1, 1) # No margins
# ... Lots of stuff/subplots might happen here...
df.plot(xticks=[]) # Depending on df, leaves a bottom margin
plt.show()
This leaves a large margin at the bottom:
Why is this? Is there a workaround?

After some digging I found the cause myself. It turns out that pandas treats a date x axis special (format_date_labels). And unless the date range is completely regular (no gaps), bottom=0.2 is set explicitly (via fig.subplots_adjust). At least when the date gap is in one of the bottom subplots.
This brings a simple workaround:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.Series([1, 2, 2, 4, 5], index=pd.date_range('2023', periods=5))
df = df.drop_duplicates() # Without gaps as is well
fig = plt.figure()
plt.subplots_adjust(0, 0, 1, 1) # No margins
# ... Lots of stuff/subplots might happen here...
df.plot(xticks=[]) # Depending on df, leaves a bottom margin
plt.subplots_adjust(bottom=0) # Fix!
plt.show()
Now the result has no margin as expected:
I'm unsure whether this behavior is a pandas bug. I thought it would be a good idea to document the workaround. At least for me, as I will probably find it here in the future.

Related

How to change seaborn violinplot legend labels?

I'm using seaborn to make a violinplot, which uses hues to identify who survived and who didn't. This is given by the column 'DEATH_EVENT', where 0 means the person survived and 1 means they didn't. The only issue I'm having is that I can't figure out how to set labels for this hue legend. As seen below, 'DEATH_EVENT' presents 0 and 1, but I want to change this into 'Survived' and 'Not survived'.
Current code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
sns.set()
plt.style.use('seaborn')
data = pd.read_csv('heart_failure_clinical_records_dataset.csv')
g = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
g.set_xticklabels(['No smoking', 'Smoking'])
I tried to use: g.legend(labels=['Survived', 'Not survived']), but it returns it without the colors, instead a thin and thick line for some reason.
I'm aware I could just use:
data['DEATH_EVENT'].replace({0:'Survived', 1:'Not survived'}, inplace=True)
but I wanted to see if there was another way. I'm still a rookie, so I'm guessing that there's a reason why the CSV's author made it so that it uses integers to describe plenty of things. Ex: if someone smokes or not, sex, diabetic or not, etc. Maybe it runs faster?
Controlling Seaborn legends is still somewhat tricky (some extensions to matplotlib's API would be helpful). In this case, you could grab the handles from the just-created legend and reuse them for a new legend:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.DataFrame({"smoking": np.random.randint(0, 2, 200),
"survived": np.random.randint(0, 2, 200),
"age": np.random.normal(60, 10, 200),
"DEATH_EVENT": np.random.randint(0, 2, 200)})
ax = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
ax.set_xticklabels(['No smoking', 'Smoking'])
ax.legend(handles=ax.legend_.legendHandles, labels=['Survived', 'Not survived'])
Here is an approach to make the change via the dataframe without changing the original dataframe. To avoid accessing ax.legend_ alltogether (to remove the legend title), a trick is to rename the column to a blank string (and use that blank string for hue). If the dataframe isn't super long (i.e. not having millions of rows), the speed and memory overhead are quite modest.
names = {0: 'Survived', 1: 'Not survived'}
ax = sns.violinplot(data=data.replace({'DEATH_EVENT': names}).rename(columns={'DEATH_EVENT': ''}),
x='smoking', y='age', hue='')

Python Plotly: Percentage Axis Formatter

I want to create a diagram from a pandas dataframe where the axes ticks should be percentages.
With matplotlib there is a nice axes formatter which automatically calculates the percentage ticks based on the given maximum value:
Example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame( { 'images': np.arange(0, 355, 5) } ) # 70 items in total, max is 350
ax = df.plot()
ax.yaxis.set_major_formatter(pltticker.PercentFormatter(xmax=350))
loc = pltticker.MultipleLocator(base=50) # locator puts ticks at regular intervals
ax.yaxis.set_major_locator(loc)
Since the usage of matplotlib is rather tedious, I want to do the same with Plotly. I only found the option to format the tick labels as percentages - but no 'auto formatter' who calculates the ticks and percentages for me. Is there a way to use automatic percentage ticks or do I have to calculate them everytime by hand (urgh)?
import plotly.express as px
import pandas as pd
fig = px.line(df, x=df.index, y=df.images, labels={'index':'num of users', '0':'num of img'})
fig.layout.yaxis.tickformat = ',.0%' # does not help
fig.show()
Thank you for any hints.
I'm not sure there's an axes option for percent, BUT it's relatively easy to get there by dividing y by it's max, y = df.y/df.y.max(). These types calculations, performed right inside the plot call, are really handy and I use them all of the time.
NOTE: if you have the possibility of negative values it does get more complicated (and ugly). Something like y=(df.y-df.y.min())/(df.y.max()-df.y.min()) may be necessary and a more general solution.
Full example:
import plotly.express as px
import pandas as pd
data = {'x': [0, 1, 2, 3, 4], 'y': [0, 1, 4, 9, 16]}
df = pd.DataFrame.from_dict(data)
fig = px.line(df, x=df.x, y=df.y/df.y.max())
#or# fig = px.line(df, x=df.x, y=(df.y-df.y.min())/(df.y.max()-df.y.min()))
fig.layout.yaxis.tickformat = ',.0%'
fig.show()

Changing the order of pandas/matplotlib line plotting without changing data order

Given the following example:
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
df.plot(linewidth=10)
The order of plotting puts the last column on top:
How can I make this keep the data & legend order but change the behaviour so that it plots X on top of Y on top of Z?
(I know I can change the data column order and edit the legend order but I am hoping for a simpler easier method leaving the data as is)
UPDATE: final solution used:
(Thanks to r-beginners) I used the get_lines to modify the z-order of each plot
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
fig = plt.figure()
ax = fig.add_subplot(111)
df.plot(ax=ax, linewidth=10)
lines = ax.get_lines()
for i, line in enumerate(lines, -len(lines)):
line.set_zorder(abs(i))
fig
In a notebook produces:
Get the default zorder and sort it in the desired order.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
ax = df.plot(linewidth=10)
l = ax.get_children()
print(l)
l[0].set_zorder(3)
l[1].set_zorder(1)
l[2].set_zorder(2)
Before definition
After defining zorder
I will just put this answer here because it is a solution to the problem, but probably not the one you are looking for.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
# read columns in reverse order and plot them
# so normally, the legend will be inverted as well, but if we invert it again, you should get what you want
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
Note that in this example, you don't change the order of your data, you just read it differently, so I don't really know if that's what you want.
You can also make it easier on the eyes by creating a corresponding method.
def plot_dataframe(df: pd.DataFrame) -> None:
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
# then you just have to call this
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
plot_dataframe(df)

Dataframe printing shows only 2 lines instead of the 20 lines

I want to print my dataframe. Unfortunately the picture shows only 2 lines of the dataframe
instead of the 20 lines and the table is below and there is a huge empty area as well.. Could someone help me to get all the 20 lines of the dataframe?
This is the source How to save a pandas DataFrame table as a png
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import table # EDIT: see deprecation warnings below
from pathlib import Path
PATH_DATA = "../data"
PATH_RETAILROCKET = Path(PATH_DATA,"retailrocket/retailrocket/events.csv")
# RetailRocket
print(PATH_RETAILROCKET)
df = pd.read_csv(Path(PATH_RETAILROCKET))
df = df.head(20)
ax = plt.subplot(111, frame_on=False) # no visible frame
ax.xaxis.set_visible(False) # hide the x axis
ax.yaxis.set_visible(False) # hide the y axis
table(ax, df) # where df is your data frame
plt.savefig('mytable.png')
Tables are, by default, placed below the area occupied by the axes. Here you have most of the figure area occupied by an (invisible) axes, leaving little room for the table.
There are several ways to fix the issue depending on your desired output.
Here, I'm using the bbox= argument of Table to override the position, and make the table occupy the entirety of the figure.
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
fig = plt.figure()
ax = plt.subplot(111)
ax.axis('off')
table(ax, df, bbox=[0,0,1,1])

Simple barplot of column means using seaborn

I have a pandas dataframe with 26 columns of numerical data. I want to represent the mean of each column in a barplot with 26 bars. This is easy to do with pandas plotting function: df.plot(kind = 'bar'). However, the results are ugly and the column labels are often truncated, i.e.:
I'd like to do this with seaborn instead, but can't seem to find a way no matter how hard I look. Surely there's an easy way to do a simple barplot of column averages? Thanks.
You can try something like this:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
fig = df.mean().plot(kind='bar')
plt.margins(0.02)
plt.ylabel('Your y-label')
plt.xlabel('Your x-label')
fig.set_xticklabels(df.columns, rotation = 45, ha="right")
plt.show()
If anyone finds this by a search, the easiest solution I've found (I'm OP) is to use use the pandas.melt() function. This concatenates all the columns into a single column, but adds a second column that preserves the column title adjacent to each value. This dataframe can be passed directly to seaborn.
You can use sns.barplot - especially for horizontal barplots more suitable for so many categories - like this:
import seaborn as sns
df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})
unstacked = df.unstack().to_frame()
sns.barplot(
y=unstacked.index.get_level_values(0),
x=unstacked[0]);
df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]})
sns.barplot(x = df.mean().index, y = df.mean())
plt.show()

Categories