I want to create a diagram from a pandas dataframe where the axes ticks should be percentages.
With matplotlib there is a nice axes formatter which automatically calculates the percentage ticks based on the given maximum value:
Example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame( { 'images': np.arange(0, 355, 5) } ) # 70 items in total, max is 350
ax = df.plot()
ax.yaxis.set_major_formatter(pltticker.PercentFormatter(xmax=350))
loc = pltticker.MultipleLocator(base=50) # locator puts ticks at regular intervals
ax.yaxis.set_major_locator(loc)
Since the usage of matplotlib is rather tedious, I want to do the same with Plotly. I only found the option to format the tick labels as percentages - but no 'auto formatter' who calculates the ticks and percentages for me. Is there a way to use automatic percentage ticks or do I have to calculate them everytime by hand (urgh)?
import plotly.express as px
import pandas as pd
fig = px.line(df, x=df.index, y=df.images, labels={'index':'num of users', '0':'num of img'})
fig.layout.yaxis.tickformat = ',.0%' # does not help
fig.show()
Thank you for any hints.
I'm not sure there's an axes option for percent, BUT it's relatively easy to get there by dividing y by it's max, y = df.y/df.y.max(). These types calculations, performed right inside the plot call, are really handy and I use them all of the time.
NOTE: if you have the possibility of negative values it does get more complicated (and ugly). Something like y=(df.y-df.y.min())/(df.y.max()-df.y.min()) may be necessary and a more general solution.
Full example:
import plotly.express as px
import pandas as pd
data = {'x': [0, 1, 2, 3, 4], 'y': [0, 1, 4, 9, 16]}
df = pd.DataFrame.from_dict(data)
fig = px.line(df, x=df.x, y=df.y/df.y.max())
#or# fig = px.line(df, x=df.x, y=(df.y-df.y.min())/(df.y.max()-df.y.min()))
fig.layout.yaxis.tickformat = ',.0%'
fig.show()
Related
I want to plot a histogram with row and colum facets using plotly.express.histogram() where each subplot gets its own x- and y-axis (for better readability). When looking at the documentation (e.g. go to section "Histogram Facet Grids") I can see a lot of examples where the x- and y-axes are repeated. But in my case, this somehow is not done automatically.
import numpy as np
import pandas as pd
import plotly.express as px
# create a dummy dataframe with lots of variables
rng = np.random.default_rng(42)
n_vars = 3
n_samples = 10
random_vars = [rng.normal(size=n_samples) for v in range(n_vars)]
m = np.vstack(random_vars).T
columns = pd.MultiIndex.from_tuples([('a','b'),('a','c'),('b','c')],names=['src','tgt'])
df = pd.DataFrame(m,columns=columns)
# convert to long format
df_long = df.melt()
# plot with plotly
fig = px.histogram(df_long,x='value',facet_row='src',facet_col='tgt')
fig.update_layout(yaxis={'side': 'left'})
fig.show()
which gives me:
How do I post-hoc configure the figure so that the x- and y-axis are shown for each subplot?
All you need to do is to customize each y and x axis by:
fig.for_each_yaxis(lambda y: y.update(showticklabels=True,matches=None))
fig.for_each_xaxis(lambda x: x.update(showticklabels=True,matches=None))
Output
My data is in a dataframe of two columns: y and x. The data refers to the past few years. Dummy data is below:
np.random.seed(167)
rng = pd.date_range('2017-04-03', periods=365*3)
df = pd.DataFrame(
{"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)])
}, index=rng
)
In first attempt, I plotted a scatterplot with Seaborn using the following code:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'])
plot_scatter(data=df, title='dummy title', figsize=(10,7))
However, I would like to generate a 4x3 matrix including 12 scatterplots, one for each month with year as hue. I thought I could create a third column in my dataframe that tells me the year and I tried the following:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'],
hue=data.iloc[:, 2])
df['year'] = df.index.year
plot_scatter(data=df, title='dummy title', figsize=(10,7))
While this allows me to see the years, it still shows all the data in the same scatterplot instead of creating multiple scatterplots, one for each month, so it's not offering the level of detail I need.
I could slice the data by month and build a for loop that plots one scatterplot per month but I actually want a matrix where all the scatterplots use similar axis scales. Does anyone know an efficient way to achieve that?
To create multiple subplots at once, seaborn introduces figure-level functions. The col= argument indicates which column of the dataframe should be used to identify the subplots. col_wrap= can be used to tell how many subplots go next to each other before starting an additional row.
Note that you shouldn't create a figure, as the function creates its own new figure. It uses the height= and aspect= arguments to tell the size of the individual subplots.
The code below uses a sns.relplot() on the months. An extra column for the months is created; it is made categorical to fix an order.
To remove the month= in the title, you can loop through the generated axes (a recent seaborn version is needed for axes_dict). With sns.set(font_scale=...) you can change the default sizes of all texts.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(167)
dates = pd.date_range('2017-04-03', periods=365 * 3, freq='D')
df = pd.DataFrame({"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)])
}, index=dates)
df['year'] = df.index.year
month_names = pd.date_range('2017-01-01', periods=12, freq='M').strftime('%B')
df['month'] = pd.Categorical.from_codes(df.index.month - 1, month_names)
sns.set(font_scale=1.7)
g = sns.relplot(kind='scatter', data=df, x='x', y='y', hue='year', col='month', col_wrap=4, height=4, aspect=1)
# optionally remove the `month=` in the title
for name, ax in g.axes_dict.items():
ax.set_title(name)
plt.setp(g.axes, xlabel='', ylabel='') # remove all x and y labels
g.axes[-2].set_xlabel('x', loc='left') # set an x label at the left of the second to last subplot
g.axes[4].set_ylabel('y') # set a y label to 5th subplot
plt.subplots_adjust(left=0.06, bottom=0.06) # set some more spacing at the left and bottom
plt.show()
I would like to change this from a line of regression to a curve. Also to have the line reach either side of the graph. Here is my code:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = {'Days': [5, 10, 15, 20],
'Impact': [33.7561, 30.6281, 29.5748, 29.0482]
}
a = pd.DataFrame (data, columns = ['Days','Impact'])
print (a)
ax = sns.barplot(data=a, x='Days', y='Impact', color='lightblue' )
# put bars in background:
for c in ax.patches:
c.set_zorder(0)
# plot regplot with numbers 0,..,len(a) as x value
ax = sns.regplot(x=np.arange(0,len(a)), y=a['Impact'], marker="+")
sns.despine(offset=10, trim=False)
ax.set_ylabel("")
ax.set_xticklabels(['5', '10','15','20'])
plt.show()
Alternatively, I would prefer to do it in matplotlib as a scatter plot instead of bar chart. Here is an example in excel, but ideally to have the curve extend beyond the outside markers at least a little.
Can anyone help?
So before anyone says, I'm not trying to create a horizontal bar plot. I'm trying to make a scatter graph that categorises the different plots based on the y values.
So this is my current code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import datetime
import random
f = []
for i in range(10):
f.append(random.randint(60,80))
df = pd.DataFrame({
"Weight": f, "Dates": ["01/12/20", "05/11/20", "12/02/20", "18/09/20", "22/04/20", "19/01/20", "18/02/20", "02/01/20", "28/11/20", "26/03/20"]
}, columns=["Weight", "Dates"])
df["Dates"] = pd.to_datetime(df["Dates"])
df.sort_values(by="Dates", inplace=True, ascending=True)
sns.set_theme(style="dark")
dates = [datetime.datetime.date(x) for x in df["Dates"]]
graph = sns.stripplot(data=df, x=dates, y="Weight")
graph.set_xticklabels(graph.get_xticklabels(), rotation=45)
plt.show()
So this is the current output:
But I want to be able to add some bars so I can categorise the data like (sorry for my poor drawing):
I still want to see the points afterwards, but I don't care about what colour they are.
I don't know if this is possible, but thanks!
EDIT: Answered by tmdavidson in comments.
I would recommend axhspan that was made for this very purpose
bands = [77.5,72.5,67.5,60]
colors = plt.cm.get_cmap('tab10')(range(len(limits)))
for y1,y2,c in zip(bands[0:], bands[1:], colors):
graph.axhspan(ymin=y1, ymax=y2, color=c, zorder=0, alpha=0.5)
I'm trying to plot a pandas series with a 'pandas.tseries.index.DatetimeIndex'. The x-axis label stubbornly overlap, and I cannot make them presentable, even with several suggested solutions.
I tried stackoverflow solution suggesting to use autofmt_xdate but it doesn't help.
I also tried the suggestion to plt.tight_layout(), which fails to make an effect.
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
ax.figure.autofmt_xdate()
#plt.tight_layout()
print(type(test_df[(test_df.index.year ==2017) ]['error'].index))
UPDATE: That I'm using a bar chart is an issue. A regular time-series plot shows nicely-managed labels.
A pandas bar plot is a categorical plot. It shows one bar for each index at integer positions on the scale. Hence the first bar is at position 0, the next at 1 etc. The labels correspond to the dataframes' index. If you have 100 bars, you'll end up with 100 labels. This makes sense because pandas cannot know if those should be treated as categories or ordinal/numeric data.
If instead you use a normal matplotlib bar plot, it will treat the dataframe index numerically. This means the bars have their position according to the actual dates and labels are placed according to the automatic ticker.
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=42).tolist()
df = pd.DataFrame(np.cumsum(np.random.randn(42)),
columns=['error'], index=pd.to_datetime(datelist))
plt.bar(df.index, df["error"].values)
plt.gcf().autofmt_xdate()
plt.show()
The advantage is then in addition that matplotlib.dates locators and formatters can be used. E.g. to label each first and fifteenth of a month with a custom format,
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=93).tolist()
df = pd.DataFrame(np.cumsum(np.random.randn(93)),
columns=['error'], index=pd.to_datetime(datelist))
plt.bar(df.index, df["error"].values)
plt.gca().xaxis.set_major_locator(mdates.DayLocator((1,15)))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%d %b %Y"))
plt.gcf().autofmt_xdate()
plt.show()
In your situation, the easiest would be to manually create labels and spacing, and apply that using ax.xaxis.set_major_formatter.
Here's a possible solution:
Since no sample data was provided, I tried to mimic the structure of your dataset in a dataframe with some random numbers.
The setup:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
# A dataframe with random numbers ro run tests on
np.random.seed(123456)
rows = 100
df = pd.DataFrame(np.random.randint(-10,10,size=(rows, 1)), columns=['error'])
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
test_df = df.copy(deep = True)
# Plot of data that mimics the structure of your dataset
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
ax.figure.autofmt_xdate()
plt.figure(figsize=(15,8))
A possible solution:
test_df = df.copy(deep = True)
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
plt.figure(figsize=(15,8))
# Make a list of empty myLabels
myLabels = ['']*len(test_df.index)
# Set labels on every 20th element in myLabels
myLabels[::20] = [item.strftime('%Y - %m') for item in test_df.index[::20]]
ax.xaxis.set_major_formatter(ticker.FixedFormatter(myLabels))
plt.gcf().autofmt_xdate()
# Tilt the labels
plt.setp(ax.get_xticklabels(), rotation=30, fontsize=10)
plt.show()
You can easily change the formatting of labels by checking strftime.org