Matplotlib pandas Quarterly bar chart with datetime as index not Working - python

I have a pandas series with index as datetime which I am trying to visualize,
using bar graph. My code is below. But the chart I am getting is not quite accurate (pic below) it seems. How do I fix this?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)
dti = pd.date_range('2012-12-31', periods=30, freq='Q')
s2 = pd.Series(np.random.randint(100,1000,size=(30)),index=dti)
df4 = s2.to_frame(name='count')
print('\ndf4:')
print(df4)
print(type(df4))
f2 = plt.figure("Quarterly",figsize=(10,5))
ax = plt.subplot(1,1,1)
ax.bar(df4.index,df4['count'])
plt.tight_layout()
plt.show()

Unfortunately, matplotlib's bar plots don't seem to play along very happily with pandas dates.
In theory, matplotlib expresses the bar widths in days. But if you try something like ax.bar(df4.index,df4['count'], width=30), you'll see the plot with extremely wide bars, almost completely filling the plot. Experimenting with the width, something weird happens. When width is smaller than 2, it looks like it is expressed in days. But with the width larger than 2, it suddenly jumps to something much wider.
On my system (matplotlib 3.1.2, pandas 0.25.3, Windows) it looks like:
A workaround uses the bar plots from pandas. These seem to make the bars categorical, with one tick per bar. But they get labelled with a full date including hours, minutes and seconds. You could relabel them, for example like:
df4.plot.bar(y='count', width=0.9, ax=ax)
plt.xticks(range(len(df4.index)),
[t.to_pydatetime().strftime("%b '%y") for t in df4.index],
rotation=90)
Investigating further, the inconsistent jumping around of matplotlib's bar width, seems related to the frequency build into pandas times. So, a solution could be to convert the dates to matplotlib dates. Trying this, yes, the widths get expressed consistently in days.
Unfortunately, the quarterly dates don't have exactly the same number of days between them, resulting in some bars too wide, and others too narrow. A solution to this next problem is explicitly calculating the number of days for each bar. In order to get nice separations between the bars, it helps to draw their edges in white.
from datetime import datetime
x = [datetime.date(t) for t in df4.index] # convert the pandas datetime to matplotlib's
widths = [t1-t0 for t0, t1 in zip(x, x[1:])] # time differences between dates
widths += [widths[-1]] # the very last bar didn't get a width, just repeat the last width
ax.bar(x, df4['count'], width=widths, edgecolor='white')

You can set the width of the bars via the width argument in ax.bar() to some value larger than the default of 0.8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)
dti = pd.date_range('2012-12-31', periods=30, freq='Q')
s2 = pd.Series(np.random.randint(100,1000,size=(30)),index=dti)
df4 = s2.to_frame(name='count')
f2 = plt.figure("Quarterly",figsize=(10,5))
ax = plt.subplot(1,1,1)
ax.bar(df4.index,df4['count'], width=70)
plt.tight_layout()
plt.show()
In this case the width is interpreted as a scalar in days.
Edit
For some reason the above only works correctly for older versions of matplotlib (tested 2.2.3). In order to work with current (3.1.2) version the following modification must be made:
# ...
dti = pd.date_range('2012-12-31', periods=30, freq='Q')
dti = [pd.to_datetime(t) for t in dti]
# ...
which will then produce the correct behavior in setting the widths of the bars.

Related

How to create a stacked barchart for a large dataset in Python?

For research purposes at my university, I need to create a stacked bar chart for speech data. I would like to represent the hours of speech on the y-axis and the frequency on the x-axis. The speech comes from different components, hence the stacked part of the chart. The data resides in a Pandas dataframe, which has a lot of columns, but the important ones are "component", "hours" and "ps_med_frequency" which are used in the graph.
A simplified view of the DF (it has 6.2k rows and 120 columns, a-k components):
component
filename
ps_med_freq (rounded to integer)
hours (length)
...
a
fn0001_ps
230
0.23
b
fn0002_ps
340
0.12
c
fn003_ps
278
0.09
I have already tried this with matplotlib, seaborn or just the plot method from the Pandas dataframe itself. None seem to work properly.
A snippet of seaborn code I have tried:
sns.barplot(data=meta_dataframe, x='ps_med_freq', y='hours', hue='component', dodge=False)
And basically all variations of this as well.
Below you can see one of the most "viable" results I've had so far:
example of failed graph
It seems to have a lot of inexplicable grey blobs, which I first contributed to the large dataset, but if I just plot it as a histogram and count the frequencies instead of showing them by hour, it works perfectly fine. Does anyone know a solution to this?
Thanks in advance!
P.S.: Yes, I realise this is a huge dataset and at first sight, the graph seems useless with that much data on it, but matplotlib has interactive graphs where you can zoom etc., where this kind of graph becomes useful for my purpose.
With sns.barplot you're creating a bar for each individual frequency value. You'll probably want to group similar frequencies together, as with sns.histplot(..., multiple='stack'). If you want a lot of detail, you can increase the number of bins for the histogram. Note that sns.barplot never creates stacks, it would just plot each bar transparently on top of the others.
You can create a histogram, using the hours as weights, so they get summed.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some suitable random test data
np.random.seed(20230104)
component_prob = np.random.uniform(0.1, 2, 7)
component_prob /= component_prob.sum()
df = pd.DataFrame({'component': np.random.choice([*'abcdefg'], 6200, p=component_prob),
'ps_med_freq': (np.random.normal(0.05, 1, 6200).cumsum() + 200).astype(int),
'hours': np.random.randint(1, 39, 6200) * .01})
# create bins for every range of 10, suitably rounded, and shifted by 0.001 to avoid floating point roundings
bins = np.arange(df['ps_med_freq'].min() // 10 * 10 - 0.001, df['ps_med_freq'].max() + 10, 10)
plt.figure(figsize=(16, 5))
ax = sns.histplot(data=df, x='ps_med_freq', weights='hours', hue='component', palette='bright',
multiple='stack', bins=bins)
# sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1.01, 1.01)) # legend outside
sns.despine()
plt.tight_layout()
plt.show()

Can I create a time series with a reverse-log axis scale, so more recent dates are wider apart?

In a few recent applications I have plotted some time series where the more recent values are of more interest, but the historical values are good for context. Here's an example, a toy tracker for progress on a written project:
Ideally I would like the more recent dates to be shown more spread out on the x-axis, but still labelled correctly as dates. This would look a bit like a logarithmic scale in reverse. I think this idea might be quite useful for timelines in contexts where progress accelerates over time, e.g. computation power.
I've looked at this matplotlib "timeline" example for inspiration, but can't quite see how to achieve the spacing and labelling I want.
Here is a minimal example to work with:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from io import StringIO
data = '''when,words
2019-11-11,537
2019-12-19,530
2020-03-16,876
2020-08-10,1488
2021-02-05,2222
2022-01-21,2839
2022-03-02,3050
2022-03-04,3296
2022-03-15,3370
2022-03-23,3575
2022-04-25,2711
2022-06-06,3014
2022-06-28,3130
'''
prog = pd.read_csv(StringIO(data), parse_dates=['when'])
fig, ax = plt.subplots(figsize=(6,6))
sns.lineplot(ax=ax,data=prog,x="when",y="words",markers=True,marker="o")
ax.set(ylim=(0,1.1*prog.words.max()),title="Dissertation Progress")
fig.autofmt_xdate()
fig.tight_layout()
Which produces:
You may try something along these lines, i.e. converting your dates to number of days and then applying np.log10() on them:
prog.loc[:,'days'] = -np.log10(abs((prog.when - prog.iloc[-1].when).dt.days - 100))
g = sns.lineplot(x="days",y="words",data=prog,markers=True,marker="o")
_ = g.set_xticks(prog.days.to_list(), prog.when.dt.date.to_list(), rotation=90)
This yields this plot:
You can then handle overlapping labels by manually adjusting the xticklabels.

How to modify Matplot chart intervals in Python?

In Python, I am pulling in data from a data frame that should show me the number of COVID-19 cases by date. See example values for three dates:
date: 20201201; positive: 10000
date: 20201202; positive: 10500
date: 20201203; positive: 11000
I am hitting a roadblock when I try to format the plot I created. How can I increase the font of the x and y axes and modify the intervals so that instead of each individual date being shown, I can show the only first day of every month? Note that "date" is currently listed as an object and "positive" is listed as int64.
Also, what does the 121 represent in my code? I picked this up from somewhere else and noticed whenever I change the number, I get an error.
Thanks in advance.
import matplotlib.pyplot as plt
import numpy as np
x = data["date"]
y = data["positive"]
fig = plt.figure(figsize=(75, 25))
# Adds subplot on position 1
ax = fig.add_subplot(121)
ax.plot(x, y)
plt.show()
plt.xlabel('xlabel', fontsize=10) # for the label
plt.xticks(fontsize=10) # for the tick values
Also suggest using Seaborn library for plotting data, relatively straightforward than Matplotlib
Quick resource as example : https://seaborn.pydata.org/examples/errorband_lineplots.html

Matplotlib bar chart - overlay bars similar to stacked

I want to create a matplotlib bar plot that has the look of a stacked plot without being additive from a multi-index pandas dataframe.
The below code gives the basic behaviour
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import io
data = io.StringIO('''Fruit,Color,Price
Apple,Red,1.5
Apple,Green,1.0
Pear,Red,2.5
Pear,Green,2.3
Lime,Green,0.5
Lime, Red, 3.0
''')
df_unindexed = pd.read_csv(data)
df_unindexed
df = df_unindexed.set_index(['Fruit', 'Color'])
df.unstack().plot(kind='bar')
The plot command df.unstack().plot(kind='bar') shows all the apple prices grouped next to each other. If you choose the option df.unstack().plot(kind='bar',stacked=True) - it adds the prices for Red and Green together and stacks them.
I am wanting a plot that is halfway between the two - it shows each group as a single bar, but overlays the values so you can see them all. The below figure (done in powerpoint) shows what behaviour I am looking for -> I want the image on the right.
Short of calculating all the values and then using the stacked option, is this possible?
This seems (to me) like a bad idea, since this representation leads to several problem. Will a reader understand that those are not staked bars? What happens when the front bar is taller than the ones behind?
In any case, to accomplish what you want, I would simply repeatedly call plot() on each subset of the data and using the same axes so that the bars are drawn on top of each other.
In your example, the "Red" prices are always higher, so I had to adjust the order to plot them in the back, or they would hide the "Green" bars.
fig,ax = plt.subplots()
my_groups = ['Red','Green']
df_group = df_unindexed.groupby("Color")
for color in my_groups:
temp_df = df_group.get_group(color)
temp_df.plot(kind='bar', ax=ax, x='Fruit', y='Price', color=color, label=color)
There are two problems with this kind of plot. (1) What if the background bar is smaller than the foreground bar? It would simply be hidden and not visible. (2) A chart like this is not distinguishable from a stacked bar chart. Readers will have severe problems interpreting it.
That being said, you can plot both columns individually.
import matplotlib.pyplot as plt
import pandas as pd
import io
data = io.StringIO('''Fruit,Color,Price
Apple,Red,1.5
Apple,Green,1.0
Pear,Red,2.5
Pear,Green,2.3
Lime,Green,0.5
Lime,Red,3.0''')
df_unindexed = pd.read_csv(data)
df = df_unindexed.set_index(['Fruit', 'Color']).unstack()
df.columns = df.columns.droplevel()
plt.bar(df.index, df["Red"].values, label="Red")
plt.bar(df.index, df["Green"].values, label="Green")
plt.legend()
plt.show()

How to change the step size matplotlib uses when plotting timestamp objects?

I'm currently attempting to graph a fairly small dataset using the matplotlib and pandas libraries. The format of the dataset is a CSV file. Here is the dataset:
DATE,UNRATE
1948-01-01,3.4
1948-02-01,3.8
1948-03-01,4.0
1948-04-01,3.9
1948-05-01,3.5
1948-06-01,3.6
1948-07-01,3.6
1948-08-01,3.9
1948-09-01,3.8
1948-10-01,3.7
1948-11-01,3.8
1948-12-01,4.0
I loaded the dataset using pandas (as can be seen, the file that holds that dataset is named 'dataset.csv'):
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dataset.csv')
dataset['DATE'] = pd.to_datetime(dataset['DATE'])
I then attempted to plot the dataset loaded, using matplotlib:
plt.plot(dataset['DATE'], dataset['UNRATE'])
plt.show()
The code above mostly worked fine, and displayed the following graph:
The problem, however, is that the data I wanted displayed on the x axis, seems to have only been plotted in intervals of two:
I found the question, Changing the “tick frequency” on x or y axis in matplotlib?, which does correlate to my problem. But, from my testing, only seems to work with integral values.
I also found the question, controlling the number of x ticks in pyplot, which seemed to provide a solution to my problem. The method the answer said to use, to_pydatetime, was a method of DatetimeIndex. Since my understanding is that pandas.to_datetime would return a DatetimeIndex by default, I could use to_pydatetime on dataset['DATE']:
plt.xticks(dataset['DATE'].to_pydatetime())
However, I instead received the error:
AttributeError: 'Series' object has no attribute 'to_pydatetime'
Since this appears to just be default behavior, is there a way to force matplotlib to graph each point along the x axis, rather than simply graphing every other point?
To get rid of the error you may convert the dates as follows and also set the labels accordingly:
plt.xticks(dataset['DATE'].tolist(),dataset['DATE'].tolist())
or as has been mentionned in the comments
plt.xticks(dataset['DATE'].dt.to_pydatetime(),dataset['DATE'].dt.to_pydatetime())
But let's look at some more useful options.
Plotting strings
First of all it is possible to plot the data as it is, i.e. as strings.
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dateunrate.txt')
plt.plot(dataset['DATE'], dataset['UNRATE'])
plt.setp(plt.gca().get_xticklabels(), rotation=45, ha="right")
plt.show()
This is just like plotting plt.plot(["apple", "banana", "cherry"], [1,2,3]). This means that the successive dates are just placed one-by-one on the axes, independent on whether they are a minute, a day or a year appart. E.g. if your dates were 2018-01-01, 2018-01-03, 2018-01-27 they would still appear equally spaced on the axes.
Plot dates with pandas (automatically)
Pandas can nicely plot dates out of the box if the dates are in the index of the dataframe. To this end you may read the dataframe in a way that the first csv column is parsed as the index.
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('dateunrate.txt', parse_dates=[0], index_col=0)
dataset.plot()
plt.show()
This is equivalent to
dataset = pd.read_csv('../dateunrate.txt', parse_dates=[0])
dataset = dataset.set_index("DATE")
dataset.plot()
or
dataset = pd.read_csv('../dateunrate.txt')
dataset["DATE"] = pd.to_datetime(dataset["DATE"])
dataset = dataset.set_index("DATE")
dataset.plot()
or even
dataset = pd.read_csv('../dateunrate.txt')
dataset["DATE"] = pd.to_datetime(dataset["DATE"])
dataset.plot(x="DATE",y="UNRATE")
This works nice in this case because you happen to have one date per month and pandas will decide to show all 12 months as ticklabels in this case.
For other cases this may result in different tick locations.
Plot dates with matplotlib or pandas (manually)
In the general case, you may use matplotlib.dates formatters and locators to tweak the tick(label)s in the way you want. Here, we might use a MonthLocator and set the ticklabel format to "%b %Y". This works well with matplotlib plot or pandas plot(x_compat=True).
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.dates as mdates
dataset = pd.read_csv('dateunrate.txt', parse_dates=[0], index_col=0)
plt.plot(dataset.index, dataset['UNRATE'])
## or use
#dataset.plot(x_compat=True) #note the x_compat argument
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%b %Y"))
plt.setp(plt.gca().get_xticklabels(), rotation=45, ha="right")
plt.show()

Categories