Matplotlib WeekdayLocator giving wrong dates/too many ticks - python

I've been working with matplotlib.pyplot to plot some data over date ranges, but have been running across some weird behavior, not too different from this question.
The primary difference between my issue and that one (aside from the suggested fix not working) is they refer to different locators (WeekdayLocator() in my case, AutoDateLocator() in theirs.) As some background, here's what I'm getting:
The expected and typical result, where my data is displayed with a reasonable date range:
And the very occasional result, where the data is given some ridiculous range of about 5 years (from what I can see):
I did some additional testing with a generic matplotlib.pyplot.plot and it seemed to be unrelated to using a subplot, or just creating the plot using the module directly.
plt.plot(some plot)
vs.
fig = plt.figure(...)
sub = fig.add_subplot(...)
sub.plot(some plot)
From what I could find, the odd behavior only happens when the data set only has one point (and therefore only having a single date to plot). The outrageous number of ticks is caused by the WeekdayLocator() which, for some reason, attempts to generate 1653 ticks for the x-axis date range (from about 2013 to 2018) based on this error output:
RuntimeError: RRuleLocator estimated to generate 1635 ticks from
2013-07-11 19:23:39+00:00 to 2018-01-02 00:11:39+00:00: exceeds Locator.MAXTICKS * 2 (20)
(This was from some experimenting with the WeekdayLocator().MAXTICKS member set to 10)
I then tried changing the Locator based on how many date points I had to plot:
# If all the entries in the plot dictionary have <= 1 data point to plot
if all(len(times[comp]) <= 1 for comp in times.keys()):
sub.xaxis.set_major_locator(md.DayLocator())
else:
sub.xaxis.set_major_locator(md.WeekdayLocator())
This worked for the edge cases where I'd have a line 2 points and a line with 1 (or just a point) and wanted the normal ticking since it didn't get messed up, but only sort of worked to fix my problem:
Now I don't have a silly amount of tick marks, but my date range is still 5 years! (Side Note: I also tried using an HourLocator(), but it attempted to generate almost 40,000 tick marks...)
So I guess my question is this: is there some way to rein in the date range explosion when only having one date to plot, or am I at the mercy of a strange bug with Matplotlib's date plotting methods?
What I would like to have is something similar to the first picture, where the date range goes from a little before the first date and a little after the last date. Even if Matplotlib were to fill up the axis range to about match the frequency of ticks in the first image, I would expect it to only span the course of a month or so, not five whole years.
Edit:
Forgot to mention that the range explosion also appears to occur regardless of which Locator I use. Plotting with zero points just results in a blank x-axis (due to no date range at all), a single point gives me the described huge date range, and multiple points/lines gives the expected date ranges.

Related

How to start graph lines at 0 in the Y axis with Bokeh (Python)

I'm using Bokeh for showing line graphs at my Django/Python web.
By default, the graphs start at the minimum value provided, but I want them to start always at 0 in the Y-axis.
For example, in the following example it starts at 167 in the Y-axis (the minimum value in that data set), but I wanted to start at 0.
y_range seems to work fine if I want to define a minimum and a maximum, but I only want to define the minimum (0) and let the data "decide" the maximum.
I've tried using y_range=(0, None), min_border=0, start=0 and a bunch of other things, without success. ChatGPT keeps recommending me alternatives that don't really work or even exist.
This is my current code:
y = WHATEVER
x = WHATEVER
plot = figure(title='TITLE',
x_axis_type='datetime',
sizing_mode="stretch_width",
max_width=600,
height=400)
plot.line(x, y, line_width=4)
script, div = components(plot)
ChatGPT keeps recommending me alternatives that don't really work or even exist.
This should not be surprising, ChatGPT is not a serious or reliable source of accurate information.
In any case, the only thing you need to do is set:
plot.y_range.start = 0
with the default range (i.e. don't pass a range value to figure). That will keep auto-ranging for the upper y-axis but pin the start to 0.

How to set a limit to the number of elements that appear on a matplotlib bar chart's x-axis?

Let's say I have the following dataframe:
I have to find the 5 trips with the longer duration, but matplotlib gives me a very ugly graph as there are a lot of entries on the db. Can I somehow set a limit on the quantity of bars that the chart will show? I know it'll be more than 1000 bars, but I want matplotlib to only show me the first like 10.
I have this code, but doesn't do the work as required. It gives me an unreadable chart with every single trips_duration value.
trips_duration = trips_copy.copy()
trips_duration.duration.value_counts().plot(kind='bar' ,title='longest trips')
Simplest:
trips_duration.duration.nlargest(n=10).plot(kind='bar' ,title='longest trips')
Note that the keep= argument of nlargest allows you to decide how to break ties, if that matters for your use case.
If you need more options for how to sort, you could use sort_values and then subset the top ten values:
trips_duration.duration.sort_values(ascending=False)[:10].plot(kind='bar' ,title='longest trips')

Python boxplot fails at automatic plot boundaries/limits

I am manually putting a bunch of boxplots in a plot.
The code I am using is this (I am computing mean_, iqr, CL, etc. elsewhere):
A = np.random.random(2)
D = plt.boxplot(A, positions=np.atleast_1d(dist_val), widths=np.min(unique_dists_vals) / 10.) # a simple case with just one variable to boxplot
D['medians'][0].set_ydata(median_)
D['boxes'][0]._xy[[0,1,4], 1] = iqr[0]
D['boxes'][0]._xy[[2,3],1] = iqr[1]
D['whiskers'][0].set_ydata(np.array([iqr[0], CL[0]]))
D['whiskers'][1].set_ydata(np.array([iqr[1], CL[1]]))
D['caps'][0].set_ydata(np.array([CL[0], CL[0]]))
D['caps'][1].set_ydata(np.array([CL[1], CL[1]]))
I do this in a loop, putting one box plot per some location x.
I am not making any changes to the axis limits. The resulting figure looks like this:
what is going on with 1 x-tick?
the limits are just off on both x and y.
This appears to be a bug?
And no, I cannot just manually set the limits etc. since this has to be a completely general code.
What I have tried so far is:
During the loop when I compute the box plots, try keeping track of the largest y value seen so far and the largest x value etc. and then at the end manually set the bound to this. Other issues come up here, however, such as boxes extending beyond the plot etc. and then I manually have to adjust the limits to extend beyond the box width etc.
I have used both "ax.axis('auto')" and "ax.set_autoscale_on(True)" after plotting right before plt.show(), does not work:
While the first item in the list above does technically work (not ideal) I would like to know if there is a generic way to simply say: "done plotting, fix limits" (should automatically be done while plotting I guess?).
Thank you.

Why do sns.lmplot and FacetGrid+plt.scatter create different scatter points from the same data?

I'm quite new to Python, pandas DataFrames and Seaborn. When I was trying to understand Seaborn better, particularly sns.lmplot, I came across a difference between two figures made of the same data, that I thought were supposed to look alike, and I wonder why that is.
Data: My data is a pandas DataFrame that has 454 rows and 19 columns. The data relevant to this question includes 4 columns and looks something like this:
Columns: Av_density; pred2; LOC; Year;
Variable type: Continuous variable; Continuous variable; Categorical variable 1...4;Categorical 2012...2014
There are no missing data points.
My aim is to draw a 2x2 figure panel describing the relationship between Av_density and pred2 separately for each LOC(=location) with years marked with different colours. I call seaborn with:
import seaborn as sns
sns.set(style="whitegrid")
np.random.seed(sum(map(ord, "linear_categorical")))
(Side point: for some reason calling "linear_quantitative" does not work, i.e. I get a "File "stdin", line 2
sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2);
^
SyntaxError: invalid syntax")
Figure method 1, FacetGrid + scatter:
sur=sns.FacetGrid(Data,col="LOC", col_wrap=2,hue="YEAR")
sur.map(plt.scatter, "Av_density", "pred2" );
plt.legend()
This produces a nice scatter of the data accurately. You can see the picture here:https://drive.google.com/file/d/0B7h2wsx9mUBScEdUbGRlRk5PV1E/view?usp=sharing
Figure method 2, sns.lmplot:
sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2);
This produces the figure panel divided by LOC accurately, with Years in different colours, but the scatter of the data points does not look right. Instead, it looks like lmplot has linearised the data points, and lost the original scatter points that it is supposed to be drawing in addition to the regression lines.
You can see the figure here: https://drive.google.com/file/d/0B7h2wsx9mUBSRkN5ZXhBeW9ob1E/view?usp=sharing
My data produces only three points per location per year, and I was first wondering if this is what makes the "mistake" in lmplot datapoint. Optimally I would have a shorter line describing the trend between years instead of a proper regression, but I have not figured out the code to this yet.
But before tackling that issue, I would really like to know if there is something I am doing wrong that I can fix, or if this is an issue of lmplot trying to handle my data?
Any help, comments and ideas on this are warmly welcome!
-TA-
Ps. I'm running Python 2.7.8 with Spyder 2.3.4
EDIT: I get shorter "trend lines" with the first method by adding:
sur.map(plt.plot,"Av_density", "pred2" );
Still would like to know what is messing the figure with lmplot.
The issue is probably only that the added regression line is messing up the y-axis, so that the variability in the data cannot be seen.
Try resetting the y-axis based on the variability in your original plot to see if they show the same thing, in your case e.g.
fig1 = sns.lmplot("Av_density", "pred2", Data, col="LOC", hue="YEAR", col_wrap=2);
fig1.set(ylim=(-0.03, 0.05))
plt.show(fig1)

Show only the n'th ticklabel in a pandas boxplot

I am new to pandas and matplotlib, but not to Python. I have two questions; a primary and a secondary one.
Primary:
I have a pandas boxplot with FICO score on the x-axis and interest rate on the y-axis.
My x-axis is all messed up since the FICO scores are overwriting each other.
I'd like to show only every 4th or 5th ticklabel on the x-axis for a couple of reasons:
in general it's less chart-junky
in this case it will allow the labels to actually be read.
My code snippet is as follows:
plt.figure()
loansmin = pd.read_csv('../datasets/loanf.csv')
p = loansmin.boxplot('Interest.Rate','FICO.Score')
I saved the return value in p as I thought I might need to manipulate the plot further which I do now.
Secondary:
How do I access the plot, subplot, axes objects from pandas boxplot.
p above is an matplotlib.axes.AxesSubplot object.
help(matplotlib.axes.AxesSubplot) gives a message saying:
'AttributeError: 'module' object has no attribute 'AxesSubplot'
dir(matplotlib.axes) lists Axes, Subplot and Subplotbase as in that namespace but no AxesSubplot. How do I understand this returned object better?
As I explored further I found that one could explore the returned object p via dir().
Doing this I found a long list of useful methods, amongst which was set_xticklabels.
Doing help(p.set_xticklabels) gave some cryptic, but still useful, help - essentially suggesting passing in a list of strings for ticklabels.
I then tried doing the following - adding set_xticklabels to the end of the last line in the above code effectively chaining the invocations.
plt.figure()
loansmin = pd.read_csv('../datasets/loanf.csv')
p=loansmin.boxplot('Interest.Rate','FICO.Score').set_xticklabels(['650','','','','','700'])
This gave the desired result. I suspect there's a better way as in the way matplotlib does it which allows you to show every n'th label. But for immediate use this works, and also allows setting labels where they are not periodic for whatever reason, if you need that.
As usual, writing out the question explicitly helped me find the answer. And if anyone can help me get to the underlying matplotlib object that is still an open question.
AxesSubplot (I think) is just another way to get at the Axes in matplotlib. set_xticklabels() is part of the matplotlib object oriented interface (on axes). So, if you were using something like pylab, you might use xticks(ticks, labels), but instead here you have to separate it into different calls ax.set_xticks(ticks), ax.set_xticklabels(labels). (where ax is an Axes object).
Let's say you only want to set ticks at 650 and 700. You could do the following:
ticks = labels = [650, 700]
plt.figure()
loansmin = pd.read_csv('../datasets/loanf.csv')
p=loansmin.boxplot('Interest.Rate','FICO.Score')
p.set_xticks(ticks)
p.set_xticklabels(labels)
Similarly, you can use set_xlim and set_ylim to do the equivalent of xlim() and ylim() in plt.

Categories