pyplot.scatter(dataframe) vs. dataframe.plot(kind='scatter') - python

I have several pandas dataframes. I want to plot several columns against one another in separate scatter plots, and combine them as subplots in a figure. I want to label each subplot accordingly. I had a lot of trouble with getting subplot labels working, until I discovered that there are two ways of plotting directly from dataframes, as far as I know; see SO and pandasdoc:
ax0 = plt.scatter(df.column0, df.column5)
type(ax0): matplotlib.collections.PathCollection
and
ax1 = df.plot(0,5,kind='scatter')
type(ax1): matplotlib.axes._subplots.AxesSubplot
ax.set_title('title') works on ax1 but not on ax0, which returns
AttributeError: 'PathCollection' object has no attribute 'set_title'
I don't understand why the two separate ways exist. What is the purpose of the first method using PathCollections? The second one was added in 17.0; is the first one obsolete or has it a different purpose?

As you have found, the pandas function returns an axes object. The PathCollection object can be interpreted as an axes object as well using the "get current axes" function. For instance:
plot = plt.scatter(df.column0, df.column5)
ax0 = plt.gca()
type(ax0)
< matplotlib.axes._subplots.AxesSubplot at 0x10d2cde10>
A more standard way you might see this is the following:
fig = plt.figure()
ax0 = plt.add_subplot()
ax0.scatter(df.column0, df.column5)
At this point you are welcome to do "set" commands such as your set_title.
Hope this helps.

The difference between the two is that they are from different libraries. The first one is from matplotlib, the second one from pandas. They do the same, which is create a matplotlib scatter plot, but the matplotlib version returns a collection of points, whereas the pandas version returns a matplotlib subplot. This makes the matplotlib version a bit more versatile, as you can use the collection of points in another plot.

Related

Why am I unable to make a plot containing subplots in plotly using a px.scatter plot?

I have been trying to make a figure using plotly that combines multiple figures together. In order to do this, I have been trying to use the make_subplots function, but I have found it very difficult to have the plots added in such a way that they are properly formatted. I can currently make singular plots (as seen directly below):
However, whenever I try to combine these singular plots using make_subplots, I end up with this:
This figure has the subplots set up completely wrong, since I need each of the four subplots to contain data pertaining to the four methods (A, B, C, and D). In other words, I would like to have four subplots that look like my singular plot example above.
I have set up the code in the following way:
for sequence in sequences:
#process for making sequence profile is done here
sequence_df = pd.DataFrame(sequence_profile)
row_number=1
grand_figure = make_subplots(rows=4, cols=1)
#there are four groups per sequence, so the grand figure should have four subplots in total
for group in sequence_df["group"].unique():
figure_df_group = sequence_df[(sequence_df["group"]==group)]
figure_df_group.sort_values("sample", ascending=True, inplace=True)
figure = px.line(figure_df_group, x = figure_df_group["sample"], y = figure_df_group["intensity"], color= figure_df_group["method"])
figure.update_xaxes(title= "sample")
figure.update_traces(mode='markers+lines')
#note: the next line fails, since data must be extracted from the figure, hence why it is commented out
#grand_figure.append_trace(figure, row = row_number, col=1)
figure.update_layout(title_text="{} Profile Plot".format(sequence))
grand_figure.append_trace(figure.data[0], row = row_number, col=1)
row_number+=1
figure.write_image(os.path.join(output_directory+"{}_profile_plot_subplots_in_{}.jpg".format(sequence, group)))
grand_figure.write_image(os.path.join(output_directory+"grand_figure_{}_profile_plot_subplots.jpg".format(sequence)))
I have tried following directions (like for example, here: ValueError: Invalid element(s) received for the 'data' property) but I was unable to get my figures added as is as subplots. At first it seemed like I needed to use the graph object (go) module in plotly (https://plotly.com/python/subplots/), but I would really like to keep the formatting/design of my current singular plot. I just want the plots to be conglomerated in groups of four. However, when I try to add the subplots like I currently do, I need to use the data property of the figure, which causes the design of my scatter plot to be completely messed up. Any help for how I can ameliorate this problem would be great.
Ok, so I found a solution here. Rather than using the make_subplots function, I just instead exported all the figures onto an .html file (Plotly saving multiple plots into a single html) and then converted it into an image (HTML to IMAGE using Python). This isn't exactly the approach I would have preferred to have, but it does work.
UPDATE
I have found that plotly express offers another solution, as the px.line object has the parameter of facet that allows one to set up multiple subplots within their plot. My code is set up like this, and is different from the code above in that the dataframe does not need to be iterated in a for loop based on its groups:
sequence_df = pd.DataFrame(sequence_profile)
figure = px.line(sequence_df, x = sequence_df["sample"], y = sequence_df["intensity"], color= sequence_df["method"], facet_col= sequence_df["group"])
Although it still needs more formatting, my plot now looks like this, which is works much better for my purposes:

In a pairplot, how can I not show confidence intervals but display grid lines instead? [duplicate]

I'm plotting two data series with Pandas with seaborn imported. Ideally I would like the horizontal grid lines shared between both the left and the right y-axis, but I'm under the impression that this is hard to do.
As a compromise I would like to remove the grid lines all together. The following code however produces the horizontal gridlines for the secondary y-axis.
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.DataFrame(np.cumsum(np.random.normal(size=(100,2)),axis=0),columns=['A','B'])
data.plot(secondary_y=['B'],grid=False)
You can take the Axes object out after plotting and perform .grid(False) on both axes.
# Gets the axes object out after plotting
ax = data.plot(...)
# Turns off grid on the left Axis.
ax.grid(False)
# Turns off grid on the secondary (right) Axis.
ax.right_ax.grid(False)
sns.set_style("whitegrid", {'axes.grid' : False})
Note that the style can be whichever valid one that you choose.
For a nice article on this, refer to this site.
The problem is with using the default pandas formatting (or whatever formatting you chose). Not sure how things work behind the scenes, but these parameters are trumping the formatting that you pass as in the plot function. You can see a list of them here in the mpl_style dictionary
In order to get around it, you can do this:
import pandas as pd
pd.options.display.mpl_style = 'default'
new_style = {'grid': False}
matplotlib.rc('axes', **new_style)
data = pd.DataFrame(np.cumsum(np.random.normal(size=(100,2)),axis=0),columns=['A','B'])
data.plot(secondary_y=['B'])
This feels like buggy behavior in Pandas, with not all of the keyword arguments getting passed to both Axes. But if you want to have the grid off by default in seaborn, you just need to call sns.set_style("dark"). You can also use sns.axes_style in a with statement if you only want to change the default for one figure.
You can just set:
sns.set_style("ticks")
It goes back to normal.

Change single axis in plotly subplots

I generated a plot in plotly using plotly.express. I used a pandas.DataFrame and one column to differentiate the two subplots putting it into facet_row
Now my data has very different scales they are operating one, but plotly assigns the same range to both yaxis.
I've tried to assign a 'range' attribute to the .layout.yaxis1 dictionary (using a list), but this changes the yaxis for both the upper and the lower plot.
Minimal working example:
import plotly.express as px
px.bar(pd.DataFrame({'x':[50,60,45,.80],'y': [800,900,1,2],"dif":['a','a','b','b']}),x='x',y='y',facet_row='dif')
How can I change the first axis alone?
If you add .update_yaxes(matches=None) that will break the linkage between the Y-axis ranges.
This works because px.bar() (and any other px.*()) returns a Figure object which are are updatable via various .update_*() methods. These methods mutate and return the figure. px.bar(facet_row="whatever") by default sets the yxaxis*.matches attribute to y1 and so clearing that via .update_yaxes() will get the behaviour you want.
After that you can set the range independently however you like.
More information here: https://plot.ly/python/creating-and-updating-figures/

Why do matplotlib subplots start with 1

When creating subplots with matplotlib you need to start with 1, while most other python things start with zero. So to create the very first subplot (top left)
ax = fig.add_subplot(3,4,1)
Where I would have expected 0 to be the first subplot
ax = fig.add_subplot(3,4,0)
I've seen the explanation "we got this from matlab" but that seems like a particularly unsatisfying answer.
The answer really is: "it's meant for matlab-compatibility". There is one minor advantage in terms of the shortcut integer notation (subplot(231) instead of subplot(2,3,1)). You can't express a 0-based system that way without using strings instead. However, that shortcut notation is generally a bad idea, and should only ever be used in an interactive scenario where readability isn't a concern.
As #Cong Ma mentioned, in most cases, you'd use subplots and index a 2D array instead of the matlab-style numerical system. It's a better approach all-around.
For example:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=2, ncols=3)
axes[0, 0].plot(range(10))
plt.show()
It's not exactly identical, as it also adds all of the subplots, but you can always hide the ones you don't want to be visible (ax.axis('off') or ax.set(visible=False)).

Show only the n'th ticklabel in a pandas boxplot

I am new to pandas and matplotlib, but not to Python. I have two questions; a primary and a secondary one.
Primary:
I have a pandas boxplot with FICO score on the x-axis and interest rate on the y-axis.
My x-axis is all messed up since the FICO scores are overwriting each other.
I'd like to show only every 4th or 5th ticklabel on the x-axis for a couple of reasons:
in general it's less chart-junky
in this case it will allow the labels to actually be read.
My code snippet is as follows:
plt.figure()
loansmin = pd.read_csv('../datasets/loanf.csv')
p = loansmin.boxplot('Interest.Rate','FICO.Score')
I saved the return value in p as I thought I might need to manipulate the plot further which I do now.
Secondary:
How do I access the plot, subplot, axes objects from pandas boxplot.
p above is an matplotlib.axes.AxesSubplot object.
help(matplotlib.axes.AxesSubplot) gives a message saying:
'AttributeError: 'module' object has no attribute 'AxesSubplot'
dir(matplotlib.axes) lists Axes, Subplot and Subplotbase as in that namespace but no AxesSubplot. How do I understand this returned object better?
As I explored further I found that one could explore the returned object p via dir().
Doing this I found a long list of useful methods, amongst which was set_xticklabels.
Doing help(p.set_xticklabels) gave some cryptic, but still useful, help - essentially suggesting passing in a list of strings for ticklabels.
I then tried doing the following - adding set_xticklabels to the end of the last line in the above code effectively chaining the invocations.
plt.figure()
loansmin = pd.read_csv('../datasets/loanf.csv')
p=loansmin.boxplot('Interest.Rate','FICO.Score').set_xticklabels(['650','','','','','700'])
This gave the desired result. I suspect there's a better way as in the way matplotlib does it which allows you to show every n'th label. But for immediate use this works, and also allows setting labels where they are not periodic for whatever reason, if you need that.
As usual, writing out the question explicitly helped me find the answer. And if anyone can help me get to the underlying matplotlib object that is still an open question.
AxesSubplot (I think) is just another way to get at the Axes in matplotlib. set_xticklabels() is part of the matplotlib object oriented interface (on axes). So, if you were using something like pylab, you might use xticks(ticks, labels), but instead here you have to separate it into different calls ax.set_xticks(ticks), ax.set_xticklabels(labels). (where ax is an Axes object).
Let's say you only want to set ticks at 650 and 700. You could do the following:
ticks = labels = [650, 700]
plt.figure()
loansmin = pd.read_csv('../datasets/loanf.csv')
p=loansmin.boxplot('Interest.Rate','FICO.Score')
p.set_xticks(ticks)
p.set_xticklabels(labels)
Similarly, you can use set_xlim and set_ylim to do the equivalent of xlim() and ylim() in plt.

Categories