I have a dictionary of dataframes where the key is the name of each dataframe and the value is the dataframe itself.
I am looking to iterate through the dictionary and quickly plot the top 10 rows in each dataframe. Each dataframe would have its own plot. I've attempted this with the following:
for df in dfs:
data = dfs[df].head(n=10)
sns.barplot(data=data, x='x_col', y='y_col', color='indigo').set_title(df)
This works, but only returns a plot for the last dataframe in the iteration. Is there a way I can modify this so that I am also able to return the subsequent plots?
By default, seaborn.barplot() plots data on the current Axes. If you didn't specify the Axes to plot on, the latter will override the previous one. To overcome this, you can either create a new figure in each loop or plot on a different axis by specifying the ax argument.
import matplotlib.pyplot as plt
for df in dfs:
data = dfs[df].head(n=10)
plt.figure() # Create a new figure, current axes also changes.
sns.barplot(data=data, x='x_col', y='y_col', color='indigo').set_title(df)
Related
I created a pandas dataframe using the below code. (See the extra code and plt.show() which is there to create a new plot every time or else we get one plot with all of them in the same plot)
%matplotlib inline
pd.DataFrame(
np.array([[
col,
plt.scatter(data[col], data['SalePrice']) and plt.show()]
for col in data.columns]),
columns=['Feature', 'Scatter Plot']
)
But what I get is this
And at the end of the dataframe, I get all the scatter plots separately.
What I want is, for those graphs to get printed inline, inside the columns, just like the other values.
I want to know how can I iterate through each column and plot a separate box plot for the values of each column using for loop in python.
Provided that you have a list of your columns (say a list of list), you can use this:
import matplotlib.pyplot as plt
data = ...
#data = [column1, column2, column3]
for elem in data:
plt.plot(elem)
plt.show()
This code will create on plot for each column and will create a new one when you close the current.
I guess you don't want to plot everything on the same graph but you could do it by unindenting once the last line of my example plt.show().
It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!
I'm trying to use the convenience of the plot method of a pandas dataframe while adjusting the size of the figure produced. (I'm saving the figures to file as well as displaying them inline in a Jupyter notebook). I found the method below successful most of the time, except when I plot two lines on the same chart - then the figure goes back to the default size.
I suspect this might be due to the differences between plot on a series and plot on a dataframe.
Setup example code:
data = {
'A': 90 + np.random.randn(366),
'B': 85 + np.random.randn(366)
}
date_range = pd.date_range('2016-01-01', '2016-12-31')
index = pd.Index(date_range, name='Date')
df = pd.DataFrame(data=data, index=index)
Control - this code produces the expected result (a wide plot):
fig = plt.figure(figsize=(10,4))
df['A'].plot()
plt.savefig("plot1.png")
plt.show()
Result:
Plotting two lines - figure size is not (10,4)
fig = plt.figure(figsize=(10,4))
df[['A', 'B']].plot()
plt.savefig("plot2.png")
plt.show()
Result:
What's the right way to do this so that the figure size is consistency set regardless of number of series selected?
The reason for the difference between the two cases is a bit hidden inside the logic of pandas.DataFrame.plot(). As one can see in the documentation this method allows a lot of arguments to be passed such that it will handle all kinds of different cases.
Here in the first case, you create a matplotlib figure via fig = plt.figure(figsize=(10,4)) and then plot a single column DataFrame. Now the internal logic of pandas plot function is to check if there is already a figure present in the matplotlib state machine, and if so, use it's current axes to plot the columns values to it. This works as expected.
However in the second case, the data consists of two columns. There are several options how to handle such a plot, including using different subplots with shared or non-shared axes etc. In order for pandas to be able to apply any of those possible requirements, it will by default create a new figure to which it can add the axes to plot to. The new figure will not know about the already existing figure and its size, but rather have the default size, unless you specify the figsize argument.
In the comments, you say that a possible solution is to use df[['A', 'B']].plot(figsize=(10,4)). This is correct, but you then need to omit the creation of your initial figure. Otherwise it will produce 2 figures, which is probably undesired. In a notebook this will not be visible, but if you run this as a usual python script with plt.show() at the end, there will be two figure windows opening.
So the solution which lets pandas take care of figure creation is
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
df[['A', 'B']].plot(figsize=(10,4))
plt.show()
A way to circumvent the creation of a new figure is to supply the ax argument to the pandas.DataFrame.plot(ax=ax) function, where ax is an externally created axes. This axes can be the standard axes you obtain via plt.gca().
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax = plt.gca())
plt.show()
Alternatively use the more object oriented way seen in the answer from PaulH.
Always operate explicitly and directly on your Figure and Axes objects. Don't rely on the pyplot state machine. In your case that means:
fig1, ax1 = plt.subplots(figsize=(10,4))
df['A'].plot(ax=ax1)
fig1.savefig("plot1.png")
fig2, ax2 = plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax=ax2)
fig2.savefig("plot2.png")
plt.show()
I am trying to generate a grid of subplots based off of a Pandas groupby object. I would like each plot to be based off of two columns of data for one group of the groupby object. Fake data set:
C1,C2,C3,C4
1,12,125,25
2,13,25,25
3,15,98,25
4,12,77,25
5,15,889,25
6,13,56,25
7,12,256,25
8,12,158,25
9,13,158,25
10,15,1366,25
I have tried the following code:
import pandas as pd
import csv
import matplotlib as mpl
import matplotlib.pyplot as plt
import math
#Path to CSV File
path = "..\\fake_data.csv"
#Read CSV into pandas DataFrame
df = pd.read_csv(path)
#GroupBy C2
grouped = df.groupby('C2')
#Figure out number of rows needed for 2 column grid plot
#Also accounts for odd number of plots
nrows = int(math.ceil(len(grouped)/2.))
#Setup Subplots
fig, axs = plt.subplots(nrows,2)
for ax in axs.flatten():
for i,j in grouped:
j.plot(x='C1',y='C3', ax=ax)
plt.savefig("plot.png")
But it generates 4 identical subplots with all of the data plotted on each (see example output below):
I would like to do something like the following to fix this:
for i,j in grouped:
j.plot(x='C1',y='C3',ax=axs)
next(axs)
but I get this error
AttributeError: 'numpy.ndarray' object has no attribute 'get_figure'
I will have a dynamic number of groups in the groupby object I want to plot, and many more elements than the fake data I have provided. This is why I need an elegant, dynamic solution and each group data set plotted on a separate subplot.
Sounds like you want to iterate over the groups and the axes in parallel, so rather than having nested for loops (which iterates over all groups for each axis), you want something like this:
for (name, df), ax in zip(grouped, axs.flat):
df.plot(x='C1',y='C3', ax=ax)
You have the right idea in your second code snippet, but you're getting an error because axs is an array of axes, but plot expects just a single axis. So it should also work to replace next(axs) in your example with ax = axs.next() and change the argument of plot to ax=ax.