I feel like I'm missing something ridiculously basic here.
If I'm trying to create a bar chart with values from a dataframe, what's the difference between calling .plot on the dataframe object and just entering the data within plt.plot's parentheses?
e.g.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
VERSUS
df.groupby('category').count().plot(kind='bar')?
Can someone please walk me through what the difference is and when I should use either? I get that with plt.plot I'm calling the plot method of the plt (Matplotlib) library, whereas when I do df.plot I'm calling plot on the dataframe? What does that mean exactly -- that the dataframe has a plot object?
Those are different plotting methods. Fundamentally, they both produce a matplotlib object, which can be shown via one of the matplotlib backends.
There is however an important difference. Pandas bar plots are categorical in nature. This means, bars are positionned at subsequent integer numbers, and each bar gets a tick with a label according to the index of the dataframe.
For example:
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
s.plot.bar()
plt.show()
Here, there are four bars, the first is at positon 0, with the first label of the series' index, 1. The second is at positon 1, with the label 4 etc.
In contrast, a matplotlib bar plot is numeric in nature. Compare this to
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
plt.bar(s.index, s.values)
plt.show()
Here the bars are at the numerical position of the index; the first bar at 1, the second at 4 etc. and the axis labelling is independent of where the bars are.
Note that you can achieve a categorical bar plot with matplotlib by casting your values to strings.
plt.bar(s.index.astype(str), s.values)
The result looks similar to the pandas plot, except for some minor tweaks like rotated labels and bar widths. In case you are interested in tweaking some sophisticated properties, it will be easier to do with a matplotlib bar plot, because that directly returns the bar container with all the bars.
bc = plt.bar()
for bar in bc:
bar.set_some_property(...)
Pandas plot function is using Matplotlib's pyplot to do the plotting, but it's like a shortcut.
I was similarly confused when I started trying to visualise my data, but I decided in the end to learn matplotlib because in the end you get more control of the visualisation.
I think it depends on the data you have. If you have a clean data frame and you just want to print something quickly, then you can use df.plot. For example, you can group by a column and then specify x and y axes.
If you want a more complicated graph, then working directly with matplotlib is better. At the end, matplotlib will give you more options.
This is a good reference to start with: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/
Related
I have created a seaborn scatterplot for a dataset, where I set the sizes parameter to one column, and the hue parameter to another. Now the hue parameter only consists of five different values and is supposed to help classifying my data, while the sizes parameter consists of a lot more to represent actual numeric data. In this current data set, my hue values only consist of 0, 2, and 4, but in the "brief" legend option, the legend labels are not synchronized to that, which is very confusing. In the "full" legend option, the hue-labels are correct, but the size-labels are way too many. Therefore I would like to display the full legend for my hue parameter, but only a brief legend for the sizes parameter, because it consists of lots of unique values.
How the overcrowded "full" legend looks
The "brief" legend that is confusingly labeled
Edit: I edited some code in that demonstrates the issue for a random dataset. To specify my question again, I want the "shape" parameters to get fully depicted on the legend, while the "size" parameters have to be shortened (equivalent to the legend setting "brief").
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x_condition=np.arange(0,20,1)
y_condition=np.arange(0,20,1)
size=np.random.randint(0,200,20)
# I haven't made a random distribution here, because I wanted to make sure it contains at least one of each [0,2,4]
shape=[0,2,0,4]*5
df=pd.DataFrame({"x_condition":x_condition,"y_condition":y_condition,"size":size,"shape":shape})
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="brief")
plt.show()
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="full")
plt.show()
This may be a very stupid question, but when plotting a Pandas DataFrame using .plot() it is very quick and produces a graph with an appropriate index. As soon as I try to change this to a bar chart, it just seems to lose all formatting and the index goes wild. Why is this the case? And is there an easy way to just plot a bar chart with the same format as the line chart?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.date_range(start='01/01/2012', end='31/12/2018')
df['Value'] = np.random.randint(low=5, high=100, size=len(df))
df.set_index('Date', inplace=True)
df.plot()
plt.show()
df.plot(kind='bar')
plt.show()
Update:
For comparison, if I take the data and put it into Excel, then create a line plot and a bar ('column') plot it instantly will convert the plot and keep the axis labels as they were for the line plot. If I try to produce many (thousands) of bar charts in Python with years of daily data, this takes a long time. Is there just an equivalent way of doing this Excel transformation in Python?
Pandas bar plots are categorical in nature; i.e. each bar is a separate category and those get their own label. Plotting numeric bar plots (in the same manner a line plots) is not currently possible with pandas.
In contrast matplotlib bar plots are numerical if the input data is numbers or dates. So
plt.bar(df.index, df["Value"])
produces
Note however that due to the fact that there are 2557 data points in your dataframe, distributed over only some hundreds of pixels, not all bars are actually plotted. Inversely spoken, if you want each bar to be shown, it needs to be one pixel wide in the final image. This means with 5% margins on each side your figure needs to be more than 2800 pixels wide, or a vector format.
So rather than showing daily data, maybe it makes sense to aggregate to monthly or quarterly data first.
The default .plot() connects all your data points with straight lines and produces a line plot.
On the other hand, the .plot(kind='bar') plots each data point as a discrete bar. To get a proper formatting on the x-axis, you will have to modify the tick-labels post plotting.
I'm trying to use the convenience of the plot method of a pandas dataframe while adjusting the size of the figure produced. (I'm saving the figures to file as well as displaying them inline in a Jupyter notebook). I found the method below successful most of the time, except when I plot two lines on the same chart - then the figure goes back to the default size.
I suspect this might be due to the differences between plot on a series and plot on a dataframe.
Setup example code:
data = {
'A': 90 + np.random.randn(366),
'B': 85 + np.random.randn(366)
}
date_range = pd.date_range('2016-01-01', '2016-12-31')
index = pd.Index(date_range, name='Date')
df = pd.DataFrame(data=data, index=index)
Control - this code produces the expected result (a wide plot):
fig = plt.figure(figsize=(10,4))
df['A'].plot()
plt.savefig("plot1.png")
plt.show()
Result:
Plotting two lines - figure size is not (10,4)
fig = plt.figure(figsize=(10,4))
df[['A', 'B']].plot()
plt.savefig("plot2.png")
plt.show()
Result:
What's the right way to do this so that the figure size is consistency set regardless of number of series selected?
The reason for the difference between the two cases is a bit hidden inside the logic of pandas.DataFrame.plot(). As one can see in the documentation this method allows a lot of arguments to be passed such that it will handle all kinds of different cases.
Here in the first case, you create a matplotlib figure via fig = plt.figure(figsize=(10,4)) and then plot a single column DataFrame. Now the internal logic of pandas plot function is to check if there is already a figure present in the matplotlib state machine, and if so, use it's current axes to plot the columns values to it. This works as expected.
However in the second case, the data consists of two columns. There are several options how to handle such a plot, including using different subplots with shared or non-shared axes etc. In order for pandas to be able to apply any of those possible requirements, it will by default create a new figure to which it can add the axes to plot to. The new figure will not know about the already existing figure and its size, but rather have the default size, unless you specify the figsize argument.
In the comments, you say that a possible solution is to use df[['A', 'B']].plot(figsize=(10,4)). This is correct, but you then need to omit the creation of your initial figure. Otherwise it will produce 2 figures, which is probably undesired. In a notebook this will not be visible, but if you run this as a usual python script with plt.show() at the end, there will be two figure windows opening.
So the solution which lets pandas take care of figure creation is
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
df[['A', 'B']].plot(figsize=(10,4))
plt.show()
A way to circumvent the creation of a new figure is to supply the ax argument to the pandas.DataFrame.plot(ax=ax) function, where ax is an externally created axes. This axes can be the standard axes you obtain via plt.gca().
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({"A":[2,3,1], "B":[1,2,2]})
plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax = plt.gca())
plt.show()
Alternatively use the more object oriented way seen in the answer from PaulH.
Always operate explicitly and directly on your Figure and Axes objects. Don't rely on the pyplot state machine. In your case that means:
fig1, ax1 = plt.subplots(figsize=(10,4))
df['A'].plot(ax=ax1)
fig1.savefig("plot1.png")
fig2, ax2 = plt.figure(figsize=(10,4))
df[['A', 'B']].plot(ax=ax2)
fig2.savefig("plot2.png")
plt.show()
The pairplot function from seaborn allows to plot pairwise relationships in a dataset.
According to the documentation (highlight added):
By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
I could find only one example of subsetting different variables for rows and columns, here (it's the 6th plot under the Plotting pairwise relationships with PairGrid and pairplot() section). As you can see, it's plotting many independent variables (x_vars) against the same single dependent variable (y_vars) and the results are pretty nice.
I'm trying to do the same plotting a single independent variable against many dependent ones.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
ages = np.random.gamma(6,3, size=50)
data = pd.DataFrame({"age": ages,
"weight": 80*ages**2/(ages**2+10**2)*np.random.normal(1,0.2,size=ages.shape),
"height": 1.80*ages**5/(ages**5+12**5)*np.random.normal(1,0.2,size=ages.shape),
"happiness": (1-ages*0.01*np.random.normal(1,0.3,size=ages.shape))})
pp = sns.pairplot(data=data,
x_vars=['age'],
y_vars=['weight', 'height', 'happiness'])
The problem is that the subplots get arranged vertically, and I couldn't find a way to change it.
I know that then the tiling structure would not be so neat as the Y axis should be labeled at every subplot. Also, I know I could generate the plots making it by hand with something like this:
fig, axes = plt.subplots(ncols=3)
for i, yvar in enumerate(['weight', 'height', 'happiness']):
axes[i].scatter(data['age'],data[yvar])
Still, I'm learning to use the seaborn and I find interface very convenient, so I wonder if there's a way. Also, this example is pretty easy, but for more complex datasets seaborn handles for you many more things that would make the raw-matplotlib approach much more complex quite quickly (hue, to start)
You can achieve what it seems you are looking for by swapping the variable names passed to the x_vars and y_vars parameters. So revisiting the sns.pairplot portion of your code:
pp = sns.pairplot(data=data,
y_vars=['age'],
x_vars=['weight', 'height', 'happiness'])
Note that all I've done here is swap x_vars for y_vars. The plots should now be displayed horizontally:
The x-axis will now be unique to each plot with a common y-axis determined by the age column.
I have a csv file of power levels at several stations (4 in this case, though "HUT4" is not in this short excerpt):
2014-06-21T20:03:21,HUT3,74
2014-06-21T21:03:16,HUT1,70
2014-06-21T21:04:31,HUT3,73
2014-06-21T21:04:33,HUT2,30
2014-06-21T22:03:50,HUT3,64
2014-06-21T23:03:29,HUT1,60
(etc . .)
The times are not synchronised across stations. The power level is (in this case) integer percent. Some machines report in volts (~13.0), which would be an additional issue when plotting.
The data is easy to read into a dataframe, to index the dataframe, to put into a dictionary. But I can't get the right syntax to make a meaningful plot. Either all stations on a single plot sharing a timeline that's big enough for all stations, or as separate plots, maybe a subplot for each station. If I do:
import pandas as pd
df = pd.read_csv('Power_Log.csv',names=['DT','Station','Power'])
df2=df.groupby(['Station']) # set 'Station' as the data index
d = dict(iter(df2)) # make a dictionary including each station's data
for stn in d.keys():
d[stn].plot(x='DT',y='Power')
plt.legend(loc='lower right')
plt.savefig('Station_Power.png')
I do get a plot but the X axis is not right for each station.
I have not figured out yet how to do four independent subplots, which would free me from making a wide-enough timescale.
I would greatly appreciate comments on getting a single plot right and/or getting good looking subplots. The subplots do not need to have synchronised X axes.
I'd rather plot the typical way, smth like:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
plt.savefig()
( http://matplotlib.org/users/pyplot_tutorial.html )
Re more subplots: simply call plt.plot() multiple times, once for each data series.
P.S. you can set xticks this way: Changing the "tick frequency" on x or y axis in matplotlib?
Sorry for the comment above where I needed to add code. Still learning . .
From the 5th code line:
import matplotlib.dates as mdates
for stn in d.keys():
plt.figure()
d[stn].interpolate().plot(x='DT',y='Power',title=stn,rot=45)
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%D/%M/%Y'))
plt.savefig('Station_Power_'+stn+'.png')
Does more or less what I want to do except the DateFormatter line does not work. I would like to shorten my datetime data to show just date. If it places ticks at midnight that would be brilliant but not strictly necessary.
The key to getting a continuous plot is to use the interpolate() method in the plot.
With this data having different x scales from station to station a plot of all stations on the same graph does not work. HUT4 in my data has far fewer records and only plots to about 25% of the scale even though the datetime values cover more or less the same range as the other HUTs.