How do I create a multiline plot using seaborn? - python

I am trying out Seaborn to make my plot visually better than matplotlib. I have a dataset which has a column 'Year' which I want to plot on the X-axis and 4 Columns say A,B,C,D on the Y-axis using different coloured lines. I was trying to do this using the sns.lineplot method but it allows for only one variable on the X-axis and one on the Y-axis. I tried doing this
sns.lineplot(data_preproc['Year'],data_preproc['A'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['B'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['C'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['D'], err_style=None)
But this way I don't get a legend in the plot to show which coloured line corresponds to what. I tried checking the documentation but couldn't find a proper way to do this.

Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from its "wide format" (one column per measurement type) into long format (one column for all measurement values, one column to indicate the type) is pandas.melt. Given a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
sns.lineplot(x='Year', y='value', hue='variable',
data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be adapted to your liking.)

This:
sns.lineplot(data=data_preproc)
will do what you want.

See the documentation:
sns.lineplot(x="Year", y="signal", hue="label", data=data_preproc)
You probably need to re-organize your dataframe in a suitable way so that there is one column for the x data, one for the y data, and one which holds the label for the data point.
You can also just use matplotlib.pyplot. If you import seaborn, much of the improved design is also used for "regular" matplotlib plots. Seaborn is really "just" a collection of methods which conveniently feed data and plot parameters to matplotlib.

Related

pandas boxplot returns series type object instead axes

I have a dataframe like below,
And I am trying to plot size distribution of different species from different projects. Here is I have been trying (very simple code as I am new to python):
test=pd.read_excel(file,sheet_name="test",engine='openpyxl')
test.set_index('Species')
test=test.groupby('Project ID')
ax=test.boxplot(column='sizes',by='Species',return_type='axes')
The plot is exactly I need (below)
However, this returns ax as series object not axes, that make it hard to handle plot formatting (ie adding y labels, etc...) afterwards, it there any way to fix?
In matplotlib (which is what pandas uses), you always get one "axes" per subplot. Therefore, it makes sense that you have a collection (Series) of axes in your example (two subplots). This is actually good news, because now you can access the subplot you want to style very conveniently by name. Say, for example, you want to add a y-label to the left subplot, you can do:
ax_A = ax.loc["A"].loc["sizes"]
ax_B = ax.loc["B"].loc["sizes"]
ax_A.set_ylabel("My y-label")
Full example:
import numpy as np
import pandas as pd
test = pd.DataFrame({"Project ID": np.random.choice(["A", "B"], 100),
"Species": np.random.choice(["Plant1", "Plant2", "Plant3"], 100),
"sizes": np.random.random(100)})
test=test.groupby('Project ID')
ax=test.boxplot(column='sizes',by='Species',return_type='axes')
ax_A = ax.loc["A"].loc["sizes"]
ax_B = ax.loc["B"].loc["sizes"]
ax_A.set_ylabel("My y-label")

Differences between bar plots in Matplotlib and pandas

I feel like I'm missing something ridiculously basic here.
If I'm trying to create a bar chart with values from a dataframe, what's the difference between calling .plot on the dataframe object and just entering the data within plt.plot's parentheses?
e.g.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
VERSUS
df.groupby('category').count().plot(kind='bar')?
Can someone please walk me through what the difference is and when I should use either? I get that with plt.plot I'm calling the plot method of the plt (Matplotlib) library, whereas when I do df.plot I'm calling plot on the dataframe? What does that mean exactly -- that the dataframe has a plot object?
Those are different plotting methods. Fundamentally, they both produce a matplotlib object, which can be shown via one of the matplotlib backends.
There is however an important difference. Pandas bar plots are categorical in nature. This means, bars are positionned at subsequent integer numbers, and each bar gets a tick with a label according to the index of the dataframe.
For example:
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
s.plot.bar()
plt.show()
Here, there are four bars, the first is at positon 0, with the first label of the series' index, 1. The second is at positon 1, with the label 4 etc.
In contrast, a matplotlib bar plot is numeric in nature. Compare this to
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
plt.bar(s.index, s.values)
plt.show()
Here the bars are at the numerical position of the index; the first bar at 1, the second at 4 etc. and the axis labelling is independent of where the bars are.
Note that you can achieve a categorical bar plot with matplotlib by casting your values to strings.
plt.bar(s.index.astype(str), s.values)
The result looks similar to the pandas plot, except for some minor tweaks like rotated labels and bar widths. In case you are interested in tweaking some sophisticated properties, it will be easier to do with a matplotlib bar plot, because that directly returns the bar container with all the bars.
bc = plt.bar()
for bar in bc:
bar.set_some_property(...)
Pandas plot function is using Matplotlib's pyplot to do the plotting, but it's like a shortcut.
I was similarly confused when I started trying to visualise my data, but I decided in the end to learn matplotlib because in the end you get more control of the visualisation.
I think it depends on the data you have. If you have a clean data frame and you just want to print something quickly, then you can use df.plot. For example, you can group by a column and then specify x and y axes.
If you want a more complicated graph, then working directly with matplotlib is better. At the end, matplotlib will give you more options.
This is a good reference to start with: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/

Why is matplotlib .plot(kind='bar') plot so different to .plot()

This may be a very stupid question, but when plotting a Pandas DataFrame using .plot() it is very quick and produces a graph with an appropriate index. As soon as I try to change this to a bar chart, it just seems to lose all formatting and the index goes wild. Why is this the case? And is there an easy way to just plot a bar chart with the same format as the line chart?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.date_range(start='01/01/2012', end='31/12/2018')
df['Value'] = np.random.randint(low=5, high=100, size=len(df))
df.set_index('Date', inplace=True)
df.plot()
plt.show()
df.plot(kind='bar')
plt.show()
Update:
For comparison, if I take the data and put it into Excel, then create a line plot and a bar ('column') plot it instantly will convert the plot and keep the axis labels as they were for the line plot. If I try to produce many (thousands) of bar charts in Python with years of daily data, this takes a long time. Is there just an equivalent way of doing this Excel transformation in Python?
Pandas bar plots are categorical in nature; i.e. each bar is a separate category and those get their own label. Plotting numeric bar plots (in the same manner a line plots) is not currently possible with pandas.
In contrast matplotlib bar plots are numerical if the input data is numbers or dates. So
plt.bar(df.index, df["Value"])
produces
Note however that due to the fact that there are 2557 data points in your dataframe, distributed over only some hundreds of pixels, not all bars are actually plotted. Inversely spoken, if you want each bar to be shown, it needs to be one pixel wide in the final image. This means with 5% margins on each side your figure needs to be more than 2800 pixels wide, or a vector format.
So rather than showing daily data, maybe it makes sense to aggregate to monthly or quarterly data first.
The default .plot() connects all your data points with straight lines and produces a line plot.
On the other hand, the .plot(kind='bar') plots each data point as a discrete bar. To get a proper formatting on the x-axis, you will have to modify the tick-labels post plotting.

Heatmap or other two variable histogram option?

I have a dataframe with two columns, the first one can have an integer from 0-15, the other one can have an integer from 0-10.
The df has approximately 10,000 rows.
I want to plot some sort of grid, (15x10) that can visually represent how many instances of each combination I have throughout the dataframe, ideally displaying the actual number on every grid cell.
I have tried both Seaborn and Matplotlib.
In Seaborn I tried a jointplot which almost did it but I can't get it to show an actual 15x10 grid. I also tried a heatmap but it gave me an error (see below) and I wasn't able to find anything on it.
I also tried plotting some sort of 3D histogram.
Finally I tried pivoting the data but Pandas calculates the numbers as values instead of treating them as "buckets".
Not sure where to go from here.
*heatmap error: "ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''"
sns.heatmap(x='pressure_bucket', y='rate_bucket', data=df)
The closest to what I want is something like this, ideally with the actual numbers in each cell
https://imgur.com/a/d4qWIod
Thanks to all in advance!
We can use plt.imshow to display a heat map,
# get the counts in form of a dataframe indexed by (c1,c2)
counts = df.groupby(['c1'])['c2'].value_counts().rename('value').reset_index()
# pivot to c1 as index, c2 as columns
counts = counts.pivot(index='c1', columns='c2', values='value')
# after reading your question carefully, there's another step
# fill all missing value in c1
counts.reindex(range(16))
# fill all missing value in c2
counts = counts.reindex(range(10), axis=1)
# fill all missing values with 0
counts = counts.fillna(0)
# imshow
plt.figure(figsize=(15,10))
plt.imshow(counts, cmap='hot')
plt.grid(False)
plt.show()
# sns would give a color bar legend
plt.figure(figsize=(15,10))
sns.heatmap(counts, cmap='hot')
plt.show()
Output (random entries)
Output sns:

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

Categories