Plotting as a group using Panda and Matplotlib - python

I want to plot as a group using Panda and Matplotlib. THe plot would look like this kind of grouping:
Now let's assume I have a data file example.csv:
first,second,third,fourth,fifth,sixth
-42,11,3,La_c-,D
-42,21,2,La_c-,D0
-42,31,2,La_c-,D
-42,122,3,La_c-,L
print(df.head()) of the above is:
first second third fourth fifth sixth
0 -42 11 3 La_c- D NaN
1 -42 21 2 La_c- D0 NaN
2 -42 31 2 La_c- D NaN
3 -42 122 3 La_c- L NaN
In my case, on the x-axis, each group will consist of (first and the second column), just like in the above plot they have pies_2018,pies_2019,pies_2020.
To do that, I have tried to plot a single column first:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from scipy import stats
#import ast
filename = 'example.csv'
df = pd.read_csv(filename)
print(df.head())
df.plot(kind='bar', x=df.columns[1],y=df.columns[2],figsize=(12, 4))
plt.gcf().subplots_adjust(bottom=0.35)
I get a plot like this:
Now the problem is when I want to make a group I get the following error:
raise ValueError("x must be a label or position")
ValueError: x must be a label or position
The thing is that I was considering the numbers as a label.
The code I used:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from scipy import stats
#import ast
filename = 'example.csv'
df = pd.read_csv(filename)
print(df.head())
df.plot(kind='bar', x=["first", "second"],y="third",figsize=(12, 4))
plt.gcf().subplots_adjust(bottom=0.35)
plt.xticks(rotation=90)
If I can plot the first and second as a group, in addition to the legends, I will want to mention the fifth column in the "first" bar and the sixth column in the "second" bar.

Try this. You can play around but this gives you the stacked bars in groups.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
first = [-42, -42, -42, -42] #Use your column df['first']
second = [11, 21, 31, 122] #Use your column df['second']
third = [3, 2, 2, 3]
x = np.arange(len(third))
width = 0.25 #bar width
fig, ax = plt.subplots()
bar1 = ax.bar(x, third, width, label='first', color='blue')
bar2 = ax.bar(x + width, third, width, label='second', color='green')
ax.set_ylabel('third')
ax.set_xticks(x)
rects = ax.patches
labels = [str(i) for i in zip(first, second)] #You could use the columns df['first'] instead of the lists
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width() / 2, height, label,
ha='center', va='bottom')
ax.legend()
EDITED & NEW Plot -

using ax.patches you can achieve it.
df:
a b c d
a1 66 92 98 17
a2 83 57 86 97
a3 96 47 73 32
ax = df.T.plot(width=0.8, kind='bar',y=df.columns,figsize=(10,5))
for p in ax.patches:
ax.annotate(str(round(p.get_height(),2)), (p.get_x() * 1.005, p.get_height() * 1.005),color='green')
ax.axes.get_yaxis().set_ticks([])

Related

how to set x_axis label(not xtick label) for all subplots in relplot?

I tried drawing subplot through relplot method of seaborn. Now the question is, due to the original dataset is varying, sometimes I don't know how much final subplots will be.
I set col_wrap to limit it, but sometimes the results looks not so good. For example, I set col_wrap = 3, while there are 5 subplots as below:
As the figure shows, the x_axis only occurs in the C D E, which seems strange. I want x axis label is shown in all subplots(from A to E).
Now I already know that facet_kws={'sharex': 'col'} allows plots to have independent axis scales(according to set axis limits on individual facets of seaborn facetgrid).
But I want set labels for x axis of all subplots.I haven't found any solution for it.
Any keyword like set_xlabels in object FacetGrid seems to be useless, because official document announces they only control "on the bottom row of the grid".
FacetGrid.set_xlabels(label=None, clear_inner=True, **kwargs)
Label the x axis on the bottom row of the grid.
The following are my example data and my code:
city date value
0 A 1 9
1 B 1 20
2 C 1 4
3 D 1 33
4 E 1 2
5 A 2 22
6 B 2 32
7 C 2 27
8 D 2 32
9 E 2 18
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_excel("data/example_data.xlsx")
# print(df)
g = sns.relplot(data=df, x="date", y="value", kind="line", col="city", col_wrap=3,
errorbar=None, facet_kws={'sharex': 'col'})
(g.set_axis_labels("x_axis", "y_axis", )
.set_titles("{col_name}")
.tight_layout()
.add_legend()
)
plt.subplots_adjust(top=0.94, wspace=None, hspace=0.4)
plt.show()
Thanks in advance.
In order to reduce superfluous information, Seaborn makes these inner labels invisible. You can make them visible again:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': np.repeat([1, 2], 5),
'value': np.random.randint(1, 20, 10),
'city': np.tile([*'abcde'], 2)})
# print(df)
g = sns.relplot(data=df, x="date", y="value", kind="line", col="city", col_wrap=3,
errorbar=None, facet_kws={'sharex': 'col'})
g.set_titles("{col_name}")
g.add_legend()
for ax in g.axes.flat:
ax.set_xlabel('x axis', visible=True)
ax.set_ylabel('y axis', visible=True)
plt.subplots_adjust(top=0.94, wspace=None, hspace=0.4)
plt.show()

What is the most simple way to set scatterplot color based on category in python?

I'm trying to, in the most simple way, color points in a scatterplot using python. X is one column, y is another, and the last (let's say Z) has values (for example A, B, C). I would like to color the points (X, Y) using the value in Z.
I realize somewhat similar questions have been asked in the past, but this just isn't working out for me. Possibly because I had to force everything to be a float?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
df = pd.read_csv(r"C:\......combinedsheet2.csv")
df['crowd1'] = pd.to_numeric(df['c1'], errors='coerce')
df['crowd3'] = pd.to_numeric(df['c3'], errors='coerce')
df['dist1'] = pd.to_numeric(df['d1'], errors='coerce')
I'm not sure why these specific values were read as anything other than floats-- everything else was, and I haven't used this command enough to know whether it messed with any future data analysis and may be the source of some pf my trouble when trying to do mixed-model analysis and such.
To plot I use:
df.plot(x="c1", y="d1", c="black", kind="scatter")
ax = plt.gca()
ax.set_ylim([0, 610])
ax.set_xlim([0, 30])
And to plot all of my data together I use:
df.plot(x=["c1", "c2", "c3", "c4"], y=["d1", "d2", "d3", "d4"], c="black", kind="scatter")
ax = plt.gca()
ax.set_ylim([0, 450])
ax.set_xlim([0, 20])
Here is my csv file contents, minus a few decimal points in some cases (first 3 lines):
bwc
c1
d1
dbz
c2
d2
lmr
c3
d3
tti
c4
d4
A
12
67.00
F
20.0
454.2
I
4
405.4
L
14.0
137.9
B
8
122.0
G
20.0
265.0
J
3
490
M
0.0
144.9
A
0
217.0
F
15.0
235.0
I
0
62.80
N
11.0
418.7
I would like to in each instance be able to see each different point (A, B, C, etc) as a different color. Thanks!
I suggest using the seaborn package to do this. The first plot can be created like this:
sns.scatterplot(data=df, x='c1', y='d1', hue='bwc')
When plotting all the data together, you first need to reshape the dataframe to have the x, y, and hue variables in single columns. There is more than one way to do this. The following example uses pd.wide_to_long which requires renaming the columns containing the letters:
import io
import pandas as pd # v 1.2.3
import seaborn as sns # v 0.11.1
data = """
bwc c1 d1 dbz c2 d2 lmr c3 d3 tti c4 d4
A 12 67.00 F 20.0 454.2 I 4 405.4 L 14.0 137.9
B 8 122.0 G 20.0 265.0 J 3 490 M 0.0 144.9
A 0 217.0 F 15.0 235.0 I 0 62.80 N 11.0 418.7
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Melt dataframe to have x, y and hue variables in single columns
dfren = (df.rename(dict(bwc='let1', dbz='let2', lmr='let3', tti='let4'), axis=1)
.reset_index())
dfmelt = pd.wide_to_long(dfren, stubnames=['let', 'c', 'd'], i='index', j='j')
# Plot scatter plot with seaborn
ax = sns.scatterplot(data=dfmelt, x='c', y='d', hue='let')
ax.figure.set_size_inches(8,6)
ax.set_ylim([0, 450])
ax.set_xlim([0, 20]);

How to merge two plots in Pandas?

I want to merge two plots, that is my dataframe:
df_inc.head()
id date real_exe_time mean mean+30% mean-30%
0 Jan 31 33.14 43.0 23.0
1 Jan 30 33.14 43.0 23.0
2 Jan 33 33.14 43.0 23.0
3 Jan 38 33.14 43.0 23.0
4 Jan 36 33.14 43.0 23.0
My first plot:
df_inc.plot.scatter(x = 'date', y = 'real_exe_time')
Then
My second plot:
df_inc.plot(x='date', y=['mean','mean+30%','mean-30%'])
When I try to merge with:
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()
I got the following:
How I can merge the right way?
You should not repeat your mean values as an extra column. df.plot() for categorical data will be plotted against the index - hence you will see the original scatter plot (also plotted against the index) squeezed into the left corner.
You could create instead an additional aggregation dataframe that you can plot then into the same graph:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
n=30
np.random.seed(123)
df = pd.DataFrame({"date": np.random.choice(list("ABCDEF"), n), "real_exe_time": np.random.randint(1, 100, n)})
df = df.sort_values(by="date").reindex()
#aggregate data for plotting
df_agg = df.groupby("date")["real_exe_time"].agg(mean="mean").reset_index()
df_agg["mean+30%"] = df_agg["mean"] * 1.3
df_agg["mean-30%"] = df_agg["mean"] * 0.7
#plot both into the same subplot
ax = df.plot.scatter(x = 'date', y = 'real_exe_time')
df_agg.plot(x='date', y=['mean','mean+30%','mean-30%'], ax=ax)
plt.show()
Sample output:
You could also consider using seaborn that has, for instance, pointplots for categorical data aggregation.
I'm Guessing that you haven't transform the Date to a datetime object so the first thing you should do is this
#Transform the date to datetime object
df_inc['date']=pd.to_datetime(df_inc['date'],format='%b')
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()

How to change pyplot background colour in region of interest?

I have a dataframe with a datetime index:
A B
date
2020-05-04 0 0
2020-05-05 5 0
2020-05-07 2 0
2020-05-09 2 0
2020-05-18 -5 0
2020-05-19 -1 0
2020-05-20 0 0
2020-05-21 1 0
2020-05-22 0 0
2020-05-23 3 0
2020-05-24 1 1
2020-05-25 0 1
2020-05-26 4 1
2020-05-27 3 1
I want to make a lineplot to track A over time and colour the background of the plot red when the values of B are 1. I have implemented this code to make the graph:
from matplotlib import dates as mdates
from matplotlib.colors import ListedColormap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
cmap = ListedColormap(['white','red'])
ax.plot(data['A'])
ax.set_xlabel('')
plt.xticks(rotation = 30)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.pcolorfast(ax.get_xlim(), ax.get_ylim(),
data['B'].values[np.newaxis],
cmap = cmap, alpha = 0.4)
plt.axhline(y = 0, color = 'black')
plt.tight_layout()
This gives me this graph:
But the red region incorrectly starts from 2020-05-21 rather than 2020-05-24 and it doesn't end at the end date in the dataframe. How can I alter my code to fix this?
If you change ax.pcolorfast(ax.get_xlim(), ... by ax.pcolor(data.index, ... you get what you want. The problem with the current code is that by using ax.get_xlim(), it creates a uniform rectangular grid while your index is not uniform (dates are missing), so the coloredmeshed is not like expected. The whole thing is:
from matplotlib import dates as mdates
from matplotlib.colors import ListedColormap
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
cmap = ListedColormap(['white','red'])
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(data['A'])
ax.set_xlabel('')
plt.xticks(rotation = 30)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
#here are the two changes use pcolor
ax.pcolor(data.index, #use data.index to create the proper grid
ax.get_ylim(),
data['B'].values[np.newaxis],
cmap = cmap, alpha = 0.4,
linewidth=0, antialiased=True)
plt.axhline(y = 0, color = 'black')
plt.tight_layout()
and you get
I prefer axvspan in this case, see here for more information.
This adaptation will color the areas where data.B==1, including the potential where data.B might not be a continuous block.
With a modified dataframe data from data1.csv (added some more points that are 1):
date A B
5/4/2020 0 0
5/5/2020 5 0
5/7/2020 2 1
5/9/2020 2 1
5/18/2020 -5 0
5/19/2020 -1 0
5/20/2020 0 0
5/21/2020 1 0
5/22/2020 0 0
5/23/2020 3 0
5/24/2020 1 1
5/25/2020 0 1
5/26/2020 4 1
5/27/2020 3 1
from matplotlib import dates as mdates
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data1.csv',index_col='date')
data.index = pd.to_datetime(data.index)
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(data['A'])
plt.xticks(rotation = 30)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.axhline(y = 0, color = 'black')
# in this case I'm looking for a pair of ones to determine where to color
for i in range(1,len(data.B)):
if data.B[i]==True and data.B[i-1]==True:
plt.axvspan(data.index[i-1], data.index[i], color='r', alpha=0.4, lw=0)
plt.tight_layout()
If data.B==1 will always be "one block" you can do away with the for loop and just use something like this in its place:
first = min(idx for idx, val in enumerate(data.B) if val == 1)
last = max(idx for idx, val in enumerate(data.B) if val == 1)
plt.axvspan(data.index[first], data.index[last], color='r', alpha=0.4, lw=0)
Regarding "why" your data does not align, #Ben.T has this solution.
UPDATE: as pointed out, the for loop could be too crude for large datasets. The following uses numpy to find the falling and rising edges of data.B and then loops on those results:
import numpy as np
diffB = np.append([0], np.diff(data.B))
up = np.where(diffB == 1)[0]
dn = np.where(diffB == -1)[0]
if diffB[np.argmax(diffB!=0)]==-1:
# we have a falling edge before rising edge, must have started 'up'
up = np.append([0], up)
if diffB[len(diffB) - np.argmax(diffB[::-1]) - 1]==1:
# we have a rising edge that never fell, force it 'dn'
dn = np.append(dn, [len(data.B)-1])
for i in range(len(up)):
plt.axvspan(data.index[up[i]], data.index[dn[i]], color='r', alpha=0.4, lw=0)

Pandas dataframe plotting - issue when switching from two subplots to single plot w/ secondary axis

I have two sets of data I want to plot together on a single figure. I have a set of flow data at 15 minute intervals I want to plot as a line plot, and a set of precipitation data at hourly intervals, which I am resampling to a daily time step and plotting as a bar plot. Here is what the format of the data looks like:
2016-06-01 00:00:00 56.8
2016-06-01 00:15:00 52.1
2016-06-01 00:30:00 44.0
2016-06-01 00:45:00 43.6
2016-06-01 01:00:00 34.3
At first I set this up as two subplots, with precipitation and flow rate on different axis. This works totally fine. Here's my code:
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
filename = 'manhole_B.csv'
plotname = 'SSMH-2A B'
plt.style.use('bmh')
# Read csv with precipitation data, change index to datetime object
pdf = pd.read_csv('precip.csv', delimiter=',', header=None, index_col=0)
pdf.columns = ['Precipitation[in]']
pdf.index.name = ''
pdf.index = pd.to_datetime(pdf.index)
pdf = pdf.resample('D').sum()
print(pdf.head())
# Read csv with flow data, change index to datetime object
qdf = pd.read_csv(filename, delimiter=',', header=None, index_col=0)
qdf.columns = ['Flow rate [gpm]']
qdf.index.name = ''
qdf.index = pd.to_datetime(qdf.index)
# Plot
f, ax = plt.subplots(2)
qdf.plot(ax=ax[1], rot=30)
pdf.plot(ax=ax[0], kind='bar', color='r', rot=30, width=1)
ax[0].get_xaxis().set_ticks([])
ax[1].set_ylabel('Flow Rate [gpm]')
ax[0].set_ylabel('Precipitation [in]')
ax[0].set_title(plotname)
f.set_facecolor('white')
f.tight_layout()
plt.show()
2 Axis Plot
However, I decided I want to show everything on a single axis, so I modified my code to put precipitation on a secondary axis. Now my flow data data has disppeared from the plot, and even when I set the axis ticks to an empty set, I get these 00:15 00:30 and 00:45 tick marks along the x-axis.
Secondary-y axis plots
Any ideas why this might be occuring?
Here is my code for the single axis plot:
f, ax = plt.subplots()
qdf.plot(ax=ax, rot=30)
pdf.plot(ax=ax, kind='bar', color='r', rot=30, secondary_y=True)
ax.get_xaxis().set_ticks([])
Here is an example:
Setup
In [1]: from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({'x' : np.arange(10),
'y1' : np.random.rand(10,),
'y2' : np.square(np.arange(10))})
df
Out[1]: x y1 y2
0 0 0.451314 0
1 1 0.321124 1
2 2 0.050852 4
3 3 0.731084 9
4 4 0.689950 16
5 5 0.581768 25
6 6 0.962147 36
7 7 0.743512 49
8 8 0.993304 64
9 9 0.666703 81
Plot
In [2]: fig, ax1 = plt.subplots()
ax1.plot(df['x'], df['y1'], 'b-')
ax1.set_xlabel('Series')
ax1.set_ylabel('Random', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
ax2 = ax1.twinx() # Note twinx, not twiny. I was wrong when I commented on your question.
ax2.plot(df['x'], df['y2'], 'ro')
ax2.set_ylabel('Square', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
Out[2]:

Categories