Pandas line plot with markers based on another column

Pandas line plot with markers based on another column - python

I have a dataframe like following
df:
ind group people value value_50
1 1 5 100 1
1 2 2 90 1
2 1 10 80 1
2 2 20 40 0
3 1 7 10 0
3 2 23 30 0
And I tried to pivot them, to see 'group' individual metrics in columns
df = data.pivot_table(index = data.ind, columns = ['group'], values = ['people', 'value','value_50'])
df
And then tried to plot 'value' for both groups separately with "ind" on x axis
df.plot()
but I dont want to include all the columns in graph, instead I am trying to color marker based on df['value_50'] and a size bubble or size marker based on df['people'] as c and s paramteres respectively.
It will help to identify certain points on the graph
df['value'].plot(c =df['value_50'], s = df['value'])
but receiving an error
AttributeError: Unknown property s
And also is it possible with cufflinks also, because I have tried
df['value'].iplot(c =df['value_50'], s = df['value'])
Again failed to do so
how to perform it with pandas/ cufflinks?

You asked for plotly express, but It would be almost as easy and way more flexible to use plotly.graph_objs.
Plot:
Code 1:
import numpy as np
import plotly.graph_objs as go
# plotly setup and traces
fig = go.Figure()
# lines 1
fig.add_trace(go.Scatter(x=df.index, y=df['value'][1].values,
name = 'value_1',
mode = 'lines'))
# lines 2
fig.add_trace(go.Scatter(x=df.index, y=df['value'][2].values,
name = 'value_2',
mode = 'lines'))
# markers 1
fig.add_trace(go.Scatter(x=df.index, y=df['value'][1].values,
name = 'people',
mode = 'markers',
marker=dict(color=df['value_50'][1], colorscale='viridis', colorbar=dict(title='value_50')),
marker_size=df['people'][1]*1.8
)
)
# markers 2
fig.add_trace(go.Scatter(x=df.index, y=df['value'][2].values,
name = 'people',
mode = 'markers',
marker=dict(color=df['value_50'][2], colorscale='viridis', colorbar=dict(title='value_50')),
marker_size=df['people'][2]*1.8
)
)
# adjust and show final figure
fig.update_layout(legend=dict(x=-.15, y=1))
fig.show()
I'm still not 100% sure what you'er aiming to do here. Let me know how this works for you and we can have a look at the details.

I am using matplotlib to graph the data in the way you want. To recap your question, you were looking to plot data with value in the y axis and ind in the x axs, and each specific point size will be based on the value in the people column. The whole graph being divided between each group.
Pivot DF
df_pv = df.pivot(index='ind', columns='group', values=['people', 'value', 'value_50'])
>> out
people value value_50
group 1 2 1 2 1 2
ind
1 5 2 100 90 1 1
2 10 20 80 40 1 0
3 7 23 10 30 0 0
Graph
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,figsize=(10,5))
ind = df_pv.index.values
# generate random hex color & create as many colors as groups.
r = lambda: random.randint(0,255)
colors = ['#%02X%02X%02X' % (r(), r(), r()) for i in range(len(df_pv.people.columns.values))]
labels = df_pv.people.columns.values
for i in range(len(df_pv.people.values[0])):
val = df_pv.value.values[:,i]
peop = df_pv.people.values[:,i]
for j in range(len(peop)):
plt.scatter(x=[ind[j]], y=[val[j]],
marker='o', linestyle='--',s=peop[j]*7, color=colors[i])
plt.plot(ind, val, color=colors[i], label=f'Group: {labels[i]}')
plt.legend()
plt.xticks(df_pv.index.unique())
plt.ylabel('Value')
plt.xlabel('Ind')
plt.title('Graph')
plt.show()
My hope, at first, was to create the graph and access each individual marker to set the size. Unfortunately I was not able to find a solution leading this way.
Instead, we are ploting each individual point for each group using plt.scatter(). Here we assign the size of the point based on the people column for the specific group. We then connect each point using plt.plot() and assign the label and the color.
The code as been write to accept n different groups without having to assign any value manually (colors, points, etc.).

Related

Matplotlib customize rank line plot

I have the following dataframe where it contains the best equipment in operation ranked by 1 to 300 (1 is the best, 300 is the worst) over a few days (df columns)
Equipment 21-03-27 21-03-28 21-03-29 21-03-30 21-03-31 21-04-01 21-04-02
P01-INV-1-1 1 1 1 1 1 2 2
P01-INV-1-2 2 2 4 4 5 1 1
P01-INV-1-3 4 4 3 5 6 10 10
I would like to customize a line plot (example found here) but I'm having some troubles trying to modify the example code provided:
import matplotlib.pyplot as plt
import numpy as np
def energy_rank(data, marker_width=0.1, color='blue'):
y_data = np.repeat(data, 2)
x_data = np.empty_like(y_data)
x_data[0::2] = np.arange(1, len(data)+1) - (marker_width/2)
x_data[1::2] = np.arange(1, len(data)+1) + (marker_width/2)
lines = []
lines.append(plt.Line2D(x_data, y_data, lw=1, linestyle='dashed', color=color))
for x in range(0,len(data)*2, 2):
lines.append(plt.Line2D(x_data[x:x+2], y_data[x:x+2], lw=2, linestyle='solid', color=color))
return lines
data = ranks.head(4).to_numpy() #ranks is the above dataframe
artists = []
for row, color in zip(data, ('red','blue','green','magenta')):
artists.extend(energy_rank(row, color=color))
fig, ax = plt.subplots()
ax.set_xticklabels(ranks.columns) # set X axis to be dataframe columns
ax.set_xticklabels(ax.get_xticklabels(), rotation=35, fontsize = 10)
for artist in artists:
ax.add_artist(artist)
ax.set_ybound([15,0])
ax.set_xbound([.5,8.5])
When using ax.set_xticklabels(ranks.columns), for some reason, it only plots 5 of the 7 days from ranks columns, removing specifically the first and last values. I tried to duplicate those values but this did not work as well. I end up having this below:
In summary, I would like to know if its possible to do 3 customizations:
input all dates from ranks columns on X axis
revert Y axis. ax.set_ybound([15,0]) is not working. It would make more sense to see the graph starting with 0 on top, since 1 is the most important rank to look at
add labels to the end of each line at the last day (last value on X axis). I could add the little window label, but it often gets really messy when you plot more data, so adding just the text at the end of each line would really make it look cleaner
Please let me know if those customizations are impossible to do and any help is really appreciated! Thank you in advance!

To show all the dates, use plt.xticks() and set_xbound to start at 0. To reverse the y axis, use ax.set_ylim(ax.get_ylim()[::-1]). To set the legends the way you described, you can use annotation and set the coordinates of the annotation at your last datapoint for each series.
fig, ax = plt.subplots()
plt.xticks(np.arange(len(ranks.columns)), list(ranks.columns), rotation = 35, fontsize = 10)
plt.xlabel('Date')
plt.ylabel('Rank')
for artist in artists:
ax.add_artist(artist)
ax.set_ybound([0,15])
ax.set_ylim(ax.get_ylim()[::-1])
ax.set_xbound([0,8.5])
ax.annotate('Series 1', xy =(7.1, 2), color = 'red')
ax.annotate('Series 2', xy =(7.1, 1), color = 'blue')
ax.annotate('Series 3', xy =(7.1, 10), color = 'green')
plt.show()
Here is the plot for the three rows of data in your sample dataframe:

Pandas stacked bar creating many individual plots with incorrect bottom values

Given a Dataframe (this is generated from a csv that contains the names and orders and updated everyday):
# Note that this is just an example df and the real can have N names in n shuffled orders
df = pd.read_csv('names_and_orders.csv', header=0)
print(df)
names order
0 mike 0
1 jo 1
2 mary 2
3 jo 0
4 mike 1
5 mary 2
6 mike 0
7 mary 1
8 jo 2
I am turning this into a stacked bar plot using pandas' stacked bar functionality and a for loop, as shown below.
# Create list of names from original df
names1 = df['names'].drop_duplicates().tolist()
N = len(names1)
viridis = cm.get_cmap('viridis', 100)
# Get count of each name at each order
df_count = df_o.groupby(['order', 'names']).size().reset_index(name='count')
# Plot count vs order in a stacked bar with the label as the current name
for i in range(len(names1)):
values = list(df_count[df_count['names'] == names1[i]].loc[:, 'count'])
df_count[df_count['names'] == names1[i]].plot.bar(x='order', y='count', color=viridis(i / N), stacked=True,
bottom=values, edgecolor='black', label=names1[i])
values += values
# Add ticks, labels, title, and legend to plot
plt.xticks(np.arange(0, N, step=1))
plt.xlabel('Order')
plt.yticks(np.arange(0, df_count['count'].max(), step=1))
plt.ylabel('Count')
plt.title('How many times each person has been at each order number')
plt.legend()
plt.show()
Given this code, there are two main issues I am having:
It is currently plotting every name on a different figure instead of making one stacked bar plot
I don't believe the values use for the bottom kwarg is correct

I think you're overthinking this. Just unstack the groupby and plot:
df_count = df.groupby(['order', 'names']).size().unstack('names')
df_count.plot.bar(stacked=True)
Output:

Plotting classification results in different dates?

I have a data frame (my_data) as follows:
0 2017-01 2017-02 2017-03 2017-04
0 S1 2 3 2 2
1 S2 2 0 2 0
2 S3 1 0 2 2
3 S4 3 2 2 2
4 … … … … …
5 … … … … …
6 S10 2 2 3 2
This data frame is a result of a classification problem in different dates for each sample (S1,.., S10). In order to simplify the plotting I converted the confusion matrix in different numbers as follows: 0 means ‘TP’, 1 means ‘FP’, 2 refers to ‘TN’ and 3 points to ‘FN’. Now, I want to plot this data frame like the below image.
It needs to be mentioned that I already asked this question, but nobody could help me. So, now I tried to make the question more easy to understand that I can get help.

Unfortunately, I don't know of a way to plot one set of data with different markers, so you will have to plot over all your data separately.
You can use matplotlib to plot your data. I'm not sure how your data looks, but for a file with these contents:
2017-01,2017-02,2017-03,2017-04
2,3,2,2
2,0,2,0
1,0,2,2
3,2,2,2
2,2,3,2
You can use the following code to get the plot you want:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
df = pd.read_csv('dataframe.txt', parse_dates = True)
dates = list(df.columns.values) #get dates
number_of_dates = len(dates)
markers = ["o", "d", "^", "s"] #set marker shape
colors = ["g", "r", "m", "y"] #set marker color
# loop over the data in your dataframe
for i in range(df.shape[0]):
# get a row of 1s, 2s, ... as you want your
# data S1, S2, in one line on top of each other
dataY = (i+1)*np.ones(number_of_dates)
# get the data that will specify which marker to use
data = df.loc[i]
# plot dashed line first, setting it underneath markers with zorder
plt.plot(dates, dataY, c="k", linewidth=1, dashes=[6, 2], zorder=1)
# loop over each data point x is the date, y a constant number,
# and data specifies which marker to use
for _x, _y, _data in zip(dates, dataY, data):
plt.scatter(_x, _y, marker=markers[_data], c=colors[_data], s=100, edgecolors="k", linewidths=0.5, zorder=2)
# label your ticks S1, S2, ...
ticklist = list(range(1,df.shape[0]+1))
l2 = [("S%s" % x) for x in ticklist]
ax.set_yticks(ticklist)
ax.set_yticklabels(l2)
labels = ["TP","TN","FP","FN"]
legend_elements = []
for l,c, m in zip(labels, colors, markers):
legend_elements.append(Line2D([0], [0], marker=m, color="w", label=l, markerfacecolor=c, markeredgecolor = "k", markersize=10))
ax.legend(handles=legend_elements, loc='upper right')
plt.show()
Plotting idea from this answer.
This results in a plot looking like this:
EDIT Added dashed line and outline for markers to look more like example in question.
EDIT2 Added legend.

Final plot in a series of matplotlib subplots has increased y tick label padding

I have a Pandas dataframe that contains columns representing year, month within year and a binary outcome (0/1). I want to plot a column of barcharts with one barchart per year. I've used the subplots() function in matplotlib.pyplot with sharex = True and sharey = True. The graphs look fine except the padding between the y-ticks and the y-tick labels is different on the final (bottom) graph.
An example dataframe can be created as follows:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Generate dataframe containing dates over several years and a random binary outcome
tempDF = pd.DataFrame()
tempDF['date'] = pd.date_range(start = pd.to_datetime('2014-01-01'),end = pd.to_datetime('2017-12-31'))
tempDF['case'] = np.random.choice([0,1],size = [len(tempDF.index)],p = [0.9,0.1])
# Create a dataframe that summarises proportion of cases per calendar month
tempGroupbyCalendarMonthDF = tempDF.groupby([tempDF['date'].dt.year,tempDF['date'].dt.month]).agg({'case': sum,
'date': 'count'})
tempGroupbyCalendarMonthDF.index.names = ['year','month']
tempGroupbyCalendarMonthDF = tempGroupbyCalendarMonthDF.reset_index()
# Rename columns to something more meaningful
tempGroupbyCalendarMonthDF = tempGroupbyCalendarMonthDF.rename(columns = {'case': 'numberCases',
'date': 'monthlyTotal'})
# Calculate percentage positive cases per month
tempGroupbyCalendarMonthDF['percentCases'] = (tempGroupbyCalendarMonthDF['numberCases']/tempGroupbyCalendarMonthDF['monthlyTotal'])*100
The final dataframe looks something like:
year month monthlyTotal numberCases percentCases
0 2014 1 31 5 16.129032
1 2014 2 28 5 17.857143
2 2014 3 31 3 9.677419
3 2014 4 30 1 3.333333
4 2014 5 31 4 12.903226
.. ... ... ... ... ...
43 2017 8 31 2 6.451613
44 2017 9 30 2 6.666667
45 2017 10 31 3 9.677419
46 2017 11 30 2 6.666667
47 2017 12 31 1 3.225806
Then the plots are produced as shown below. The subplots() function is used to return an array of axes. The code steps through each axis and plots the values. The x-axis ticks and labels are only displayed on the final (bottom) plot. Finally, the get a common y-axis label, an additional subplot is added that covers all the bar graphs but all the axes and labels (except the y axis label) are not displayed.
# Calculate minimumn and maximum years in dataset
minYear = tempDF['date'].min().year
maxYear = tempDF['date'].max().year
# Set a few parameters
barWidth = 0.80
labelPositionX = 0.872
labelPositionY = 0.60
numberSubplots = maxYear - minYear + 1
fig, axArr = plt.subplots(numberSubplots,figsize = [8,10],sharex = True,sharey = True)
# Keep track of which year to plot, starting with first year in dataset.
currYear = minYear
# Step through each subplot
for ax in axArr:
# Plot the data
rects = ax.bar(tempGroupbyCalendarMonthDF.loc[tempGroupbyCalendarMonthDF['year'] == currYear,'month'],
tempGroupbyCalendarMonthDF.loc[tempGroupbyCalendarMonthDF['year'] == currYear,'percentCases'],
width = barWidth)
# Format the axes
ax.set_xlim([0.8,13])
ax.set_ylim([0,40])
ax.grid(True)
ax.tick_params(axis = 'both',
left = 'on',
bottom = 'off',
top = 'off',
right = 'off',
direction = 'out',
length = 4,
width = 2,
labelsize = 14)
# Turn on the x-axis ticks and labels for final plot only
if currYear == maxYear:
ax.tick_params(bottom = 'on')
xtickPositions = [1,2,3,4,5,6,7,8,9,10,11,12]
ax.set_xticks([x + barWidth/2 for x in xtickPositions])
ax.set_xticklabels(['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
ytickPositions = [10,20,30]
ax.set_yticks(ytickPositions)
# Add label (in this case, the year) to each subplot
# (The transform = ax.transAxes makes positios relative to current axis.)
ax.text(labelPositionX, labelPositionY,currYear,
horizontalalignment = 'center',
verticalalignment = 'center',
transform = ax.transAxes,
family = 'sans-serif',
fontweight = 'bold',
fontsize = 18,
color = 'gray')
# Onto the next year...
currYear = currYear + 1
# Fine-tune overall figure
# ========================
# Make subplots close to each other.
fig.subplots_adjust(hspace=0)
# To display a common y-axis label, create a large axis that covers all the subplots
fig.add_subplot(111, frameon=False)
# Hide tick and tick label of the big axss
plt.tick_params(labelcolor='none', top='off', bottom='off', left='off', right='off')
# Add y label that spans several subplots
plt.ylabel("Percentage cases (%)", fontsize = 16, labelpad = 20)
plt.show()
The figure that is produced is almost exactly what I want but the y-axis tick labels on the bottom plot are set further from the axis compared with all the other plots. If the number of plots produced is altered (by using a wider range of dates), the same pattern occurs, namely, it's only the final plot that appears different.
I'm almost certainly not seeing the wood for the trees but can anyone spot what I've done wrong?
EDIT
The above code was original run on Matplotlib 1.4.3 (see comment by ImportanceOfBeingErnest). However, when updated to the Matplotlib 2.0.2 the code failed to run (KeyError: 0). The reason appears to be that the default setting in Matplotlib 2.xxx is for bars to be aligned center. To get the above code to run, either adjust the x-axis range and tick positions so that the bars don't extend beyond the y-axis or set align='center' in the plotting function, i.e.:
rects = ax.bar(tempGroupbyCalendarMonthDF.loc[tempGroupbyCalendarMonthDF['year'] == currYear,'month'],
tempGroupbyCalendarMonthDF.loc[tempGroupbyCalendarMonthDF['year'] == currYear,'percentCases'],
width = barWidth,
align = 'edge')

How to add error bars on a grouped barplot from a column

I have a pandas data frame df that has four columns: Candidate, Sample_Set, Values, and Error. The Candidate column has, say, three unique entries: [X, Y, Z] and we have three sample sets, such that Sample_Set has three unique values as well: [1,2,3]. The df would roughly look like this.
import pandas as pd
data = {'Candidate': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
'Sample_Set': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'Values': [20, 10, 10, 200, 101, 99, 1999, 998, 1003],
'Error': [5, 2, 3, 30, 30, 30, 10, 10, 10]}
df = pd.DataFrame(data)
# display(df)
Candidate Sample_Set Values Error
0 X 1 20 5
1 Y 1 10 2
2 Z 1 10 3
3 X 2 200 30
4 Y 2 101 30
5 Z 2 99 30
6 X 3 1999 10
7 Y 3 998 10
8 Z 3 1003 10
I am using seaborn to create a grouped barplot out of this with x="Candidate", y="Values", hue="Sample_Set". All's good, until I try to add an error bar along the y-axis using the values under the column named Error. I am using the following code.
import seaborn as sns
ax = sns.factorplot(x="Candidate", y="Values", hue="Sample_Set", data=df,
size=8, kind="bar")
How do I incorporate the error?
I would appreciate a solution or a more elegant approach on the task.

As #ResMar pointed out in the comments, there seems to be no built-in functionality in seaborn to easily set individual errorbars.
If you rather care about the result than the way to get there, the following (not so elegant) solution might be helpful, which builds on matplotlib.pyplot.bar. The seaborn import is just used to get the same style.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
def grouped_barplot(df, cat,subcat, val , err):
u = df[cat].unique()
x = np.arange(len(u))
subx = df[subcat].unique()
offsets = (np.arange(len(subx))-np.arange(len(subx)).mean())/(len(subx)+1.)
width= np.diff(offsets).mean()
for i,gr in enumerate(subx):
dfg = df[df[subcat] == gr]
plt.bar(x+offsets[i], dfg[val].values, width=width,
label="{} {}".format(subcat, gr), yerr=dfg[err].values)
plt.xlabel(cat)
plt.ylabel(val)
plt.xticks(x, u)
plt.legend()
plt.show()
cat = "Candidate"
subcat = "Sample_Set"
val = "Values"
err = "Error"
# call the function with df from the question
grouped_barplot(df, cat, subcat, val, err )
Note that by simply inversing the category and subcategory
cat = "Sample_Set"
subcat = "Candidate"
you can get a different grouping:

I suggest extracting the position coordinates from patches attributes, and then plotting the error bars.
ax = sns.barplot(data=df, x="Candidate", y="Values", hue="Sample_Set")
x_coords = [p.get_x() + 0.5*p.get_width() for p in ax.patches]
y_coords = [p.get_height() for p in ax.patches]
ax.errorbar(x=x_coords, y=y_coords, yerr=df["Error"], fmt="none", c= "k")

seaborn plots generate error bars when aggregating data, however this data is already aggregated, and has a specified error column.
The easiest solution is to use pandas to create the bar-chart with pandas.DataFrame.plot and kind='bar'
matplotlib is used by default as the plotting backend, and the plot API has a yerr parameter, which accepts the following:
As a DataFrame or dict of errors with column names matching the columns attribute of the plotting DataFrame or matching the name attribute of the Series.
As a str indicating which of the columns of plotting DataFrame contain the error values.
As raw values (list, tuple, or np.ndarray). Must be the same length as the plotting DataFrame/Series.
This can be accomplished by reshaping the dataframe from long form to wide form with pandas.DataFrame.pivot
See pandas User Guide: Plotting with error bars
Tested in python 3.8.12, pandas 1.3.4, matplotlib 3.4.3
# reshape the dataframe into a wide format for Values
vals = df.pivot(index='Candidate', columns='Sample_Set', values='Values')
# display(vals)
Sample_Set 1 2 3
Candidate
X 20 200 1999
Y 10 101 998
Z 10 99 1003
# reshape the dataframe into a wide format for Errors
yerr = df.pivot(index='Candidate', columns='Sample_Set', values='Error')
# display(yerr)
Sample_Set 1 2 3
Candidate
X 5 30 10
Y 2 30 10
Z 3 30 10
# plot vals with yerr
ax = vals.plot(kind='bar', yerr=yerr, logy=True, rot=0, figsize=(6, 5))
_ = ax.legend(title='Sample Set', bbox_to_anchor=(1, 1.02), loc='upper left')

You can get close to what you need using pandas plotting functionalities: see this answer
bars = data.groupby("Candidate").plot(kind='bar',x="Sample_Set", y= "Values", yerr=data['Error'])
This does not do exactly what you want, but pretty close. Unfortunately ggplot2 for python currently does not render error bars properly. Personally, I would resort to R ggplot2 in this case:
data <- read.csv("~/repos/tmp/test.csv")
data
library(ggplot2)
ggplot(data, aes(x=Candidate, y=Values, fill=factor(Sample_Set))) +
geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Values-Error, ymax=Values+Error), width=.1, position=position_dodge(.9))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas line plot with markers based on another column - python

Related

Matplotlib customize rank line plot

Pandas stacked bar creating many individual plots with incorrect bottom values

Plotting classification results in different dates?

Final plot in a series of matplotlib subplots has increased y tick label padding

How to add error bars on a grouped barplot from a column

Categories

Resources