Plot graph that includes time duration of event as width of bars - python

I'm trying to plot different duration entries on a graph, not sure if the best way would be to plot a bar chart and have the duration variable define the width?
The data looks like this:
Variable 1 Variable 2 Duration (s)
50 36 14
70 41 25
60 40 20
55 18 27
Thanks in advance to anyone who can help out here!

plt.step draws a step function of the accumulated time. An extra zero time point and repeating the first entry makes sure all the values are shown.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
data = [[50, 36, 14],
[70, 41, 25],
[60, 40, 20],
[55, 18, 27]]
df = pd.DataFrame(data=data, columns=['Variable 1', 'Variable 2', 'Duration'])
xs = df['Duration'].cumsum()
for col in df.columns[:-1]:
plt.step(np.append(0, xs), np.append(df[col][0], df[col]), where='pre')
plt.xticks(xs, df['Duration'])
plt.yticks(df.iloc[0, :-1], df.columns[:-1])
plt.tight_layout()
plt.show()

You can use cumsum to compute the cumulative duration, then plot with step:
(df.append(df.iloc[-1])
.assign(TimeDuration=lambda x: x['Duration (s)'].shift(fill_value=0).cumsum())
.plot(x="TimeDuration", y=['Variable 1', 'Variable 2'],drawstyle='steps-post')
)
Output:

Related

set x axis as column names on barplot

I have a dataframe such as this:
data = {'name': ['Bob', 'Chuck', 'Daren', 'Elisa'],
'100m': [19, 14, 12, 11],
'200m': [36, 25, 24, 24],
'400m': [67, 64, 58, 57],
'800m': [117, 120, 123, 121]}
df = pd.DataFrame(data)
name 100m 200m 400m 800m
1 Bob 19 36 67 117
2 Chuck 14 25 64 120
3 Daren 12 24 58 123
4 Elisa 11 24 57 121
My task is simple: Plot the times (along the y-axis), with the name of the event (100m, 200m, etc. along the x-axis). The hue of each bar should be determined by the 'name' column, and look something like this.
Furthermore, I would like to overlay the results (not stack). However, there is no functionality in seaborn nor matplotlib to do this.
Instead of using seaborn, which is an API for matplotlib, plot df directly with pandas.DataFrame.plot. matplotlib is the default plotting backend for pandas.
Tested in python 3.11, pandas 1.5.1, matplotlib 3.6.2, seaborn 0.12.1
ax = df.set_index('name').T.plot.bar(alpha=.7, rot=0, stacked=True)
seaborn.barplot does not have an option for stacked bars, however, this can be implemented with seaborn.histplot, as shown in Stacked Bar Chart with Centered Labels.
df must be converted from a wide format to a long format with df.melt
# melt the dataframe
dfm = df.melt(id_vars='name')
# plot
ax = sns.histplot(data=dfm, x='variable', weights='value', hue='name', discrete=True, multiple='stack')

How to interpolate values between points

I have this dataset show below
temp = [0.1, 1, 4, 10, 15, 20, 25, 30, 35, 40]
sg =[0.999850, 0.999902, 0.999975, 0.999703, 0.999103, 0.998207, 0.997047, 0.995649, 0.99403, 0.99222]
sg_temp = pd.DataFrame({'temp' : temp,
'sg' : sg})
temp sg
0 0.1 0.999850
1 1.0 0.999902
2 4.0 0.999975
3 10.0 0.999703
4 15.0 0.999103
5 20.0 0.998207
6 25.0 0.997047
7 30.0 0.995649
8 35.0 0.994030
9 40.0 0.992220
I would like to interpolate all the values between 0.1 and 40 on a scale of 0.001 with a spline interpolation and have those points as in the dataframe as well. I have used resample() before but can't seem to find an equivalent for this case.
I have tried this based off of other questions but it doesn't work.
scale = np.linspace(0, 40, 40*1000)
interpolation_sg = interpolate.CubicSpline(list(sg_temp.temp), list(sg_temp.sg))
It works very well for me. What exactly does not work for you?
Have you correctly used the returned CubicSpline to generate your interpolated values? Or is there some kind of error?
Basically you obtain your interpolated y values by plugging in the new x values (scale) to your returned CubicSpline function:
y = interpolation_sg(scale)
I believe this is the issue here. You probably expect that the interpolation function returns you the values, but it returns a function. And you use this function to obtain your values.
If I plot this, I obtain this graph:
import matplotlib.pyplot as plt
plt.plot(sg_temp['temp'], sg_temp['sg'], marker='o', ls='') # Plots the originial data
plt.plot(scale, interpolation_sg(scale)) # Plots the interpolated data
Call scale with the result of the interpolation:
from scipy import interpolate
out = pd.DataFrame(
{'temp': scale,
'sg': interpolate.CubicSpline(sg_temp['temp'],
sg_temp['sg'])(scale)
})
Visual output:
Code for the plot
ax = plt.subplot()
out.plot(x='temp', y='sg', label='interpolated', ax=ax)
sg_temp.plot(x='temp', y='sg', marker='o', label='sg', ls='', ax=ax)

Converting columns from float datatype to categorical datatype using binning

I wish to convert a data frame consisting of two columns.
Here is the sample df:
Output:
df:
cost numbers
1 360 23
2 120 35
3 2000 49
Both columns are float and I wish to convert them to categorical using binning.
I wish to create the following bins for each column when converting to categorical.
Bins for the numbers : 18-24, 25-44, 45-65, 66-92
Bins for cost column: >=1000, <1000
Finally, I want to not create a new column but just convert the column without creating a new one.
Here is my attempted code at this:
def PreprocessDataframe(df):
#use binning to convert age and budget to categorical columns
df['numbers'] = pd.cut(df['numbers'], bins=[18, 24, 25, 44, 45, 65, 66, 92])
df['cost'] = pd.cut(df['cost'], bins=['=>1000', '<1000'])
return df
I understand how to convert the "numbers" column but I am having trouble with the "cost" one.
Help would be nice on how to solve this.
Thanks in advance!
Cheers!
If you use bins=[18, 24, 25, 44, 45, 65, 66, 92], this is going to generate bins for 18-24, 24-25, 25-44, 44-45, etc... and you don't need the ones for 24-25, 44-45...
By default, the bins are from the first value (not incusive) to the last value inclusive.
So, for numbers, you could use instead bins=[17, 24, 44, 65, 92] (note the 17 at the first position, so 18 is included).
The optional parameter label allows to choose labels for the bins.
df['numbers'] = pd.cut(df['numbers'], bins=[17, 24, 44, 65, 92], labels=['18-24', '25-44', '45-65', '66-92'])
df['cost'] = pd.cut(df['cost'], bins=[0, 999.99, df['cost'].max()], labels=['<1000', '=>1000'])
print(df)
>>> df
cost numbers
0 <1000 18-24
1 <1000 25-44
2 =>1000 45-65

Plotting as a group using Panda and Matplotlib

I want to plot as a group using Panda and Matplotlib. THe plot would look like this kind of grouping:
Now let's assume I have a data file example.csv:
first,second,third,fourth,fifth,sixth
-42,11,3,La_c-,D
-42,21,2,La_c-,D0
-42,31,2,La_c-,D
-42,122,3,La_c-,L
print(df.head()) of the above is:
first second third fourth fifth sixth
0 -42 11 3 La_c- D NaN
1 -42 21 2 La_c- D0 NaN
2 -42 31 2 La_c- D NaN
3 -42 122 3 La_c- L NaN
In my case, on the x-axis, each group will consist of (first and the second column), just like in the above plot they have pies_2018,pies_2019,pies_2020.
To do that, I have tried to plot a single column first:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from scipy import stats
#import ast
filename = 'example.csv'
df = pd.read_csv(filename)
print(df.head())
df.plot(kind='bar', x=df.columns[1],y=df.columns[2],figsize=(12, 4))
plt.gcf().subplots_adjust(bottom=0.35)
I get a plot like this:
Now the problem is when I want to make a group I get the following error:
raise ValueError("x must be a label or position")
ValueError: x must be a label or position
The thing is that I was considering the numbers as a label.
The code I used:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from scipy import stats
#import ast
filename = 'example.csv'
df = pd.read_csv(filename)
print(df.head())
df.plot(kind='bar', x=["first", "second"],y="third",figsize=(12, 4))
plt.gcf().subplots_adjust(bottom=0.35)
plt.xticks(rotation=90)
If I can plot the first and second as a group, in addition to the legends, I will want to mention the fifth column in the "first" bar and the sixth column in the "second" bar.
Try this. You can play around but this gives you the stacked bars in groups.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
first = [-42, -42, -42, -42] #Use your column df['first']
second = [11, 21, 31, 122] #Use your column df['second']
third = [3, 2, 2, 3]
x = np.arange(len(third))
width = 0.25 #bar width
fig, ax = plt.subplots()
bar1 = ax.bar(x, third, width, label='first', color='blue')
bar2 = ax.bar(x + width, third, width, label='second', color='green')
ax.set_ylabel('third')
ax.set_xticks(x)
rects = ax.patches
labels = [str(i) for i in zip(first, second)] #You could use the columns df['first'] instead of the lists
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width() / 2, height, label,
ha='center', va='bottom')
ax.legend()
EDITED & NEW Plot -
using ax.patches you can achieve it.
df:
a b c d
a1 66 92 98 17
a2 83 57 86 97
a3 96 47 73 32
ax = df.T.plot(width=0.8, kind='bar',y=df.columns,figsize=(10,5))
for p in ax.patches:
ax.annotate(str(round(p.get_height(),2)), (p.get_x() * 1.005, p.get_height() * 1.005),color='green')
ax.axes.get_yaxis().set_ticks([])

Lines not showing up on Matplotlib graph

I am trying to plot three lines on the same plot in Matplotlib. They are InvoicesThisYear, DisputesThisYear, and PercentThisYear (Which is Disputes/Invoices)
The original input is two columns of dates -- one for the date of a logged dispute and one for the date of a logged invoice.
I use the dates to count up the number of disputes and invoices per month during a certain year.
Then I try to graph it, but it comes up empty. I started with just trying to print PercentThisYear and InvoicesThisYear.
PercentThisYear = (DisputesFYThisYear/InvoicesFYThisYear).fillna(0.0)
#Percent_ThisYear.plot(kind = 'line')
#InvoicesFYThisYear.plot(kind = 'line')
plt.plot(PercentThisYear)
plt.xlabel('Date')
plt.ylabel('Percent')
plt.title('Customer Disputes')
# Remove the plot frame lines. They are unnecessary chartjunk.
ax = plt.subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax2 = ax.twinx()
ax2.plot(InvoicesFYThisYear)
# Ensure that the axis ticks only show up on the bottom and left of the plot.
# Ticks on the right and top of the plot are generally unnecessary chartjunk.
ax.get_xaxis().tick_bottom()
#ax.get_yaxis().tick_left()
# Limit the range of the plot to only where the data is.
# Avoid unnecessary whitespace.
datenow = datetime.datetime.now()
dstart = datetime.datetime(2015,4,1)
print datenow
#plt.ylim(0, .14)
plt.xlim(dstart, datenow)
firsts=[]
for i in range(dstart.month, datenow.month+1):
firsts.append(datetime.datetime(2015,i,1))
plt.xticks(firsts)
plt.show()
This is the output... The date is all messed up and nothing prints. But the scaled on the axes look right. What am I doing wrong?
Here is the set up leading up to the graph if that is helpful
The Input looks like this:
InvoicesThisYear
Out[82]:
7 7529
5 5511
6 4934
8 3552
dtype: int64
DisputesThisYear
Out[83]:
2 211
1 98
7 54
4 43
3 32
6 29
5 21
8 8
dtype: int64
PercentThisYear
Out[84]:
1 0.000000
2 0.000000
3 0.000000
4 0.000000
5 0.003810
6 0.005877
7 0.007172
8 0.002252
dtype: float64
Matplotlib has no way of knowing which dates are associated with which data points. When you call plot with only one argument y, Matplotlib automatically assumes that the x-values are range(len(y)). You need to supply the dates as the first argument to plot. Assuming that InvoicesThisYear is a count of the number of invoices each month, starting at 1 and ending at 8, you could do something like
import datetime
import matplotlib.pyplot as plt
import pandas as pd
InvoicesFYThisYear = pd.DataFrame([0, 0, 0, 0, 5511, 4934, 7529, 3552])
Disputes = pd.DataFrame([98, 211, 32, 43, 21, 29, 54, 8])
PercentThisYear = (Disputes / InvoicesFYThisYear)
datenow = datetime.date.today()
ax = plt.subplot(111)
dates = [datetime.date(2015,i,1) for i in xrange(1, 9, 1)]
plt.plot(dates, PercentThisYear)
ax2 = ax.twinx()
ax2.plot(dates, InvoicesFYThisYear)
dstart = datetime.datetime(2015,4,1)
plt.xlim(dstart, datenow)
plt.xticks(dates, dates)
plt.show()
If your data is in a Pandas series and the index is an integer representing the month, all you have to do is change the index to datetime objects instead. The plot method for pandas.Series will handle things automatically from there. Here's how you might do that:
Invoices = pd.Series((211, 98, 54, 43, 32, 29, 21, 8), index = (2, 1, 7, 4, 3, 6, 5, 8))
dates = [datetime.date(2015, month, 1) for month in Invoices.index]
Invoices.index = dates
Invoices.plot()

Categories