pandas DataFrame plot - impossible to set xtick intervals for timedelta values - python

I am trying to specify the x-axis interval when plotting DataFrames. I have several data files like,
0:0:0 29
0:5:0 85
0:10:0 141
0:15:0 198
0:20:0 251
0:25:0 308
0:30:0 363
0:35:0 413
Where first column is time in %H:%M:%S format but hours goes beyond 24 hours (till 48 hours).
When I read the file as below and plot it looks fine but I want to set the xticks interval to 8 hours.
df0 = pd.read_csv(fil, names=['Time', 'Count'], delim_whitespace=True, parse_dates=['Time'])
df0 = df0.set_index('Time')
ax = matplotlib.pyplot.gca()
mkfunc = lambda x, pos: '%1.1fM' % (x * 1e-6) if x >= 1e6 else '%1.1fK' % (x * 1e-3) if x >= 1e3 else '%1.1f' % x
mkformatter = matplotlib.ticker.FuncFormatter(mkfunc)
ax.yaxis.set_major_formatter(mkformatter)
ax.xaxis.set_major_locator(mdates.HourLocator(interval=8))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H'))
df0.plot(ax=ax, x_compat=True, color='blue')
plt.grid()
plt.savefig('figure2.pdf',dpi=300, bbox_inches = "tight")
I tried the above method as specified by many answers here but that resulted in the following warning,
Locator attempting to generate 1874 ticks ([-28.208333333333332, ..., 596.125]), which exceeds Locator.MAXTICKS (1000).
The figure also displayed many vertical lines.
I tried converting my time column specifically to timedelta and it still did not help.
I converted to timedelta as below.
custom_date_parser = lambda x: pd.to_timedelta(x.split('.')[0])
df0 = pd.read_csv(fil, names=['Time', 'Count'], delim_whitespace=True, parse_dates=['Time']), date_parser=custom_date_parser)
Could you please help me to identify the issue and set the xticks interval correctly?

The problem here is that a) matplotlib/pandas don't have much support for timedelta objects and b) you cannot use the HourLocator with your data because after conversion to a datetime object, your axis would be labelled 0, 8, 16, 0, 8, 16...
Instead, we can convert the timedelta imported by your converter into hours and plot the numerical values:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import numpy as np
custom_date_parser = lambda x: pd.to_timedelta(x.split('.')[0])
df0 = pd.read_csv("test.txt", names=['Time', 'Count'], delim_whitespace=True, parse_dates=['Time'], date_parser=custom_date_parser)
#conversion into numerical hour value
df0["Time"] /= np.timedelta64(1, "h")
df0 = df0.set_index('Time')
ax = matplotlib.pyplot.gca()
df0.plot(ax=ax, x_compat=True, color='blue')
mkfunc = lambda x, pos: '%1.1fM' % (x * 1e-6) if x >= 1e6 else '%1.1fK' % (x * 1e-3) if x >= 1e3 else '%1.1f' % x
mkformatter = matplotlib.ticker.FuncFormatter(mkfunc)
ax.yaxis.set_major_formatter(mkformatter)
#set locator at regular hour intervals
ax.xaxis.set_major_locator(MultipleLocator(8))
ax.set_xlabel("Time (in h)")
plt.grid()
plt.show()
Sample output:
If for reasons unknown you actually need datetime objects, you can convert your timedelta values using an arbitrary offset, as you intend to ignore the day value:
df0["Time"] += pd.to_datetime("2000-01-01 00:00:00 UTC")
But I doubt this will be of advantage in your case.
As an aside - for debugging, it is useful not to use regularly spaced test data. In your example, you probably did not notice that the graph was plotted against the index (0, 1, 2...) and then relabeled with strings, imitating regularly spaced datetime objects. The following test data immediately reveal the problem.
0:0:0 29
0:5:0 85
0:10:0 141
3:15:0 98
5:20:0 251
17:25:0 308
27:30:0 63
35:35:0 413

Related

Month, Year with Value Plot, Pandas and MatPlotLib

I am trying to plot a time graph with month and year combined for my x and values for y. Python is reading my excel data with decimal points so won't allow to convert to %m %Y. Any ideas?
MY EXCEL DATA
How python reads my data
0 3.0-2015.0
1 5.0-2015.0
3 6.0-2017.0
...
68 nan-nan
69 nan-nan
70 nan-nan
71 nan-nan'
# Code
import plotly
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import pandas as pd
import math
# Set Directory
workbook1 = 'GAP Insurance - 1.xlsx'
workbook2 = 'GAP Insurance - 2.xlsx'
workbook3 = 'GAP Insurance - 3.xlsx'
df = pd.read_excel(workbook1, 'Sheet1',)
# Set x axis
df['Time'] = (df['Month']).astype(str)+ '-' + (df['Year']).astype(str)
df['Time'] = pd.to_datetime(df['Time'], format='%m-%Y').dt.strftime('%m-%Y')
You could try converting to "int" before converting to "str" in this line:
df['Time'] = (df['Month']).astype(str)+ '-' + (df['Year']).astype(str)
This should ensure that what gets stored does not include decimal points.

Plotly: How to plot a range with a line in the center using a datetime index?

I would like to plot a line with a range around it, like on this photo:
I posted an original question, but didn't specify the index being a datetime index. I thought it wouldn't be important, but I was wrong.
There is an answer that covers it with a numerical index:
Plotly: How to make a figure with multiple lines and shaded area for standard deviations?
and documentation here:
https://plotly.com/python/continuous-error-bars/
but the issue of datetime index is not covered.
Here is some test data:
timestamp price min mean max
1596267946298 100.0 100 100.5 101
1596267946299 101.0 100 100.5 101
1596267946300 102.0 98 99.5 102
1596267948301 99.0 98 99.5 102
1596267948302 98.0 98 99.5 102
1596267949303 99.0 98 995. 102
where I'd like the band to cover from min to max and the mean to be drawn in the center.
another option is to take the code from the first answer of the question posted above (Plotly: How to make a figure with multiple lines and shaded area for standard deviations?) and change the data generation to:
index = pd.date_range('1/1/2000', periods=25, freq='T')
df = pd.DataFrame(dict(A=np.random.uniform(low=-1, high=2, size=25).tolist(),
B=np.random.uniform(low=-4, high=3, size=25).tolist(),
C=np.random.uniform(low=-1, high=3, size=25).tolist()),
index=index)
this will work the same way but create a datetime index.
Compared to the setup in the linked question, what causes trouble is the fact that x+x[::-1] doesn't work very well with a datetime index. But if you set x=df.index in:
# add line and shaded area for each series and standards deviation
for i, col in enumerate(df):
new_col = next(line_color)
# x = list(df.index.values+1)
x = df.index
And then replace x+x[::-1] with x=x.append(x[::-1]):
# standard deviation area
fig.add_traces(go.Scatter(
#x+x[::-1],
x=x.append(x[::-1]),
Then things should work out perfectly well.
Plot:
Complete code:
# imports
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd
import numpy as np
# sample data in a pandas dataframe
np.random.seed(1)
df=pd.DataFrame(dict(A=np.random.uniform(low=-1, high=2, size=25).tolist(),
B=np.random.uniform(low=-4, high=3, size=25).tolist(),
C=np.random.uniform(low=-1, high=3, size=25).tolist(),
))
df = df.cumsum()
# set daterange as index
df['dates'] = pd.date_range('2020', freq='D', periods=len(df))
df.set_index('dates', inplace=True)
# ---
# define colors as a list
colors = px.colors.qualitative.Plotly
# convert plotly hex colors to rgba to enable transparency adjustments
def hex_rgba(hex, transparency):
col_hex = hex.lstrip('#')
col_rgb = list(int(col_hex[i:i+2], 16) for i in (0, 2, 4))
col_rgb.extend([transparency])
areacol = tuple(col_rgb)
return areacol
rgba = [hex_rgba(c, transparency=0.2) for c in colors]
colCycle = ['rgba'+str(elem) for elem in rgba]
# Make sure the colors run in cycles if there are more lines than colors
def next_col(cols):
while True:
for col in cols:
yield col
line_color=next_col(cols=colCycle)
# plotly figure
fig = go.Figure()
# add line and shaded area for each series and standards deviation
for i, col in enumerate(df):
new_col = next(line_color)
x = df.index
y1 = df[col]
y1_upper = [(y + np.std(df[col])) for y in df[col]]
y1_lower = [(y - np.std(df[col])) for y in df[col]]
y1_lower = y1_lower[::-1]
# standard deviation area
fig.add_traces(go.Scatter(
#x+x[::-1],
x=x.append(x[::-1]),
y=y1_upper+y1_lower,
fill='tozerox',
fillcolor=new_col,
line=dict(color='rgba(255,255,255,0)'),
showlegend=False,
name=col))
# line trace
fig.add_traces(go.Scatter(x=df.index,
y=y1,
line=dict(color=new_col, width=2.5),
mode='lines',
name=col)
)
fig.update_layout(xaxis=dict(range=[df.index[1],df.index[-1]]))
fig.show()

pandas calculate delta time

Here's some code where that will generate some random data, and chart plus lines representing 30th & 90th percentiles.
import pandas as pd
import numpy as np
from numpy.random import randint
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(10) # added for reproductibility
rng = pd.date_range('10/9/2018 00:00', periods=10, freq='1H')
df = pd.DataFrame({'Random_Number':randint(1, 100, 10)}, index=rng)
df.plot()
plt.axhline(df.quantile(0.3)[0], linestyle="--", color="g")
plt.axhline(df.quantile(0.90)[0], linestyle="--", color="r")
plt.show()
Outputs: (minus the highlighted part of the chart)
Im trying to figure out if its possible to calculate the time in the data it takes to reach (highlighted yellow) from green to the red line.
I can manually enter in the data:
minStart = df.loc[df['Random_Number'] < 18].index[0]
maxStart = df.loc[df['Random_Number'] > 90].index[0]
hours = maxStart - minStart
hours
Which will output:
Timedelta('0 days 05:00:00')
But if I attempt to use:
minStart = df.loc[df['Random_Number'] < df.quantile(0.3)].index[0]
maxStart = df.loc[df['Random_Number'] > df.quantile(0.90)].index[0]
hours = maxStart - minStart
hours
This will throw an ValueError: Can only compare identically-labeled Series objects
Would there be a better method to madness? Ideally it would be nice to create some sort of an algorithm that can calculate delta Time to it takes to go from 30th - 90th percentile and then delta back from 90th - 30th.. But I may have to put some thought towards how that could be accomplished..
minStart = df.loc[df['Random_Number'] < df.quantile(0.3)[0]].index[0]
maxStart = df.loc[df['Random_Number'] > df.quantile(0.90)[0]].index[0]
hours = maxStart - minStart
hours
df.quantile doesn't return a number so you need to get the first entry of it

Averaging several time-series together with confidence interval (with test code)

Sounds very complicated but a simple plot will make it easy to understand:
I have three curves of cumulative sum of some values over time, which are the blue lines.
I want to average (or somehow combine in a statistically correct way) the three curves into one smooth curve and add confidence interval.
I tried one simple solution - combining all the data into one curve, average it with the "rolling" function in pandas, getting the standard deviation for it. I plotted those as the purple curve with the confidence interval around it.
The problem with my real data, and as illustrated in the plot above is the curve isn't smooth at all, also there are sharp jumps in the confidence interval which also isn't a good representation of the 3 separate curves as there is no jumps in them.
Is there a better way to represent the 3 different curves in one smooth curve with a nice confidence interval?
I supply a test code, tested on python 3.5.1 with numpy and pandas (don't change the seed in order to get the same curves).
There are some constrains - increasing the number of points for the "rolling" function isn't a solution for me because some of my data is too short for that.
Test code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
np.random.seed(seed=42)
## data generation - cumulative analysis over time
df1_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df1_values = pd.DataFrame(np.random.randint(0,10000,size=100), columns=['vals'])
df1_combined_sorted = pd.concat([df1_time, df1_values], axis = 1).sort_values(by=['time'])
df1_combined_sorted_cumulative = np.cumsum(df1_combined_sorted['vals'])
df2_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df2_values = pd.DataFrame(np.random.randint(1000,13000,size=100), columns=['vals'])
df2_combined_sorted = pd.concat([df2_time, df2_values], axis = 1).sort_values(by=['time'])
df2_combined_sorted_cumulative = np.cumsum(df2_combined_sorted['vals'])
df3_time = pd.DataFrame(np.random.uniform(0,1000,size=50), columns=['time'])
df3_values = pd.DataFrame(np.random.randint(0,4000,size=100), columns=['vals'])
df3_combined_sorted = pd.concat([df3_time, df3_values], axis = 1).sort_values(by=['time'])
df3_combined_sorted_cumulative = np.cumsum(df3_combined_sorted['vals'])
## combining the three curves
df_all_vals_cumulative = pd.concat([df1_combined_sorted_cumulative,.
df2_combined_sorted_cumulative, df3_combined_sorted_cumulative]).reset_index(drop=True)
df_all_time = pd.concat([df1_combined_sorted['time'],
df2_combined_sorted['time'], df3_combined_sorted['time']]).reset_index(drop=True)
df_all = pd.concat([df_all_time, df_all_vals_cumulative], axis = 1)
## creating confidence intervals
df_all_sorted = df_all.sort_values(by=['time'])
ma = df_all_sorted.rolling(10).mean()
mstd = df_all_sorted.rolling(10).std()
## plotting
plt.fill_between(df_all_sorted['time'], ma['vals'] - 2 * mstd['vals'],
ma['vals'] + 2 * mstd['vals'],color='b', alpha=0.2)
plt.plot(df_all_sorted['time'],ma['vals'], c='purple')
plt.plot(df1_combined_sorted['time'], df1_combined_sorted_cumulative, c='blue')
plt.plot(df2_combined_sorted['time'], df2_combined_sorted_cumulative, c='blue')
plt.plot(df3_combined_sorted['time'], df3_combined_sorted_cumulative, c='blue')
matplotlib.use('Agg')
plt.show()
First of all, your sample code could be re-written to make better use of pd. For example
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
df = pd.concat([times, vals], axis = 1).sort_values(by=['time']).\
reset_index().drop('index', axis=1)
df['cumulative'] = df.vals.cumsum()
return df
# generate the dataframes
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# join
df_all = pd.concat(dfs, ignore_index=True).sort_values(by=['time'])
# render function
def render(window=10):
# compute rolling means and confident intervals
mean_val = df_all.cumulative.rolling(window).mean()
std_val = df_all.cumulative.rolling(window).std()
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
plt.figure(figsize=(16,9))
for df in dfs:
plt.plot(df.time, df.cumulative, c='blue')
plt.plot(df_all.time, mean_val, c='r')
plt.fill_between(df_all.time, min_val, max_val, color='blue', alpha=.2)
plt.show()
The reason your curves aren't that smooth is maybe your rolling window is not large enough. You can increase this window size to get smoother graphs. For example render(20) gives:
while render(30) gives:
Although, the better way might be imputing each of df['cumulative'] to the entire time window and compute the mean/confidence interval on these series. With that in mind, we can modify the code as follows:
np.random.seed(seed=42)
## data generation - cumulative analysis over time
def get_data(max_val, max_time=1000):
times = pd.DataFrame(np.random.uniform(0,max_time,size=50), columns=['time'])
vals = pd.DataFrame(np.random.randint(0,max_val,size=100), columns=['vals'])
# note that we set time as index of the returned data
df = pd.concat([times, vals], axis = 1).dropna().set_index('time').sort_index()
df['cumulative'] = df.vals.cumsum()
return df
df1,df2,df3 = (df for df in map(get_data, [10000, 13000, 4000]))
dfs = (df1, df2, df3)
# rename column for later plotting
for i,df in zip(range(3),dfs):
df.rename(columns={'cumulative':f'cummulative_{i}'}, inplace=True)
# concatenate the dataframes with common time index
df_all = pd.concat(dfs,sort=False).sort_index()
# interpolate each cumulative column linearly
df_all.interpolate(inplace=True)
# plot graphs
mean_val = df_all.iloc[:,1:].mean(axis=1)
std_val = df_all.iloc[:,1:].std(axis=1)
min_val = mean_val - 2*std_val
max_val = mean_val + 2*std_val
fig, ax = plt.subplots(1,1,figsize=(16,9))
df_all.iloc[:,1:4].plot(ax=ax)
plt.plot(df_all.index, mean_val, c='purple')
plt.fill_between(df_all.index, min_val, max_val, color='blue', alpha=.2)
plt.show()
and we get:

ggplot multiple plots in one object

I've created a script to create multiple plots in one object. The results I am looking for are two plots one over the other such that each plot has different y axis scale but x axis is fixed - dates. However, only one of the plots (the top) is properly created, the bottom plot is visible but empty i.e the geom_line is not visible. Furthermore, the y-axis of the second plot does not match the range of values - min to max. I also tried using facet_grid (scales="free") but no change in the y-axis. The y-axis for the second graph has a range of 0 to 0.05.
I've limited the date range to the past few weeks. This is the code I am using:
df = df.set_index('date')
weekly = df.resample('w-mon',label='left',closed='left').sum()
data = weekly[-4:].reset_index()
data= pd.melt(data, id_vars=['date'])
pplot = ggplot(aes(x="date", y="value", color="variable", group="variable"), data)
#geom_line()
scale_x_date(labels = date_format('%d.%m'),
limits=(data.date.min() - dt.timedelta(2),
data.date.max() + dt.timedelta(2)))
#facet_grid("variable", scales="free_y")
theme_bw()
The dataframe sample (df), its a daily dataset containing values for each variable x and a, in this case 'date' is the index:
date x a
2016-08-01 100 20
2016-08-02 50 0
2016-08-03 24 18
2016-08-04 0 10
The dataframe sample (to_plot) - weekly overview:
date variable value
0 2016-08-01 x 200
1 2016-08-08 x 211
2 2016-08-15 x 104
3 2016-08-22 x 332
4 2016-08-01 a 8
5 2016-08-08 a 15
6 2016-08-15 a 22
7 2016-08-22 a 6
Sorry for not adding the df dataframe before.
Your calls to the plot directives geom_line(), scale_x_date(), etc. are standing on their own in your script; you do not connect them to your plot object. Thus, they do not have any effect on your plot.
In order to apply a plot directive to an existing plot object, use the graphics language and "add" them to your plot object by connecting them with a + operator.
The result (as intended):
The full script:
from __future__ import print_function
import sys
import pandas as pd
import datetime as dt
from ggplot import *
if __name__ == '__main__':
df = pd.DataFrame({
'date': ['2016-08-01', '2016-08-08', '2016-08-15', '2016-08-22'],
'x': [100, 50, 24, 0],
'a': [20, 0, 18, 10]
})
df['date'] = pd.to_datetime(df['date'])
data = pd.melt(df, id_vars=['date'])
plt = ggplot(data, aes(x='date', y='value', color='variable', group='variable')) +\
scale_x_date(
labels=date_format('%y-%m-%d'),
limits=(data.date.min() - dt.timedelta(2), data.date.max() + dt.timedelta(2))
) +\
geom_line() +\
facet_grid('variable', scales='free_y')
plt.show()

Categories