Timeseries plot from CSV data (Timestamp and events): x-label constant - python

(This question can be read alone, but is a sequel to: Timeseries from CSV data (Timestamp and events))
I would like to visualize CSV data (from 2 files) as shown below, by a timeseries representation, using python's pandas module (see links below).
Sample data of df1:
TIMESTAMP eventid
0 2017-03-20 02:38:24 1
1 2017-03-21 05:59:41 1
2 2017-03-23 12:59:58 1
3 2017-03-24 01:00:07 1
4 2017-03-27 03:00:13 1
The 'eventid' column always contains the value of 1, and I am trying to show the sum of events for each day in the dataset.
The 2nd dataset, df0, has similar structure but contains only zeros:
Sample data of df0:
TIMESTAMP eventid
0 2017-03-21 01:38:24 0
1 2017-03-21 03:59:41 0
2 2017-03-22 11:59:58 0
3 2017-03-24 01:03:07 0
4 2017-03-26 03:50:13 0
The x-axis label only shows the same date, and my question is: How can the different dates be shown? (What causes the same date to be shown multiple times on x labels?)
script so far:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df1 = pd.read_csv('timestamp01.csv', parse_dates=True, index_col='TIMESTAMP')
df0 = pd.read_csv('timestamp00.csv', parse_dates=True, index_col='TIMESTAMP')
f, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(df0.resample('D').size())
ax1.set_xlim([pd.to_datetime('2017-01-27'), pd.to_datetime('2017-04-30')])
ax1.xaxis.set_major_formatter(ticker.FixedFormatter
(df0.index.strftime('%Y-%m-%d')))
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=15)
ax2.plot(df1.resample('D').size())
ax2.set_xlim([pd.to_datetime('2017-03-22'), pd.to_datetime('2017-04-29')])
ax2.xaxis.set_major_formatter(ticker.FixedFormatter(df1.index.strftime
('%Y-%m-%d')))
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=15)
plt.show()
Output: (https://www.dropbox.com/s/z21koflkzglm6c3/figure_1.png?dl=0)
Links I have tried to follow:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
Multiple timeseries plots from Pandas Dataframe
Pandas timeseries plot setting x-axis major and minor ticks and labels
Any help is much appreciated.

Making the example reproducible, we can create the following text file (data/timestamp01.csv):
TIMESTAMP;eventid
2017-03-20 02:38:24;1
2017-03-21 05:59:41;1
2017-03-23 12:59:58;1
2017-03-24 01:00:07;1
2017-03-27 03:00:13;1
(same for data/timestamp00.csv). We can then read them in
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df1 = pd.read_csv('data/timestamp01.csv', parse_dates=True, index_col='TIMESTAMP', sep=";")
df0 = pd.read_csv('data/timestamp00.csv', parse_dates=True, index_col='TIMESTAMP', sep=";")
Plotting them
f, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(df0.resample('D').size())
ax2.plot(df1.resample('D').size())
plt.setp(ax1.xaxis.get_majorticklabels(), rotation=30, ha="right")
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=30, ha="right")
plt.show()
results in
which is the desired plot.

Related

How to create a Day/Hour heatmaps using Python [duplicate]

I need to generate a heat map Where I have to arrange days as columns and week_num as rows and Green for a positive day and red for the negative day.
It should have break for each day and each week.
I have tried using seaborn library but couldn't succeed in plotting this. Can anyone help me with this?
week_num day color_code
1 2020-05-01 red
1 2020-05-02 green
2 2020-05-05 red
2 2020-05-06 red
3 2020-05-13 green
3 2020-05-14 green
3 2020-05-15 red
I am guessing you refer to the day of the week, otherwise it will be a really weird heatmap. You can try something like below, basically in something like your data.frame, get the day of week as another column, then pivot this into a wide format and plot. sns.heatmap does not take in categorical values so you need to replace this with 0,1 and label them accordingly in the legend:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
dates = pd.date_range(start='1/1/2018', periods=60, freq='1D')
color_code = np.random.choice(['green','red'],60)
df = pd.DataFrame({'dates':dates ,'color_code':color_code})
df['week_num'] = df['dates'].dt.strftime("%W")
df['day_num'] = df['dates'].dt.weekday
fig, ax = plt.subplots(1, 1, figsize = (5, 3))
df_wide = df.pivot_table(index='week_num',columns='day_num',values='color_code',
aggfunc=lambda x:x)
sns.heatmap(df_wide.replace({'green':0,'red':1}),cmap=["#2ecc71","#e74c3c"],
linewidths=1.0,ax=ax)
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([0.25,0.75])
colorbar.set_ticklabels(['green','red'])

How to merge two plots in Pandas?

I want to merge two plots, that is my dataframe:
df_inc.head()
id date real_exe_time mean mean+30% mean-30%
0 Jan 31 33.14 43.0 23.0
1 Jan 30 33.14 43.0 23.0
2 Jan 33 33.14 43.0 23.0
3 Jan 38 33.14 43.0 23.0
4 Jan 36 33.14 43.0 23.0
My first plot:
df_inc.plot.scatter(x = 'date', y = 'real_exe_time')
Then
My second plot:
df_inc.plot(x='date', y=['mean','mean+30%','mean-30%'])
When I try to merge with:
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()
I got the following:
How I can merge the right way?
You should not repeat your mean values as an extra column. df.plot() for categorical data will be plotted against the index - hence you will see the original scatter plot (also plotted against the index) squeezed into the left corner.
You could create instead an additional aggregation dataframe that you can plot then into the same graph:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
n=30
np.random.seed(123)
df = pd.DataFrame({"date": np.random.choice(list("ABCDEF"), n), "real_exe_time": np.random.randint(1, 100, n)})
df = df.sort_values(by="date").reindex()
#aggregate data for plotting
df_agg = df.groupby("date")["real_exe_time"].agg(mean="mean").reset_index()
df_agg["mean+30%"] = df_agg["mean"] * 1.3
df_agg["mean-30%"] = df_agg["mean"] * 0.7
#plot both into the same subplot
ax = df.plot.scatter(x = 'date', y = 'real_exe_time')
df_agg.plot(x='date', y=['mean','mean+30%','mean-30%'], ax=ax)
plt.show()
Sample output:
You could also consider using seaborn that has, for instance, pointplots for categorical data aggregation.
I'm Guessing that you haven't transform the Date to a datetime object so the first thing you should do is this
#Transform the date to datetime object
df_inc['date']=pd.to_datetime(df_inc['date'],format='%b')
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()

Highlight time interval in multivariate time-series plot using matplotlib and seaborn

I want to annotate a plot of multivariate time-series with time intervals (in colour for each type of annotation).
data overview
An example dataset looks like this:
metrik_0 metrik_1 metrik_2 geospatial_id topology_id \
2020-01-01 -0.848009 1.305906 0.924208 12 4
2020-01-01 -0.516120 0.617011 0.623065 8 3
2020-01-01 0.762399 -0.359898 -0.905238 19 3
2020-01-01 0.708512 -1.502019 -2.677056 8 4
2020-01-01 0.249475 0.590983 -0.677694 11 3
cohort_id device_id
2020-01-01 1 1
2020-01-01 1 9
2020-01-01 2 13
2020-01-01 2 8
2020-01-01 1 12
The labels look like this:
cohort_id marker_type start end
0 1 a 2020-01-02 00:00:00 NaT
1 1 b 2020-01-04 05:00:00 2020-01-05 16:00:00
2 1 a 2020-01-06 00:00:00 NaT
desired result
multivariate plot of all the time-series of a cohort_id
highlighting for the markers (different color for each type)
notice the markers might overlay / transparency is useful
there will be attenuation around the marker type a (configured by the number of hours)
I thought about using seaborn/matplotlib for this task.
So far I have come around:
%pylab inline
import seaborn as sns; sns.set()
import matplotlib.dates as mdates
aut_locator = mdates.AutoDateLocator(minticks=3, maxticks=7)
aut_formatter = mdates.ConciseDateFormatter(aut_locator)
g = df[df['cohort_id'] == 1].plot(figsize=(8,8))
g.xaxis.set_major_locator(aut_locator)
g.xaxis.set_major_formatter(aut_formatter)
plt.show()
which is rather chaotic.
I fear, it will not be possible to fit the metrics (multivariate data) into a single plot.
It should be facetted by each column.
However, this again would require to reshape the dataframe for seaborn FacetGrid to work, which also doesn`t quite feel right - especially if the number of elements (time-series) in a cohort_id gets larger.
If FacetGrid is the right way, then something along the lines of: https://seaborn.pydata.org/examples/timeseries_facets.html would be the first part, but the labels would still be missing.
How could the labels be added?
How should the first part be accomplished?
An example of the desired result:
https://imgur.com/9J1EcmI, i.e. one of
for each metric value
code for the example data
The datasets are generated from the code snippet below:
import pandas as pd
import numpy as np
import random
random_seed = 47
np.random.seed(random_seed)
random.seed(random_seed)
def generate_df_for_device(n_observations, n_metrics, device_id, geo_id, topology_id, cohort_id):
df = pd.DataFrame(np.random.randn(n_observations,n_metrics), index=pd.date_range('2020', freq='H', periods=n_observations))
df.columns = [f'metrik_{c}' for c in df.columns]
df['geospatial_id'] = geo_id
df['topology_id'] = topology_id
df['cohort_id'] = cohort_id
df['device_id'] = device_id
return df
def generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels):
results = []
for i in range(1, n_devices +1):
#print(i)
r = random.randrange(1, n_devices)
cohort = random.randrange(1, cohort_levels)
topo = random.randrange(1, topo_levels)
df_single_dvice = generate_df_for_device(n_observations, n_metrics, i, r, topo, cohort)
results.append(df_single_dvice)
#print(r)
return pd.concat(results)
# hourly data, 1 week of data
n_observations = 7 * 24
n_metrics = 3
n_devices = 20
cohort_levels = 3
topo_levels = 5
df = generate_multi_device(n_observations, n_metrics, n_devices, cohort_levels, topo_levels)
df = df.sort_index()
df.head()
marker_labels = pd.DataFrame({'cohort_id':[1,1, 1], 'marker_type':['a', 'b', 'a'], 'start':['2020-01-2', '2020-01-04 05', '2020-01-06'], 'end':[np.nan, '2020-01-05 16', np.nan]})
marker_labels['start'] = pd.to_datetime(marker_labels['start'])
marker_labels['end'] = pd.to_datetime(marker_labels['end'])
In general, you can use either plt.fill_between for horizontal and plt.fill_betweenx for vertical bands. For "bands-within-bands" you can just call the method twice.
A basic example using your data would look like this. I've used fixed values for the position of the bands, but you can put them on the main dataframe and reference them dynamically inside the loop.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(3 ,figsize=(20, 9), sharex=True)
plt.subplots_adjust(hspace=0.2)
metriks = ["metrik_0", "metrik_1", "metrik_2"]
colors = ['#66c2a5', '#fc8d62', '#8da0cb'] #Set2 palette hexes
for i, metric in enumerate(metriks):
df[[metric]].plot(ax=ax[i], color=colors[i], legend=None)
ax[i].set_ylabel(metric)
ax[i].fill_betweenx(y=[-3, 3], x1="2020-01-04 05:00:00",
x2="2020-01-05 16:00:00", color='gray', alpha=0.2)
ax[i].fill_betweenx(y=[-3, 3], x1="2020-01-04 15:00:00",
x2="2020-01-05 00:00:00", color='gray', alpha=0.4)

Wrong Dates in Dataframe and Subplots

I am trying to plot my data in the csv file. Currently my dates are not shown properly in the plot also if i am converting it. How can I change it to show the proper dat format as defined Y-m-d? The second question is that I am currently plotting all the dat in one plot but want to have for every Valuegroup one subplot.
My code looks like the following:
import pandas as pd
import matplotlib.pyplot as plt
csv_loader = pd.read_csv('C:/Test.csv', encoding='cp1252', sep=';', index_col=0).dropna()
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'], format="%Y-%m-%d")
print(csv_loader)
fig, ax = plt.subplots()
csv_loader.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
The csv file looks like the following:
Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45
You can just tell pandas to parse that column as a datetime and it will just work:
In[151]:
import matplotlib.pyplot as plt
t="""Calcgroup;Valuegroup;id;Date;Value
Group1;A;1;20080103;0.1
Group1;A;1;20080104;0.3
Group1;A;1;20080107;0.5
Group1;A;1;20080108;0.9
Group1;B;1;20080103;0.5
Group1;B;1;20080104;1.3
Group1;B;1;20080107;2.0
Group1;B;1;20080108;0.15
Group1;C;1;20080103;1.9
Group1;C;1;20080104;2.1
Group1;C;1;20080107;2.9
Group1;C;1;20080108;0.45"""
df = pd.read_csv(io.StringIO(t), parse_dates=['Date'], sep=';', index_col=0)
df
Out[151]:
Valuegroup id Date Value
Calcgroup
Group1 A 1 2008-01-03 0.10
Group1 A 1 2008-01-04 0.30
Group1 A 1 2008-01-07 0.50
Group1 A 1 2008-01-08 0.90
Group1 B 1 2008-01-03 0.50
Group1 B 1 2008-01-04 1.30
Group1 B 1 2008-01-07 2.00
Group1 B 1 2008-01-08 0.15
Group1 C 1 2008-01-03 1.90
Group1 C 1 2008-01-04 2.10
Group1 C 1 2008-01-07 2.90
Group1 C 1 2008-01-08 0.45
fig, ax = plt.subplots()
df.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
plt.show()
results in:
Besides your format string was incorrect anyway, it should be:
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'], format="%Y%m%d")
however, this won't work as that column will have been loaded as int dtype so you would've needed to convert to string first:
csv_loader['Date'] = pd.to_datetime(csv_loader['Date'].astype(str), format="%Y%m%d")
To format the dates on the x-axis you can use DateFormatter from matplotlib see related: Editing the date formatting of x-axis tick labels in matplotlib
from matplotlib.dates import DateFormatter
fig, ax = plt.subplots()
df.groupby('Valuegroup').plot(x='Date', y='Value', ax=ax, legend=False, kind='line')
plt.grid(True)
myFmt = DateFormatter("%d-%m-%Y")
ax.xaxis.set_minor_formatter(myFmt)
plt.show()
now gives plot:
You're parsing your dates wrong; "%Y-%m-%d" would work for dates like 2017-12-11 (which is Dec 12, 2017). Your dates are of the form "%Y%m%d", without the hyphen.

Pandas dataframe plotting - issue when switching from two subplots to single plot w/ secondary axis

I have two sets of data I want to plot together on a single figure. I have a set of flow data at 15 minute intervals I want to plot as a line plot, and a set of precipitation data at hourly intervals, which I am resampling to a daily time step and plotting as a bar plot. Here is what the format of the data looks like:
2016-06-01 00:00:00 56.8
2016-06-01 00:15:00 52.1
2016-06-01 00:30:00 44.0
2016-06-01 00:45:00 43.6
2016-06-01 01:00:00 34.3
At first I set this up as two subplots, with precipitation and flow rate on different axis. This works totally fine. Here's my code:
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
filename = 'manhole_B.csv'
plotname = 'SSMH-2A B'
plt.style.use('bmh')
# Read csv with precipitation data, change index to datetime object
pdf = pd.read_csv('precip.csv', delimiter=',', header=None, index_col=0)
pdf.columns = ['Precipitation[in]']
pdf.index.name = ''
pdf.index = pd.to_datetime(pdf.index)
pdf = pdf.resample('D').sum()
print(pdf.head())
# Read csv with flow data, change index to datetime object
qdf = pd.read_csv(filename, delimiter=',', header=None, index_col=0)
qdf.columns = ['Flow rate [gpm]']
qdf.index.name = ''
qdf.index = pd.to_datetime(qdf.index)
# Plot
f, ax = plt.subplots(2)
qdf.plot(ax=ax[1], rot=30)
pdf.plot(ax=ax[0], kind='bar', color='r', rot=30, width=1)
ax[0].get_xaxis().set_ticks([])
ax[1].set_ylabel('Flow Rate [gpm]')
ax[0].set_ylabel('Precipitation [in]')
ax[0].set_title(plotname)
f.set_facecolor('white')
f.tight_layout()
plt.show()
2 Axis Plot
However, I decided I want to show everything on a single axis, so I modified my code to put precipitation on a secondary axis. Now my flow data data has disppeared from the plot, and even when I set the axis ticks to an empty set, I get these 00:15 00:30 and 00:45 tick marks along the x-axis.
Secondary-y axis plots
Any ideas why this might be occuring?
Here is my code for the single axis plot:
f, ax = plt.subplots()
qdf.plot(ax=ax, rot=30)
pdf.plot(ax=ax, kind='bar', color='r', rot=30, secondary_y=True)
ax.get_xaxis().set_ticks([])
Here is an example:
Setup
In [1]: from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
df = pd.DataFrame({'x' : np.arange(10),
'y1' : np.random.rand(10,),
'y2' : np.square(np.arange(10))})
df
Out[1]: x y1 y2
0 0 0.451314 0
1 1 0.321124 1
2 2 0.050852 4
3 3 0.731084 9
4 4 0.689950 16
5 5 0.581768 25
6 6 0.962147 36
7 7 0.743512 49
8 8 0.993304 64
9 9 0.666703 81
Plot
In [2]: fig, ax1 = plt.subplots()
ax1.plot(df['x'], df['y1'], 'b-')
ax1.set_xlabel('Series')
ax1.set_ylabel('Random', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
ax2 = ax1.twinx() # Note twinx, not twiny. I was wrong when I commented on your question.
ax2.plot(df['x'], df['y2'], 'ro')
ax2.set_ylabel('Square', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
Out[2]:

Categories