Pandas plot ONLY overlap between multiple data frames - python

Found on S.O. the following solution to plot multiple data frames:
ax = df1.plot()
df2.plot(ax=ax)
But what if I only want to plot where they overlap?
Say that df1 index are timestamps that spans 24 hour and df2 index also are timestamps that spans 12 hours within the 24 hours of df1 (but not exactly the same as df1).
If I only want to plot the 12 hours that both data frames covers. What's the easies way to do this?

A general answer to a general question:
You have three options:
Filter both DataFrames prior to plotting, such that they contain the same time interval.
Use the xlim keyword from the pandas plotting function.
Plot both dataframes and set the axes limits later on (ax.set_xlim())

There are multiple ways to achieve this. The code snippet below shows two such ways as an example.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Make up time data for 24 hour period and 12 hour period
times1 = pd.date_range('1/30/2016 12:00:00', '1/31/2016 12:00:00', freq='H')
times2 = pd.date_range('1/30/2016 12:00:00', '1/31/2016 00:00:00', freq='H')
# Put time into DataFrame
df1 = pd.DataFrame(np.cumsum(np.random.randn(times1.size)), columns=['24 hrs'],
index=times1)
df2 = pd.DataFrame(np.cumsum(np.random.randn(times2.size)), columns=['12 hrs'],
index=times2)
# Method 1: Filter first dataframe according to second dataframe's time index
fig1, ax1 = plt.subplots()
df1.loc[times2].plot(ax=ax1)
df2.plot(ax=ax1)
# Method 2: Set the x limits of the axis
fig2, ax2 = plt.subplots()
df1.plot(ax=ax2)
df2.plot(ax=ax2)
ax2.set_xlim(times2.min(), times2.max())

To plot only the portion of df1 whose index lies within the index range of df2, you could do something like this:
ax = df1.loc[df2.index.min():df2.index.max()].plot()
There may be other ways to do it, but that's the one that occurs to me first.
Good luck!

Related

Plot each single day on one plot by extracting time of DatetimeIndex without for loop

I have a dataframe including random data over 7 days and each data point is indexed by DatetimeIndex. I want to plot data of each day on a single plot. Currently my try is the following:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n =10000
i = pd.date_range('2018-04-09', periods=n, freq='1min')
ts = pd.DataFrame({'A': [np.random.randn() for i in range(n)]}, index=i)
dates = list(ts.index.map(lambda t: t.date).unique())
for date in dates:
ts['A'].loc[date.strftime('%Y-%m-%d')].plot()
The result is the following:
As you can see when DatetimeIndex is used the corresponding day is kept that is why we have each day back to the next one.
Questions:
1- How can I fix the current code to have an x-axis which starts from midnight and ends next midnight.
2- Is there a pandas way to group days better and plot them on a single day without using for loop?
You can split the index into dates and times and unstack the ts into a dataframe:
df = ts.set_index([ts.index.date, ts.index.time]).unstack(level=0)
df.columns = df.columns.get_level_values(1)
then plot all in one chart:
df.plot()
or in separate charts:
axs = df.plot(subplots=True, title=df.columns.tolist(), legend=False, figsize=(6,8))
axs[0].figure.execute_constrained_layout()

Highlight data gaps (NaN) in Matplotlib Scatter Plot

I am plotting some time based data from pandas in matplotlib (can be tens of thousands of rows) and i would like to highlight periods where there are NaNs in the data. The way i though to accomplish this was to use axvspan to draw a red box(es) on the plot starting and stopping where there are data gaps. I did think about just drawing a vertical line each time there was a NaN using axvline, but this could create thousands of objects on the plot and cause the resultant PNG to take a long time to write. So the use of axvspan i think is more appropriate. However where I am stuck is finding the start and stop indices of the groups of NaNs.
The code below isn't from my actual code is just a basic mockup to show what i am trying to achieve.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
print(df)
#Code to find the start index and stop index of the groups of NaNs
# resuls in list which contains lists of each gap start and stop datetime
gaps = []
plt.plot(df.index, df['col'])
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.show()
The result would look something like the mockup below:
Other suggestions for visualizing the gaps would also be appreciated. Such as a straight line in a different color connecting the data across the gap using some sort of fillna?
To find the start and stop indices of the groups of NaNs you can first create a variable to hold the boolean values where the col is NaN. With this variable you can find the rows where there's a transition between valid and NaN values. This can be done using the shift (to dislocate one row on the dataframe) and ne, this way you can compare two consecutive rows and determine where the values alternate. After that, apply cumsum to create distinct groups of contiguous data of valid and NaN values.
Now, using only the rows with NaN values (df[is_nan]) use groupby with n_groups to gather the gaps within the same group. Next, apply aggregate to return a single tuple with the start and end timestamps of each group. The use of DateOffset here is to extend the rectangle display to the adjacent points following the desired image output. You can now use ['col'].values to access the dataframe returned by aggregate and convert it into a list.
...
...
df = df.set_index('idx')
print(df)
# Code to find the start index and stop index of the groups of NaNs
is_nan = df['col'].isna()
n_groups = is_nan.ne(is_nan.shift()).cumsum()
gap_list = df[is_nan].groupby(n_groups).aggregate(
lambda x: (
x.index[0] + pd.DateOffset(days=-1),
x.index[-1] + pd.DateOffset(days=+1)
)
)["col"].values
# resuls in list which contains tuples of each gap start and stop datetime
gaps = gap_list
plt.plot(df.index, df['col'], marker='o' )
plt.xticks(df.index, rotation=45)
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.grid()
plt.show()
We can use fill_between to highlight areas. However, it is much easier to define the parts where data are than the ones where no data are without creating gaps to existing data points. So, we simply highlight the entire plotting area, then overwrite the areas where data are in white, then plot:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
fig, ax = plt.subplots()
ax.fill_between(df.index, df.col.min(), df.col.max(), where=df.col, facecolor="lightblue", alpha=0.5)
ax.fill_between(df.index, df.col.min(), df.col.max(), where=np.isfinite(df.col), facecolor="white", alpha=1)
ax.plot(df.index, df.col)
ax.xaxis.set_tick_params(rotation=45)
plt.tight_layout()
plt.show()
Sample output:
You can loop through the enumerated list of boolean values given by df['col'].isna() and compare each boolean value to the previous one to select the timestamps for the starts and stops of the gaps. Here is an example based on your code sample and where the plot is generated with the pandas plotting function:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
days = pd.date_range('2021-03-08', periods=14, freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame(dict(col=data), index=days)
ax = df.plot(y='col', marker='.', figsize=(8,4))
# Generate lists of starts and stops timestamps for gaps in time series,
# assuming that the first and last data points are not NaNs
starts, stops = [], []
for idx, isna in enumerate(df['col'].isna()):
if isna != df['col'].isna()[idx-1] and isna:
starts.append(df.index[idx-1])
elif isna != df['col'].isna()[idx-1] and not isna:
stops.append(df.index[idx])
# Plot red vertical spans for gaps in time series
for start, stop in zip(starts, stops):
ax.axvspan(start, stop, facecolor='r', alpha=0.3)
plt.show()
In the end I took a little from column A, B and C from the provided answers, thanks for the feedback. Building the list of start stops was very slow for real world data (tens-hundreds of thousands of rows). Since i didn't need a numerical answer just a visual one i did it using matplotlib alone with the following code:
ax[i].fill_between(data.index, 0, (is_nan*data.max()), color='r', step='mid', linewidth='0')
ax[i].plot(data.index, data, color='b', linestyle='-', marker=',', label=ylabel)
The fill between creates my shaded blocks where the nans are. Multiplying them by the data.max() allows them to span the entire y axis. Step='mid' squares off the sides. Linewidth=0 hides the red line when data is 0 (not NaN).

Dataframe changing question with time series data with pandas

I have this dataframe:
The event-time is certain time, date-time column is every 10 min with a specific price. Continues for 4 hours after event time and 2 hours before the event for each security. I have thousands of securities. I want to create a plot that i x-axis starts from -12 to 24 which is event time to -2 hour to 4 hours after. y-axis price change. Is any way to synchronize date-time in python for security.
If you're looking to simply plot the data pandas should handle your datetimes for you assuming they are in datetime formats instead of strings.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
event_time = pd.to_datetime('2024-04-28T07:52:00')
date_time = pd.date_range(event_time, periods=24, freq=pd.to_timedelta(10, 'minute'))
df = pd.DataFrame({'date_time': date_time, 'change': np.random.normal(size=len(date_time))})
ax = df.plot(x='date_time', y='change')
plt.show()
However if you're wanting to remove the specific times from the x axis and just count up from zero you could use the index as the x-axis:
df['_index'] = df.index
ax = df.plot(x='_index', y='change')
plt.show()

Integrating over range of dates, and labeling the xaxis

I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.
Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:
I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python

Inconsistent automatic pandas date labeling

I was wondering how pandas formats the x-axis date exactly. I am using the same script on a bunch of data results, which all have the same pandas df format. However, pandas formats each df date differently. How could this be more consistently?
Each df has a DatetimeIndex like this, dtype='datetime64[ns]
>>> df.index
DatetimeIndex(['2014-10-02', '2014-10-03', '2014-10-04', '2014-10-05',
'2014-10-06', '2014-10-07', '2014-10-08', '2014-10-09',
'2014-10-10', '2014-10-11',
...
'2015-09-23', '2015-09-24', '2015-09-25', '2015-09-26',
'2015-09-27', '2015-09-28', '2015-09-29', '2015-09-30',
'2015-10-01', '2015-10-02'],
dtype='datetime64[ns]', name='Date', length=366, freq=None)
Eventually, I plot with df.plot() where the df has two columns.
But the axes of the plots have different styles, like this:
I would like all plots to have the x-axis style of the first plot. pandas should do this automatically, so I'd rather not prefer to begin with xticks formatting, since I have quite a lot of data to plot. Could anyone explain what to do? Thanks!
EDIT:
I'm reading two csv-files from 2015. The first has the model results of about 200 stations, the second has the gauge measurements of the same stations. Later, I read another two csv-files from 2016 with the same format.
import pandas as pd
df_model = pd.read_csv(path_model, sep=';', index_col=0, parse_dates=True)
df_gauge = pd.read_csv(path_gauge, sep=';', index_col=0, parse_dates=True)
df = pd.DataFrame(columns=['model', 'gauge'], index=df_model.index)
df['model'] = df_model['station_1'].copy()
df['gauge'] = df_gauge['station_1'].copy()
df.plot()
I do this for each year, so the x-axis should look the same, right?
I do not think this possible unless you make modifications to the pandas library. I looked around a bit for options that one may set in Pandas, but couldn't find one. Pandas tries to intelligently select the type of axis ticks using logic implemented here (I THINK). So in my opinion, it would be best to define your own function to make the plots and than overwrite the tick formatting (although you do not want to do that).
There are many references around the internet which show how to do this. I used this one by "Simone Centellegher" and this stackoverflow answer to come up with a function that may work for you (tested in python 3.7.1 with matplotlib 3.0.2, pandas 0.23.4):
import pandas as pd
import numpy as np
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
## pass df with columns you want to plot
def my_plotter(df, xaxis, y_cols):
fig, ax = plt.subplots()
plt.plot(xaxis,df[y_cols])
ax.xaxis.set_minor_locator(mdates.MonthLocator())
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y'))
# Remove overlapping major and minor ticks
majticklocs = ax.xaxis.get_majorticklocs()
minticklocs = ax.xaxis.get_minorticklocs()
minticks = ax.xaxis.get_minor_ticks()
for i in range(len(minticks)):
cur_mintickloc = minticklocs[i]
if cur_mintickloc in majticklocs:
minticks[i].set_visible(False)
return fig, ax
df = pd.DataFrame({'values':np.random.randint(0,1000,36)}, \
index=pd.date_range(start='2014-01-01', \
end='2016-12-31',freq='M'))
fig, ax = my_plotter(df, df.index, ["values"])

Categories