Highlight data gaps (NaN) in Matplotlib Scatter Plot

Highlight data gaps (NaN) in Matplotlib Scatter Plot - python

I am plotting some time based data from pandas in matplotlib (can be tens of thousands of rows) and i would like to highlight periods where there are NaNs in the data. The way i though to accomplish this was to use axvspan to draw a red box(es) on the plot starting and stopping where there are data gaps. I did think about just drawing a vertical line each time there was a NaN using axvline, but this could create thousands of objects on the plot and cause the resultant PNG to take a long time to write. So the use of axvspan i think is more appropriate. However where I am stuck is finding the start and stop indices of the groups of NaNs.
The code below isn't from my actual code is just a basic mockup to show what i am trying to achieve.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
print(df)
#Code to find the start index and stop index of the groups of NaNs
# resuls in list which contains lists of each gap start and stop datetime
gaps = []
plt.plot(df.index, df['col'])
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.show()
The result would look something like the mockup below:
Other suggestions for visualizing the gaps would also be appreciated. Such as a straight line in a different color connecting the data across the gap using some sort of fillna?

To find the start and stop indices of the groups of NaNs you can first create a variable to hold the boolean values where the col is NaN. With this variable you can find the rows where there's a transition between valid and NaN values. This can be done using the shift (to dislocate one row on the dataframe) and ne, this way you can compare two consecutive rows and determine where the values alternate. After that, apply cumsum to create distinct groups of contiguous data of valid and NaN values.
Now, using only the rows with NaN values (df[is_nan]) use groupby with n_groups to gather the gaps within the same group. Next, apply aggregate to return a single tuple with the start and end timestamps of each group. The use of DateOffset here is to extend the rectangle display to the adjacent points following the desired image output. You can now use ['col'].values to access the dataframe returned by aggregate and convert it into a list.
...
...
df = df.set_index('idx')
print(df)
# Code to find the start index and stop index of the groups of NaNs
is_nan = df['col'].isna()
n_groups = is_nan.ne(is_nan.shift()).cumsum()
gap_list = df[is_nan].groupby(n_groups).aggregate(
lambda x: (
x.index[0] + pd.DateOffset(days=-1),
x.index[-1] + pd.DateOffset(days=+1)
)
)["col"].values
# resuls in list which contains tuples of each gap start and stop datetime
gaps = gap_list
plt.plot(df.index, df['col'], marker='o' )
plt.xticks(df.index, rotation=45)
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.grid()
plt.show()

We can use fill_between to highlight areas. However, it is much easier to define the parts where data are than the ones where no data are without creating gaps to existing data points. So, we simply highlight the entire plotting area, then overwrite the areas where data are in white, then plot:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
fig, ax = plt.subplots()
ax.fill_between(df.index, df.col.min(), df.col.max(), where=df.col, facecolor="lightblue", alpha=0.5)
ax.fill_between(df.index, df.col.min(), df.col.max(), where=np.isfinite(df.col), facecolor="white", alpha=1)
ax.plot(df.index, df.col)
ax.xaxis.set_tick_params(rotation=45)
plt.tight_layout()
plt.show()
Sample output:

You can loop through the enumerated list of boolean values given by df['col'].isna() and compare each boolean value to the previous one to select the timestamps for the starts and stops of the gaps. Here is an example based on your code sample and where the plot is generated with the pandas plotting function:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
days = pd.date_range('2021-03-08', periods=14, freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame(dict(col=data), index=days)
ax = df.plot(y='col', marker='.', figsize=(8,4))
# Generate lists of starts and stops timestamps for gaps in time series,
# assuming that the first and last data points are not NaNs
starts, stops = [], []
for idx, isna in enumerate(df['col'].isna()):
if isna != df['col'].isna()[idx-1] and isna:
starts.append(df.index[idx-1])
elif isna != df['col'].isna()[idx-1] and not isna:
stops.append(df.index[idx])
# Plot red vertical spans for gaps in time series
for start, stop in zip(starts, stops):
ax.axvspan(start, stop, facecolor='r', alpha=0.3)
plt.show()

In the end I took a little from column A, B and C from the provided answers, thanks for the feedback. Building the list of start stops was very slow for real world data (tens-hundreds of thousands of rows). Since i didn't need a numerical answer just a visual one i did it using matplotlib alone with the following code:
ax[i].fill_between(data.index, 0, (is_nan*data.max()), color='r', step='mid', linewidth='0')
ax[i].plot(data.index, data, color='b', linestyle='-', marker=',', label=ylabel)
The fill between creates my shaded blocks where the nans are. Multiplying them by the data.max() allows them to span the entire y axis. Step='mid' squares off the sides. Linewidth=0 hides the red line when data is 0 (not NaN).

Related

Graphing in Dataframe Pandas Pyton. How to plot a line after filtering a dataframe

So I have a pandas Dataframe with pateint id, date of visit, location, weight, and heartrate. I need to graph the line of the number of visits in one location in the Dataset over a period of 12 months with the month number on the horizontal axis.
Any other suggestions about how I may go about this?
I tried making the data into 3 data sets and then just graphing the number of visits counted from each data set but creating new columns and assigning the values wasn't working, it only worked for when I was graphing the values of all of the clinics but after splitting it into 3 dataframes, it stopped working.
DataFrame

Here is a working example of filtering a DataFrame and using the filtered results to plot a chart.
import pandas as pd
import matplotlib.pyplot as plt
# larger dataframe example
d = {'x values':[1,2,3,4,5,6,7,8,9],'y values':[2,4,6,8,10,12,14,16,18]}
df = pd.DataFrame(d)
# apply filter
df = df[df['x values'] < 5]
# plot chart
plt.plot(df['x values'], df['y values'])
plt.show()
result:

simply place your data into an ndarray and plot it with the matplotlib.pyplot or you can simply plot from a dataframe for example plt.plot(df['something'])

Graphing a dataframe line plot with a legend in Matplotlib

I'm working with a dataset that has grades and states and need to create line graphs by state showing what percent of each state's students fall into which bins.
My methodology (so far) is as follows:
First I import the dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
records = [{'Name':'A', 'Grade':'.15','State':'NJ'},{'Name':'B', 'Grade':'.15','State':'NJ'},{'Name':'C', 'Grade':'.43','State':'CA'},{'Name':'D', 'Grade':'.75','State':'CA'},{'Name':'E', 'Grade':'.17','State':'NJ'},{'Name':'F', 'Grade':'.85','State':'HI'},{'Name':'G', 'Grade':'.89','State':'HI'},{'Name':'H', 'Grade':'.38','State':'CA'},{'Name':'I', 'Grade':'.98','State':'NJ'},{'Name':'J', 'Grade':'.49','State':'NJ'},{'Name':'K', 'Grade':'.17','State':'CA'},{'Name':'K', 'Grade':'.94','State':'HI'},{'Name':'M', 'Grade':'.33','State':'HI'},{'Name':'N', 'Grade':'.22','State':'NJ'},{'Name':'O', 'Grade':'.7','State':'NJ'}]
df = pd.DataFrame(records)
df.Grade = df.Grade.astype(float)
Next I cut each grade into a bin
df['bin'] = pd.cut(df['Grade'],[-np.inf,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1],labels=False)/10
Then I create a pivot table giving me the count of people by bin in each state
df2 = pd.pivot_table(df,index=['bin'],columns='State',values=['Name'],aggfunc=pd.Series.nunique,margins=True)
df2 = df2.fillna(0)
Then I convert those n-counts into percentages and remove the margin rows
df3 = df2.div(df2.iloc[-1])
df3 = df3.iloc[:-1,:-1]
Now I want to create a line graph with multiple lines (one for each state) with the bin on the x axis and the percentage on the Y axis. df3.plot() will give me the chart I want but I would like to accomplish the same using matplotlib, because it offers me greater customization of the graph. However, running
plt.plot(df3)
gives me the lines I need but I can't get the legend the work properly. Any thoughts on how to accomplish this?

It may not be the best way, but I use the pandas plot function to draw df3, then get the legend and get the new label names. Please note that the processing of the legend string is limited to this data.
line = df3.plot(kind='line')
handles, labels = line.get_legend_handles_labels()
label = []
for l in labels:
label.append(l[7:-1])
plt.legend(handles, label, loc='best')

You can do this:
plt.plot(df3,label="label")
plt.legend()
plt.show()
For more information visit here
And if it helps you to solve your issues then don't forget to mark this as accepted answer.

Matplotlib and Pandas treatment of timeseries without weekends

I am running into some issues adding Matplotlib lines into Pandas plot. I am trying to plot a straight line using the slope to determine what the start and end-points are. But the resultant graph does not look like a straight line at all.
I have simplified the case to the MVCE below. The initial part is for setup to replicate the key feature of the complicated dataframe I have.
import pandas as pd
import matplotlib.pyplot as plt
LEN_SER = 23
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='B')
df = pd.DataFrame(range(1,LEN_SER+1), index=dates)
ts = df.iloc[:,0]
# The above is the setup of the MVCE to replicate the issue.
fig = plt.figure()
ax1 = plt.subplot2grid((1, 1), (0, 0))
ax1.plot([ts.index[5], ts.index[20]],
[ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
ts.plot(ax=ax1)
plt.show()
This gives a graph that has a wavy line due to the weekends. The Matplotlib is affecting how Pandas is plotting the series. If I take out the ax1.plot() line, then it becomes a straight line.
So the question is: How do I draw straight lines on my Pandas plot with Matplotlib? Put it another way, I want the plot to treat the axis labels as categories so weekends will be ignored. That way, I am hoping that Matplotlib and Pandas will both give a straight line.

As you correctly observe, if you delete the line ax1.plot(), then matplotlib treats your dates as categories, and the pandas plot is a nice straight line. However, in the command
ax1.plot([ts.index[5], ts.index[20]],
[ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
you ask matplotlib to interpolate between two points, in the process of interpolating matplotlib recognize dates in the x-axis. That is why the straight line pandas plot with respect to date categories (5 a week) becomes a wavy line with respect to dates (7 a week). Which is correct as well, because with respect to dates your data simply isn't a represented by a straight line.
You can force the category interpretation replacing dates by strings through
df.index = df.reset_index().apply(lambda x: x['index'].strftime('%Y-%m-%d'), axis=1)
before defining ts. That results in the plot
Now the matplotlib plot is just two categories against two values and matplotlib does not bother to realize that the two categories are among the categories in the pandas plot. (Changing the order of the two plots saves your x-axis at least.) Modifying the matplotlib plot to
ax1.plot([5, 20], [ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
plots a line between categories 5 and 20, and finally gives you two straight lines with respect to a categories x-axis.
Full code:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn') # (optional - style was set when I produced my graph)
LEN_SER = 23
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='B')
df = pd.DataFrame(range(1,LEN_SER+1), index=dates)
df.index = df.reset_index().apply(lambda x: \
x['index'].strftime('%Y-%m-%d'), axis=1) # dates -> categories (string)
ts = df.iloc[:,0]
# The above is the setup of the MVCE to replicate the issue.
fig = plt.figure()
ax1 = plt.subplot2grid((1, 1), (0, 0))
ax1.plot([5, 20], [ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
# x coordinates 'categories' 5 and 20
ts.plot(ax=ax1)
plt.show()

You already answered the question: " probably due to the weekends"
replace:
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='B')
with
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='D')
B - business day frequency
D - calendar day frequency
And your lines are straightened.

You're right - it is due to weekends. You can tell by the slope - five consecutive days have a sharper incline (+1 each day), than the three consecutive days (+1 total). So, what exactly do you want to plot? If you want to literally plot the blue line, you can interpolate the points between your two points like this:
...
# ts.plot(ax=ax1)
ts.iloc[[5,20]].resample('1D').interpolate(how='mean').plot(ax=ax1)
plt.show()

For simplicity I started from 2015-07-04. Does it work for you?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
LEN_SER = 21
dates = pd.date_range('2015-07-04', periods=LEN_SER, freq='B')
the_axes = []
# take the_axes like monday and friday for each week
for monday, friday in zip(dates[dates.weekday==0], dates[dates.weekday==4]):
the_axes.append([monday.date(), friday.date()])
x = dates
y = range(1,LEN_SER+1)
n_Axes = len(the_axes)
fig,(axes) = plt.subplots(1, n_Axes, sharey=True, figsize=(15,8))
for i in range(n_Axes):
ax = axes[i]
ax.plot(x, y)
ax.set_xlim(the_axes[i])
fig.autofmt_xdate()
print(dates)
plt.show()

Integrating over range of dates, and labeling the xaxis

I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.

Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:

I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python

Seaborn pairplot and NaN values

I'm trying to understand why this fails, even though the documentation says:
dropna : boolean, optional
Drop missing values from the data before plotting.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
sns.pairplot(a) # this works as expected
# snip
b = a.copy()
b.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(b) # this fails with error
# "AttributeError: max must be larger than min in range parameter."
# in histogram(a, bins, range, normed, weights, density)"
> sns.pairplot(b, dropna=True) # same error as above

when you are using the data directly, ie
sns.pairplot(b) #Same as sns.pairplot(b, x_vars=['a','b','c'] , y_vars=['a','b','c'],dropna=True)
your are plotting against all the columns in the DataFrame,Then make sure no:of rows are same in all columns.
sns.pairplot(b, x_vars=['a','c'] , y_vars=['a','b','c'],dropna=True)
In this case it works fine, but there will be a minute difference in the graph for removing the 'NaN value'.
So, If you want to plot with the whole Data then :-
either the null values must be replaced using "fillna()",
or the whole row containing 'nan values' must be dropped
b = b.drop(b.index[5])
sns.pairplot(b)

I'm going to post an answer to my own question, even though it doesn't exactly solve the problem in general, but at least it solves my problem.
The problem arises when trying to draw histograms. However, it looks like the kdes are much more robust to missing data. Therefore, this works, despite the NaN in the middle of the dataframe:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
a.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(a, diag_kind='kde')

Something of a necro- but as I cracked the answer to this today I thought it might be worth sharing. I could not find this solution elsewhere on the web... If the Seaborn ignoreNa keyword has not worked for your data and you don't want to drop all rows that have any NaN. This should work for you.
All of this is in Seaborn 0.9 with pandas 0.23.4, assuming a data frame (df) with j rows (samples) that have n columns (attributes).
The solution to the issue of Seaborn being unable to cope with NaN arrays being passed to it; particularly when you want to make sure you retain a row due to it having other data within it that is useful, is based on using a function to intercept the pair-wise columns before they are passed to the PairGrid for plotting.
Functions can be passed to the grid sectors to carry out an operation per subplot. A simple example of this would be to calculate and annotate RMSE for a column pair (subplot) onto each plot:
def rmse(x,y, **kwargs):
rmse = math.sqrt(skm.mean_squared_error(x, y))
label = 'RMSE = ' + str(round(rmse, 2))
ax = plt.gca()
ax.annotate(label, xy = (0.1, 0.95), size = 20, xycoords = ax.transAxes)
grid = grid.map_upper(rmse)
Therefore by writing a function that Seaborn can take as a data plotting argument, which drops NaNs on a column pair basis as the grid.map_ iterates over the main data frame, we can minimize data loss per sample (row). This is because one NaN in a row will not cause the entire row to be lost for all sub-plots. But rather just the sub-plot for that specific column pair will exclude the given row.
The following function carries out the pairwise NaN drop, returns the two series that seaborn then plots on the axes with matplotlibs scatter plot:
df = [YOUR DF HERE]
def col_nan_scatter(x,y, **kwargs):
df = pd.DataFrame({'x':x[:],'y':y[:]})
df = df.dropna()
x = df['x']
y = df['y']
plt.gca()
plt.scatter(x,y)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
The same can be done with seaborn plotting (with for example, just the x value):
def col_nan_kde_histo(x, **kwargs):
df = pd.DataFrame({'x':x[:]})
df = df.dropna()
x = df['x']
plt.gca()
sns.kdeplot(x)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
grid = grid.map_upper(col_nan_kde_histo)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Highlight data gaps (NaN) in Matplotlib Scatter Plot - python

Related

Graphing in Dataframe Pandas Pyton. How to plot a line after filtering a dataframe

Graphing a dataframe line plot with a legend in Matplotlib

Matplotlib and Pandas treatment of timeseries without weekends

Integrating over range of dates, and labeling the xaxis

Seaborn pairplot and NaN values

Categories

Resources