I am plotting some time based data from pandas in matplotlib (can be tens of thousands of rows) and i would like to highlight periods where there are NaNs in the data. The way i though to accomplish this was to use axvspan to draw a red box(es) on the plot starting and stopping where there are data gaps. I did think about just drawing a vertical line each time there was a NaN using axvline, but this could create thousands of objects on the plot and cause the resultant PNG to take a long time to write. So the use of axvspan i think is more appropriate. However where I am stuck is finding the start and stop indices of the groups of NaNs.
The code below isn't from my actual code is just a basic mockup to show what i am trying to achieve.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
print(df)
#Code to find the start index and stop index of the groups of NaNs
# resuls in list which contains lists of each gap start and stop datetime
gaps = []
plt.plot(df.index, df['col'])
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.show()
The result would look something like the mockup below:
Other suggestions for visualizing the gaps would also be appreciated. Such as a straight line in a different color connecting the data across the gap using some sort of fillna?
To find the start and stop indices of the groups of NaNs you can first create a variable to hold the boolean values where the col is NaN. With this variable you can find the rows where there's a transition between valid and NaN values. This can be done using the shift (to dislocate one row on the dataframe) and ne, this way you can compare two consecutive rows and determine where the values alternate. After that, apply cumsum to create distinct groups of contiguous data of valid and NaN values.
Now, using only the rows with NaN values (df[is_nan]) use groupby with n_groups to gather the gaps within the same group. Next, apply aggregate to return a single tuple with the start and end timestamps of each group. The use of DateOffset here is to extend the rectangle display to the adjacent points following the desired image output. You can now use ['col'].values to access the dataframe returned by aggregate and convert it into a list.
...
...
df = df.set_index('idx')
print(df)
# Code to find the start index and stop index of the groups of NaNs
is_nan = df['col'].isna()
n_groups = is_nan.ne(is_nan.shift()).cumsum()
gap_list = df[is_nan].groupby(n_groups).aggregate(
lambda x: (
x.index[0] + pd.DateOffset(days=-1),
x.index[-1] + pd.DateOffset(days=+1)
)
)["col"].values
# resuls in list which contains tuples of each gap start and stop datetime
gaps = gap_list
plt.plot(df.index, df['col'], marker='o' )
plt.xticks(df.index, rotation=45)
for gap in gaps:
plt.axvspan(gap[0], gap[1], facecolor='r', alpha=0.5)
plt.grid()
plt.show()
We can use fill_between to highlight areas. However, it is much easier to define the parts where data are than the ones where no data are without creating gaps to existing data points. So, we simply highlight the entire plotting area, then overwrite the areas where data are in white, then plot:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
days = pd.date_range(datetime.now(), datetime.now() + timedelta(13), freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame({'idx': days, 'col': data})
df = df.set_index('idx')
fig, ax = plt.subplots()
ax.fill_between(df.index, df.col.min(), df.col.max(), where=df.col, facecolor="lightblue", alpha=0.5)
ax.fill_between(df.index, df.col.min(), df.col.max(), where=np.isfinite(df.col), facecolor="white", alpha=1)
ax.plot(df.index, df.col)
ax.xaxis.set_tick_params(rotation=45)
plt.tight_layout()
plt.show()
Sample output:
You can loop through the enumerated list of boolean values given by df['col'].isna() and compare each boolean value to the previous one to select the timestamps for the starts and stops of the gaps. Here is an example based on your code sample and where the plot is generated with the pandas plotting function:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
days = pd.date_range('2021-03-08', periods=14, freq='D')
data = [2,2.3,3,np.nan, np.nan,4.7,3.4,3.1,2.7,np.nan,np.nan,np.nan,4,4.5]
df = pd.DataFrame(dict(col=data), index=days)
ax = df.plot(y='col', marker='.', figsize=(8,4))
# Generate lists of starts and stops timestamps for gaps in time series,
# assuming that the first and last data points are not NaNs
starts, stops = [], []
for idx, isna in enumerate(df['col'].isna()):
if isna != df['col'].isna()[idx-1] and isna:
starts.append(df.index[idx-1])
elif isna != df['col'].isna()[idx-1] and not isna:
stops.append(df.index[idx])
# Plot red vertical spans for gaps in time series
for start, stop in zip(starts, stops):
ax.axvspan(start, stop, facecolor='r', alpha=0.3)
plt.show()
In the end I took a little from column A, B and C from the provided answers, thanks for the feedback. Building the list of start stops was very slow for real world data (tens-hundreds of thousands of rows). Since i didn't need a numerical answer just a visual one i did it using matplotlib alone with the following code:
ax[i].fill_between(data.index, 0, (is_nan*data.max()), color='r', step='mid', linewidth='0')
ax[i].plot(data.index, data, color='b', linestyle='-', marker=',', label=ylabel)
The fill between creates my shaded blocks where the nans are. Multiplying them by the data.max() allows them to span the entire y axis. Step='mid' squares off the sides. Linewidth=0 hides the red line when data is 0 (not NaN).
Given the following example:
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
df.plot(linewidth=10)
The order of plotting puts the last column on top:
How can I make this keep the data & legend order but change the behaviour so that it plots X on top of Y on top of Z?
(I know I can change the data column order and edit the legend order but I am hoping for a simpler easier method leaving the data as is)
UPDATE: final solution used:
(Thanks to r-beginners) I used the get_lines to modify the z-order of each plot
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
fig = plt.figure()
ax = fig.add_subplot(111)
df.plot(ax=ax, linewidth=10)
lines = ax.get_lines()
for i, line in enumerate(lines, -len(lines)):
line.set_zorder(abs(i))
fig
In a notebook produces:
Get the default zorder and sort it in the desired order.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(2021)
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
ax = df.plot(linewidth=10)
l = ax.get_children()
print(l)
l[0].set_zorder(3)
l[1].set_zorder(1)
l[2].set_zorder(2)
Before definition
After defining zorder
I will just put this answer here because it is a solution to the problem, but probably not the one you are looking for.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
# read columns in reverse order and plot them
# so normally, the legend will be inverted as well, but if we invert it again, you should get what you want
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
Note that in this example, you don't change the order of your data, you just read it differently, so I don't really know if that's what you want.
You can also make it easier on the eyes by creating a corresponding method.
def plot_dataframe(df: pd.DataFrame) -> None:
df[df.columns[::-1]].plot(linewidth=10, legend="reverse")
# then you just have to call this
df = pd.DataFrame(np.random.randint(1,10, size=(8,3)), columns=list('XYZ'))
plot_dataframe(df)
I want to create a figure with different violin plots on the same graph (but not on the same column).
My data are a list of dataframes and I want to create a violin plot of one column for each dataframe. (the names of the columns in the final figure I prefer to have as a name that is inside each dataframe in one other column).
I used this code:
for i in range(0,len(sta_list)):
plt.violinplot(sta_list[i]['diff_APS_1'])
I know that this is wrong, I want to split up the resulting plots in the figure.
You can specify the x-position of the violin plot for each column using positions argument
for i in range(0, len(sta_list)):
plt.violinplot(sta_list[i]['diff_APS_1'], positions=[i])
A sample answer for demonstration taking the dataset from this post
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.random.poisson(lam =3, size=100)
y = np.random.choice(["S{}".format(i+1) for i in range(4)], size=len(x))
df = pd.DataFrame({"Scenario":y, "LMP":x})
fig, ax = plt.subplots()
for i, key in enumerate(['S1', 'S2', 'S3', 'S4']):
ax.violinplot(df[df.Scenario == key]["LMP"].values, positions=[i])
I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)
Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()
I have a large data set with over 10,000 rows with values between 0 and 400,000,000. I would like to plot those values vs. the mean of another column in matplotlib where the x axis increments by 50,000,000 but I am unsure how to do so. I can plot it using pandas but would really like to do it using matplotlib but unsure how. This is what I have in pandas:
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
mean_values.plot(kind='line',figsize=(12,5))
I think I figured out what your problem is
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Create some data
df = pd.DataFrame({'budget_adj': np.random.uniform(0, 4000000000, 10000),
'vote_average': np.random.uniform(0, 100000, 10000)})
# Calculate the mean values
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
And this is what I suspect you do
# This wont work since mean_values.index is an interval
plt.plot(mean_values.index, mean_values)
This wont work since you index is a categorical interval. In order for plot to work your x-values have to be numbers. We can convert our intervals in many ways
# You can pick the left endpoint...
x_values = [i.left for i in mean_values.index]
# the right endpoint...
x_values = [i.right for i in mean_values.index]
# or the center value.
x_values = [i.mid for i in mean_values.index]
# And NOW you will get no error
plt.plot(x_values, mean_values)