change x-axis of a plot - python

I am trying to create a visualization of vehicles passing by in the first 25 weeks of the years 2015-2020 all in one graph (one curve for every year).
df_data_groups = df_data[(df_data['week']<=25)].groupby(['year','week'])
df_data_weekly = df_data_groups[['NO','nr_of_vehicles']].mean()
fig, ax = plt.subplots()
bp = df_data_weekly['nr_of_vehicles'].groupby('year').plot(ax=ax)
The following is what i get
The x-axis is not right. It should not contain the year, only the weeks, but I don't know how to solve this correctly. It also is not allowing me to create a legend to show which lines belongs to the color of the line, by using:
bp.set_legend()

The index shown, is the index of the last dataframe in the group. This dataframe has a 2-level index: the year and the week. Dropping the first index (the year) will only show the week:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df_data = pd.DataFrame({'year': np.repeat(np.arange(2015, 2021), 52),
'week': np.tile(np.arange(1, 53), 6),
'nr_of_vehicles': 200_000 + np.random.randint(-9_000, 10_000, 52 * 6).cumsum()})
df_data_groups = df_data[(df_data['week'] <= 25)].groupby(['year', 'week'])
df_data_weekly = df_data_groups[['nr_of_vehicles']].mean()
fig, ax = plt.subplots()
for year, df in df_data_weekly['nr_of_vehicles'].groupby('year'):
df.reset_index(level=0, drop=True).plot(ax=ax, label=year)
ax.legend()
ax.margins(x=0.02)
plt.show()
PS: Note that in the question's code, bp is a list of axes, one ax per year. In this case, all of them point to the same ax. bp is organized as a pandas Series, to obtain the legend, get one of the axes: bp[2015].legend() (or bp.iloc[0].legend()).

Related

Matrix of scatterplots by month-year

My data is in a dataframe of two columns: y and x. The data refers to the past few years. Dummy data is below:
np.random.seed(167)
rng = pd.date_range('2017-04-03', periods=365*3)
df = pd.DataFrame(
{"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)])
}, index=rng
)
In first attempt, I plotted a scatterplot with Seaborn using the following code:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'])
plot_scatter(data=df, title='dummy title', figsize=(10,7))
However, I would like to generate a 4x3 matrix including 12 scatterplots, one for each month with year as hue. I thought I could create a third column in my dataframe that tells me the year and I tried the following:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'],
hue=data.iloc[:, 2])
df['year'] = df.index.year
plot_scatter(data=df, title='dummy title', figsize=(10,7))
While this allows me to see the years, it still shows all the data in the same scatterplot instead of creating multiple scatterplots, one for each month, so it's not offering the level of detail I need.
I could slice the data by month and build a for loop that plots one scatterplot per month but I actually want a matrix where all the scatterplots use similar axis scales. Does anyone know an efficient way to achieve that?
To create multiple subplots at once, seaborn introduces figure-level functions. The col= argument indicates which column of the dataframe should be used to identify the subplots. col_wrap= can be used to tell how many subplots go next to each other before starting an additional row.
Note that you shouldn't create a figure, as the function creates its own new figure. It uses the height= and aspect= arguments to tell the size of the individual subplots.
The code below uses a sns.relplot() on the months. An extra column for the months is created; it is made categorical to fix an order.
To remove the month= in the title, you can loop through the generated axes (a recent seaborn version is needed for axes_dict). With sns.set(font_scale=...) you can change the default sizes of all texts.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(167)
dates = pd.date_range('2017-04-03', periods=365 * 3, freq='D')
df = pd.DataFrame({"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)])
}, index=dates)
df['year'] = df.index.year
month_names = pd.date_range('2017-01-01', periods=12, freq='M').strftime('%B')
df['month'] = pd.Categorical.from_codes(df.index.month - 1, month_names)
sns.set(font_scale=1.7)
g = sns.relplot(kind='scatter', data=df, x='x', y='y', hue='year', col='month', col_wrap=4, height=4, aspect=1)
# optionally remove the `month=` in the title
for name, ax in g.axes_dict.items():
ax.set_title(name)
plt.setp(g.axes, xlabel='', ylabel='') # remove all x and y labels
g.axes[-2].set_xlabel('x', loc='left') # set an x label at the left of the second to last subplot
g.axes[4].set_ylabel('y') # set a y label to 5th subplot
plt.subplots_adjust(left=0.06, bottom=0.06) # set some more spacing at the left and bottom
plt.show()

Multiple boxplot in a single Graphic in Python

I'm a beginner in Python.
In my internship project I am trying to plot bloxplots from data contained in a csv
I need to plot bloxplots for each of the 4 (four) variables showed above (AAG, DENS, SRG e RCG). Since each variable presents values ​​in the range from [001] to [100], there will be 100 boxplots for each variable, which need to be plotted in a single graph as shown in the image.
This is the graph I need to plot, but for each variable there will be 100 bloxplots as each one has 100 columns of values:
The x-axis is the "Year", which ranges from 2025 to 2030, so I need a graph like the one shown in figure 2 for each year and the y-axis is the sets of values ​​for each variable.
Using Pandas-melt function and seaborn library I was able to plot only the boxplots of a column. But that's not what I need:
import pandas as pd
import seaborn as sns
df = pd.read_csv("2DBM_50x50_Central_Aug21_Sim.cliped.csv")
mdf= df.melt(id_vars=['Year'], value_vars='AAG[001]')
print(mdf)
ax=sns.boxplot(x='Year', y='value',width = 0.2, data=mdf)
Result of the code above:
What can I try to resolve this?
The following code gives you five subplots, where each subplot only contains the data of one variable. Then a boxplot is generated for each year. To change the range of columns used for each variable, change the upper limit in var_range = range(1, 101), and to see the outliers change showfliers to True.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv("2DBM_50x50_Central_Aug21_Sim.cliped.csv")
variables = ["AAG", "DENS", "SRG", "RCG", "Thick"]
period = range(2025, 2031)
var_range = range(1, 101)
fig, axes = plt.subplots(2, 3)
flattened_axes = fig.axes
flattened_axes[-1].set_visible(False)
for i, var in enumerate(variables):
var_columns = [f"TB_acc_{var}[{j:05}]" for j in var_range]
data = df.melt(id_vars=["Period"], value_vars=var_columns, value_name=var)
ax = flattened_axes[i]
sns.boxplot(x="Period", y=var, width=0.2, data=data, ax=ax, showfliers=False)
plt.tight_layout()
plt.show()
output:

Seaborn plot showing two labels at the start and end of a month [duplicate]

I am trying to create a heat map from pandas dataframe using seaborn library. Here, is the code:
test_df = pd.DataFrame(np.random.randn(367, 5),
index = pd.DatetimeIndex(start='01-01-2000', end='01-01-2001', freq='1D'))
ax = sns.heatmap(test_df.T)
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_minor_locator(mdates.DayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%d'))
However, I am getting a figure with nothing printed on the x-axis.
Seaborn heatmap is a categorical plot. It scales from 0 to number of columns - 1, in this case from 0 to 366. The datetime locators and formatters expect values as dates (or more precisely, numbers that correspond to dates). For the year in question that would be numbers between 730120 (= 01-01-2000) and 730486 (= 01-01-2001).
So in order to be able to use matplotlib.dates formatters and locators, you would need to convert your dataframe index to datetime objects first. You can then not use a heatmap, but a plot that allows for numerical axes, e.g. an imshow plot. You may then set the extent of that imshow plot to correspond to the date range you want to show.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
df = pd.DataFrame(np.random.randn(367, 5),
index = pd.DatetimeIndex(start='01-01-2000', end='01-01-2001', freq='1D'))
dates = df.index.to_pydatetime()
dnum = mdates.date2num(dates)
start = dnum[0] - (dnum[1]-dnum[0])/2.
stop = dnum[-1] + (dnum[1]-dnum[0])/2.
extent = [start, stop, -0.5, len(df.columns)-0.5]
fig, ax = plt.subplots()
im = ax.imshow(df.T.values, extent=extent, aspect="auto")
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_minor_locator(mdates.DayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
fig.colorbar(im)
plt.show()
I found this question when trying to do a similar thing and you can hack together a solution but it's not very pretty.
For example I get the current labels, loop over them to find the ones for January and set those to just the year, setting the rest to be blank.
This gives me year labels in the correct position.
xticklabels = ax.get_xticklabels()
for label in xticklabels:
text = label.get_text()
if text[5:7] == '01':
label.set_text(text[0:4])
else:
label.set_text('')
ax.set_xticklabels(xticklabels)
Hopefully from that you can figure out what you want to do.

Matplotlib and Pandas treatment of timeseries without weekends

I am running into some issues adding Matplotlib lines into Pandas plot. I am trying to plot a straight line using the slope to determine what the start and end-points are. But the resultant graph does not look like a straight line at all.
I have simplified the case to the MVCE below. The initial part is for setup to replicate the key feature of the complicated dataframe I have.
import pandas as pd
import matplotlib.pyplot as plt
LEN_SER = 23
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='B')
df = pd.DataFrame(range(1,LEN_SER+1), index=dates)
ts = df.iloc[:,0]
# The above is the setup of the MVCE to replicate the issue.
fig = plt.figure()
ax1 = plt.subplot2grid((1, 1), (0, 0))
ax1.plot([ts.index[5], ts.index[20]],
[ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
ts.plot(ax=ax1)
plt.show()
This gives a graph that has a wavy line due to the weekends. The Matplotlib is affecting how Pandas is plotting the series. If I take out the ax1.plot() line, then it becomes a straight line.
So the question is: How do I draw straight lines on my Pandas plot with Matplotlib? Put it another way, I want the plot to treat the axis labels as categories so weekends will be ignored. That way, I am hoping that Matplotlib and Pandas will both give a straight line.
As you correctly observe, if you delete the line ax1.plot(), then matplotlib treats your dates as categories, and the pandas plot is a nice straight line. However, in the command
ax1.plot([ts.index[5], ts.index[20]],
[ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
you ask matplotlib to interpolate between two points, in the process of interpolating matplotlib recognize dates in the x-axis. That is why the straight line pandas plot with respect to date categories (5 a week) becomes a wavy line with respect to dates (7 a week). Which is correct as well, because with respect to dates your data simply isn't a represented by a straight line.
You can force the category interpretation replacing dates by strings through
df.index = df.reset_index().apply(lambda x: x['index'].strftime('%Y-%m-%d'), axis=1)
before defining ts. That results in the plot
Now the matplotlib plot is just two categories against two values and matplotlib does not bother to realize that the two categories are among the categories in the pandas plot. (Changing the order of the two plots saves your x-axis at least.) Modifying the matplotlib plot to
ax1.plot([5, 20], [ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
plots a line between categories 5 and 20, and finally gives you two straight lines with respect to a categories x-axis.
Full code:
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn') # (optional - style was set when I produced my graph)
LEN_SER = 23
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='B')
df = pd.DataFrame(range(1,LEN_SER+1), index=dates)
df.index = df.reset_index().apply(lambda x: \
x['index'].strftime('%Y-%m-%d'), axis=1) # dates -> categories (string)
ts = df.iloc[:,0]
# The above is the setup of the MVCE to replicate the issue.
fig = plt.figure()
ax1 = plt.subplot2grid((1, 1), (0, 0))
ax1.plot([5, 20], [ts[5], ts[5] + (1.0 * (20 - 5))], 'o-')
# x coordinates 'categories' 5 and 20
ts.plot(ax=ax1)
plt.show()
You already answered the question: " probably due to the weekends"
replace:
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='B')
with
dates = pd.date_range('2015-07-03', periods=LEN_SER, freq='D')
B - business day frequency
D - calendar day frequency
And your lines are straightened.
You're right - it is due to weekends. You can tell by the slope - five consecutive days have a sharper incline (+1 each day), than the three consecutive days (+1 total). So, what exactly do you want to plot? If you want to literally plot the blue line, you can interpolate the points between your two points like this:
...
# ts.plot(ax=ax1)
ts.iloc[[5,20]].resample('1D').interpolate(how='mean').plot(ax=ax1)
plt.show()
For simplicity I started from 2015-07-04. Does it work for you?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
LEN_SER = 21
dates = pd.date_range('2015-07-04', periods=LEN_SER, freq='B')
the_axes = []
# take the_axes like monday and friday for each week
for monday, friday in zip(dates[dates.weekday==0], dates[dates.weekday==4]):
the_axes.append([monday.date(), friday.date()])
x = dates
y = range(1,LEN_SER+1)
n_Axes = len(the_axes)
fig,(axes) = plt.subplots(1, n_Axes, sharey=True, figsize=(15,8))
for i in range(n_Axes):
ax = axes[i]
ax.plot(x, y)
ax.set_xlim(the_axes[i])
fig.autofmt_xdate()
print(dates)
plt.show()

how to highlight weekends for time series line plot in python

I am trying to do analysis on a bike share dataset. Part of the analysis includes showing the weekends' demand in date wise plot.
My dataframe in pandas with last 5 row looks like this.
Here is my code for date vs total ride plot.
import seaborn as sns
sns.set_style("darkgrid")
plt.plot(d17_day_count)
plt.show()
.
I want to highlight weekends in the plot. So that it could look something similar to this plot.
I am using Python with matplotlib and seaborn library.
You can easily highlight areas by using axvspan, to get the areas to be highlighted you can run through the index of your dataframe and search for the weekend days. I've also added an example for highlighting 'occupied hours' during a working week (hopefully that doesn't confuse things).
I've created dummy data for a dataframe based on days and another one for hours.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# dummy data (Days)
dates_d = pd.date_range('2017-01-01', '2017-02-01', freq='D')
df = pd.DataFrame(np.random.randint(1, 20, (dates_d.shape[0], 1)))
df.index = dates_d
# dummy data (Hours)
dates_h = pd.date_range('2017-01-01', '2017-02-01', freq='H')
df_h = pd.DataFrame(np.random.randint(1, 20, (dates_h.shape[0], 1)))
df_h.index = dates_h
#two graphs
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True)
#plot lines
dfs = [df, df_h]
for i, df in enumerate(dfs):
for v in df.columns.tolist():
axes[i].plot(df[v], label=v, color='black', alpha=.5)
def find_weekend_indices(datetime_array):
indices = []
for i in range(len(datetime_array)):
if datetime_array[i].weekday() >= 5:
indices.append(i)
return indices
def find_occupied_hours(datetime_array):
indices = []
for i in range(len(datetime_array)):
if datetime_array[i].weekday() < 5:
if datetime_array[i].hour >= 7 and datetime_array[i].hour <= 19:
indices.append(i)
return indices
def highlight_datetimes(indices, ax):
i = 0
while i < len(indices)-1:
ax.axvspan(df.index[indices[i]], df.index[indices[i] + 1], facecolor='green', edgecolor='none', alpha=.5)
i += 1
#find to be highlighted areas, see functions
weekend_indices = find_weekend_indices(df.index)
occupied_indices = find_occupied_hours(df_h.index)
#highlight areas
highlight_datetimes(weekend_indices, axes[0])
highlight_datetimes(occupied_indices, axes[1])
#formatting..
axes[0].xaxis.grid(b=True, which='major', color='black', linestyle='--', alpha=1) #add xaxis gridlines
axes[1].xaxis.grid(b=True, which='major', color='black', linestyle='--', alpha=1) #add xaxis gridlines
axes[0].set_xlim(min(dates_d), max(dates_d))
axes[0].set_title('Weekend days', fontsize=10)
axes[1].set_title('Occupied hours', fontsize=10)
plt.show()
I tried using the code in the accepted answer but the way the indices are used, the last weekend in the time series does not get highlighted entirely, despite what the image currently shown suggests (this is noticeable mainly with a frequency of 6 hours or more). Also, it does not work if the frequency of the data is higher than daily. This is why I share here a solution that uses the x-axis units so that weekends (or any other recurring time period) can be highlighted without any problem related to the index.
This solution takes only 6 lines of code and it works with any frequency. In the example below, it highlights full weekend days which makes it more efficient than the accepted answer where small frequencies (e.g. 30 minutes) will produce many polygons to cover the whole weekend.
The x-axis limits are used to compute the range of time covered by the plot in terms of days, which is the unit used for matplotlib dates. Then a weekends mask is computed and passed to the where argument of the fill_between plotting function. The masks are processed as right-exclusive so in this case, they must contain Mondays for the highlights to be drawn up to Mondays 00:00. Because plotting these highlights can alter the x-axis limits when weekends occur near the limits, the x-axis limits are set back to the original values after plotting.
Note that contrary to axvspan, the fill_between function needs the y1 and y2 arguments. For some reason, using the default y-axis limits leaves a small gap between the plot frame and the tops and bottoms of the weekend highlights. This issue is solved by running ax.set_ylim(*ax.get_ylim()) just after creating the plot.
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
import matplotlib.dates as mdates
# Create sample dataset
rng = np.random.default_rng(seed=1234) # random number generator
dti = pd.date_range('2017-01-01', '2017-05-15', freq='D')
counts = 5000 + np.cumsum(rng.integers(-1000, 1000, size=dti.size))
df = pd.DataFrame(dict(Counts=counts), index=dti)
# Draw pandas plot: x_compat=True converts the pandas x-axis units to matplotlib
# date units (not strictly necessary when using a daily frequency like here)
ax = df.plot(x_compat=True, figsize=(10, 5), legend=None, ylabel='Counts')
ax.set_ylim(*ax.get_ylim()) # reset y limits to display highlights without gaps
# Highlight weekends based on the x-axis units
xmin, xmax = ax.get_xlim()
days = np.arange(np.floor(xmin), np.ceil(xmax)+2)
weekends = [(dt.weekday()>=5)|(dt.weekday()==0) for dt in mdates.num2date(days)]
ax.fill_between(days, *ax.get_ylim(), where=weekends, facecolor='k', alpha=.1)
ax.set_xlim(xmin, xmax) # set limits back to default values
# Create appropriate ticks using matplotlib date tick locators and formatters
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_minor_locator(mdates.MonthLocator(bymonthday=np.arange(5, 31, step=7)))
ax.xaxis.set_major_formatter(mdates.DateFormatter('\n%b'))
ax.xaxis.set_minor_formatter(mdates.DateFormatter('%d'))
# Additional formatting
ax.figure.autofmt_xdate(rotation=0, ha='center')
title = 'Daily count of trips with weekends highlighted from SAT 00:00 to MON 00:00'
ax.set_title(title, pad=20, fontsize=14);
As you can see, the weekends are always highlighted to the full extent, regardless of where the data starts and ends.
You can find more examples of this solution in the answers I have posted here and here.
I have another suggestion to make in this regard, which takes inspirations from previous posts by other contributors. The code is as follows:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
rng = np.random.default_rng(seed=42) # random number generator
dti = pd.date_range('2021-08-01', '2021-08-31', freq='D')
counts = 5000 + np.cumsum(rng.integers(-1000, 1000, size=dti.size))
df = pd.DataFrame(dict(Counts=counts), index=dti)
weekends = [d for d in df.index if d.isoweekday() in [6,7]]
weekend_list = []
for weekendday in weekends:
d1 = weekendday
d2 = weekendday + datetime.timedelta(days=1)
weekend_list.append((d1, d2))
weekend_df = pd.DataFrame(weekend_list)
sns.set()
plt.figure(figsize=(15, 10), dpi=100)
df.plot()
plt.legend(bbox_to_anchor=(1.02, 0), loc="lower left", borderaxespad=0)
plt.ylabel("Counts")
plt.xlabel("Date of visit")
plt.xticks(rotation = 0)
plt.title("Daily counts of shop visits with weekends highlighted in green")
ax = plt.gca()
for d in weekend_df.index:
print(weekend_df[0][d], weekend_df[1][d])
ax.axvspan(weekend_df[0][d], weekend_df[1][d], facecolor="g", edgecolor="none", alpha=0.5)
ax.relim()
ax.autoscale_view()
plt.savefig("junk.png", dpi=100, bbox_inches='tight', pad_inches=0.2)
The result would be something like the following diagram:

Categories