Seaborn heatmap change date frequency of yticks - python

My problem is similar to the one encountered on this topic: Change heatmap's yticks for multi-index dataframe
I would like to have yticks every 6 months, with them being the index of my dataframe. But I can't manage to make it work.
The issue is that my dataframe is 13500*290 and the answer given in the link takes a long time and doesn't really work (see image below).
This is an example of my code without the solution from the link, this part works fine for me:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
df = pd.DataFrame(index = pd.date_range(datetime(1984, 6, 10), datetime(2021, 1, 14), freq='1D') )
for i in range(0,290):
df['Pt{0}'.format(i)] = np.random.random(size=len(df))
f, ax = plt.subplots(figsize=(20,20))
sns.heatmap(df, cmap='PuOr', vmin = np.min(np.min(df)), vmax = np.max(np.max(df)), cbar_kws={"label": "Ice Velocity (m/yr)"})
This part does not work for me and produces the figure below, which shouldn't have the stack of ylabels on the yaxis:
f, ax = plt.subplots(figsize=(20,20))
years = df.index.get_level_values(0)
ytickvalues = [year if index in (2, 7, 12) else '' for index, year in enumerate(years)]
sns.heatmap(df, cmap='PuOr', vmin = np.min(np.min(df)), vmax = np.max(np.max(df)), cbar_kws={"label": "Ice Velocity (m/yr)"}, yticklabels = ytickvalues)

Here are a couple ways to adapt that link for your use case (1 label per 6 months):
Either: Show an empty string except on Jan 1 and Jul 1 (i.e., when %m%d evals to 0101 or 0701)
labels = [date if date.strftime('%m%d') in ['0101', '0701'] else ''
for date in df.index.date]
Or: Show an empty string except every ~365/2 days (i.e., when row % 183 == 0)
labels = [date if row % 183 == 0 else ''
for row, date in enumerate(df.index.date)]
Note that you don't have a MultiIndex, so you can just use df.index.date (no need for get_level_values).
Here is the output with a minimized version of your df:
sns.heatmap(df, cmap='PuOr', cbar_kws={'label': 'Ice Velocity (m/yr)'},
vmin=df.values.min(), vmax=df.values.max(),
yticklabels=labels)

Related

Clustered x-axis with the dates not showing clearly

I'm trying to plot a graph of a time series which has dates from 1959 to 2019 including months, and I when I try plotting this time series I'm getting a clustered x-axis where the dates are not showing properly. How is it possible to remove the months and get only the years on the x-axis so it wont be as clustered and it would show the years properly?
fig,ax = plt.subplots(2,1)
ax[0].hist(pca_function(sd_Data))
ax[0].set_ylabel ('Frequency')
ax[1].plot(pca_function(sd_Data))
ax[1].set_xlabel ('Years')
fig.suptitle('Histogram and Time series of Plot Factor')
plt.tight_layout()
# fig.savefig('factor1959.pdf')
pca_function(sd_Data)
comp_0
sasdate
1959-01 -0.418150
1959-02 1.341654
1959-03 1.684372
1959-04 1.981473
1959-05 1.242232
...
2019-08 -0.075270
2019-09 -0.402110
2019-10 -0.609002
2019-11 0.320586
2019-12 -0.303515
[732 rows x 1 columns]
From what I see, you do have years on your second subplot, they are just overlapped because there are to many of them placed horizontally. Try to increase figsize, and rotate ticks:
# Builds an example dataframe.
df = pd.DataFrame(columns=['Years', 'Frequency'])
df['Years'] = pd.date_range(start='1/1/1959', end='1/1/2023', freq='M')
df['Frequency'] = np.random.normal(0, 1, size=(df.shape[0]))
fig, ax = plt.subplots(2,1, figsize=(20, 5))
ax[0].hist(df.Frequency)
ax[0].set_ylabel ('Frequency')
ax[1].plot(df.Years, df.Frequency)
ax[1].set_xlabel('Years')
for tick in ax[0].get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right')
for tick in ax[1].get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right')
fig.suptitle('Histogram and Time series of Plot Factor')
plt.tight_layout()
p.s. if the x-labels still overlap, try to increase your step size.
First off, you need to store the result of the call to pca_function into a variable. E.g. called result_pca_func. That way, the calculations (and possibly side effects or different randomization) are only done once.
Second, the dates should be converted to a datetime format. For example using pd.to_datetime(). That way, matplotlib can automatically put year ticks as appropriate.
Here is an example, starting from a dummy test dataframe:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': [f'{y}-{m:02d}' for y in range(1959, 2019) for m in range(1, 13)]})
df['Values'] = np.random.randn(len(df)).cumsum()
df = df.set_index('Date')
result_pca_func = df
result_pca_func.index = pd.to_datetime(result_pca_func.index)
fig, ax2 = plt.subplots(figsize=(10, 3))
ax2.plot(result_pca_func)
plt.tight_layout()
plt.show()

Cannot prepare proper labels in Matplotlib

I have very simple code:
from matplotlib import dates
import matplotlib.ticker as ticker
my_plot=df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90)
I've got:
but I would like to have fewer labels on X axis. To do this I've add:
my_plot.xaxis.set_major_locator(ticker.MaxNLocator(12))
It generates fewer labels but values of labels have wrong values (=first of few labels from whole list)
What am I doing wrong?
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' (YYYY-MM)
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
You were almost there. Just set ticker.MultipleLocator.
The pandas.DataFrame.boxplot also returns axes, which is an object of class matplotlib.axes.Axes. So you can use this code snippet to customize your labels:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
center = np.random.randint(50,size=(10, 20))
spread = np.random.rand(10, 20) * 30
flier_high = np.random.rand(10, 20) * 30 + 30
flier_low = np.random.rand(10, 20) * -30
y = np.concatenate((spread, center, flier_high, flier_low))
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(y)
x = ['Label '+str(i) for i in range(20)]
ax.set_xticklabels(x)
ax.set_xlabel('Day')
# Set a tick on each integer multiple of a base within the view interval.
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
plt.xticks(rotation=90)
I think there is a compatibility issue with Pandas plots and Matplotlib formatters.
With the following code:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
You get this chart:
But I notice that this is grouped by month instead of by day. It may not be what you wanted.
Adding the day component to the string 'Day' messes up the chart as there seems to be too many boxes.
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 15 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
The for loop creates the tick labels every as many periods as desired. In the first chart they were set every 3 months. In the second one, every 15 days.
If you would like to see less grid lines:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
new_locs = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
loc = locs[i]
i += 1
new_labels.append(label)
new_locs.append(loc)
else:
i += 1
ax.set_xticks(new_locs)
ax.set_xticklabels(new_labels)
ax.grid(axis='y')
plt.show()
I've read about x_compat in Pandas plot in order to apply Matplotlib formatters, but I get an error when trying to apply it. I'll give it another shot later.
Old unsuccesful answer
The tick labels seem to be dates. If they are set as datetime in your dataframe, you can:
months = mdates.MonthLocator(1,4,7,10) #Choose the months you like the most
ax.xaxis.set_major_locator(months)
Otherwise, you can let Matplotlib know they are dates by:
ax.xaxis_date()
Your comment:
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' *(YYYY-MM) *
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
Based on your comment in italic above, I would use reg_Date as the input and the following lines:
days = mdates.DayLocator(interval=3)
daysFmt = mdates.DateFormatter('%Y-%m') #to format display
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(daysFmt)
I forgot to mention that you will need to:
import matplotlib.dates as mdates
Does this work?

Pandas - How to group-by and plot for each hour of each day of week

I need help figuring out how to plot sub-plots for easy comparison from my dataframe shown:
Date A B C
2017-03-22 15:00:00 obj1 value_a other_1
2017-03-22 14:00:00 obj2 value_ns other_5
2017-03-21 15:00:00 obj3 value_kdsa other_23
2014-05-08 17:00:00 obj2 value_as other_4
2010-07-01 20:00:00 obj1 value_as other_0
I am trying to graph the occurrences of each hour for each respective day of the week. So count the number of occurrences for each day of the week and hour and plot them on subplots like the ones shown below.
If this question sounds confusing please let me know if you have any questions. Thanks.
You can accomplish this with multiple groupby. Since we know there are 7 days in a week, we can specify that number of panels. If you groupby(df.Date.dt.dayofweek), you can use the group index as the index for your subplot axes:
Sample Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
n = 10000
np.random.seed(123)
df = pd.DataFrame({'Date': pd.date_range('2010-01-01', freq='1.09min', periods=n),
'A': np.random.randint(1,10,n),
'B': np.random.normal(0,1,n)})
Code:
fig, ax = plt.subplots(ncols=7, figsize=(30,5))
plt.subplots_adjust(wspace=0.05) #Remove some whitespace between subplots
for idx, gp in df.groupby(df.Date.dt.dayofweek):
ax[idx].set_title(gp.Date.dt.day_name().iloc[0]) #Set title to the weekday
(gp.groupby(gp.Date.dt.hour).size().rename_axis('Tweet Hour').to_frame('')
.reindex(np.arange(0,24,1)).fillna(0)
.plot(kind='bar', ax=ax[idx], rot=0, ec='k', legend=False))
# Ticks and labels on leftmost only
if idx == 0:
_ = ax[idx].set_ylabel('Counts', fontsize=11)
_ = ax[idx].tick_params(axis='both', which='major', labelsize=7,
labelleft=(idx == 0), left=(idx == 0))
# Consistent bounds between subplots.
lb, ub = list(zip(*[axis.get_ylim() for axis in ax]))
for axis in ax:
axis.set_ylim(min(lb), max(ub))
plt.show()
If you'd like to make the aspect ratio less extreme, then consider plotting a 4x2 grid. It's a very similar plot as above, once we flatten the axis array. There's some integer and remainder division to figure out which axes need the labels.
fig, ax = plt.subplots(nrows=2, ncols=4, figsize=(20,10))
fig.delaxes(ax[1,3]) #7 days in a week, remove 8th panel
ax = ax.flatten() #Far easier to work with a flattened array
lsize=8
plt.subplots_adjust(wspace=0.05, hspace=0.15) #Remove some whitespace between subplots
for idx, gp in df.groupby(df.Date.dt.dayofweek):
ax[idx].set_title(gp.Date.dt.day_name().iloc[0]) #Set title to the weekday
(gp.groupby(gp.Date.dt.hour).size().rename_axis([None]).to_frame()
.reindex(np.arange(0,24,1)).fillna(0)
.plot(kind='bar', ax=ax[idx], rot=0, ec='k', legend=False))
# Titles on correct panels
if idx%4 == 0:
_ = ax[idx].set_ylabel('Counts', fontsize=11)
if (idx//4 == 1) | (idx%4 == 3):
_ = ax[idx].set_xlabel('Tweet Hour', fontsize=11)
# Ticks on correct panels
_ = ax[idx].tick_params(axis='both', which='major', labelsize=lsize,
labelbottom=(idx//4 == 1) | (idx%4 == 3),
bottom=(idx//4 == 1) | (idx%4 == 3),
labelleft=(idx%4 == 0),
left=(idx%4 == 0))
# Consistent bounds between subplots.
lb, ub = list(zip(*[axis.get_ylim() for axis in ax]))
for axis in ax:
axis.set_ylim(min(lb), max(ub))
plt.show()
What about using seaborn? sns.FacetGrid was made for this:
import pandas as pd
import seaborn as sns
# make some data
date = pd.date_range('today', periods=100, freq='2.5H')
# put in dataframe
df = pd.DataFrame({
'date' : date
})
# create day_of_week and hour columns
df['dow'] = df.date.dt.day_name()
df['hour'] = df.date.dt.hour
# create facet grid
g = sns.FacetGrid(data=df.groupby([
'dow',
'hour'
]).hour.count().to_frame(name='day_hour_count').reset_index(), col='dow', col_order=[
'Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday'
], col_wrap=3)
# map barplot to each subplot
g.map(sns.barplot, 'hour', 'day_hour_count');

Matplotlib: custom ticker for pandas MultiIndex DataFrame

I have a large pandas MultiIndex DataFrame that I would like to plot. A minimal example would look like:
import pandas as pd
years = range(2015, 2018)
fields = range(4)
days = range(4)
bands = ['R', 'G', 'B']
index = pd.MultiIndex.from_product(
[years, fields], names=['year', 'field'])
columns = pd.MultiIndex.from_product(
[days, bands], names=['day', 'band'])
df = pd.DataFrame(0, index=index, columns=columns)
df.loc[(2015,), (0,)] = 1
df.loc[(2016,), (1,)] = 1
df.loc[(2017,), (2,)] = 1
If I plot this using plt.spy, I get:
However, the tick locations and labels are less than desirable. I would like the ticks to completely ignore the second level of the MultiIndex. Using IndexLocator and IndexFormatter, I'm able to do the following:
from matplotlib.ticker import IndexFormatter, IndexLocator
import matplotlib.pyplot as plt
ax = plt.gca()
plt.spy(df)
xbase = len(bands)
xoffset = xbase / 2
xlabels = df.columns.get_level_values('day')
ax.xaxis.set_major_locator(IndexLocator(base=xbase, offset=xoffset))
ax.xaxis.set_major_formatter(IndexFormatter(xlabels))
plt.xlabel('Day')
ax.xaxis.tick_bottom()
ybase = len(fields)
yoffset = ybase / 2
ylabels = df.index.get_level_values('year')
ax.yaxis.set_major_locator(IndexLocator(base=ybase, offset=yoffset))
ax.yaxis.set_major_formatter(IndexFormatter(ylabels))
plt.ylabel('Year')
plt.show()
This gives me exactly what I want:
But here's the problem. My actual DataFrame has 15 years, 4,000 fields, 365 days, and 7 bands. If I actually label every single day, the labels would be illegible. I could place a tick every 50 days, but I would like the ticks to be dynamic so that when I zoom in, the ticks become more fine-grained. Basically what I'm looking for is a custom MultiIndexLocator that combines the placement of IndexLocator with the dynamism of MaxNLocator.
Bonus: My data is really nice in the sense that there are always the same number of fields for every year and the same number of bands for every day. But what if this was not the case? I would love to contribute a generic MultiIndexLocator and MultiIndexFormatter to matplotlib that works for any MultiIndex DataFrame.
Matplotlib does not know about dataframes or MultiIndex. It simply plots the data you supply. I.e. you get the same as if you were plotting the numpy array of data, spy(df.values).
So I would suggest to first set the extent of the image correctly such that you may use numeric tickers. Then a MaxNLocator should work fine, unless you do not zoom in too much.
import numpy as np
import pandas as pd
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
plt.rcParams['axes.formatter.useoffset'] = False
years = range(2000, 2018)
fields = range(9) #17
days = range(120) #365
bands = ['R', 'G', 'B', 'A']
index = pd.MultiIndex.from_product(
[years, fields], names=['year', 'field'])
columns = pd.MultiIndex.from_product(
[days, bands], names=['day', 'band'])
data = np.random.rand(len(years)*len(fields),len(days)*len(bands))
x,y = np.meshgrid(np.arange(data.shape[1]),np.arange(data.shape[0]))
data += 2*((y//len(fields)+x//len(bands)) % 2)
df = pd.DataFrame(data, index=index, columns=columns)
############
# Plotting
############
xbase = len(bands)
xlabels = df.columns.get_level_values('day')
ybase = len(fields)
ylabels = df.index.get_level_values('year')
extent = [xlabels.min()-np.diff(np.unique(xlabels))[0]/2.,
xlabels.max()+np.diff(np.unique(xlabels))[0]/2.,
ylabels.min()-np.diff(np.unique(ylabels))[0]/2.,
ylabels.max()+np.diff(np.unique(ylabels))[0]/2.,]
fig, ax = plt.subplots()
ax.imshow(df.values, extent=extent, aspect="auto")
ax.set_ylabel('Year')
ax.set_xlabel('Day')
ax.xaxis.set_major_locator(MaxNLocator(integer=True,min_n_ticks=1))
ax.yaxis.set_major_locator(MaxNLocator(integer=True,min_n_ticks=1))
plt.show()

Remove Saturdays (but not Sundays or other dataless periods) from Timeserie plot

I am plotting a financial timeserie (see below, here 1 month worth of data)
I would like to remove the periods I show with red cross etc., which are Saturdays. Note that those periods are not all the time periods without data but only the Saturdays.
I know there are some example of how to remove the gaps , for instance: http://matplotlib.org/examples/api/date_index_formatter.html.
This is not what I am after since they remove all the gaps. (NOT MY INTENT!).
I was thinking that the way to go might be to create a custom sequence of values for the xaxis. Since the days are ordinals (ie 1 day = a value of 1), it might be possible to create a sequence such as 1,2,3,4,5,6,8,9,10,11,12,13,15,16,etc. skipping 1 every seven days - that day skipped needing to be a Saturday of course.
The skipping of every Saturday i can imagine how to do it using rrule from timeutil. It is done here (see below) as every Monday is marked with a stronger vertical line. But How would i go at passing it to the Tick locator? There is in fact a RRuleLocator class in the matplotlib API but no indication on how to use it is given in the doc: http://matplotlib.org/api/dates_api.html#matplotlib.dates.RRuleLocator.
Every suggestion welcome.
Here the code that I use for the current chart:
fig, axes = plt.subplots(2, figsize=(20, 6))
quotes = price_data.as_matrix() # as matrix() to remove the columns header of the df
mpf.candlestick_ohlc(axes[0], quotes, width=0.01)
plt.bar(quotes[:,0] , quotes[:,5], width = 0.01)
for i , axes[i] in enumerate(axes):
axes[i].xaxis.set_major_locator(mdates.DayLocator(interval=1) )
axes[i].xaxis.set_major_formatter(mdates.DateFormatter('%a, %b %d'))
axes[i].grid(True)
# show night times with a grey shade
majors=axes[i].xaxis.get_majorticklocs()
chart_start, chart_end = (axes[i].xaxis.get_view_interval()[0],
axes[i].xaxis.get_view_interval()[1])
for major in majors:
axes[i].axvspan(max (chart_start, major-(0.3333)),
min(chart_end, major+(0.3333)),color="0.95", zorder=-1 ) #0.33 corresponds to 1/3 of a day i.e. 8h
# show mondays with a line
mondays = list(rrule(WEEKLY, byweekday=MO, dtstart= mdates.num2date(chart_start),
until=mdates.num2date(chart_end)))
for j, monday in enumerate(mondays):
axes[i].axvline(mdates.date2num(mondays[j]), linewidth=0.75, color='k', zorder=1)
If your dates are datetime objects, or a DateTimeIndex in a pandas DataFrame, you could check which weekday a certain date is using .weekday. Then you just set the data on Saturdays to nan. See the example below.
Code
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import random
import datetime
import numpy as np
# generate some data with a datetime index
x = 400
data = pd.DataFrame([
random.random() for i in range(x)],
index=[datetime.datetime(2018, 1, 1, 0)
+ datetime.timedelta(hours=i) for i in range(x)])
# Set all data on a Saturday (5) to nan, so it doesn't show in the graph
data[data.index.weekday == 5] = np.nan
# Plot the data
fig, ax = plt.subplots(figsize=(12, 2.5))
ax.plot(data)
# Set a major tick on each weekday
days = mdates.DayLocator()
daysFmt = mdates.DateFormatter('%a')
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(daysFmt)
Result

Categories