I need some guidance to plot:
scatter plot of df1 data: time vs y use the hue for the column z
line plot df2 data: time vs. y
a single line at y=c (c is a constant)
y data in df1 and df2 are different but they are in the same range.
I do not know where to begin. Any guidance is appreciated.
More explanation. A portion of data is presented here. I want to plot:
scatter plot of time vs CO2
finding the yearly rolling average of CO2 (from 01/01/2016 to 09/30/2019 based on hourly data. So the first average will be from "01/01/2016 00" to "12/31/2016 23" and second average will be from "01/01/2016 01" to "01/01/2017 00") (like the trend in plot below)
finding the maximum of all the data and through a line over the plot (like straight line below)
Sample data
data = {'Date':['0 01/14/2016 00', '01/14/2016 01','01/14/2016 02','01/14/2016 03','01/14/2016 04','01/14/2016 05','01/14/2016 06','01/14/2016 07','01/14/2016 08','01/14/2016 09','01/14/2016 10','01/14/2016 11','01/14/2016 12','01/14/2016 13','01/14/2016 14','01/14/2016 15','01/14/2016 16','01/14/2016 17','01/14/2016 18','01/14/2016 19'],
'CO2':[2415.9,2416.5,2429.8,2421.5,2422.2,2428.3,2389.1,2343.2,2444.,2424.8,2429.6,2414.7,2434.9,2420.6,2420.5,2397.1,2415.6,2417.4,2373.2,2367.9],
'Year':[2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016]}
# Create DataFrame
df = pd.DataFrame(data)
# DataFrame view
Date CO2 Year
0 01/14/2016 00 2415.9 2016
01/14/2016 01 2416.5 2016
01/14/2016 02 2429.8 2016
01/14/2016 03 2421.5 2016
01/14/2016 04 2422.2 2016
using matplotlib.pyplot:
plt.hlines to add a horizontal line at a constant
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# with synthetic data
np.random.seed(365)
data = {'CO2': [np.random.randint(2000, 2500) for _ in range(783)],
'Date': pd.bdate_range(start='1/1/2016', end='1/1/2019').tolist()}
# create the dataframe:
df = pd.DataFrame(data)
# verify Date is in datetime format
df['Date'] = pd.to_datetime(df['Date'])
# set Date as index so .rolling can be used
df.set_index('Date', inplace=True)
# add rolling mean
df['rolling'] = df['CO2'].rolling('365D').mean()
# plot the data
plt.figure(figsize=(8, 8))
plt.scatter(x=df.index, y='CO2', data=df, label='data')
plt.plot(df.index, 'rolling', data=df, color='black', label='365 day rolling mean')
plt.hlines(max(df['CO2']), xmin=min(df.index), xmax=max(df.index), color='red', linestyles='dashed', label='Max')
plt.hlines(np.mean(df['CO2']), xmin=min(df.index), xmax=max(df.index), color='green', linestyles='dashed', label='Mean')
plt.xticks(rotation='45')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
Plot using synthetic data:
Issues with the Date format in the data from the op:
Use a regular expression to fix the Date column
Place the code to fix Date, just before df['Date'] = pd.to_datetime(df['Date'])
import re
# your data
Date CO2 Year
0 01/14/2016 00 2415.9 2016
01/14/2016 01 2416.5 2016
01/14/2016 02 2429.8 2016
01/14/2016 03 2421.5 2016
01/14/2016 04 2422.2 2016
df['Date'] = df['Date'].apply(lambda x: (re.findall(r'\d{2}/\d{2}/\d{4}', x)[0]))
# fixed Date column
Date CO2 Year
01/14/2016 2415.9 2016
01/14/2016 2416.5 2016
01/14/2016 2429.8 2016
01/14/2016 2421.5 2016
01/14/2016 2422.2 2016
You can use a dual-axis chart. It will ideally look the same as yours because both the axes will be the same scale. Can directly plot using pandas data frames
import matplotlib.pyplot as plt
import pandas as pd
# create a color map for the z column
color_map = {'z_val1':'red', 'z_val2':'blue', 'z_val3':'green', 'z_val4':'yellow'}
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx() #second axis within the first
# define scatter plot
df1.plot.scatter(x = 'date',
y = 'CO2',
ax = ax1,
c = df['z'].apply(lambda x:color_map[x]))
# define line plot
df2.plot.line(x = 'date',
y = 'MA_CO2', #moving average in dataframe 2
ax = ax2)
# plot the horizontal line at y = c (constant value)
ax1.axhline(y = c, color='r', linestyle='-')
# to fit the chart properly
plt.tight_layout()
Related
I'm trying to plot a graph of a time series which has dates from 1959 to 2019 including months, and I when I try plotting this time series I'm getting a clustered x-axis where the dates are not showing properly. How is it possible to remove the months and get only the years on the x-axis so it wont be as clustered and it would show the years properly?
fig,ax = plt.subplots(2,1)
ax[0].hist(pca_function(sd_Data))
ax[0].set_ylabel ('Frequency')
ax[1].plot(pca_function(sd_Data))
ax[1].set_xlabel ('Years')
fig.suptitle('Histogram and Time series of Plot Factor')
plt.tight_layout()
# fig.savefig('factor1959.pdf')
pca_function(sd_Data)
comp_0
sasdate
1959-01 -0.418150
1959-02 1.341654
1959-03 1.684372
1959-04 1.981473
1959-05 1.242232
...
2019-08 -0.075270
2019-09 -0.402110
2019-10 -0.609002
2019-11 0.320586
2019-12 -0.303515
[732 rows x 1 columns]
From what I see, you do have years on your second subplot, they are just overlapped because there are to many of them placed horizontally. Try to increase figsize, and rotate ticks:
# Builds an example dataframe.
df = pd.DataFrame(columns=['Years', 'Frequency'])
df['Years'] = pd.date_range(start='1/1/1959', end='1/1/2023', freq='M')
df['Frequency'] = np.random.normal(0, 1, size=(df.shape[0]))
fig, ax = plt.subplots(2,1, figsize=(20, 5))
ax[0].hist(df.Frequency)
ax[0].set_ylabel ('Frequency')
ax[1].plot(df.Years, df.Frequency)
ax[1].set_xlabel('Years')
for tick in ax[0].get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right')
for tick in ax[1].get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right')
fig.suptitle('Histogram and Time series of Plot Factor')
plt.tight_layout()
p.s. if the x-labels still overlap, try to increase your step size.
First off, you need to store the result of the call to pca_function into a variable. E.g. called result_pca_func. That way, the calculations (and possibly side effects or different randomization) are only done once.
Second, the dates should be converted to a datetime format. For example using pd.to_datetime(). That way, matplotlib can automatically put year ticks as appropriate.
Here is an example, starting from a dummy test dataframe:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': [f'{y}-{m:02d}' for y in range(1959, 2019) for m in range(1, 13)]})
df['Values'] = np.random.randn(len(df)).cumsum()
df = df.set_index('Date')
result_pca_func = df
result_pca_func.index = pd.to_datetime(result_pca_func.index)
fig, ax2 = plt.subplots(figsize=(10, 3))
ax2.plot(result_pca_func)
plt.tight_layout()
plt.show()
I am trying to plot a simple pandas Series object, its something like this:
2018-01-01 10
2018-01-02 90
2018-01-03 79
...
2020-01-01 9
2020-01-02 72
2020-01-03 65
It includes only the first month of each year, so it only contains the month January and all its values through the days.
When i try to plot it
# suppose the name of the series is dates_and_values
dates_and_values.plot()
It returns a plot like this (made using my current data)
It is clearly plotting by year and then the month, so it looks pretty squished and small, since i don't have any other months except January, is there a way to plot it by the year and day so it outputs a better plot to observe the days.
the x-axis is the index of the dataframe
dates are a continuous series, x-axis is continuous
change index to be a string of values, means it it no longer continuous and squishes your graph
have generated some sample data that only has January to demonstrate
import matplotlib.pyplot as plt
cf = pd.tseries.offsets.CustomBusinessDay(weekmask="Sun Mon Tue Wed Thu Fri Sat",
holidays=[d for d in pd.date_range("01-jan-1990",periods=365*50, freq="D")
if d.month!=1])
d = pd.date_range("01-jan-2015", periods=200, freq=cf)
df = pd.DataFrame({"Values":np.random.randint(20,70,len(d))}, index=d)
fig, ax = plt.subplots(2, figsize=[14,6])
df.set_index(df.index.strftime("%Y %d")).plot(ax=ax[0])
df.plot(ax=ax[1])
I suggest that you convert the series to a dataframe and then pivot it to get one column for each year. This lets you plot the data for each year with a separate line, either in the same plot using different colors or in subplots. Here is an example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
# Create sample series
rng = np.random.default_rng(seed=123) # random number generator
dt = pd.date_range('2018-01-01', '2020-01-31', freq='D')
dt_jan = dt[dt.month == 1]
series = pd.Series(rng.integers(20, 90, size=dt_jan.size), index=dt_jan)
# Convert series to dataframe and pivot it
df_raw = series.to_frame()
df_pivot = df_raw.pivot_table(index=df_raw.index.day, columns=df_raw.index.year)
df = df_pivot.droplevel(axis=1, level=0)
df.head()
# Plot all years together in different colors
ax = df.plot(figsize=(10,4))
ax.set_xlim(1, 31)
ax.legend(frameon=False, bbox_to_anchor=(1, 0.65))
ax.set_xlabel('January', labelpad=10, size=12)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
# Plot years separately
axs = df.plot(subplots=True, color='tab:blue', sharey=True,
figsize=(10,8), legend=None)
for ax in axs:
ax.set_xlim(1, 31)
ax.grid(axis='x', alpha=0.3)
handles, labels = ax.get_legend_handles_labels()
ax.text(28.75, 80, *labels, size=14)
if ax.is_last_row():
ax.set_xlabel('January', labelpad=10, size=12)
ax.figure.subplots_adjust(hspace=0)
I would like to highlithgt a single point on my lineplot graph using a marker. So far I managed to create my plot and insert the highlight where I wanted.
The problem is that I have 4 differents lineplot (4 different categorical attributes) and I get the marker placed on every sigle lineplot like in the following image:
I would like to place the marker only on the 2020 line (the purple one). This is my code so far:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import seaborn as sns
import numpy as np
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(15,10))
gs0 = gridspec.GridSpec(2,2, figure=fig, hspace=0.2)
ax1 = fig.add_subplot(gs0[0,:]) # lineplot
ax2 = fig.add_subplot(gs0[1,0]) #Used for another plot not shown here
ax3 = fig.add_subplot(gs0[1,1]) #Used for another plot not shown here
flatui = ["#636EFA", "#EF553B", "#00CC96", "#AB63FA"]
sns.lineplot(ax=ax1,x="number of weeks", y="avg streams", hue="year", data=df, palette=flatui, marker = 'o', markersize=20, fillstyle='none', markeredgewidth=1.5, markeredgecolor='black', markevery=[5])
ax1.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1000) + 'K'))
ax1.set(title='Streams trend')
ax1.xaxis.set_major_locator(ticker.MultipleLocator(2))
I used the markevery field to place a marker in position 5. Is there a way to specify also on which line/category place my marker?
EDIT: This is my dataframe:
avg streams date year number of weeks
0 145502.475 01-06 2017 0
1 158424.445 01-13 2017 1
2 166912.255 01-20 2017 2
3 169132.215 01-27 2017 3
4 181889.905 02-03 2017 4
... ... ... ... ...
181 760505.945 06-26 2020 25
182 713891.695 07-03 2020 26
183 700764.875 07-10 2020 27
184 753817.945 07-17 2020 28
185 717685.125 07-24 2020 29
186 rows × 4 columns
markevery is a Line2D property. sns.lineplot doesn't return the lines so you need to get the line you want to annotate from the Axes. Remove all the marker parameters from the lineplot call and add ...
lines = ax1.get_lines()
If the 2020 line/data is the fourth in the series,
line = lines[3]
line.set_marker('o')
line.set_markersize(20)
line.set_markevery([5])
line.set_fillstyle('none')
line.set_markeredgewidth(1.5)
line.set_markeredgecolor('black')
# or
props = {'marker':'o','markersize':20, 'fillstyle':'none','markeredgewidth':1.5,
'markeredgecolor':'black','markevery': [5]}
line.set(**props)
Another option, inspired by Quang Hoang's comment would be to add a circle around/at the point deriving the point from the DataFrame.
x = 5 # your spec
wk = df['number of weeks']==5
yr = df['year']==2020
s = df[wk & yr]
y = s['avg streams'].to_numpy()
# or
y = df.loc[(df['year']==2020) & (df['number of weeks']==5),'avg streams'].to_numpy()
ax1.plot(x,y, 'ko', markersize=20, fillstyle='none', markeredgewidth=1.5)
df
Date Col1 Col2 Col3
2016-11-1 12 13 14
2016-10-1 2 3 1
2016-03-01 2 1 1
and so on
Code to decompose time series to get seasonality, trends, observed and residual values:
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import dates as mdates
years = mdates.YearLocator() # only print label for the years
months = mdates.MonthLocator() # mark months as ticks
years_fmt = mdates.DateFormatter('%Y')
fmt = mdates.DateFormatter('%b')
df = df.set_index('Date')
s_dec_multiplicative = seasonal_decompose(df['Col1'], model = "multiplicative")
s_dec_multiplicative.plot()
s_dec_multiplicative.xaxis.set_major_locator(years)
s_dec_multiplicative.xaxis.set_minor_locator(months)
s_dec_multiplicative.xaxis.set_major_formatter(years_fmt)
s_dec_multiplicative.xaxis.set_minor_formatter(fmt)
plt.show()
Problem: I want tickers for JAN,FEB, MAR etc like that for all months. Years should be mentioned like 2016, 2017 and so on and months should be in between with small ticks.
ERROR:
---> 12 s_dec_multiplicative.xaxis.set_major_locator(years)
AttributeError: 'DecomposeResult' object has no attribute 'xaxis'
Your problem is that you're trying to change attribute of DecomposeResult object whereas you're supposed to work on ax object.
Let's retrieve some toy time series data:
from pandas_datareader import data
goog = data.DataReader("GOOG", "yahoo")["Adj Close"]
goog.plot();
Now let's do the desired decomposition and put the results into Pandas' df:
from statsmodels.tsa.seasonal import seasonal_decompose
s_dec_multiplicative = seasonal_decompose(goog, model = "multiplicative", freq=12)
observed = s_dec_multiplicative.observed
seasonal = s_dec_multiplicative.seasonal
residual = s_dec_multiplicative.resid
df = pd.DataFrame({"observed":observed, "seasonal":seasonal,"residual":residual}
Finally, we're ready to plot:
from matplotlib import dates as mdates
years = mdates.YearLocator() # only print label for the years
months = mdates.MonthLocator() # mark months as ticks
years_fmt = mdates.DateFormatter('%Y-%b')
fmt = mdates.DateFormatter('%b')
_, axes = plt.subplots(nrows=3,ncols=1, figsize=(20, 10))
for i, ax in enumerate(axes):
ax = df.iloc[:,i].plot(ax=ax)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(years_fmt)
ax.xaxis.set_minor_locator(months)
ax.xaxis.set_minor_formatter(fmt)
ax.set_ylabel(df.iloc[:,i].name)
plt.setp(ax.xaxis.get_minorticklabels(), rotation=90)
plt.setp(ax.xaxis.get_majorticklabels(), rotation=90)
I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!
There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()