I have a Kaggle dataset (link).
I read the dataset, and I set the Date to be index column:
museum_data = pd.read_csv("museum_visitors.csv", index_col = "Date", parse_dates = True)
Then, the museum_data be like:
Date
Avila Adobe
Firehouse Museum
Chinese American Museum
America Tropical Interpretive Center
2014-01-01
24778
4486
1581
6602
2014-02-01
18976
4172
1785
5029
...
...
...
...
...
2018-10-01
19280
4622
2364
3775
2018-11-01
17163
4082
2385
4562
Here is the code I use to plot the lineplot in seaborn:
plt.figure(figsize = (20,8))
sns.lineplot(data = museum_data)
plt.show()
And, this is what the result looks like:
What I want to know is that, how I can show multiple (not all, for example, first month of each season) months per year in x-axis.
Thank you all for your time, in advance.
You can use MonthLocator and perhaps ConciseDateFormatter to add minor ticks with a few months showing, something like the following:
import matplotlib.dates as mdates
...
fig, ax = plt.subplots(figsize = (20,8))
sns.lineplot(data = museum_data, ax=ax)
locator = mdates.MonthLocator(bymonth=[4,7,10])
ax.xaxis.set_minor_locator(locator)
ax.xaxis.set_minor_formatter(mdates.ConciseDateFormatter(locator))
Output:
Edit (closer): you can add the following to show January as well:
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b\n%Y'))
Output:
Edit 2 (there's probably a better way but I'm rusty):
length = plt.rcParams["xtick.minor.size"]
pad = plt.rcParams['xtick.minor.pad']
ax.tick_params('x', length=length, pad=pad)
Related
I am trying to draw a stock market graph
timeseries vs closing price and timeseries vs volume.
Somehow the x-axis shows the time in 1970
the following is the graph and the code
The code is:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
pd_data = pd.DataFrame(data, columns=['id', 'symbol', 'volume', 'high', 'low', 'open', 'datetime','close','datetime_utc','created_at'])
pd_data['DOB'] = pd.to_datetime(pd_data['datetime_utc']).dt.strftime('%Y-%m-%d')
pd_data.set_index('DOB')
print(pd_data)
print(pd_data.dtypes)
ax=pd_data.plot(x='DOB',y='close',kind = 'line')
ax.set_ylabel("price")
#ax.pd_data['volume'].plot(secondary_y=True, kind='bar')
ax1=pd_data.plot(y='volume',secondary_y=True, ax=ax,kind='bar')
ax1.set_ylabel('Volumne')
# Choose your xtick format string
date_fmt = '%d-%m-%y'
date_formatter = mdates.DateFormatter(date_fmt)
ax1.xaxis.set_major_formatter(date_formatter)
# set monthly locator
ax1.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()
plt.show()
Also tried the two graphs independently without ax=ax
ax=pd_data.plot(x='DOB',y='close',kind = 'line')
ax.set_ylabel("price")
ax1=pd_data.plot(y='volume',secondary_y=True,kind='bar')
ax1.set_ylabel('Volumne')
then price graph shows years properly whereas volumen graph shows 1970
And if i swap them
ax1=pd_data.plot(y='volume',secondary_y=True,kind='bar')
ax1.set_ylabel('Volumne')
ax=pd_data.plot(x='DOB',y='close',kind = 'line')
ax.set_ylabel("price")
Now the volume graph shows years properly whereas the price graph shows the years as 1970
I tried removing secondary_y and also changing bar to line. BUt no luck
Somehow pandas Data after first graph is changing the year.
I do not advise plotting a bar plot with such a numerous amount of bars.
This answer explains why there is an issue with the xtick labels, and how to resolve the issue.
Plotting with pandas.DataFrame.plot works without issue with .set_major_locator
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.2
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas_datareader as web # conda install -c anaconda pandas-datareader or pip install pandas-datareader
# download data
df = web.DataReader('amzn', data_source='yahoo', start='2015-02-21', end='2021-04-27')
# plot
ax = df.plot(y='Close', color='magenta', ls='-.', figsize=(10, 6), ylabel='Price ($)')
ax1 = df.plot(y='Volume', secondary_y=True, ax=ax, alpha=0.5, rot=0, lw=0.5)
ax1.set(ylabel='Volume')
# format
date_fmt = '%d-%m-%y'
years = mdates.YearLocator() # every year
yearsFmt = mdates.DateFormatter(date_fmt)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
plt.setp(ax.get_xticklabels(), ha="center")
plt.show()
Why are the OP x-tick labels starting from 1970?
Bar plots locations are being 0 indexed (with pandas), and 0 corresponds to 1970
See Pandas bar plot changes date format
Most solutions with bar plots simply reformat the label to the appropriate datetime, however this is cosmetic and will not align the locations between the line plot and bar plot
Solution 2 of this answer shows how to change the tick locators, but is really not worth the extra code, when plt.bar can be used.
print(pd.to_datetime(ax1.get_xticks()))
DatetimeIndex([ '1970-01-01 00:00:00',
'1970-01-01 00:00:00.000000001',
'1970-01-01 00:00:00.000000002',
'1970-01-01 00:00:00.000000003',
...
'1970-01-01 00:00:00.000001552',
'1970-01-01 00:00:00.000001553',
'1970-01-01 00:00:00.000001554',
'1970-01-01 00:00:00.000001555'],
dtype='datetime64[ns]', length=1556, freq=None)
ax = df.plot(y='Close', color='magenta', ls='-.', figsize=(10, 6), ylabel='Price ($)')
print(ax.get_xticks())
ax1 = df.plot(y='Volume', secondary_y=True, ax=ax, kind='bar')
print(ax1.get_xticks())
ax1.set_xlim(0, 18628.)
date_fmt = '%d-%m-%y'
years = mdates.YearLocator() # every year
yearsFmt = mdates.DateFormatter(date_fmt)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
[out]:
[16071. 16436. 16801. 17167. 17532. 17897. 18262. 18628.] ← ax tick locations
[ 0 1 2 ... 1553 1554 1555] ← ax1 tick locations
With plt.bar the bar plot locations are indexed based on the datetime
ax = df.plot(y='Close', color='magenta', ls='-.', figsize=(10, 6), ylabel='Price ($)', rot=0)
plt.setp(ax.get_xticklabels(), ha="center")
print(ax.get_xticks())
ax1 = ax.twinx()
ax1.bar(df.index, df.Volume)
print(ax1.get_xticks())
date_fmt = '%d-%m-%y'
years = mdates.YearLocator() # every year
yearsFmt = mdates.DateFormatter(date_fmt)
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
[out]:
[16071. 16436. 16801. 17167. 17532. 17897. 18262. 18628.]
[16071. 16436. 16801. 17167. 17532. 17897. 18262. 18628.]
sns.barplot(x=df.index, y=df.Volume, ax=ax1) has xtick locations as [ 0 1 2 ... 1553 1554 1555], so the bar plot and line plot did not align.
I could not find the reason for 1970, but rather use matplotlib.pyplot to plot instead of indirectly using pandas and also pass the datatime array instead of pandas
So the following code worked
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import datetime as dt
import numpy as np
pd_data = pd.read_csv("/home/stockdata.csv",sep='\t')
pd_data['DOB'] = pd.to_datetime(pd_data['datetime2']).dt.strftime('%Y-%m-%d')
dates=[dt.datetime.strptime(d,'%Y-%m-%d').date() for d in pd_data['DOB']]
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.MonthLocator(interval=2))
plt.bar(dates,pd_data['close'],align='center')
plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(1))
plt.gcf().autofmt_xdate()
plt.show()
I have created a dates array in the datetime format. If i make graph using that then the dates are no more shown as 1970
open high low close volume datetime datetime2
35.12 35.68 34.79 35.58 1432995 1244385200000 2012-6-15 10:30:00
35.69 36.02 35.37 35.78 1754319 1244371600000 2012-6-16 10:30:00
35.69 36.23 35.59 36.23 3685845 1245330800000 2012-6-19 10:30:00
36.11 36.52 36.03 36.32 2635777 1245317200000 2012-6-20 10:30:00
36.54 36.6 35.8 35.9 2886412 1245303600000 2012-6-21 10:30:00
36.03 36.95 36.0 36.09 3696278 1245390000000 2012-6-22 10:30:00
36.5 37.27 36.18 37.11 2732645 1245376400000 2012-6-23 10:30:00
36.98 37.11 36.686 36.83 1948411 1245335600000 2012-6-26 10:30:00
36.67 37.06 36.465 37.05 2557172 1245322000000 2012-6-27 10:30:00
37.06 37.61 36.77 37.52 1780126 1246308400000 2012-6-28 10:30:00
37.47 37.77 37.28 37.7 1352267 1246394800000 2012-6-29 10:30:00
37.72 38.1 37.68 37.76 2194619 1246381200000 2012-6-30 10:30:00
The plot i get is
I am trying to plot a graph to represent a monthly river discharge dataset from 1980-01-01 to 2013-12-31.
Please check out this graph
The plan is to plot "Jan Feb Mar Apr May...Dec" as the x-axis and the discharge (m3/s) as the y-axis. The actual lines on the graphs would represent the years. Alternatively, the lines on the graph would showcase monthly average (from jan to dec) of every year from 1980 to 2013.
DAT = pd.read_excel('Modelled Discharge_UIB_1980-2013_Daily.xlsx',
sheet_name='Karhmong', header=None, skiprows=1,
names=['year', 'month', 'day', 'flow'],
parse_dates={ 'date': ['year', 'month', 'day'] },
index_col='date')
the above is to show what type of data it is
date flow
1980-01-01 104.06
1980-01-02 103.81
1980-01-03 103.57
1980-01-04 103.34
1980-01-05 103.13
... ...
2013-12-27 105.65
2013-12-28 105.32
2013-12-29 105.00
2013-12-30 104.71
2013-12-31 104.42
because I want to compare all the years to each other so I tried the below command
DAT1980 = DAT[DAT.index.year==1980]
DAT1980
DAT1981 = DAT[DAT.index.year==1981
DAT1981
...etc
in terms of grouping the months for the x-axis I tried grouping months using the command
datmonth = np.unique(DAT.index.month)
so far all of these commands caused no error
however as I plot the graph I got this error
Graph plot command
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,6))
ax.plot(datmonth, DAT1980, color='purple', linestyle='--', label='1980')
ax.grid()
plt.legend()
ax.set_title('Monthly River Indus Discharge Comparison 1980-2013')
ax.set_ylabel('Discharge (m3/s)')
ax.set_xlabel('Month')
axs.set_xlim(3, 5)
axs.xaxis.set_major_formatter
fig.autofmt_xdate()
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
which I got "ValueError: x and y must have same first dimension, but have shapes (12,) and (366, 1)" as the error
I then tried
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,6))
ax.plot(DAT.index.month, DAT.index.year==1980, color='purple', linestyle='--', label='1980')
ax.grid()
ax.plot(DAT.index.month, DAT.index.year==1981, color='black', marker='o', linestyle='-', label='C1981')
ax.grid()
plt.legend()
ax.set_title('Monthly River Indus Discharge Comparison 1980-2013')
ax.set_ylabel('Discharge (m3/s)')
ax.set_xlabel('Month')
#axs.set_xlim(1, 12)
axs.xaxis.set_major_formatter
fig.autofmt_xdate()
ax.legend(loc='upper left', bbox_to_anchor=(1, 1))
and it worked better than the previous graph but still not what I wanted
(please check out the graph here)
as my intention is to create a graph similar to this
I wholeheartedly appreciate any suggestion you may have! Thank you so so much and if you need any further information please do not hesitate to ask, I will reply as soon as possible.
Welcome to SO! Nice job creating a clear description of your issue and showing lots of code : )
There are a few syntax issues here and there, but the main issue I see is that you need to add a groupby/aggregation operation at some point. That is, you have daily data, but your desired plot has monthly resolution (for each year). It sounds like you want an average of the daily values for each month for each year (correct me if that is wrong).
Here is some fake data:
dr = pd.date_range('01-01-1980', '12-31-2013', freq='1D')
flow = np.random.rand(len(dr))
df = pd.DataFrame(flow, columns=['flow'], index=dr)
Looks like your example:
flow
1980-01-01 0.751287
1980-01-02 0.411040
1980-01-03 0.134878
1980-01-04 0.692086
1980-01-05 0.671108
...
2013-12-27 0.683654
2013-12-28 0.772894
2013-12-29 0.380631
2013-12-30 0.957220
2013-12-31 0.864612
[12419 rows x 1 columns]
You can use groupby to get a mean for each month, using the same datetime attributes you use above (with some additional methods to help make the data easier to work with)
monthly = (df.groupby([df.index.year, df.index.month])
.mean()
.rename_axis(index=['year', 'month'],)
.reset_index())
monthly has flow data for each month for each year, i.e. what you want to plot:
year month flow
0 1980 1 0.514496
1 1980 2 0.633738
2 1980 3 0.566166
3 1980 4 0.553763
4 1980 5 0.537686
.. ... ... ...
403 2013 8 0.402805
404 2013 9 0.479226
405 2013 10 0.446874
406 2013 11 0.526942
407 2013 12 0.599161
[408 rows x 3 columns]
Now to plot an individual year, you index it from monthly and plot the flow data. I use most of your axes formatting:
# make figure
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,6))
# plotting for one year
sub = monthly[monthly['year'] == 1980]
ax.plot(sub['month'], sub['flow'], color='purple', linestyle='--', label='1980')
# some formatting
ax.set_title('Monthly River Indus Discharge Comparison 1980-2013')
ax.set_ylabel('Discharge (m3/s)')
ax.set_xlabel('Month')
ax.set_xticks(range(1, 13))
ax.set_xticklabels(['J','F','M','A','M','J','J','A','S','O','N','D'])
ax.legend()
ax.grid()
Producing the following:
You could instead plot several years using a loop of some sort:
years = [1980, 1981, 1982, ...]
for year in years:
sub = monthly[monthly['year'] == year]
ax.plot(sub['month'], sub['flow'], ...)
You many run into some other challenges here (like finding a way to set nice styling for 30+ lines, and doing so in a loop). You can open a new post (building off of this one) if you can't find out how to accomplish something through other posts here. Best of luck!
I feel like this question has an obvious answer and I'm just being a bit of a fool. Say you have a couple of dataframes with datetime indices, where each dataframe is for a different year. In my case the index is every day going from June 25th to June 24th the next year:
date var
2019-06-25 107.230294
2019-06-26 104.110004
2019-06-27 104.291506
2019-06-28 111.162552
2019-06-29 112.515364
...
2020-06-20 132.840242
2020-06-21 127.641148
2020-06-22 132.797584
2020-06-23 129.094451
2020-06-24 110.408866
What I want is a single plot with multiple lines, where each line represents a year. The y-axis is my variable, var, and the x-axis should be day of the year. The x-axis should start from June 25th and end at June 24th.
This is what I've tried so far but it messes up the x-axis. Anyone know a more elegant way to do this?
fig, ax = plt.subplots()
plt.plot(average_prices19.index.strftime("%d/%m"), average_prices19.var, label = "2019-20")
plt.plot(average_prices20.index.strftime("%d/%m"), average_prices20.var, label = "2020-21")
plt.legend()
plt.show()
Well, there is a twist in this question: the list of dates in a year is not constant: on leap years there is a 'Feb-29' that is otherwise absent.
If you are comfortable glossing over this (and always representing a potential 'Feb-29' date on your plot, with missing data for non-leap years), then the following will achieve what you are seeking (assuming the data is in df with the date as DateTimeIndex):
import matplotlib.dates as mdates
fig, ax = plt.subplots()
for label, dfy in df.assign(
# note: 2000 is a leap year; the choice is deliberate
date=pd.to_datetime(df.index.strftime('2000-%m-%d')),
label=df.index.strftime('%Y')
).groupby('label'):
dfy.set_index('date')['var'].plot(ax=ax, label=str(label))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d"))
ax.legend()
Update
For larger amounts of data however, the above does not produce very legible xlabels. So instead, we can use ConciseFormatter to customize the display of xlabels (and remove the fake year 2000):
import matplotlib.dates as mdates
fig, ax = plt.subplots()
for label, dfy in df.assign(
# note: 2000 is a leap year; the choice is deliberate
date=pd.to_datetime(df.index.strftime('2000-%m-%d')),
label=df.index.strftime('%Y')
).groupby('label'):
dfy.set_index('date')['var'].plot(ax=ax, label=str(label))
ax.legend()
locator = mdates.AutoDateLocator(minticks=3, maxticks=7)
formatter = mdates.ConciseDateFormatter(
locator,
formats=['', '%b', '%d', '%H:%M', '%H:%M', '%S.%f'],
offset_formats=['', '', '%b', '%b-%d', '%b-%d', '%b-%d %H:%M']
)
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(formatter)
For the data in your example:
For more data:
# setup
idx = pd.date_range('2016-01-01', 'now', freq='QS')
df = pd.DataFrame(
{'var': np.random.uniform(size=len(idx))},
index=idx).resample('D').interpolate(method='polynomial', order=5)
Corresponding plot (with ConciseFormatter):
I am trying to plot a simple pandas Series object, its something like this:
2018-01-01 10
2018-01-02 90
2018-01-03 79
...
2020-01-01 9
2020-01-02 72
2020-01-03 65
It includes only the first month of each year, so it only contains the month January and all its values through the days.
When i try to plot it
# suppose the name of the series is dates_and_values
dates_and_values.plot()
It returns a plot like this (made using my current data)
It is clearly plotting by year and then the month, so it looks pretty squished and small, since i don't have any other months except January, is there a way to plot it by the year and day so it outputs a better plot to observe the days.
the x-axis is the index of the dataframe
dates are a continuous series, x-axis is continuous
change index to be a string of values, means it it no longer continuous and squishes your graph
have generated some sample data that only has January to demonstrate
import matplotlib.pyplot as plt
cf = pd.tseries.offsets.CustomBusinessDay(weekmask="Sun Mon Tue Wed Thu Fri Sat",
holidays=[d for d in pd.date_range("01-jan-1990",periods=365*50, freq="D")
if d.month!=1])
d = pd.date_range("01-jan-2015", periods=200, freq=cf)
df = pd.DataFrame({"Values":np.random.randint(20,70,len(d))}, index=d)
fig, ax = plt.subplots(2, figsize=[14,6])
df.set_index(df.index.strftime("%Y %d")).plot(ax=ax[0])
df.plot(ax=ax[1])
I suggest that you convert the series to a dataframe and then pivot it to get one column for each year. This lets you plot the data for each year with a separate line, either in the same plot using different colors or in subplots. Here is an example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
# Create sample series
rng = np.random.default_rng(seed=123) # random number generator
dt = pd.date_range('2018-01-01', '2020-01-31', freq='D')
dt_jan = dt[dt.month == 1]
series = pd.Series(rng.integers(20, 90, size=dt_jan.size), index=dt_jan)
# Convert series to dataframe and pivot it
df_raw = series.to_frame()
df_pivot = df_raw.pivot_table(index=df_raw.index.day, columns=df_raw.index.year)
df = df_pivot.droplevel(axis=1, level=0)
df.head()
# Plot all years together in different colors
ax = df.plot(figsize=(10,4))
ax.set_xlim(1, 31)
ax.legend(frameon=False, bbox_to_anchor=(1, 0.65))
ax.set_xlabel('January', labelpad=10, size=12)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
# Plot years separately
axs = df.plot(subplots=True, color='tab:blue', sharey=True,
figsize=(10,8), legend=None)
for ax in axs:
ax.set_xlim(1, 31)
ax.grid(axis='x', alpha=0.3)
handles, labels = ax.get_legend_handles_labels()
ax.text(28.75, 80, *labels, size=14)
if ax.is_last_row():
ax.set_xlabel('January', labelpad=10, size=12)
ax.figure.subplots_adjust(hspace=0)
I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!
There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()