In Pandas, generate DateTime index from Multi-Index with years and weeks - python

I have a DataFrame df with columns saledate (in DateTime, dytpe <M8[ns]) and price (dytpe int64), such if I plot them like
fig, ax = plt.subplots()
ax.plot_date(dfp['saledate'],dfp['price']/1000.0,'.')
ax.set_xlabel('Date of sale')
ax.set_ylabel('Price (1,000 euros)')
I get a scatter plot which looks like below.
Since there are so many points that it is difficult to discern an average trend, I'd like to compute the average sale price per week, and plot that in the same plot. I've tried the following:
dfp_week = dfp.groupby([dfp['saledate'].dt.year, dfp['saledate'].dt.week]).mean()
If I plot the resulting 'price' column like this
plt.figure()
plt.plot(df_week['price'].values/1000.0)
plt.ylabel('Price (1,000 euros)')
I can more clearly discern an increasing trend (see below).
The problem is that I no longer have a time axis to plot this DataSeries in the same plot as the previous figure. The time axis starts like this:
longitude_4pp postal_code_4pp price rooms \
saledate saledate
2014 1 4.873140 1067.5 206250.0 2.5
6 4.954779 1102.0 129000.0 3.0
26 4.938828 1019.0 327500.0 3.0
40 4.896904 1073.0 249000.0 2.0
43 4.938828 1019.0 549000.0 5.0
How could I convert this Multi-Index with years and weeks back to a single DateTime index that I can plot my per-week-averaged data against?

If you group using pd.TimeGrouper you'll keep datetimes in your index.
dfp.groupby(pd.TimeGrouper('W')).mean()

Create a new index:
i = pd.Index(pd.datetime(year, 1, 1) + pd.Timedelta(7 * weeks, unit='d') for year, weeks in df.index)
Then set this new index on the DataFrame:
df.index = i

For the sake of completeness, here are the details of how I implemented the solution suggested by piRSquared:
fig, ax = plt.subplots()
ax.plot_date(dfp['saledate'],dfp['price']/1000.0,'.')
ax.set_xlabel('Date of sale')
ax.set_ylabel('Price (1,000 euros)')
dfp_week = dfp.groupby(pd.TimeGrouper(key='saledate', freq='W')).mean()
plt.plot_date(dfp_week.index, dfp_week['price']/1000.0)
which yields the plot below.

Related

Clustered x-axis with the dates not showing clearly

I'm trying to plot a graph of a time series which has dates from 1959 to 2019 including months, and I when I try plotting this time series I'm getting a clustered x-axis where the dates are not showing properly. How is it possible to remove the months and get only the years on the x-axis so it wont be as clustered and it would show the years properly?
fig,ax = plt.subplots(2,1)
ax[0].hist(pca_function(sd_Data))
ax[0].set_ylabel ('Frequency')
ax[1].plot(pca_function(sd_Data))
ax[1].set_xlabel ('Years')
fig.suptitle('Histogram and Time series of Plot Factor')
plt.tight_layout()
# fig.savefig('factor1959.pdf')
pca_function(sd_Data)
comp_0
sasdate
1959-01 -0.418150
1959-02 1.341654
1959-03 1.684372
1959-04 1.981473
1959-05 1.242232
...
2019-08 -0.075270
2019-09 -0.402110
2019-10 -0.609002
2019-11 0.320586
2019-12 -0.303515
[732 rows x 1 columns]
From what I see, you do have years on your second subplot, they are just overlapped because there are to many of them placed horizontally. Try to increase figsize, and rotate ticks:
# Builds an example dataframe.
df = pd.DataFrame(columns=['Years', 'Frequency'])
df['Years'] = pd.date_range(start='1/1/1959', end='1/1/2023', freq='M')
df['Frequency'] = np.random.normal(0, 1, size=(df.shape[0]))
fig, ax = plt.subplots(2,1, figsize=(20, 5))
ax[0].hist(df.Frequency)
ax[0].set_ylabel ('Frequency')
ax[1].plot(df.Years, df.Frequency)
ax[1].set_xlabel('Years')
for tick in ax[0].get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right')
for tick in ax[1].get_xticklabels():
tick.set_rotation(45)
tick.set_ha('right')
fig.suptitle('Histogram and Time series of Plot Factor')
plt.tight_layout()
p.s. if the x-labels still overlap, try to increase your step size.
First off, you need to store the result of the call to pca_function into a variable. E.g. called result_pca_func. That way, the calculations (and possibly side effects or different randomization) are only done once.
Second, the dates should be converted to a datetime format. For example using pd.to_datetime(). That way, matplotlib can automatically put year ticks as appropriate.
Here is an example, starting from a dummy test dataframe:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': [f'{y}-{m:02d}' for y in range(1959, 2019) for m in range(1, 13)]})
df['Values'] = np.random.randn(len(df)).cumsum()
df = df.set_index('Date')
result_pca_func = df
result_pca_func.index = pd.to_datetime(result_pca_func.index)
fig, ax2 = plt.subplots(figsize=(10, 3))
ax2.plot(result_pca_func)
plt.tight_layout()
plt.show()

Plotting a pandas Series using dates and values too squished

I am trying to plot a simple pandas Series object, its something like this:
2018-01-01 10
2018-01-02 90
2018-01-03 79
...
2020-01-01 9
2020-01-02 72
2020-01-03 65
It includes only the first month of each year, so it only contains the month January and all its values through the days.
When i try to plot it
# suppose the name of the series is dates_and_values
dates_and_values.plot()
It returns a plot like this (made using my current data)
It is clearly plotting by year and then the month, so it looks pretty squished and small, since i don't have any other months except January, is there a way to plot it by the year and day so it outputs a better plot to observe the days.
the x-axis is the index of the dataframe
dates are a continuous series, x-axis is continuous
change index to be a string of values, means it it no longer continuous and squishes your graph
have generated some sample data that only has January to demonstrate
import matplotlib.pyplot as plt
cf = pd.tseries.offsets.CustomBusinessDay(weekmask="Sun Mon Tue Wed Thu Fri Sat",
holidays=[d for d in pd.date_range("01-jan-1990",periods=365*50, freq="D")
if d.month!=1])
d = pd.date_range("01-jan-2015", periods=200, freq=cf)
df = pd.DataFrame({"Values":np.random.randint(20,70,len(d))}, index=d)
fig, ax = plt.subplots(2, figsize=[14,6])
df.set_index(df.index.strftime("%Y %d")).plot(ax=ax[0])
df.plot(ax=ax[1])
I suggest that you convert the series to a dataframe and then pivot it to get one column for each year. This lets you plot the data for each year with a separate line, either in the same plot using different colors or in subplots. Here is an example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
# Create sample series
rng = np.random.default_rng(seed=123) # random number generator
dt = pd.date_range('2018-01-01', '2020-01-31', freq='D')
dt_jan = dt[dt.month == 1]
series = pd.Series(rng.integers(20, 90, size=dt_jan.size), index=dt_jan)
# Convert series to dataframe and pivot it
df_raw = series.to_frame()
df_pivot = df_raw.pivot_table(index=df_raw.index.day, columns=df_raw.index.year)
df = df_pivot.droplevel(axis=1, level=0)
df.head()
# Plot all years together in different colors
ax = df.plot(figsize=(10,4))
ax.set_xlim(1, 31)
ax.legend(frameon=False, bbox_to_anchor=(1, 0.65))
ax.set_xlabel('January', labelpad=10, size=12)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
# Plot years separately
axs = df.plot(subplots=True, color='tab:blue', sharey=True,
figsize=(10,8), legend=None)
for ax in axs:
ax.set_xlim(1, 31)
ax.grid(axis='x', alpha=0.3)
handles, labels = ax.get_legend_handles_labels()
ax.text(28.75, 80, *labels, size=14)
if ax.is_last_row():
ax.set_xlabel('January', labelpad=10, size=12)
ax.figure.subplots_adjust(hspace=0)

What is the best way to plot numerical Y axis, X axis Time series for a categorical variable in Python?

My Data Frame is in below format
Amount Category Transactiondatetime
9445 A16 22-04-2015 19:42
2000 A23 23-04-2015 16:29
1398 A16 02-05-2015 15:17
1995 A7 27-06-2015 13:51
2000 A23 07-08-2015 17:31
Variable Description
Assume category variable as some product categories sold on a website.
Category variable has around 15-20 categories.
Some products were sold 20 times in a year, some were sold 50 and so on for different different amount.
The time series is spread across the year and the data has 6000000 rows.
Aim of my task
I am interested in viewing which category gets most amount during which part of the year. This can be a little messy as the data is huge and there will be some over lapping in the categories on a time series scale.
So what would be the best way to visualize this kind of data - it can be matplotlib, seaborn or bokeh or any other library.
Will appreciate example with code.
Maybe just use a bar graph with amount on the y-axis and time on the x-axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('something.csv')
df['Transactiondatetime'] = pd.to_datetime(df['Transactiondatetime'], infer_datetime_format=True)
categories = list(set(df['Category'].tolist()))
fig, ax = plt.subplots()
bar_width = 2.0
for category in categories:
cat_df = df[df['Category'] == category]
times = cat_df['Transactiondatetime'].tolist()
values = cat_df['Amount'].tolist()
ax.bar(times, values, bar_width, label=category)
ax.legend()
plt.xlabel('Transaction Date')
plt.ylabel('Amount')
plt.gcf().autofmt_xdate()
plt.show()

Add months to xaxis and legend on a matplotlib line plot

I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!
There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()

Formatting X axis labels Pandas time series plot

I am trying to plot a multiple time series dataframe in pandas. The time series is a 1 year daily points of length 365. The figure is coming alright but I want to suppress the year tick showing on the x axis.
I want to suppress the 1950 label showing in the left corner of x axis. Can anybody suggest something on this? My code
dates = pandas.date_range('1950-01-01', '1950-12-31', freq='D')
data_to_plot12 = pandas.DataFrame(data=data_array, # values
index=homo_regions) # 1st column as index
dataframe1 = pandas.DataFrame.transpose(data_to_plot12)
dataframe1.index = dates
ax = dataframe1.plot(lw=1.5, marker='.', markersize=2, title='PRECT time series PI Slb Ocn CNTRL 60 years')
ax.set(xlabel="Months", ylabel="PRECT (mm/day)")
fig_name = 'dataframe1.pdf'
plt.savefig(fig_name)
You should be able to specify the xaxis major formatter like so
import matplotlib.dates as mdates
...
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))

Categories