pandas crosstab plotting not showing intervals that has zero count

pandas crosstab plotting not showing intervals that has zero count - python

file_path is an excel file with a column 'Year' of year numbers ranging from 1940 to 2018 and another column 'Divide Year 1976' indicating Pre-1976 or 1976-Present.
# Load excel file as a pandas data_frame
data = pd.read_excel(file_path, sheet_name=5, skiprows=1)
data_frame = pd.DataFrame(data)
# create an extra column in data_frame with bin from 1930 to 2020 with 10 years interval
data_frame['bin Year'] = pd.cut(data_frame.Year, bins=np.arange(1930, 2030, 10, dtype=int))
# Plot stacked bar plot
color_table = pd.crosstab(index=data_frame['bin Year'], columns=data_frame['Divide Year 1976'])
color_table.plot(kind='bar', figsize=(6.5, 3.5), stacked=True, legend=None, edgecolor='black')
# Add xticks
plt.xticks(locs, ['1930s','1940s','1950s','1960s','1970s','1980s','1990s','2000s','2010s'], fontsize=8, rotation=45)
The problem here is that colortable.plot() function automatically ignores the interval that has 0 counts, in my case which is 1940-1950. How can I force the code to display bars that has zero counts in certain intervals?
enter image description here

Use parameter dropna in crosstab.
color_table = pd.crosstab(
index=data_frame['bin Year'],
columns=data_frame['Divide Year 1976'],
dropna=False)
See the docs

Related

Plotting a pandas Series using dates and values too squished

I am trying to plot a simple pandas Series object, its something like this:
2018-01-01 10
2018-01-02 90
2018-01-03 79
...
2020-01-01 9
2020-01-02 72
2020-01-03 65
It includes only the first month of each year, so it only contains the month January and all its values through the days.
When i try to plot it
# suppose the name of the series is dates_and_values
dates_and_values.plot()
It returns a plot like this (made using my current data)
It is clearly plotting by year and then the month, so it looks pretty squished and small, since i don't have any other months except January, is there a way to plot it by the year and day so it outputs a better plot to observe the days.

the x-axis is the index of the dataframe
dates are a continuous series, x-axis is continuous
change index to be a string of values, means it it no longer continuous and squishes your graph
have generated some sample data that only has January to demonstrate
import matplotlib.pyplot as plt
cf = pd.tseries.offsets.CustomBusinessDay(weekmask="Sun Mon Tue Wed Thu Fri Sat",
holidays=[d for d in pd.date_range("01-jan-1990",periods=365*50, freq="D")
if d.month!=1])
d = pd.date_range("01-jan-2015", periods=200, freq=cf)
df = pd.DataFrame({"Values":np.random.randint(20,70,len(d))}, index=d)
fig, ax = plt.subplots(2, figsize=[14,6])
df.set_index(df.index.strftime("%Y %d")).plot(ax=ax[0])
df.plot(ax=ax[1])

I suggest that you convert the series to a dataframe and then pivot it to get one column for each year. This lets you plot the data for each year with a separate line, either in the same plot using different colors or in subplots. Here is an example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
# Create sample series
rng = np.random.default_rng(seed=123) # random number generator
dt = pd.date_range('2018-01-01', '2020-01-31', freq='D')
dt_jan = dt[dt.month == 1]
series = pd.Series(rng.integers(20, 90, size=dt_jan.size), index=dt_jan)
# Convert series to dataframe and pivot it
df_raw = series.to_frame()
df_pivot = df_raw.pivot_table(index=df_raw.index.day, columns=df_raw.index.year)
df = df_pivot.droplevel(axis=1, level=0)
df.head()
# Plot all years together in different colors
ax = df.plot(figsize=(10,4))
ax.set_xlim(1, 31)
ax.legend(frameon=False, bbox_to_anchor=(1, 0.65))
ax.set_xlabel('January', labelpad=10, size=12)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
# Plot years separately
axs = df.plot(subplots=True, color='tab:blue', sharey=True,
figsize=(10,8), legend=None)
for ax in axs:
ax.set_xlim(1, 31)
ax.grid(axis='x', alpha=0.3)
handles, labels = ax.get_legend_handles_labels()
ax.text(28.75, 80, *labels, size=14)
if ax.is_last_row():
ax.set_xlabel('January', labelpad=10, size=12)
ax.figure.subplots_adjust(hspace=0)

Python convert the day of year to month on an axis

I have a time series that I would like to plot year on year. I want the data to be daily, but the axis to show each month as "Jan", "Feb" etc.
At the moment I can get the daily data, BUT the axis is 1-366 (the day of the year).
Or I can get the monthly axis as 1, 2, 3 etc (by changing the index to df.index.month), BUT then the data is monthly.
How can I convert the day of year axis into months? Or how can I do this?
Code showing the daily data, but the axis is wrong:
# import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create fake time series dataframe
index = pd.date_range(start='01-Jan-2012', end='31-12-2018', freq='D')
data = np.random.randn(len(index))
df = pd.DataFrame(data, index, columns=['Data'])
# pivot to get by day in rows, then year in columns
df_pivot = pd.pivot_table(df, index=df.index.dayofyear, columns=df.index.year, values='Data')
df_pivot.plot()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

This can be done using the xticks function. Simply add the following code before plt.show():
plt.xticks(np.linspace(0,365,13)[:-1], ('Jan', 'Feb' ... 'Nov', 'Dec'))
Or the following to have the month names appear in the middle of the month:
plt.xticks(np.linspace(15,380,13)[:-1], ('Jan', 'Feb' ... 'Nov', 'Dec'))

It may be more straightforward to simply add a datetime index to your pivoted dataframe.
df_pivot.index = pd.date_range(
df.index.max() - pd.Timedelta(days=df_pivot.shape[0]),
freq='D', periods=df_pivot.shape[0])
df_pivot.plot()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
The resulting plot has the axis as desired:
This method also has the advantage over the accepted answer of working irrespective of your start and end date. For example, if you change your index's end date to end='30-Jun-2018', the axis adapts nicely to fit the data:

Add months to xaxis and legend on a matplotlib line plot

I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!

There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()

Superposing Pandas time series from different years in Seaborn plot

I have a Pandas data frame, and I want to explore the periodicity, trend, etc of the time series. Here is the data.
To visualize it, I want to superpose the "sub time series" for each year on the same plot (ie have the same x coordinate for data from 01/01/2000, 01/01/2001 and 01/01/2002).
Do I have to transform my date column so that each data has the same year?
Does anyone have an idea of how to do that?

Setup
This parses the data that you linked
df = pd.read_csv(
'data.csv', sep=';', decimal=',',
usecols=['date', 'speed', 'height', 'width'],
index_col=0, parse_dates=[0]
)
My Hack
I stripped the everything but the year from the dates and assumed the year of 2012 because it is a leap year and will accommodate Feb-29. I splity the year into another level of a multi-index, unstack and plot
idx = pd.MultiIndex.from_arrays([
pd.to_datetime(df.index.strftime('2012-%m-%d %H:%M:%S')),
df.index.year
])
ax = df.set_index(idx).unstack().speed.plot()
lg = ax.legend(bbox_to_anchor=(1.05, 1), loc=2, ncol=2)
In an effort to pretty this up
fig, axes = plt.subplots(3, 1, figsize=(15, 9))
idx = pd.MultiIndex.from_arrays([
pd.to_datetime(df.index.strftime('2012-%m-%d %H:%M:%S')),
df.index.year
])
d1 = df.set_index(idx).unstack().resample('W').mean()
d1.speed.plot(ax=axes[0], title='speed')
lg = axes[0].legend(bbox_to_anchor=(1.02, 1), loc=2, ncol=1)
d1.height.plot(ax=axes[1], title='height', legend=False)
d1.width.plot(ax=axes[2], title='width', legend=False)
fig.tight_layout()

One way you could do it is to create a common x-axis for all years like this:
df['yeartime']=df.groupby(df.date.dt.year).cumcount()
where 'yeartime' represents the number of time measures in a year. Next, create a year column:
df['year'] = df.date.dt.year
Now, let's subset our data for the Jan 1st of years 2000, 2001, and 2002
subset_df = df.loc[df.date.dt.year.isin(['2000','2001',2002]) & (df.date.dt.day == 1) & (df.date.dt.month == 1)]
And lastly, plot it.
ax = sns.pointplot('yeartime','speed',hue='year',data=subset_df, markers='None')
_ =ax.get_xaxis().set_ticks([])

In Pandas, generate DateTime index from Multi-Index with years and weeks

I have a DataFrame df with columns saledate (in DateTime, dytpe <M8[ns]) and price (dytpe int64), such if I plot them like
fig, ax = plt.subplots()
ax.plot_date(dfp['saledate'],dfp['price']/1000.0,'.')
ax.set_xlabel('Date of sale')
ax.set_ylabel('Price (1,000 euros)')
I get a scatter plot which looks like below.
Since there are so many points that it is difficult to discern an average trend, I'd like to compute the average sale price per week, and plot that in the same plot. I've tried the following:
dfp_week = dfp.groupby([dfp['saledate'].dt.year, dfp['saledate'].dt.week]).mean()
If I plot the resulting 'price' column like this
plt.figure()
plt.plot(df_week['price'].values/1000.0)
plt.ylabel('Price (1,000 euros)')
I can more clearly discern an increasing trend (see below).
The problem is that I no longer have a time axis to plot this DataSeries in the same plot as the previous figure. The time axis starts like this:
longitude_4pp postal_code_4pp price rooms \
saledate saledate
2014 1 4.873140 1067.5 206250.0 2.5
6 4.954779 1102.0 129000.0 3.0
26 4.938828 1019.0 327500.0 3.0
40 4.896904 1073.0 249000.0 2.0
43 4.938828 1019.0 549000.0 5.0
How could I convert this Multi-Index with years and weeks back to a single DateTime index that I can plot my per-week-averaged data against?

If you group using pd.TimeGrouper you'll keep datetimes in your index.
dfp.groupby(pd.TimeGrouper('W')).mean()

Create a new index:
i = pd.Index(pd.datetime(year, 1, 1) + pd.Timedelta(7 * weeks, unit='d') for year, weeks in df.index)
Then set this new index on the DataFrame:
df.index = i

For the sake of completeness, here are the details of how I implemented the solution suggested by piRSquared:
fig, ax = plt.subplots()
ax.plot_date(dfp['saledate'],dfp['price']/1000.0,'.')
ax.set_xlabel('Date of sale')
ax.set_ylabel('Price (1,000 euros)')
dfp_week = dfp.groupby(pd.TimeGrouper(key='saledate', freq='W')).mean()
plt.plot_date(dfp_week.index, dfp_week['price']/1000.0)
which yields the plot below.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas crosstab plotting not showing intervals that has zero count - python

Use parameter dropna in crosstab. color_table = pd.crosstab( index=data_frame['bin Year'], columns=data_frame['Divide Year 1976'], dropna=False) See the docs

Related

Plotting a pandas Series using dates and values too squished

Python convert the day of year to month on an axis

Add months to xaxis and legend on a matplotlib line plot

Superposing Pandas time series from different years in Seaborn plot

In Pandas, generate DateTime index from Multi-Index with years and weeks

Categories

Resources