I'm trying to extract a stacked bar chart over periodic time (5 years):
import pandas as pd
categorical = ["RL","CD(others)","DL","ML","ML","ML","DL","ML","DL","DL"]
year = [2014,2014,2015,2015,2016,2017,2019,2021,2022,2022]
df = pd.DataFrame({'year':year,
'keywords':categorical})
df
I tried relevant post1, post2, post3 to resolve the problem:
#solution1:Pivot table
df.pivot_table(index='year',
columns='keywords',
# values='paper_count',
aggfunc='sum')
#df.plot(x='year', y='paper_count', kind='bar')
#solution2: groupby
# reset_index() gives a column for counting after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year'], as_index=False)
.count()
# rename isn't strictly necessary here; it's just for readability
.rename(columns={'index':'paper_count'})
)
ctdf.plot(x='year', y='paper_count', kind='bar')
At the end, I couldn't figure out how can plot this periodically by counting every 5 yrs:
2000-2005, 2005-2010, 2015-2020, 2020-2025.
expected output:
I don't understand the full logic if the provided example is supposed to match the data, but you can use pandas.cut to form bins, then cumsum to get the cumulated sum (remove this if you just want a simple sum):
years = list(range(2000, 2030, 5))
# [2000, 2005, 2010, 2015, 2020, 2025]
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
# ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
(df.assign(year=pd.cut(df['year'], bins=years, labels=labels))
.groupby(['year', 'keywords'])['year'].count()
.unstack()
.plot.bar(stacked=True)
)
With the red line:
years = list(range(2000, 2030, 5))
# [2000, 2005, 2010, 2015, 2020, 2025]
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
# ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
df2 = (df
.assign(year=pd.cut(df['year'], bins=years, labels=labels))
.groupby(['year', 'keywords'])['year'].count()
.unstack()
)
ax = df2.plot.bar(stacked=True)
# adding arbitrary shift (0.1)
df2.sum(axis=1).add(0.1).plot(ax=ax, color='red', marker='s', label='paper count')
ax.legend()
output:
Related
file_path is an excel file with a column 'Year' of year numbers ranging from 1940 to 2018 and another column 'Divide Year 1976' indicating Pre-1976 or 1976-Present.
# Load excel file as a pandas data_frame
data = pd.read_excel(file_path, sheet_name=5, skiprows=1)
data_frame = pd.DataFrame(data)
# create an extra column in data_frame with bin from 1930 to 2020 with 10 years interval
data_frame['bin Year'] = pd.cut(data_frame.Year, bins=np.arange(1930, 2030, 10, dtype=int))
# Plot stacked bar plot
color_table = pd.crosstab(index=data_frame['bin Year'], columns=data_frame['Divide Year 1976'])
color_table.plot(kind='bar', figsize=(6.5, 3.5), stacked=True, legend=None, edgecolor='black')
# Add xticks
plt.xticks(locs, ['1930s','1940s','1950s','1960s','1970s','1980s','1990s','2000s','2010s'], fontsize=8, rotation=45)
The problem here is that colortable.plot() function automatically ignores the interval that has 0 counts, in my case which is 1940-1950. How can I force the code to display bars that has zero counts in certain intervals?
enter image description here
Use parameter dropna in crosstab.
color_table = pd.crosstab(
index=data_frame['bin Year'],
columns=data_frame['Divide Year 1976'],
dropna=False)
See the docs
I have a dataframe and I want to make a stacked plot. The command that i use is:
df1 = df.groupby(['sample', 'species']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack()
df1.plot(kind='bar', stacked=True, colormap=cmap, ax=f.gca())
The plot looks good but I would like to always the same color for the same species from a different dataset. To do so, I build a table where I link the speceis name to a RGB color. however, I`m not able to link the name of the species in the plot to the color.
How can i do it? can anyone help, please?
Possible approach - include all species values in every dataset. Thus pandas will make all the linking automatically. 2 steps needed - follow comments. Hope it's close to your expectations.
import matplotlib.pyplot as plt
import pandas as pd
import random
num = 100
df1 = pd.DataFrame({
'sample': [random.randrange(10) for dummy in range(num)],
'species': [random.choice('abcdi') for dummy in range(num)]})
dfg1 = (
df1
.groupby(['sample', 'species'])
.size()
.groupby(level=0)
.apply(lambda x: 100 * x / x.sum()).unstack())
df2 = pd.DataFrame({
'sample': [random.randrange(10) for dummy in range(num)],
'species': [random.choice('defghi') for dummy in range(num)]})
dfg2 = (
df2
.groupby(['sample', 'species'])
.size()
.groupby(level=0)
.apply(lambda x: 100 * x / x.sum()).unstack())
# step 1 - get all values of species
all_columns = set(dfg1.columns) | set(dfg2.columns)
# step 2 - add all species to every dataset
for dfg in [dfg1, dfg2]:
for col in all_columns.difference(set(dfg.columns)):
dfg[col] = 0 # zeros for all added species
dfg.sort_index(axis=1, inplace=True) # sort columns of every dataset
plt.figure()
ax1 = plt.subplot(121)
ax1.set_title('data_1')
dfg1.plot(kind='bar', stacked=True, ax=ax1, cmap='rainbow')
ax2 = plt.subplot(122)
ax2.set_title('data_2')
dfg2.plot(kind='bar', stacked=True, ax=ax2, cmap='rainbow')
When plotting the below data set:
date = ['2/18/2019','2/18/2019','2/18/2019','2/18/2019','2/25/2019','2/25/2019','2/25/2019','2/25/2019','3/4/2019','3/4/2019','3/4/2019','3/4/2019',
'3/11/2019','3/11/2019','3/11/2019','3/11/2019','3/18/2019','3/18/2019','3/18/2019','3/18/2019']
name = ['P','L','E','N','P','L','E','N','P','L','E','N','P','L','E','N','P','L','E','N']
count = [0,0,0,0,0,0,0,0,1,5,0,0,1,7,1,2,2,7,1,2]
df = pd.DataFrame({'date': date, 'name': name, 'count':count}).sort_values(['date','count'],ascending=[True, False])
I would like to maintain the order each week, ie. within every week the values should be ordered by count, for example 3/18 we should have L first, then either P or N and then E.
However, the order breaks after pivoting, and when plotted it shows the data alphabetically. Any way to make it plot by count within each week?
piv = df.pivot(index='date', columns='name', values='count')
piv = piv.reset_index(level=piv.index.names)
piv.plot(kind='bar', stacked=True, rot=0, grid=True)
The order of columns is how the bars will be stacked. If you have E, L, N, P in your pivot table, that will be the order of the series (current code). You can change this order, but all bars will have the same order. Here is an example ordering the bars by count of Letter group, (i.e. E = 2)
piv = df.pivot(index='date', columns='name', values='count')
piv = piv.reset_index(level=piv.index.names)
cols = ["date"] + piv[list("ELPN")].sum().sort_values(ascending=False).keys().tolist()
piv = piv[cols]
piv.plot(kind='bar', stacked=True, rot=0, grid=True)
I suspect you want a different order for each bar. I don't believe this is possible with Pandas, but it could probably be done with matplotlib directly.
You can sort the data on axis 1 and then plot.
df.pivot(index = 'date', columns='name', values='count')\
.sort_values(by='2019-03-18', ascending=False, axis=1)\
.plot.bar(stacked = True, grid = True)
I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!
There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()
I have a Pandas data frame, and I want to explore the periodicity, trend, etc of the time series. Here is the data.
To visualize it, I want to superpose the "sub time series" for each year on the same plot (ie have the same x coordinate for data from 01/01/2000, 01/01/2001 and 01/01/2002).
Do I have to transform my date column so that each data has the same year?
Does anyone have an idea of how to do that?
Setup
This parses the data that you linked
df = pd.read_csv(
'data.csv', sep=';', decimal=',',
usecols=['date', 'speed', 'height', 'width'],
index_col=0, parse_dates=[0]
)
My Hack
I stripped the everything but the year from the dates and assumed the year of 2012 because it is a leap year and will accommodate Feb-29. I splity the year into another level of a multi-index, unstack and plot
idx = pd.MultiIndex.from_arrays([
pd.to_datetime(df.index.strftime('2012-%m-%d %H:%M:%S')),
df.index.year
])
ax = df.set_index(idx).unstack().speed.plot()
lg = ax.legend(bbox_to_anchor=(1.05, 1), loc=2, ncol=2)
In an effort to pretty this up
fig, axes = plt.subplots(3, 1, figsize=(15, 9))
idx = pd.MultiIndex.from_arrays([
pd.to_datetime(df.index.strftime('2012-%m-%d %H:%M:%S')),
df.index.year
])
d1 = df.set_index(idx).unstack().resample('W').mean()
d1.speed.plot(ax=axes[0], title='speed')
lg = axes[0].legend(bbox_to_anchor=(1.02, 1), loc=2, ncol=1)
d1.height.plot(ax=axes[1], title='height', legend=False)
d1.width.plot(ax=axes[2], title='width', legend=False)
fig.tight_layout()
One way you could do it is to create a common x-axis for all years like this:
df['yeartime']=df.groupby(df.date.dt.year).cumcount()
where 'yeartime' represents the number of time measures in a year. Next, create a year column:
df['year'] = df.date.dt.year
Now, let's subset our data for the Jan 1st of years 2000, 2001, and 2002
subset_df = df.loc[df.date.dt.year.isin(['2000','2001',2002]) & (df.date.dt.day == 1) & (df.date.dt.month == 1)]
And lastly, plot it.
ax = sns.pointplot('yeartime','speed',hue='year',data=subset_df, markers='None')
_ =ax.get_xaxis().set_ticks([])