Ordering the plot of a pivot by count for each week - python

When plotting the below data set:
date = ['2/18/2019','2/18/2019','2/18/2019','2/18/2019','2/25/2019','2/25/2019','2/25/2019','2/25/2019','3/4/2019','3/4/2019','3/4/2019','3/4/2019',
'3/11/2019','3/11/2019','3/11/2019','3/11/2019','3/18/2019','3/18/2019','3/18/2019','3/18/2019']
name = ['P','L','E','N','P','L','E','N','P','L','E','N','P','L','E','N','P','L','E','N']
count = [0,0,0,0,0,0,0,0,1,5,0,0,1,7,1,2,2,7,1,2]
df = pd.DataFrame({'date': date, 'name': name, 'count':count}).sort_values(['date','count'],ascending=[True, False])
I would like to maintain the order each week, ie. within every week the values should be ordered by count, for example 3/18 we should have L first, then either P or N and then E.
However, the order breaks after pivoting, and when plotted it shows the data alphabetically. Any way to make it plot by count within each week?
piv = df.pivot(index='date', columns='name', values='count')
piv = piv.reset_index(level=piv.index.names)
piv.plot(kind='bar', stacked=True, rot=0, grid=True)

The order of columns is how the bars will be stacked. If you have E, L, N, P in your pivot table, that will be the order of the series (current code). You can change this order, but all bars will have the same order. Here is an example ordering the bars by count of Letter group, (i.e. E = 2)
piv = df.pivot(index='date', columns='name', values='count')
piv = piv.reset_index(level=piv.index.names)
cols = ["date"] + piv[list("ELPN")].sum().sort_values(ascending=False).keys().tolist()
piv = piv[cols]
piv.plot(kind='bar', stacked=True, rot=0, grid=True)
I suspect you want a different order for each bar. I don't believe this is possible with Pandas, but it could probably be done with matplotlib directly.

You can sort the data on axis 1 and then plot.
df.pivot(index = 'date', columns='name', values='count')\
.sort_values(by='2019-03-18', ascending=False, axis=1)\
.plot.bar(stacked = True, grid = True)

Related

Add ONLY the total values on top of stacked bars in matplotlib

I am working with the following bar plot:
And I would like to add only the total amount of each index on top of the bars, like this:
However, when I use the following code, I only get parts of the stacks of each bar.
import matplotlib.pyplot as plt
data = [['0.01 - 0.1','A'],['0.1 - 0.5','B'],['0.5 - 1.0','B'],['0.01 - 0.1','C'],['> 2.5','A'],['1.0 - 2.5','A'],['> 2.5','A']]
df = pd.DataFrame(data, columns = ['Size','Index'])
### plot
df_new = df.sort_values(['Index'])
list_of_colors_element = ['green','blue','yellow','red','purple']
# Draw
piv = df_new.assign(dummy=1) \
.pivot_table('dummy', 'Index', 'Size', aggfunc='count', fill_value=0) \
.rename_axis(columns=None)
ax = piv.plot.bar(stacked=True, color=list_of_colors_element, rot=0, width=1)
ax.bar_label(ax.containers[0],fontsize=9)
# Decorations
plt.title("Index coloured by size", fontsize=22)
plt.ylabel('Amount')
plt.xlabel('Index')
plt.grid(color='black', linestyle='--', linewidth=0.4)
plt.xticks(range(3),fontsize=15)
plt.yticks(fontsize=15)
plt.show()
I have tried with different varieties of ax.bar_label(ax.containers[0],fontsize=9) but none displays the total of the bars.
As Trenton points out, bar_label is usable only if the topmost segment is never zero (i.e., exists in every stack) but otherwise not. Here are examples of the two cases.
If the topmost segment is never zero, use bar_label
In this example, the topmost segment (purple '>2.5') exists for all A, B, and C, so we can just use ax.bar_label(ax.containers[-1]):
df = pd.DataFrame({'Index': [*'AAAABBCBC'], 'Size': ['0.01-0.1', '>2.5', '1.0-2.5', '>2.5', '0.1-0.5', '0.5-1.0', '0.01-0.1', '>2.5', '>2.5']})
piv = pd.crosstab(df['Index'], df['Size'])
ax = piv.plot.bar(stacked=True)
# auto label since none of the topmost segments are missing
ax.bar_label(ax.containers[-1])
Otherwise, sum and label manually
In OP's example, the topmost segment (purple '>2.5') does not always exist (missing for B and C), so the totals need to be summed manually.
How to compute the totals will depend on your specific dataframe. In OP's case, A, B, and C are rows, so the totals should be computed as sum(axis=1):
df = pd.DataFrame({'Index': [*'AAAABBC'], 'Size': ['0.01-0.1', '>2.5', '1.0-2.5', '>2.5', '0.1-0.5', '0.5-1.0', '0.01-0.1']})
piv = pd.crosstab(df['Index'], df['Size'])
ax = piv.plot.bar(stacked=True)
# manually sum and label since some topmost segments are missing
for x, y in enumerate(piv.sum(axis=1)):
ax.annotate(y, (x, y+0.1), ha='center')

Is there any nicer way to aggregate multiple columns on same grouped pandas dataframe?

I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to get stacked line chart which take data from different aggregation on same grouped data. How can we do this some compact way? can anyone suggest possible way of doing this in pandas? any ideas?
my current attempt:
import pandas as pd
import matplotlib.pyplot as plt
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
df_re = df[df['retail_item'].str.contains("GROUND BEEF")]
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_rei = df_rei.reset_index(level=[0,1])
df_rei['week'] = pd.DatetimeIndex(df_rei['date']).week
df_rei['year'] = pd.DatetimeIndex(df_rei['date']).year
df_rei['week'] = df_rei['date'].dt.strftime('%W').astype('uint8')
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
similarly, I need to do data aggregation also like this:
df_re['price_gap'] = df_re['high_price'] - df_re['low_price']
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_rei1 = dff_rei1.reset_index(level=[0,1])
dff_rei1['week'] = pd.DatetimeIndex(dff_rei1['date']).week
dff_rei1['year'] = pd.DatetimeIndex(dff_rei1['date']).year
dff_rei1['week'] = dff_rei1['date'].dt.strftime('%W').astype('uint8')
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
problem
when I made data aggregation, those lines are similar:
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
and
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
I think better way could be I have to make custom function with *arg, **kwargs to make shift for aggregating the columns, but how should I show stacked line chart where y axis shows different quantities. Is that doable to do so in pandas?
line plot
I did for getting line chart as follow:
for g, d in df_ret_df1.groupby('retail_item'):
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='vals', hue='mm', data=d,alpha=.8)
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.week, y1=y1.vals, y2=y2.vals)
for year in df['year'].unique():
data = df_rei[(df_rei.date.dt.year == year) & (df_rei.retail_item == g)]
sns.lineplot(x='week', y='price_gap', ci=None, data=data, palette=cmap,label=year,alpha=.8)
I want to minimize those so I could able to aggregate on different columns and make stacked line chart, where they share x-axis as week, and y axis shows number of ads and price_range respectively. I don't know is there any better way of doing this. I am doing this because stacked line chart (two vertical subplots), one shows number of ads on y axis and another one shows price ranges for same items along 52 weeks. can anyone suggest any possible way of doing this? any ideas?
This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:
The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2, which is why these contain identical values for min, max, and mean.
This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs.
The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.
Import dataset and aggregate it as needed
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])
# Create dataframe containing data for ground beef products, compute
# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
.reset_index('retail_item'))
df_gbeef_agg
Plot aggregated variables in single figure containing small multiples
variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()
fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color in zip(axs, variables, colors):
for i, (item, df_item) in enumerate(df_gbeef_agg.groupby('retail_item')):
ax = axs_row[i]
# Select data and plot it
data = df_item.xs(var, axis=1)
ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
color=color, alpha=0.3, label='min/max')
ax.plot(data.index, data['mean'], color=color, label='mean')
ax.spines['bottom'].set_position('zero')
# Format x-axis tick labels
fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
ax.xaxis.set_major_formatter(fmt)
# Fomat subplot according to position within the figure
if ax.is_first_row():
ax.set_title(item, pad=10)
if ax.is_last_row():
ax.set_xlabel('Week number', size=12, labelpad=5)
if ax.is_first_col():
ax.set_ylabel(var, size=12, labelpad=10)
if ax.is_last_col():
ax.legend(frameon=False)
fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);
I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:
import pandas as pd
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
# define which columns to group and in which way
dct = {'low_price': [max, min],
'high_price': min,
'year': 'mean'}
# actually group the columns
df.groupby(['region']).agg(dct)
Output:
low_price high_price year
max min min mean
region
ALASKA 16.99 1.33 1.33 2020.792123
HAWAII 12.99 1.33 1.33 2020.738318
MIDWEST 28.73 0.99 0.99 2020.690159
NORTHEAST 19.99 1.20 1.99 2020.709916
NORTHWEST 16.99 1.33 1.33 2020.736397
SOUTH CENTRAL 28.76 1.20 1.49 2020.700980
SOUTHEAST 21.99 1.33 1.48 2020.699655
SOUTHWEST 16.99 1.29 1.29 2020.704341

Matplotlib Plot points on an existing line, only by knowing x values

I have 2 dataframes:
'stock' is a dataframe with columns Date and Price.
'events' is a dataframe with columns Date and Text.
My goal is to produce a graph of the stock prices and on the line place dots where the events occur. However, I do not know how to do 'y' value for the events dataframe as I want it to be where it is on the stock dataframe.
I am able to plot the first dataframe fine with:
plt.plot('Date', 'Price', data=stock)
And I try to plot the event dots with:
plt.scatter('created_at', ???, data=events)
However, it is the ??? that I don't know how to set
Assuming Date and created_at are datetime:
stock = pd.DataFrame({'Date':['2021-01-01','2021-02-01','2021-03-01','2021-04-01','2021-05-01'],'Price':[1,5,3,4,10]})
events = pd.DataFrame({'created_at':['2021-02-01','2021-03-01'],'description':['a','b']})
stock.Date = pd.to_datetime(stock.Date)
events.created_at = pd.to_datetime(events.created_at)
Filter stock by events.created_at (or merge) and plot them onto the same ax :
stock_events = stock[stock.Date.isin(events.created_at)]
# or merge on the date columns
# stock_events = stock.merge(events, left_on='Date', right_on='created_at')
ax = stock.plot(x='Date', y='Price')
stock_events.plot.scatter(ax=ax, x='Date', y='Price', label='Event', c='r', s=50)

How to plot both Price and Volume in same Chart

I have a dataframe as mentioned below:
Date,Time,Price,Volume
31/01/2019,09:15:00,10691.50,600
31/01/2019,09:15:01,10709.90,13950
31/01/2019,09:15:02,10701.95,9600
31/01/2019,09:15:03,10704.10,3450
31/01/2019,09:15:04,10700.05,2625
31/01/2019,09:15:05,10700.05,2400
31/01/2019,09:15:06,10698.10,3000
31/01/2019,09:15:07,10699.90,5925
31/01/2019,09:15:08,10699.25,5775
31/01/2019,09:15:09,10700.45,5925
31/01/2019,09:15:10,10700.00,4650
31/01/2019,09:15:11,10699.40,8025
31/01/2019,09:15:12,10698.95,5025
31/01/2019,09:15:13,10698.45,1950
31/01/2019,09:15:14,10696.15,3900
31/01/2019,09:15:15,10697.15,2475
31/01/2019,09:15:16,10697.05,4275
31/01/2019,09:15:17,10696.25,3225
31/01/2019,09:15:18,10696.25,3300
The data frame contains approx 8000 rows. I want plot both price and volume in same chart. (Volume Range: 0 - 8,00,000)
Suppose you want to compare price and volume vs time, try this:
df = pd.read_csv('your_path_here')
df.plot('Time', ['Price', 'Volume'], secondary_y='Price')
edit: x-axis customization
Since you want x-axis customization,try this (this is just a basic example you can follow):
# Create a Datetime column while parsing the csv file
df = pd.read_csv('your_path_here', parse_dates= {'Datetime': ['Date', 'Time']})
Then you need to create two list, one containing the position on the x-axis and the other one the labels.
Say you want labels every 5 seconds (your requests at 30 min is possibile but not with the data you provided)
positions = [p for p in df.Datetime if p.second in range(0, 60, 5)]
labels = [l.strftime('%H:%M:%S') for l in positions]
Then you plot passing the positions and labels lists to set_xticks and set_xticklabels
ax = df.plot('Datetime', ['Price', 'Volume'], secondary_y='Price')
ax.set_xticks(positions)
ax.set_xticklabels(labels)

Getting Pandas datetime column to display as Dates, not Numbers, on Matplotlib X-axis

Using pandas and wondering why the date column isn't showing up as the actual dates (type = pandas.tslib.Timestamp) but are showing up as numbers.
Take this replicable example:
todays_date = datetime.datetime.now().date()
columns = ['month','A','B','C','D']
_dates = pd.DataFrame(pd.date_range(todays_date-datetime.timedelta(10), periods=150, freq='M'))
_randomdata = pd.DataFrame(np.random.randn(150, 4))
data = pd.concat([_dates, _randomdata], axis=1)
data.plot(figsize = (10,6))
As you can see, the x-axis is showing up as numbers, not dates.
2 questions:
a) How do I change it so that the actual dates are showing up on the x-axis?
b) How do I change the frequency of the ticks and tick labels on the x-axis if I want more/fewer months showing up?
Thanks guys!
Just use the date_range as an index to the DataFrame:
todays_date = datetime.datetime.now().date()
columns = ['A','B','C','D']
data = pd.DataFrame(data=np.random.randn(150, 4),
index=pd.date_range(todays_date-datetime.timedelta(10), periods=150, freq='M'),
columns=columns)
data.plot(figsize = (10,6))

Categories