Ordering of elements in Pandas stacked bar chart - python

I'm trying to graph information about the portion of a household's income earned in a specific industry across 5 districts in a region.
I used groupby to sort the information in my data frame by district:
df = df_orig.groupby('District')['Portion of income'].value_counts(dropna=False)
df = df.groupby('District').transform(lambda x: 100*x/sum(x))
df = df.drop(labels=math.nan, level=1)
ax = df.unstack().plot.bar(stacked=True, rot=0)
ax.set_ylim(ymax=100)
display(df.head())
District Portion of income
A <25% 12.121212
25 - 50% 9.090909
50 - 75% 7.070707
75 - 100% 2.020202
Since this income falls into categories, I would like to order the elements in the stacked bar in a logical way. The graph Pandas produced is below. Right now, the ordering (starting from the bottom of each bar) is:
25 - 50%
50 - 75%
75 - 100%
<25%
Unsure
I realize that these are sorted in alphabetical order and was curious if there was a way to set a custom ordering. To be intuitive, I would like the order to be (again, starting from the bottom of the bar):
Unsure
<25%
25 - 50%
50 - 75%
75 - 100%
Then, I would like to flip the legend to display the reverse of this order (ie, I would like the legend to have 75 - 100 at the top, as that is what will be at the top of the bars).

To impose a custom sort order on the income categories, one way is to convert them to a CategoricalIndex.
To reverse the order of matplotlib legend entries, use the get_legend_handles_labels method from this SO question: Reverse legend order pandas plot
import pandas as pd
import numpy as np
import math
np.random.seed(2019)
# Hard-code the custom ordering of categories
categories = ['unsure', '<25%', '25 - 50%', '50 - 75%', '75 - 100%']
# Generate some example data
# I'm not sure if this matches your input exactly
df_orig = pd.DataFrame({'District': pd.np.random.choice(list('ABCDE'), size=100),
'Portion of income': np.random.choice(categories + [np.nan], size=100)})
# Unchanged from your code. Note that value_counts() returns a
# Series, but you name it df
df = df_orig.groupby('District')['Portion of income'].value_counts(dropna=False)
df = df.groupby('District').transform(lambda x: 100*x/sum(x))
# In my example data, np.nan was cast to the string 'nan', so
# I have to drop it like this
df = df.drop(labels='nan', level=1)
# Instead of plotting right away, unstack the MultiIndex
# into columns, then convert those columns to a CategoricalIndex
# with custom sort order
df = df.unstack()
df.columns = pd.CategoricalIndex(df.columns.values,
ordered=True,
categories=categories)
# Sort the columns (axis=1) by the new categorical ordering
df = df.sort_index(axis=1)
# Plot
ax = df.plot.bar(stacked=True, rot=0)
ax.set_ylim(ymax=100)
# Matplotlib idiom to reverse legend entries
handles, labels = ax.get_legend_handles_labels()
ax.legend(reversed(handles), reversed(labels))

Related

How to plot multiple traces with trendlines?

I'm trying to plot trendlines on multiple traces on scatters in plotly. I'm kind of stumped on how to do it.
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
Trying to get something like this:
Here's how I resolved it. Basically I used numpy polyfit function to calculation my slop. I then added the slop for each data set as a tracer
import numpy as np
df_m, df_b = np.polyfit(df_df['Circumference (meters)'].to_numpy(), df_df['Height (meters)'].to_numpy(), 1)
wp_m, wp_b = np.polyfit(df_wp['Circumference (meters)'].to_numpy(), df_wp['Height (meters)'].to_numpy(), 1)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=(df_m*df_df['Circumference (meters)'] + df_b),
name='douglas fir trendline',
mode='lines')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=(wp_m * df_wp['Circumference (meters)'] + wp_b),
name='white pine trendline',
mode='lines')
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
You've already put together a procedure that solves your problem, but I would like to mention that you can use plotly.express and do the very same thing with only a very few lines of code. Using px.scatter() there are actually two slightly different ways, depending on whether your data is of a long or wide format. Your data seems to be of the latter format, since you're asking:
how can I make this work with separate traces?
So I'll start with that. And I'll use a subset of the built-in dataset px.data.stocks() since you haven't provided a data sample.
Code 1 - Wide data
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
Code 2 - Long data
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 1 - Identical results
About the data:
A dataframe of a wide format typically has an index with unique values in the left-most column, variable names in the column headers, and corresponding values for each variable per index in the columns like this:
index AAPL MSFT
0 1.000000 1.000000
1 1.011943 1.015988
2 1.019771 1.020524
3 0.980057 1.066561
4 0.917143 1.040708
Here, adding information about another variable would require adding another column.
A dataframe of a long format, on the other hand, typically organizes the same data with only (though not necessarily only) three columns; index, variable and value:
index variable value
0 AAPL 1.000000
1 AAPL 1.011943
.
.
100 MSFT 1.720717
101 MSFT 1.752239
An contrary to the wide format, this means that index will have duplicate values. But for a good reason.
So what's the difference?
If you look at Code 1 you'll see that the only thing you need to specify for px.scatter in order to get multiple traces with trendlines, in this case AAPL and MSFT on the y-axis versus an index on the x-axis, is trendline = 'ols'. This is because plotly.express automatically identifies the data format as wide and knows how to apply the trendlines correctly. Different columns means different catrgories for which a trace and trendline are produced.
As for the "long approach", you've got both GOOG and AAPL in the same variable column, and values for both of them in the value column. But setting color = 'variable' lets plotly.express know how to categorize the variable column, correctly separate the data in in the value column, and thus correctly produce the trendlines. A different name in the variable column means that index and value in the same row belongs to different categories, for which a new trace and trendline are built.
Any pros and cons?
The arguably only advantage with the wide format is that it's easier to read (particularly for those of us damaged by too many years of sub-excellent data handling with Excel). And one great advantage with the long format is that you can easily illustrate more dimensions of the data if you have more categories with, for example, different symbols or sizes for the markers.
Another advantage with the long format occurs if the dataset changes, for example with the addition of another variable 'AMZN'. Then the name and the values of that variable will occur in the already existing columns instead of adding another one like you would for the wide format. This means that you actually won't have to change the code in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
... in order to add the data to the figure.
While for the wide format, you would have to specify y = ['GOOG', 'AAPL', 'AMZN'] in:
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT', 'AMZN'],
trendline = 'ols',
)
And I would strongly argue that this outweighs the slight inconvenience of speifying color = 'variable' in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 2 - A new variable:
Complete code
# imports
import pandas as pd
import plotly.express as px
# data
df = px.data.stocks()
# df.date = pd.to_datetime(df.date)
df_wide = df.drop(['date', 'GOOG', 'AMZN', 'NFLX', 'FB'], axis = 1).reset_index()
# df_wide = df.drop(['date', 'GOOG', 'NFLX', 'FB'], axis = 1).reset_index()
df_long = pd.melt(df_wide, id_vars = 'index')
df_long
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
# fig_long.show()
fig_wide.show()

Plotly: How to animate a bar chart with multiple groups using plotly express?

I have a dataframe that looks like this:
I want to have one bar for old freq and one for new freq. Currently I have graph that looks like this:
This is what the code looks like:
freq_df['date'] = pd.to_datetime(freq_df['date'])
freq_df['hour'] = freq_df['hour'].astype(str)
fig = px.bar(freq_df, x="hour", y="old freq",hover_name = "date",
animation_frame= freq_df.date.dt.day)
fig.update_layout(transition = {'duration': 2000})
How do I add another bar?
Explanation about DF:
It has frequencies relevant to each hour in a specific date.
Edit:
One approach could be to create a category column and add old and new freq and assign values in another freq column. How do I do that :p ?
Edit:
Here is the DF
,date,hour,old freq,new freq
43,2020-09-04,18,273,224.0
44,2020-09-04,19,183,183.0
45,2020-09-04,20,99,111.0
46,2020-09-04,21,130,83.0
47,2020-09-04,22,48,49.0
48,2020-09-04,23,16,16.0
49,2020-09-05,0,8,6.0
50,2020-09-05,1,10,10.0
51,2020-09-05,2,4,4.0
52,2020-09-05,3,7,7.0
53,2020-09-05,4,25,21.0
54,2020-09-05,5,114,53.0
55,2020-09-05,6,284,197.0
56,2020-09-05,7,343,316.0
57,2020-09-05,8,418,419.0
58,2020-09-05,9,436,433.0
59,2020-09-05,10,469,396.0
60,2020-09-05,11,486,300.0
61,2020-09-05,12,377,140.0
62,2020-09-05,13,552,103.0
63,2020-09-05,14,362,117.0
64,2020-09-05,15,512,93.0
65,2020-09-05,16,392,41.0
66,2020-09-05,17,268,31.0
67,2020-09-05,18,223,30.0
68,2020-09-05,19,165,24.0
69,2020-09-05,20,195,15.0
70,2020-09-05,21,90,
71,2020-09-05,22,46,1.0
72,2020-09-05,23,17,1.0
The answer in two steps:
1. Perform a slight transformation of your data using pd.wide_to_long:
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
2. Plot two groups of bar traces using:
fig1 = px.bar(df_long, x='hour', y = 'freq', hover_name = "date", color='type',
animation_frame= 'date', barmode='group')
This is the result:
The details:
If I understand your question correctly, you'd like to animate a bar chart where you've got one bar for each hour for your two frequencies freq_old and freq_new like this:
If that's the case, then you sample data is no good since your animation critera is hour per date and you've only got four observations (hours) for 2020-09-04 and then 24 observations for 2020-09-05. But don't worry, since your question triggered my interest I just as well made some sample data that will in fact work the way you seem to want them to.
The only real challenge is that px.bar will not accept y= [freq_old, freq_new], or something to that effect, to build your two bar series of different categories for you. But you can make px.bar build two groups of bars by providing a color argument.
However, you'll need a column to identify your different freqs like this:
0 new
1 old
2 new
3 old
4 new
5 old
6 new
7 old
8 new
9 old
In other words, you'll have to transform your dataframe, which originally has a wide format, to a long format like this:
date hour type day freq
0 2020-01-01 0 new 1 7.100490
1 2020-01-01 0 old 1 2.219932
2 2020-01-01 1 new 1 7.015528
3 2020-01-01 1 old 1 8.707323
4 2020-01-01 2 new 1 7.673314
5 2020-01-01 2 old 1 2.067192
6 2020-01-01 3 new 1 9.743495
7 2020-01-01 3 old 1 9.186109
8 2020-01-01 4 new 1 3.737145
9 2020-01-01 4 old 1 4.884112
And that's what this snippet does:
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
stubnames uses a prefix to identify the columns you'd like to stack into a long format. And that's why I've renamed new_freq and old_freq to freq_new and freq_old, respectively. j='type' simply takes the last parts of your cartegory names using sep='_' and produces the column that we need to tell the freqs from eachother:
type
old
new
old
...
suffix='\w+' tells pd.wide_to_long that we're using non-integers as suffixes.
And that's it!
Complete code:
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random
# sample data
observations = 24*5
np.random.seed(5); cols = list('a')
freq_old = np.random.uniform(low=-1, high=1, size=observations).tolist()
freq_new = np.random.uniform(low=-1, high=1, size=observations).tolist()
date = [t[:10] for t in pd.date_range('2020', freq='H', periods=observations).format()]
hour = [int(t[11:13].lstrip()) for t in pd.date_range('2020', freq='H', periods=observations).format()]
# sample dataframe of a wide format such as yours
freq_df=pd.DataFrame({'date': date,
'hour':hour,
'freq_new':freq_new,
'freq_old':freq_old})
freq_df['day']=pd.to_datetime(freq_df['date']).dt.day
# attempt to make my random data look a bit
# like your real world data.
# but don't worry too much about that...
freq_df.freq_new = abs(freq_df.freq_new.cumsum())
freq_df.freq_old = abs(freq_df.freq_old.cumsum())
# sample dataframe of a long format that px.bar likes
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
# plotly express bar chart with multiple bar groups.
fig = px.bar(df_long, x='hour', y = 'freq', hover_name = "date", color='type',
animation_frame= 'date', barmode='group')
# set up a sensible range for the y-axis
fig.update_layout(yaxis=dict(range=[df_long['freq'].min()*0.8,df_long['freq'].max()*1.2]))
fig.show()
I was able to create the bars for both the old and new frequencies, however using a separate plot for each day (Plotly Express Bar Charts don't seem to have support for multiple series). Here is the code for doing so:
# Import packages
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import plotly
import plotly.express as px
from plotly.offline import init_notebook_mode, plot, iplot, download_plotlyjs
init_notebook_mode(connected=True)
plotly.offline.init_notebook_mode(connected=True)
# Start formatting data
allDates = np.unique(df.date)
numDates = allDates.shape[0]
print(numDates)
for i in range(numDates):
df = original_df.loc[original_df.date == allDates[i]]
oldFreqData = go.Bar(x=df["hour"].to_numpy(), y=df["old_freq"].to_numpy(), name="Old Frequency")
newFreqData = go.Bar(x=df["hour"].to_numpy(), y=df["new_freq"].to_numpy(), name="New Frequency")
fig = go.Figure(data=[oldFreqData,newFreqData])
fig.update_layout(title=allDates[i])
fig.update_xaxes(title='Hour')
fig.update_yaxes(title='Frequency')
fig.show()
where df is the dataframe DF from your question.
Here is the output:
However, if you prefer the use of the animation frame from Plotly Express, you can have two separate plots: one for old frequencies and one for new using this code:
# Reformat data
df = original_df
dates = pd.to_datetime(np.unique(df.date)).strftime('%Y-%m-%d')
numDays = dates.shape[0]
print(numDays)
hours = np.arange(0,24)
numHours = hours.shape[0]
allDates = []
allHours = []
oldFreqs = []
newFreqs = []
for i in range(numDays):
for j in range(numHours):
allDates.append(dates[i])
allHours.append(j)
if (df.loc[df.date == dates[i]].loc[df.hour == j].shape[0] != 0): # If data not missing
oldFreqs.append(df.loc[df.date == dates[i]].loc[df.hour == j].old_freq.to_numpy()[0])
newFreqs.append(df.loc[df.date == dates[i]].loc[df.hour == j].new_freq.to_numpy()[0])
else:
oldFreqs.append(0)
newFreqs.append(0)
d = {'Date': allDates, 'Hour': allHours, 'Old_Freq': oldFreqs, 'New_Freq': newFreqs, 'Comb': combined}
df2 = pd.DataFrame(data=d)
# Create px plot with animation
fig = px.bar(df2, x="Hour", y="Old_Freq", hover_data=["Old_Freq","New_Freq"], animation_frame="Date")
fig.show()
fig2 = px.bar(df2, x="Hour", y="New_Freq", hover_data=["Old_Freq","New_Freq"], animation_frame="Date")
fig2.show()
and here is the plot from that code:

Set Hue of Value Counts of Categores with a Numeric Column in Pandas

I have plotted out the value-counts of categories of a column in a bar plot - and now I want each category to be colored according to the average value of a numeric column for that category.
I also want a line plot showing the trend of the change in the average of the numeric column accross different categories. So far, I've done this:
mask = lambda x: x>=50
n_mask = lambda x: x<50
true_mask = df['CatCol'].value_counts().loc[mask]
false_mask = df['CatCol'].value_counts().loc[n_mask].sum()
cols = true_mask
cols['Other Categories occuring less than 50 times'] = false_mask
cols.plot(kind='bar', alpha=0.7, figsize=(12,12))

Python pandas grouping for correlation analysis

Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')

Scatter plot for non numeric data

I am learning to use matplotlib with pandas and I am having a little trouble with it. There is a dataframe which has districts and coffee shops as its y and x labels respectively. And the column values represent the start date of the coffee-shops in respective districts
starbucks cafe-cool barista ........ 60 shops
dist1 2008-09-18 2010-05-04 2007-02-21 ...............
dist2 2007-06-12 2011-02-17
dist3
.
.
100 districts
I want to plot a scatter plot with x axis as time series and y axis as coffee-shops. Since I couldn't figure out a direct one line way to plot this, I extracted the coffee-shops as one list and dates as other list.
shops = list(df.columns.values)
dt = pd.DataFrame(df.ix['dist1'])
dates = dt.set_index('dist1')
First I tried plt.plot(dates, shops). Got a ZeroDivisionError: integer division or modulo by zero - error. I could not figure out the reason for it. I saw on some posts that the data should be numeric, so I used ytick function.
y = [1, 2, 3, 4, 5, 6,...60]
still plt.plot(dates, y) threw same ZeroDivisionError. If I could get past this may be I would be able to plot using tick function. Source -
http://matplotlib.org/examples/ticks_and_spines/ticklabels_demo_rotation.html
I am trying to plot the graph for only first row/dist1. For that I fetched the first row as a dataframe df1 = df.ix[1] and then used the following
for badges, dates in df.iteritems():
date = dates
ax.plot_date(date, yval)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(badges)
yval+=1
.
I got an error at line ax.plot_date(date, yval) saying x and y should be have same first dimension. Since I am plotting one by one for each coffe-shop for dist1 shouldn't the length always be one for both x and y? PS: date is a datetime.date object
To achieve this you need to convert the dates to datetimes, see here for
an example. As mentioned you also need to convert the coffee shops into
some numbering system then change the tick labels accordingly.
Here is an attempt
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
from datetime import datetime
def get_datetime(string):
"Converts string '2008-05-04' to datetime"
return datetime.strptime(string, "%Y-%m-%d")
# Generate datarame
df = pd.DataFrame(dict(
starbucks=["2008-09-18", "2007-06-12"],
cafe_cool=["2010-05-04", "2011-02-17"],
barista=["2007-02-21"]),
index=["dist1", "dist2"])
ax = plt.subplot(111)
label_list = []
label_ticks = []
yval = 1 # numbering system
# Iterate through coffee shops
for coffee_shop, dates in df.iteritems():
# Convert strings into datetime list
datetimes = [get_datetime(date) for date in dates]
# Create list of yvals [yval, yval, ...] to plot against
yval_list = np.zeros(len(dates))+yval
ax.plot_date(datetimes, yval_list)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(coffee_shop)
yval+=1 # Change the number so they don't all sit at the same y position
# Now set the yticks appropriately
ax.set_yticks(label_ticks)
ax.set_yticklabels(label_list)
# Set the limits so we can see everything
ax.set_ylim(ax.get_ylim()[0]-1,
ax.get_ylim()[1]+1)

Categories