Stacked histogram fails for string values in X axis - python

I have the following code of a stacked histogram and it works fine, when FIELD is numeric. However, when I put FIELD_str that instead of 1, 2, 3, ... has abc1, abc2, abc3, etc., then it fails with the error TypeError: cannot concatenate 'str' and 'float' objects. How can I substitute (directly or indirectly) the numbers in the X axis with their string values (this is required for the better readability of the chart):
filter = df["CLUSTER"] == 1
plt.ylabel("Absolute frequency")
plt.hist([df["FIELD"][filter],df["FIELD"][~filter]],stacked=True,
color=['#8A2BE2', '#EE3B3B'], label=['1','0'])
plt.legend()
plt.show()
DATASET:
s_field1 = pd.Series(["5","5","5","8","8","9","10"])
s_field1_str = pd.Series(["abc1","abc1","abc1","abc2","abc2","abc3","abc4"])
s_cluster = pd.Series(["1","1","0","1","0","1","0"])
df = pd.concat([s_field1, s_field1_str, s_cluster], axis=1)
df
EDIT:
I tried to create a dictionary but cannot figure out how to put it inside the histogram:
# since python 2.7
import collections
yes = collections.Counter(df["FIELD_str"][filter])
no = collections.Counter(df["FIELD_str"][~filter])

You probably have to use barplot instead of histogram, as histogram by definition is for data on numeric (interval) scale, not nominal (categorical) scale. You can try this:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
s_field1 = pd.Series(["5","5","5","8","8","9","10"])
s_field1_str = pd.Series(["abc1","abc1","abc1","abc2","abc2","abc3","abc4"])
s_cluster = pd.Series(["1","1","0","1","0","1","0"])
df = pd.concat([s_field1, s_field1_str, s_cluster], axis=1)
df.columns = ['FIELD', 'FIELD_str', 'CLUSTER']
counts = df.groupby(['FIELD_str', 'CLUSTER']).count().unstack()
# calculate counts by CLUSTER and FIELD_str
counts.columns = counts.columns.get_level_values(1)
counts.index.name = 'xaxis label here'
ax = counts.plot.bar(stacked=True, title='Some title here')
ax.set_ylabel("yaxis label here")
plt.tight_layout()
plt.savefig("stacked_barplot.png")

Related

Python: Barplot colored according to a third variable

Currently I am trying to create a Barplot that shows the amount of reviews for an app per week. The bar should however be colored according to a third variable which contains the average rating of the reviews in each week (range: 1 to 5).
I followed the instructions of the following post to create the graph: Python: Barplot with colorbar
The code works fine:
# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]]
df = pd.DataFrame(data, columns = ["week", "count", "score"])
# Convert to lists
data_x = list(df["week"])
data_hight = list(df["count"])
data_color = list(df["score"])
#Create Barplot:
data_color = [x / max(data_color) for x in data_color]
fig, ax = plt.subplots(figsize=(15, 4))
my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)
sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(1,5))
sm.set_array([])
cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)
plt.show()
Now to the issue: As you might notice the value of the average score in week 4 is "1.2". The Barplot does however indicate that the value lies around "2.5". I understand that this stems from the following code line, which standardizes the values by dividing it with the max value:
data_color = [x / max(data_color) for x in data_color]
Unfortunatly I am not able to change this command in a way that the colors resemble the absolute values of the scores, e.g. with a average score of 1.2 the last bar should be colored in deep red not light orange. I tried to just plug in the regular score values (Not standardized) to solve the issue, however, doing so creates all bars with the same green color... Since this is only my second python project, I have a hard time comprehending the process behind this matter and would be very thankful for any advice or solution.
Cheers Neil
You identified correctly that the normalization is the problem here. It is in the linked code by valued SO user #ImportanceOfBeingEarnest defined for the interval [0, 1]. If you want another normalization range [normmin, normmax], you have to take this into account during the normalization:
# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]]
df = pd.DataFrame(data, columns = ["week", "mycount", "score"])
# Not necessary to convert to lists, pandas series or numpy array is also fine
data_x = df.week
data_hight = df.mycount
data_color = df.score
#Create Barplot:
normmin=1
normmax=5
data_color = [(x-normmin) / (normmax-normmin) for x in data_color] #see the difference here
fig, ax = plt.subplots(figsize=(15, 4))
my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)
sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(normmin,normmax))
sm.set_array([])
cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)
plt.show()
Sample output:
Obviously, this does not check that all values are indeed within the range [normmin, normmax], so a better script would make sure that all values adhere to this specification. We could, alternatively, address this problem by clipping the values that are outside the normalization range:
#...
import numpy as np
#.....
#Create Barplot:
normmin=1
normmax=3.5
data_color = [(x-normmin) / (normmax-normmin) for x in np.clip(data_color, normmin, normmax)]
#....
You may also have noticed another change that I introduced. You don't have to provide lists - pandas series or numpy arrays are fine, too. And if you name your columns not like pandas functions such as count, you can access them as df.ABC instead of df["ABC"].

Merging legends of different lineplots using seaborn

I would like to plot two dataframes with a 'long' representation, and differing axis, to one plot using sns.lineplot(). Yet, I am failing plot it with a single legend containing the elements of both lineplots.
The issue is similar to this: Secondary axis with twinx(): how to add to legend?, though I'd like to use seaborn.
A minimal working example up to the point I got stuck is given below.
import pandas as pd
import seaborn as sns
import numpy as np
import itertools
# mock dataset
lst = range(1,11)
steps1 = list(itertools.chain.from_iterable(itertools.repeat(x, 4) for x in lst))
labels1 = ['A','B']*20
values1 = list(np.random.uniform(0,1,40))
df1 = pd.DataFrame({'steps':steps1, 'lab':labels1, 'vals':values1})
lst = range(6,11)
steps2 = list(itertools.chain.from_iterable(itertools.repeat(x, 4) for x in lst))
labels2 = ['C','D']*10
values2 = list(np.random.uniform(10,20,20))
df2 = pd.DataFrame({'steps':steps2, 'lab2':labels2, 'others':values2})
# plotting
fig, ax = plt.subplots()
fig = sns.lineplot(x='steps',y='vals', data=df1, hue='lab',palette='bright', legend='brief')
ax2 = ax.twinx()
fig2 = sns.lineplot(x='steps',y='others', hue='lab2', data=df2 ,palette='dark', legend='brief')
# How do I merge the legends into one?
# the solution below gives me one merged and one separate legend
h1,l1 = fig.get_legend_handles_labels()
h2,l2 = fig2.get_legend_handles_labels()
ax.legend(loc=3, handles=h1+h2, labels = l1+l2)
I just resolved it by removing the obsolete legend by ax2.get_legend().remove().

How to plot a heatmap using seaborn or matplotlib?

I have a dataframe that I am trying to visualize into a heatmap, I used matplotlib to make a heatmap but it is showing data that is not apart of my dataframe.
I've tried to create a heatmap using matplotlib from an example I found online and changed the code to work for my data. But on the left side of the graph and top of it there are random values that are not apart of my data and I'm not sure how to remove them.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to
pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header
= None))
dflist[0]
#Created a new list, due to the melt we are going to do not been able to
replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in
our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win
Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win
Percent'], axis = 1)
allTeamStats
#Graph visualization Making a HeatMap
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
y=["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16"]
x=["16","15","14","13","12","11","10","9","8","7","6","5","4","3","2","1"]
winp = []
for i in x:
lst = []
for j in y:
percent = allTeamStats.loc[(allTeamStats["Team"]== 'Team '+i) &\
(allTeamStats["VS"]== "vs.#"+j)]['Win
Percent'].iloc[0]
percent = float(percent[:-1])
lst.append(percent)
winp.append(lst)
winpercentage= np.array([[]])
fig,ax=plt.subplots(figsize=(18,18))
im= ax.imshow(winp, cmap='hot')
# We want to show all ticks...
ax.set_xticks(np.arange(len(y)))
ax.set_yticks(np.arange(len(x)))
# ... and label them with the respective list entries
ax.set_xticklabels(y)
ax.set_yticklabels(x)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(x)):
for j in range(len(y)):
text = ax.text(j, i, winp[i][j],
ha="center", va="center", color="red")
ax.set_title("Win Percentage of Each Matchup", fontsize= 40)
heatmap = plt.pcolor(winp)
plt.colorbar(heatmap)
ax.set_ylabel('Seeds', fontsize=40)
ax.set_xlabel('Seeds', fontsize=40)
plt.show()
The results I get are what I want except for the two lines that are on the left side and top of the heatmap. I'm unsure what these values are coming from and to easier see them I used cmap= 'hot' to show the values that are not supposed to be there. If you could help me fix my code to plot it correctly or plot an entire new heatmap using seaborn (my TA told me to try using seaborn but I've never used it yet) with my data. Anything helps Thanks!
I think the culprit is this line: im= ax.imshow(winp, cmap='hot') in your code. Delete it and try again. Basically, anything that you plotted after that line was laid over what that line created. The left and top "margins" were the only parts of the image on the bottom that you could see.

Pandas groupby results on the same plot

I am dealing with the following data frame (only for illustration, actual df is quite large):
seq x1 y1
0 2 0.7725 0.2105
1 2 0.8098 0.3456
2 2 0.7457 0.5436
3 2 0.4168 0.7610
4 2 0.3181 0.8790
5 3 0.2092 0.5498
6 3 0.0591 0.6357
7 5 0.9937 0.5364
8 5 0.3756 0.7635
9 5 0.1661 0.8364
Trying to plot multiple line graph for the above coordinates (x as "x1 against y as "y1").
Rows with the same "seq" is one path, and has to be plotted as one separate line, like all the x, y coordinates corresponding the seq = 2 belongs to one line, and so on.
I am able to plot them, but on a separate graphs, I want all the lines on the same graph, Using subplots, but not getting it right.
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib notebook
df.groupby("seq").plot(kind = "line", x = "x1", y = "y1")
This creates 100's of graphs (which is equal to the number of unique seq). Suggest me a way to obtain all the lines on the same graph.
**UPDATE*
To resolve the above problem, I implemented the following code:
fig, ax = plt.subplots(figsize=(12,8))
df.groupby('seq').plot(kind='line', x = "x1", y = "y1", ax = ax)
plt.title("abc")
plt.show()
Now, I want a way to plot the lines with specific colors. I am clustering path from seq = 2 and 5 in cluster 1; and path from seq = 3 in another cluster.
So, there are two lines under cluster 1 which I want in red and 1 line under cluster 2 which can be green.
How should I proceed with this?
You need to init axis before plot like in this example
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['ProjID','Xcoord','Ycoord'])
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
df.groupby('ProjID').plot(kind='line', x = "Xcoord", y = "Ycoord", ax=ax)
plt.show()
Consider the dataframe df
df = pd.DataFrame(dict(
ProjID=np.repeat(range(10), 10),
Xcoord=np.random.rand(100),
Ycoord=np.random.rand(100),
))
Then we create abstract art like this
df.set_index('Xcoord').groupby('ProjID').Ycoord.plot()
Another way:
for k,g in df.groupby('ProjID'):
plt.plot(g['Xcoord'],g['Ycoord'])
plt.show()
Here is a working example including the ability to adjust legend names.
grp = df.groupby('groupCol')
legendNames = grp.apply(lambda x: x.name) #Get group names using the name attribute.
#legendNames = list(grp.groups.keys()) #Alternative way to get group names. Someone else might be able to speak on speed. This might iterate through the grouper and find keys which could be slower? Not sure
plots = grp.plot('x1','y1',legend=True, ax=ax)
for txt, name in zip(ax.legend_.texts, legendNames):
txt.set_text(name)
Explanation:
Legend values get stored in the parameter ax.legend_ which in turn contains a list of Text() objects, with one item per group, where Text class is found within the matplotlib.text api. To set the text object values, you can use the setter method set_text(self, s).
As a side note, the Text class has a number of set_X() methods that allow you to change the font sizes, fonts, colors, etc. I haven't used those, so I don't know for sure they work, but can't see why not.
based on Serenity's anwser, i make the legend better.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['ProjID','Xcoord','Ycoord'])
# plot groupby results on the same canvas
grouped = df.groupby('ProjID')
fig, ax = plt.subplots(figsize=(8,6))
grouped.plot(kind='line', x = "Xcoord", y = "Ycoord", ax=ax)
ax.legend(labels=grouped.groups.keys()) ## better legend
plt.show()
and you can also do it like:
grouped = df.groupby('ProjID')
fig, ax = plt.subplots(figsize=(8,6))
g_plot = lambda x:x.plot(x = "Xcoord", y = "Ycoord", ax=ax, label=x.name)
grouped.apply(g_plot)
plt.show()
and it looks like:

Timeseries plot with min/max shading using Seaborn

I am trying to create a 3-line time series plot based on the following data , in a Week x Overload graph, where each Cluster is a different line.
I have multiple observations for each (Cluster, Week) pair (5 for each atm, will have 1000). I would like the points on the line to be the average Overload value for that specific (Cluster, Week) pair, and the band be the min/max values of it.
Currently using the following bit of code to plot it, but I'm not getting any lines, as I don't know what unit to specify using the current dataframe:
ax14 = sns.tsplot(data = long_total_cluster_capacity_overload_df, value = "Overload", time = "Week", condition = "Cluster")
GIST Data
I have a feeling I still need to re-shape my dataframe, but I have no idea how. Looking for a final results that looks like this
Based off this incredible answer, I was able to create a monkey patch to beautifully do what you are looking for.
import pandas as pd
import seaborn as sns
import seaborn.timeseries
def _plot_range_band(*args, central_data=None, ci=None, data=None, **kwargs):
upper = data.max(axis=0)
lower = data.min(axis=0)
#import pdb; pdb.set_trace()
ci = np.asarray((lower, upper))
kwargs.update({"central_data": central_data, "ci": ci, "data": data})
seaborn.timeseries._plot_ci_band(*args, **kwargs)
seaborn.timeseries._plot_range_band = _plot_range_band
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['Unit'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
ax = sns.tsplot(time='Week',value="Overload", condition="Cluster", unit="Unit", data=cluster_overload,
err_style="range_band", n_boot=0)
Output Graph:
Notice that the shaded regions line up with the true maximum and minimums in the line graph!
If you figure out why the unit variable is required, please let me know.
If you do not want them all on the same graph then:
import pandas as pd
import seaborn as sns
import seaborn.timeseries
def _plot_range_band(*args, central_data=None, ci=None, data=None, **kwargs):
upper = data.max(axis=0)
lower = data.min(axis=0)
#import pdb; pdb.set_trace()
ci = np.asarray((lower, upper))
kwargs.update({"central_data": central_data, "ci": ci, "data": data})
seaborn.timeseries._plot_ci_band(*args, **kwargs)
seaborn.timeseries._plot_range_band = _plot_range_band
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['subindex'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
def customPlot(*args,**kwargs):
df = kwargs.pop('data')
pivoted = df.pivot(index='subindex', columns='Week', values='Overload')
ax = sns.tsplot(pivoted.values, err_style="range_band", n_boot=0, color=kwargs['color'])
g = sns.FacetGrid(cluster_overload, row="Cluster", sharey=False, hue='Cluster', aspect=3)
g = g.map_dataframe(customPlot, 'Week', 'Overload','subindex')
Which produces the following, (you can obviously play with the aspect ratio if you think the proportions are off)
I finally used the good old plot with a design (subplots) that seems (to me) more readable.
df = pd.read_csv('TSplot.csv', sep='\t', index_col=0)
# Compute the min, mean and max (could also be other values)
grouped = df.groupby(["Cluster", "Week"]).agg({'Overload': ['min', 'mean', 'max']}).unstack("Cluster")
# Plot with sublot since it is more readable
axes = grouped.loc[:,('Overload', 'mean')].plot(subplots=True)
# Getting the color palette used
palette = sns.color_palette()
# Initializing an index to get each cluster and each color
index = 0
for ax in axes:
ax.fill_between(grouped.index, grouped.loc[:,('Overload', 'mean', index + 1)],
grouped.loc[:,('Overload', 'max', index + 1 )], alpha=.2, color=palette[index])
ax.fill_between(grouped.index,
grouped.loc[:,('Overload', 'min', index + 1)] , grouped.loc[:,('Overload', 'mean', index + 1)], alpha=.2, color=palette[index])
index +=1
I really thought I would be able to do it with seaborn.tsplot. But it does not quite look right. Here is the result I get with seaborn:
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['Unit'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
ax = sns.tsplot(time='Week',value="Overload", condition="Cluster", ci=100, unit="Unit", data=cluster_overload)
Outputs:
I am really confused as to why the unit parameter is necessary since my understanding is that all the data is aggregated based on (time, condition) The Seaborn Documentation defines unit as
Field in the data DataFrame identifying the sampling unit (e.g.
subject, neuron, etc.). The error representation will collapse over
units at each time/condition observation. This has no role when data
is an array.
I am not certain of the meaning of 'collapsed over'- especially since my definition wouldn't make it a required variable.
Anyways, here's the output if you want exactly what you discussed, not nearly as pretty. I am not sure how to manually shade in those regions, but please share if you figure it out.
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
grouped = cluster_overload.groupby(['Cluster','Week'],as_index=False)
stats = grouped.agg(['min','mean','max']).unstack().T
stats.index = stats.index.droplevel(0)
colors = ['b','g','r']
ax = stats.loc['mean'].plot(color=colors, alpha=0.8, linewidth=3)
stats.loc['max'].plot(ax=ax,color=colors,legend=False, alpha=0.3)
stats.loc['min'].plot(ax=ax,color=colors,legend=False, alpha=0.3)
Outputs:

Categories