Facetgrid to plot stacked normalised counts - Seaborn - python

I'm aiming to use Seaborn facet grid to plot counts of values but normalised, rather than pure counts. Using below, each row should display each unique value in Item. The x-axis should display Num and the values come from Label.
However, each row isn't being partitioned. The same data is displayed for each Item.
import pandas as pd
import Seaborn as sns
df = pd.DataFrame({
'Num' : [1,2,1,2,3,2,1,3,2],
'Label' : ['A','B','C','B','A','C','C','A','B'],
'Item' : ['Up','Left','Up','Left','Down','Right','Up','Down','Right'],
})
g = sns.FacetGrid(df,
row = 'Item',
row_order = ['Up','Right','Down','Left'],
aspect = 2,
height = 4,
sharex = True,
legend_out = True
)
g.map(sns.histplot, x = 'Num', hue = 'Label', data = df, multiple = 'fill', shrink=.8)
g.add_legend()

Maybe you can try g.map_dataframe(sns.histplot, x='Num', hue = 'Label', multiple = 'fill', shrink=.8). I'm not good at seaborn, I just look it up at https://seaborn.pydata.org/generated/seaborn.FacetGrid.html and map_dataframe seems work better than map.

Related

Why do the bar chart ticks merge into one when plotting dataframe but work when plotting row?

I need to make a graph that would look like this:
Here's some sample data:
data = {"Small-Mid":367, "Large":0, "XXL":0, "FF":328, "AA":0, "Total":695}
df = pd.DataFrame([data], columns=data.keys())
It's a dataframe that has only one row, if I try to plot the whole dataframe I get this ugly thing:
fig, ax = plt.subplots(figsize=(11.96, 4.42))
df.plot(kind="bar")
plt.show()
The ugly thing, two graphs, one empty the other one just wrong:
If I plot by selecting the row then it looks fine:
fig, ax = plt.subplots(figsize=(11.96, 4.42))
row = df.iloc[0]
row.plot(kind='bar')
plt.show()
A much nicer graph:
The issue is that I need the Total bar to be a different colour than the other bars and I can't do that when plotting the row, because it only accepts a single value rather than a dictionary for colours.
What I don't understand is why does it return two plots when plotting the whole dataframe and why are all the bars put as one tick mark, as well as how do I make it work?
You should re-shape your dataframe with pandas.melt:
df = pd.melt(frame = df,
var_name = 'variable',
value_name = 'value')
Then you can plot your bar chart with seaborn.barplot:
sns.barplot(ax = ax, data = df, x = 'variable', y = 'value')
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {"Small-Mid":367, "Large":0, "XXL":0, "FF":328, "AA":0, "Total":695}
df = pd.DataFrame([data], columns=data.keys())
df = pd.melt(frame = df,
var_name = 'variable',
value_name = 'value')
fig, ax = plt.subplots(figsize=(11.96, 4.42))
sns.barplot(ax = ax, data = df, x = 'variable', y = 'value')
plt.show()
If you want only 'Total' column to be a different color from others, you can define a color-correspondence dictionary:
colors = {"Small-Mid":'blue', "Large":'blue', "XXL":'blue', "FF":'blue', "AA":'blue', "Total":'red'}
and pass it to seaborn as palette parameter:
sns.barplot(ax = ax, data = df, x = 'variable', y = 'value', palette = colors.values())

Generating multiple plots of categorical data counts to show trend over multiple years of data

I have a dataframe countaining ~14000 rows and ~ 100 columns. I want to visualize how the frequencies of one column of categorical data have changed over time (a second column that is YYYY). Here is a simplified data frame:
import pandas as pd
df = pd.DataFrame({
'Year': ('1999','1999','1999','2000','2000','2001','2001','2002','2003'),
'Cat': ('A','A','C','B','B','B','C','D','D')
})
Using Pandas groupby and reset_index, I am left with the data of interest in a nice table.
df = df.groupby(['Year', 'Cat'])['Cat'].size()
df = df.reset_index(name='count')
For each year, I'd like a plot showing the frequency (count) of each Cat (even if 0). As the dataset spans 16 years, I'd like it in a 4x4 matrix of bar charts (the test dataset above would be limited to 2x2).
I have experience with basic plotting in matplotlib and seaborn, but my python experience is limited and I can't seem to crack this yet.
You can use reindex to get zero values and then plot with seaborn FacetGrid. I used value_counts to get the dataframe first, but you could use set_index with the dataframe you have currently then use reset_index.
df2 = df.groupby('Year')['Cat'].value_counts()
df2.name = 'count'
ix = pd.MultiIndex.from_product([df2.index.levels[0], list('ABCD')], names=['Year', 'Cat'])
df2 = df2.reindex(ix, fill_value=0).reset_index()
g = sns.FacetGrid(df2, col='Year', col_wrap=2)
g.map(plt.bar, 'Cat', 'count')
I'm not quite sure what you want to do, but I hope this will help you.
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Year': ('1999','1999','1999','2000','2000','2001','2001','2002','2003'),
'Cat': ('A','A','C','B','B','B','C','D','D')
})
labels = df.Year
Cat = df.Cat
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, Cat, width, label='Cat')
# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Scoring or I dunno')
ax.set_title('Some Letters')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
def autolabel(rects):
"""Attach a text label above each bar in *rects*, displaying its height."""
for rect in rects:
height = rect.get_height()
ax.annotate('{}'.format(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(rects1)
fig.tight_layout()
plt.show()
Result
https://matplotlib.org/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py

Color map based on countries' frequency counts

I have users data set with Country column, and want to plot a map of users' distribution across the countries. I converted the data set into a dictionary, where keys are Country names, and values - frequency counts for countries. The dictionary looks like this:
'usa': 139421,
'canada': 21601,
'united kingdom': 18314,
'germany': 17024,
'spain': 13096,
[...]
To plot distribution on a world map I used this code:
#Convert to dictionary
counts = users['Country'].value_counts().to_dict()
#Country names
def getList(dict):
return [*dict]
countrs = getList(counts)
#Frequency counts
freqs = list(counts.values())
#Plotting
data = dict(
type = 'choropleth',
colorscale = 'Viridis',
reversescale = True,
locations = countrs,
locationmode = "country names",
z = freqs,
text = users['Country'],
colorbar = {'title' : 'Number of Users'},
)
layout = dict(title = 'Number of Users per Country',
geo = dict(showframe = False)
)
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap,validate=False)
This is the result I got:
The coloring is wrong; it shows that all countries fall into 0-20K range, which is false. Is there a way to fix this? Thank you
Without access to your complete dataset, this is really hard to answer. I'd suggest starting out with this example instead:
Plot 1:
Here you can simply replace lifeExp with your data and everything should be fine as long as your data has a correct format. In the following snippet I've created random integeres for each country to represent your counts variable.
Code:
import plotly.express as px
import numpy as np
np.random.seed(12)
gapminder = px.data.gapminder().query("year==2007")
gapminder['counts'] = np.random.uniform(low=100000, high=200000, size=len(gapminder)).tolist()
fig = px.choropleth(gapminder, locations="iso_alpha",
color="counts",
hover_name="country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Plot 2:
Let me know how this works out for you.
Edit: suggestion with your data:
If you have a dictionary with country names and counts, you can easily construct a dataframe of it and perform a left join to get this:
Plot 2:
Just make sure that your dictionary values are lists, and that the country names are spelled and formatted correctly.
Code 2:
import plotly.express as px
import numpy as np
import pandas as pd
np.random.seed(12)
gapminder = px.data.gapminder().query("year==2007")
#gapminder['counts'] = np.nan
d = {'United States': [139421],
'Canada': [21601],
'United Kingdom': [18314],
'Germany': [17024],
'Spain': [13096]}
yourdata = pd.DataFrame(d).T.reset_index()
yourdata.columns=['country', 'count']
df=pd.merge(gapminder, yourdata, how='left', on='country')
fig = px.choropleth(df, locations="iso_alpha",
color="count",
hover_name="country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()

Stacked histogram fails for string values in X axis

I have the following code of a stacked histogram and it works fine, when FIELD is numeric. However, when I put FIELD_str that instead of 1, 2, 3, ... has abc1, abc2, abc3, etc., then it fails with the error TypeError: cannot concatenate 'str' and 'float' objects. How can I substitute (directly or indirectly) the numbers in the X axis with their string values (this is required for the better readability of the chart):
filter = df["CLUSTER"] == 1
plt.ylabel("Absolute frequency")
plt.hist([df["FIELD"][filter],df["FIELD"][~filter]],stacked=True,
color=['#8A2BE2', '#EE3B3B'], label=['1','0'])
plt.legend()
plt.show()
DATASET:
s_field1 = pd.Series(["5","5","5","8","8","9","10"])
s_field1_str = pd.Series(["abc1","abc1","abc1","abc2","abc2","abc3","abc4"])
s_cluster = pd.Series(["1","1","0","1","0","1","0"])
df = pd.concat([s_field1, s_field1_str, s_cluster], axis=1)
df
EDIT:
I tried to create a dictionary but cannot figure out how to put it inside the histogram:
# since python 2.7
import collections
yes = collections.Counter(df["FIELD_str"][filter])
no = collections.Counter(df["FIELD_str"][~filter])
You probably have to use barplot instead of histogram, as histogram by definition is for data on numeric (interval) scale, not nominal (categorical) scale. You can try this:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
s_field1 = pd.Series(["5","5","5","8","8","9","10"])
s_field1_str = pd.Series(["abc1","abc1","abc1","abc2","abc2","abc3","abc4"])
s_cluster = pd.Series(["1","1","0","1","0","1","0"])
df = pd.concat([s_field1, s_field1_str, s_cluster], axis=1)
df.columns = ['FIELD', 'FIELD_str', 'CLUSTER']
counts = df.groupby(['FIELD_str', 'CLUSTER']).count().unstack()
# calculate counts by CLUSTER and FIELD_str
counts.columns = counts.columns.get_level_values(1)
counts.index.name = 'xaxis label here'
ax = counts.plot.bar(stacked=True, title='Some title here')
ax.set_ylabel("yaxis label here")
plt.tight_layout()
plt.savefig("stacked_barplot.png")

Labels in table of visualization of Pandas

​
​Hi, I am plotting a Pandas dataframe. The Pandas Dataframe look like this:
;Cosine;Neutralized
author;0.842075;0.641600
genre;0.839696;0.903227
author+genre;0.833966;0.681121
And the code for plotting that I am using is:
fig = ari_total.plot(kind="bar", legend = False, colormap= "summer",
figsize= ([7,6]), title = "Homogeinity "+corpora+" (texts: "+str(amount_texts)+")", table=True,
use_index=False, ylim =[0,1]).get_figure()
The result is nice, but it has a problem:
As you can see, the labs from the index of the table "author", "genre" and "author+gender" are render over 0, 1 and 2.
My question: how can I delete this numbers and still using the same function? I am using the argument use_index=False, which I thought they would delete the labels from the bars, but it actually only replace them with this numbers...
I would be very thankfull if you could help. Regards!
Use fig.axes[0].get_xaxis().set_visible(False).
code:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame()
df['Cosine'] = [0.842075,0.839696,0.833966]
df['Neutralized'] = [0.641600,0.903227,0.681121]
df.index = ['author', 'genre', 'author+genre']
fig = df.plot(kind="bar", legend = False, colormap= "summer",
figsize= ([7,6]), title = "whatever", table=True,
use_index=False, ylim =[0,1]).get_figure()
fig.axes[0].get_xaxis().set_visible(False)
plt.show()
result:

Categories