Python, Seaborn: Plotting frequencies with zero-values

Python, Seaborn: Plotting frequencies with zero-values - python

I have a Pandas series with values for which I'd like to plot counts. This creates roughly what I want:
dy = sns.countplot(rated.year, color="#53A2BE")
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')
plt.show()
The problem comes with missing data. There are 31 years with ratings, but over a timespan of 42 years. That means there should be some empty bins, which are not being displayed. Is there a way to configure this in Seaborn/Matplotlib? Should I use another type of graph, or is there another fix for this?
I thought about looking into whether it is possible to configure it as a time series, but I have the same problem with rating scales. So, on a 1-10 scale the count for e.g. 4 might be zero, and therefore '4' is not in the Pandas data series, which means it also does not show up in the graph.
The result I'd like is the full scale on the x-axis, with counts (for steps of one) on the y-axis, and showing zero/empty bins for missing instances of the scale, instead of simply showing the next bin for which data is available.
EDIT:
The data (rated.year) looks something like this:
import pandas as pd
rated = pd.DataFrame(data = [2016, 2004, 2007, 2010, 2015, 2016, 2016, 2015,
2011, 2010, 2016, 1975, 2011, 2016, 2015, 2016,
1993, 2011, 2013, 2011], columns = ["year"])
It has more values, but the format is the same. As you can see in..
rated.year.value_counts()
..there are quite a few x values for which count would have to be zero in the graph. Currently plot looks like:

I solved the problem by using the solution suggested by #mwaskom in the comments to my question. I.e. to add an 'order' to the countplot with all valid values for year, including those with count equals zero. This is the code that produces the graph:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
rated = pd.DataFrame(data = [2016, 2004, 2007, 2010, 2015, 2016, 2016, 2015,
2011, 2010, 2016, 1975, 2011, 2016, 2015, 2016,
1993, 2011, 2013, 2011], columns = ["year"])
dy = sns.countplot(rated.year, color="#53A2BE", order = list(range(rated.year.min(),rated.year.max()+1)))
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')
plt.show()

Consider a seaborn barplot by creating a reindexed series casted to a dataframe:
# REINDEXED DATAFRAME
rated_ser = pd.DataFrame(rated['year'].value_counts().\
reindex(range(rated.year.min(),rated.year.max()+1), fill_value=0))\
.reset_index()
# SNS BAR PLOT
dy = sns.barplot(x='index', y='year', data=rated_ser, color="#53A2BE")
dy.set_xticklabels(dy.get_xticklabels(), rotation=90) # ROTATE LABELS, 90 DEG.
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')

Related

How can plot categorical stacked bar over periodic time?

I'm trying to extract a stacked bar chart over periodic time (5 years):
import pandas as pd
categorical = ["RL","CD(others)","DL","ML","ML","ML","DL","ML","DL","DL"]
year = [2014,2014,2015,2015,2016,2017,2019,2021,2022,2022]
df = pd.DataFrame({'year':year,
'keywords':categorical})
df
I tried relevant post1, post2, post3 to resolve the problem:
#solution1:Pivot table
df.pivot_table(index='year',
columns='keywords',
# values='paper_count',
aggfunc='sum')
#df.plot(x='year', y='paper_count', kind='bar')
#solution2: groupby
# reset_index() gives a column for counting after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year'], as_index=False)
.count()
# rename isn't strictly necessary here; it's just for readability
.rename(columns={'index':'paper_count'})
)
ctdf.plot(x='year', y='paper_count', kind='bar')
At the end, I couldn't figure out how can plot this periodically by counting every 5 yrs:
2000-2005, 2005-2010, 2015-2020, 2020-2025.
expected output:

I don't understand the full logic if the provided example is supposed to match the data, but you can use pandas.cut to form bins, then cumsum to get the cumulated sum (remove this if you just want a simple sum):
years = list(range(2000, 2030, 5))
# [2000, 2005, 2010, 2015, 2020, 2025]
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
# ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
(df.assign(year=pd.cut(df['year'], bins=years, labels=labels))
.groupby(['year', 'keywords'])['year'].count()
.unstack()
.plot.bar(stacked=True)
)
With the red line:
years = list(range(2000, 2030, 5))
# [2000, 2005, 2010, 2015, 2020, 2025]
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
# ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
df2 = (df
.assign(year=pd.cut(df['year'], bins=years, labels=labels))
.groupby(['year', 'keywords'])['year'].count()
.unstack()
)
ax = df2.plot.bar(stacked=True)
# adding arbitrary shift (0.1)
df2.sum(axis=1).add(0.1).plot(ax=ax, color='red', marker='s', label='paper count')
ax.legend()
output:

Seaborn boxplot and lineplot not showing properly

I'm trying to overlay a seaborn lineplot over a seaborn boxplot
The result is someway "shocking" :)
It seems like the two graphs are put in the same figure but separate
The box plot is compressed on the left side, the line plot is compressed on the right side
Notice that if I run the two graph separatly they work fine
I cannot fugure out how to make it work
Thank you in advance for any help
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydata = pd.DataFrame({
'a':[2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020],
'v':[383.00, 519.00, 366.00, 436.00, 1348.00, 211.00, 139.00, 614.00, 365.00, 365.00, 383.00, 602.00, 994.00, 719.00, 589.00, 365.00, 990.00, 1142.00, 262.00, 1263.00, 507.00, 222.00, 363.00, 274.00, 195.00, 730.00, 730.00, 592.00, 479.00, 607.00, 292.00, 657.00, 453.00, 691.00, 673.00, 705]
})
means =mydata.groupby('a').v.mean().reset_index()
fig, ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.lineplot(data=means, x='a', y='v', ax=ax)
plt.show()

Surprisingly, I did not find a duplicate for this question with a good answer, so I elevate my comment to one. Arise, Sir Comment:
Instead of lineplot, you should use pointplot
...
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.pointplot(data=means, x='a', y='v', ax=ax)
plt.show()
Sample output:
Pointplot is the equivalent to lineplot for categorical variables that are used for boxplot. Please read here more about relational and categorical plotting.
The question came up why there is no problem with lineplot for the following data:
mydata = pd.DataFrame({'a':["m1", "m1", "m1", "m2", "m2", "m2", "m2", "m3", "m3", "m3", "m3", "m4", "m4", "m4", "m4"], 'v':[11.37, 11.31, 10.93, 9.43, 9.62, 6.61, 9.31, 11.27, 8.47, 11.86, 8.77, 8.8, 9.58, 12.26, 10] })
means =mydata.groupby('a').v.mean().reset_index()
print(means)
fig, ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.lineplot(data=means, x='a', y='v', ax=ax)
plt.show()
Output:
The difference is that this example does not have any ambiguity for lineplot. Seaborn lineplot can use both - categorical and numerical data. Seemingly, the code tries first to plot it as numerical data, and if this is not possible uses them as categorical variables (I don't know the source code). This is probably a good software decision by seaborn because the other case (not accepting categorical data) would cause way more problems than the rare case that people try to plot both categorical and numerical data into the same figure. A warning by seaborn would be a good thing, though.

Add a date event on a line chart in Python

So, I have a line chart that shows a random sales data from 2010 to 2020. But, I want to add a vertical line, or some visual resource to indicate something important that happened in 2014, for example. How can I do that in Python? Any library would do!

try using plt.axvline() with matplotlib
import matplotlib.pyplot as plt
x = [ 2015, 2016, 2017, 2018,2019,2020]
y = [ 1000, 1200, 2500, 1000, 1100,250]
plt.plot(x,y)
plt.title("Sales Bar graph")
plt.xlabel("year")
plt.ylabel('Sales')
#drwa a line in 2019 value
plt.axvline(x=2019, label='line at x = {}'.format(2019), c='red')
plt.show()

visualize two columns in the same data set

I am trying to group and sort four columns, count values and chart them in the same bar graph to see the trend how the count has changed.
Year Month Bl_year Month
2018 Jan 2019 Jan
2018 Feb 2018 Mar
2018 Dec 2020 Dec
2019 Apr 2019 Sep
2020 Nov 2020 Dec
2019 Sep 2018 Jan
I tried to group and sort first and counting values first by the year and then next by the month.
df_Activity_count = df.sort_values(['year','month'],ascending = True).groupby('month')
df_Activity_count_BL = df.sort_values(['BL year','BL month'],ascending = True).groupby('BL month')
Now I am trying to compare these two in the same bar. Can someone please help.

Try to pass ax to your plot command:
df_Activity_count = df.sort_values(['year','month'],ascending = True).groupby('month')
df_Activity_count_BL = df.sort_values(['BL year','BL month'],ascending = True).groupby('BL month')
ax = df_Activity_count.years.value_counts().unstack(0).plot.bar()
df_Activity_count_BL['BL year'].value_counts().unstack(0).plot.bar(ax=ax)

Since you tagged matplotlib, I will chip in a solution using pyplot
import matplotlib.pyplot as plt
# Create an axis object
fig, ax = plt.subplots()
# Define dataframes
df_Activity_count = df.sort_values(['year','month'],ascending = True).groupby('month')
df_Activity_count_BL = df.sort_values(['BL year','BL month'],ascending = True).groupby('BL month')
# Plot using the axis object ax defined above
df_Activity_count['year'].value_counts().unstack(0).plot.bar(ax=ax)
df_Activity_count_BL['BL year'].value_counts().unstack(0).plot.bar(ax=ax)

Seaborn chart converging on same point not visible

I have a dataframe having two columns- VOL, INVOL and for a particular year, the value are the same. Hence, while plotting in seaborn, I am not able to see the value of the other column when they converge.
For example:
My dataframe is
When I use seaborn, using the below code
f5_test = df5_test.melt('FY', var_name='cols', value_name='vals')
g = sns.catplot(x="FY", y="vals", hue='cols', data=df5_test, kind='point')
the chart is not showing the same point of 0.06.
I tried using pandas plotting, having the same result.
Please advise what I should do. Thanks in advance.

You plot looks legitimate. Two lines perfectly overlap since the data from 2016 to 2018 is exactly the same. I think maybe you can try to plot the two lines separately and add or subtract some small value to one of them to move the line a little bit. For example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'FY': [2012, 2013, 2014, 2015, 2016, 2017, 2018],
'VOL_PCT': [0, 0.08, 0.07, 0.06, 0, 0, 0.06],
'INVOL_PC': [0, 0, 0, 0, 0, 0, 0.06]})
# plot
fig, ax = plt.subplots()
sns.lineplot(df.FY, df.VOL_PCT)
sns.lineplot(df.FY+.01, df.INVOL_PC-.001)
In addition, given the type of your data, you could also consider using stack plots. For example:
fig, ax = plt.subplots()
labels = ['VOL_PCT', 'INVOL_PC']
ax.stackplot(df.FY, df.VOL_PCT, df.INVOL_PC, labels=labels)
ax.legend(loc='upper left');
Ref. Stackplot

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, Seaborn: Plotting frequencies with zero-values - python

Related

How can plot categorical stacked bar over periodic time?

Seaborn boxplot and lineplot not showing properly

Add a date event on a line chart in Python

visualize two columns in the same data set

Seaborn chart converging on same point not visible

Categories

Resources