Normalisation of data - python

I am trying to plot the data below in a pie chart. I split the pie chart based on the group first and then based on the Id. But since for some rows, the count is very small, I am not able to see it in the pie chart.
I am trying to normalise the data. I am not sure on how to do that. Any help would be sincerely appreciated.
Group Id Count
G1 12 276938
G1 13 102
G2 12 27
G3 12 4683
G3 13 7
G4 12 301

Don't pie chart what doesn't fit a visual representation in a pie chart
(
df.groupby(['Group', 'Id'])
.sum().Count.sort_values(ascending=False)
.plot.bar(logy=True, subplots=True)
)

Related

plot bar chart for multiple categorical values with groupby [duplicate]

Let's assume I have a dataframe and I'm looking at 2 columns of it (2 series).
Using one of the columns - "no_employees" below - Can someone kindly help me figure out how to create 6 different pie charts or bar charts (1 for each grouping of no_employees) that illustrate the value counts for the Yes/No values in the treatment column? I'll use matplotlib or seaborn, whatever you feel is easiest.
I'm using the attached line of code to generate the code below.
dataframe_title.groupby(['no_employees']).treatment.value_counts().
But now I'm stuck. Do I use seaborn? .plot? This seems like it should be easy, and I know there are some cases where I can make subplots=True, but I'm really confused. Thank you so much.
no_employees treatment
1-5 Yes 88
No 71
100-500 Yes 95
No 80
26-100 Yes 149
No 139
500-1000 No 33
Yes 27
6-25 No 162
Yes 127
More than 1000 Yes 146
No 135
The importance of data encoding:
The purpose of data visualization is to more easily convey information (e.g. in this case, the relative number of 'treatments' per category)
The bar chart accommodates easily displaying the important information
how many in each group said 'Yes' or 'No'
the relative sizes of each group
A pie plot is more commonly used to display a sample, where the groups within the sample, sum to 100%.
Wikipedia: Pie Chart
Research has shown that comparison by angle, is less accurate than comparison by length, in that people are less able to discern differences.
Statisticians generally regard pie charts as a poor method of displaying information, and they are uncommon in scientific literature.
This data is not well represented by a pie plot, because each company size is a separate population, which will require 6 pie plots to be correctly represented.
The data can be placed into a pie plot, as others have shown, but that doesn't mean it should be.
Regardless of the type of plot, the data must be in the correct shape for the plot API.
Tested with pandas 1.3.0, seaborn 0.11.1, and matplotlib 3.4.2
Setup a test DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # for sample data only
np.random.seed(365)
cats = ['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']
data = {'no_employees': np.random.choice(cats, size=(1000,)),
'treatment': np.random.choice(['Yes', 'No'], size=(1000,))}
df = pd.DataFrame(data)
# set a categorical order for the x-axis to be ordered
df.no_employees = pd.Categorical(df.no_employees, categories=cats, ordered=True)
no_employees treatment
0 26-100 No
1 1-5 Yes
2 >1000 No
3 100-500 Yes
4 500-1000 Yes
Plotting with pandas.DataFrame.plot():
This requires grouping the dataframe to get .value_counts, and unstacking with pandas.DataFrame.unstack.
# to get the dataframe in the correct shape, unstack the groupby result
dfu = df.groupby(['no_employees']).treatment.value_counts().unstack()
treatment No Yes
no_employees
1-5 78 72
6-25 83 86
26-100 83 76
100-500 91 84
500-1000 78 83
>1000 95 91
# plot
ax = dfu.plot(kind='bar', figsize=(7, 5), xlabel='Number of Employees in Company', ylabel='Count', rot=0)
ax.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
Plotting with seaborn
seaborn is a high-level API for matplotlib.
seaborn.barplot()
Requires a DataFrame in a tidy (long) format, which is done by grouping the dataframe to get .value_counts, and resetting the index with pandas.Series.reset_index
May also be done with the figure-level interface using sns.catplot() with kind='bar'
# groupby, get value_counts, and reset the index
dft = df.groupby(['no_employees']).treatment.value_counts().reset_index(name='Count')
no_employees treatment Count
0 1-5 No 78
1 1-5 Yes 72
2 6-25 Yes 86
3 6-25 No 83
4 26-100 No 83
5 26-100 Yes 76
6 100-500 No 91
7 100-500 Yes 84
8 500-1000 Yes 83
9 500-1000 No 78
10 >1000 No 95
11 >1000 Yes 91
# plot
p = sns.barplot(x='no_employees', y='Count', data=dft, hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
seaborn.countplot()
Uses the original dataframe, df, without any transformations.
May also be done with the figure-level interface using sns.catplot() with kind='count'
p = sns.countplot(data=df, x='no_employees', hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
Output of barplot and countplot
Let's reshape the dataframe and plot with subplots=True:
df_chart = df1.unstack()['Pct']
axs = df_chart.plot.pie(subplots=True, figsize=(4,9), layout=(2,1), legend=False, title=df_chart.columns.tolist())
ax_flat = axs.flatten()
for ax in ax_flat:
ax.yaxis.label.set_visible(False)
Output:

Black-white/Gray bar charts in Python

I have following small data:
Tom Dick Harry Jack
Sub
Maths 9 12 3 10
Science 16 40 1 10
English 12 11 4 15
French 17 15 2 15
Sports 23 19 3 15
I want to create a bar chart in black-white/gray colors for these data.
I can have such a figure with following code:
df.plot(kind='bar', colormap='gray')
plt.show()
However, the fourth bar (Jack's) is pure white and same as background. How can I avoid this problem of last bar being pure white?
Use the other colormaps or manually enter the color names. Alternatively you can change the background by using different style sheet such as ggplot,seabor or fivethirty eight.
colors=['darkgray','gray','dimgray','lightgray']
df.plot(kind='bar',color=colors )
plt.show()
df.plot(kind='bar',colormap=plt.cm.viridis )
plt.show()
Using the style sheets here:
https://matplotlib.org/3.1.1/gallery/style_sheets/style_sheets_reference.html
plt.style.use('seaborn')#change the style sheets here
df.plot(kind='bar',colormap=plt.cm.gray)
plt.show()
Here is the output looks like:

Stacked bar plot of large data in python

I would like to plot a stacked bar plot from a csv file in python. I have three columns of data
year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.
I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.
This can be done using pandas:
import pandas as pd
df = pd.read_csv("file.csv")
# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
df.groupby(["word"], as_index=False)
.agg({"frequency": "sum"})
.sort_values("frequency")["word"]
.values
)
# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)
Which outputs:

Stacked area chart with datetime axis

I am attepmtimng to create a Bokeh stacked area chart from the following Pandas DataFrame.
An example of the of the DataFrame (df) is as follows;
date tom jerry bill
2014-12-07 25 12 25
2014-12-14 15 16 30
2014-12-21 10 23 32
2014-12-28 12 13 55
2015-01-04 5 15 20
2015-01-11 0 15 18
2015-01-18 8 9 17
2015-01-25 11 5 16
The above DataFrame represents a snippet of the total df, which snaps over a number of years and contains additional names to the ones shown.
I am attempting to use the datetime column date as the x-axis, with the count information for each name as the y-axis.
Any assistance that anyone could provide would be greatly appreciated.
You can create a stacked area chart by using the patch glyph. I first used df.cumsum to stack the values in the dataframe by row. After that I append two rows to the dataframe with the max and min date and Y value 0. I plot the patches in a reverse order of the column list (excluding the date column) so the person with the highest values is getting plotted first and the persons with lower values are plotted after.
Another implementation of a stacked area chart can be found here.
import pandas as pd
from bokeh.plotting import figure, show
from bokeh.palettes import inferno
from bokeh.models.formatters import DatetimeTickFormatter
df = pd.read_csv('stackData.csv')
df_stack = df[list(df)[1:]].cumsum(axis=1)
df_stack['date'] = df['date'].astype('datetime64[ns]')
bot = {list(df)[0]: max(df_stack['date'])}
for column in list(df)[1:]:
bot[column] = 0
df_stack = df_stack.append(bot, ignore_index=True)
bot = {list(df)[0]: min(df_stack['date'])}
for column in list(df)[1:]:
bot[column] = 0
df_stack = df_stack.append(bot, ignore_index=True)
p = figure(x_axis_type='datetime')
p.xaxis.formatter=DatetimeTickFormatter(days=["%d/%m/%Y"])
p.xaxis.major_label_orientation = 45
for person, color in zip(list(df_stack)[2::-1], inferno(len(list(df_stack)))):
p.patch(x=df_stack['date'], y=df_stack[person], color=color, legend=person)
p.legend.click_policy="hide"
show(p)

pandas plot dataframe as multiple bar charts

This is my pandas dataframe df:
ab channel booked
0 control book_it 466
1 control contact_me 536
2 control instant 17
3 treatment book_it 494
4 treatment contact_me 56
5 treatment instant 22
I want to plot 3 groups of bar chart (according to channel):
for each channel:
plot control booked value vs treatment booked value.
hence i should get 6 bar charts, in 3 groups where each group has control and treatment booked values.
SO far i was only able to plot booked but not grouped by ab:
ax = df_conv['booked'].plot(kind='bar',figsize=(15,10), fontsize=12)
ax.set_xlabel('dim_contact_channel',fontsize=12)
ax.set_ylabel('channel',fontsize=12)
plt.show()
This is what i want (only show 4 but this is the gist):
Pivot the dataframe so control and treatment values are in separate columns.
df.pivot(index='channel', columns='ab', values='booked').plot(kind='bar')

Categories