histogram: setting y-axis label for pandas

histogram: setting y-axis label for pandas - python

I have dataframe:
d = {'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'D', 'D', 'D', 'D', 'D'],
'value': [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 1.0],
'count': [4, 3, 7, 3, 12, 14, 5, 10, 3, 8, 7, 15, 4]}
df = pd.DataFrame(data=d)
df
I want to plot multiple histograms in one figure. That is, a histogram for group A, a histogram for group B and group D in one figure. So, labels is a group column, y-axis is count, x-axis is value.
I do this, but there are incorrect values on the y-axis and it builds several figures.
axarr = df.hist(column='value', by = 'group', bins = 20)
for ax in axarr.flatten():
ax.set_xlabel("value")
ax.set_ylabel("count")

Assuming that you are looking for a grouped bar plot (pending the clarification in the comments):
Plot:
Code:
import pandas as pd
d = {'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'D', 'D', 'D', 'D', 'D'],
'value': [0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 0.2, 0.4, 0.6, 0.8, 1.0],
'count': [4, 3, 7, 3, 12, 14, 5, 10, 3, 8, 7, 15, 4]}
df = pd.DataFrame(data=d)
df_pivot = pd.pivot_table(
df,
values="count",
index="value",
columns="group",
)
ax = df_pivot.plot(kind="bar")
fig = ax.get_figure()
fig.set_size_inches(7, 6)
ax.set_xlabel("value")
ax.set_ylabel("count")
Brief plot option:
pd.pivot(df, index="value", columns="group").plot(kind="bar", y='count')
Brief explanation:
Pivot table for preparation:
x-axis 'value' as index
different bars 'group' as columns
group A B D
value
0.2 4.0 12.0 3.0
0.4 3.0 14.0 8.0
0.6 7.0 5.0 7.0
0.8 3.0 10.0 15.0
1.0 NaN NaN 4.0
Pandas .plot() can handle that groupded bar plot directly after the df_pivot preparation.
Its default backend is matplotlib, so usual commands apply (like fig.savefig).
Add-on apperance:
You can make it look more like a hist plot concerning the x-axis by aligning the spacing in between the groups:
Just add a , width=0.95 (or =1.0) within the .plot( ) statements.

Related

Python Plotnine (ggplot) add mean line per color to plot?

Using plotnine in python, I'd like to add dashed horizontal lines to my plot (a scatterplot, but preferably an answer compatible with other plot types) representing the mean for every color separately. I'd like to do so without manually computing the mean values myself or adapting other parts of the data (e.g. adding columns for color values etc).
Additionally, the original plot is generated via a function (make_plot below) and the mean lines are to be added afterwards, yet need to have the same color as the points from which they are derived.
Consider the following as a minimal example;
import pandas as pd
import numpy as np
from plotnine import *
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
def make_plot(df, var_x, var_y, var_fill) :
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
I'd like to add 4 lines, one for each Size. The exact same can be done in R using ggplot, as shown by this question. Adding geom_line(stat="hline", yintercept="mean", linetype="dashed") to plot however results in an error PlotnineError: "'stat_hline' Not in Registry. Make sure the module in which it is defined has been imported." that I am unable to resolve.
Answers that can resolve the aforementioned issue, or propose another working solution entirely, are greatly appreciated.

You can do it by first defining the means as a vector and then pass it to your function:
import pandas as pd
import numpy as np
from plotnine import *
from random import randint
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
a = df.groupby(['Size'])['MSE'].mean() ### Defining yuor means
a = list(a)
def make_plot(df, var_x, var_y, var_fill):
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()+ geom_hline(yintercept =a,linetype="dashed")
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
which gives:
Note that two of the lines coincide:
a = [0.6666666666666666, 0.5, 0.4666666666666666, 0.6666666666666666]
To add different colors to each dashed line, you can do this:
import pandas as pd
import numpy as np
from plotnine import *
df = pd.DataFrame( { 'MSE': [0.1, 0.7, 0.5, 0.2, 0.3, 0.4, 0.8, 0.9 ,1.0, 0.4, 0.7, 0.9 ],
'Size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
'Number': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] } )
### Generate a list of colors of the same length as your categories (Sizes)
color = []
n = len(list(set(df.Size)))
for i in range(n):
color.append('#%06X' % randint(0, 0xFFFFFF))
######################################################
def make_plot(df, var_x, var_y, var_fill):
plot = ggplot(df) + aes(x='Number', y='MSE', fill = 'Size') + geom_point()+ geom_hline(yintercept =list(df.groupby(['Size'])['MSE'].mean()),linetype="dashed", color =b)
return plot
plot = make_plot(df, 'Number', 'MSE', 'Size')
which returns:

How to make a cell higher in matplotlib using the plt.table function?

The following code uses colwidths to adjust the cell's width:
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rcParams['font.sans-serif'] = ['FangSong']
mpl.rcParams['axes.unicode_minus'] = False
labels = ['A难度水平', 'B难度水平', 'C难度水平', 'D难度水平']
students = [0.35, 0.15, 0.2, 0.3]
explode = [0.1, 0.1, 0.1, 0.1]
colors = ['r', 'y', 'b', 'gray']
plt.pie(students, autopct='%3.1f%%',
labels=labels, textprops={'fontsize': 12,
'family': 'FangSong',
'fontweight': 'bold'},
explode=explode, colors=colors)
studentValues = [['A', 'B', 'C', 'D'], [350, 150, 200, 300], ['test', 'test', 'test', 'test']]
cellcolors = [['r', 'y', 'b', 'gray'], ['b', 'gray', 'y', 'r'], ['gray', 'y', 'b', 'r']]
rowLabels = ['aaaaa','bbbbb','ccccc']
plt.table(cellText=studentValues,
cellColours=cellcolors,
cellLoc='center', colWidths=[0.1] * 4,
rowLabels=rowLabels)
plt.show()
How could I adjust the height of the cell inside the plt.table function?

Save the link to: plt.table. And adjust via 'scale'.
ytable = plt.table(cellText=studentValues,
cellColours=cellcolors,
cellLoc='center', colWidths=[0.1] * 4,
rowLabels=rowLabels)
ytable.scale(1, 1.0)

How to optimize grouping of a DataFrame and performing operations on the groups

Here's an example of my dataframe:
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df
group round score
0 a 3 0.3
1 a 3 0.1
2 a 2 0.6
3 b 1 0.8
4 b 3 0.2
5 b 1 0.5
6 b 3 0.5
7 b 3 0.6
8 c 3 0.4
9 d 2 0.9
10 d 1 0.1
My actual dataframe has 6 columns and > 1,000,000 rows. I'm trying to figure out the fastest way to do the following:
For each group find the average of scores and perform some calculation with it for each of 3 rounds. If there are no scores, write 'NA'.
I'm not sure if it would be faster to make a list of lists and then convert it into a dataframe or make a new dataframe and populate that, so i went with the list first:
def test_df(data):
value_counts = data['group'].value_counts().to_dict()
avgs = []
for key, val in value_counts.items():
row = data[data['group'] == key]
x = [key]
if val < 2:
x.extend([10 * row['score'].values[0] + 1 if i == row['round'].values[0] else 'NA' for i in range (1,4)])
else:
x.extend([(10 * row[row['round'] == i]['score'].mean() + 1) if len(row[row['round'] == i]) > 0 else 'NA' for i in range(1, 4)])
avgs.append(x)
return avgs
Here I created a separate case because about 80% of groups in my data only have one row, so I figured it might speed things up maybe?
this returns the correct results in format [group, round 1, round 2, round 3]
[['b', 7.5, 'NA', 5.333333333333333],
['a', 'NA', 7.0, 3.0],
['d', 2.0, 10.0, 'NA'],
['c', 'NA', 'NA', 5.0]]
but it's looking like it's going to take a really really long time on the actual dataframe...
Does anyone have any better ideas?

It looks to me like you're basically going a groupby/mean and a pivot.
import pandas as pd
d = {'group': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'd', 'd'], \
'round': [3, 3, 2, 1, 3, 1, 3, 3, 3, 2, 1], \
'score': [0.3, 0.1, 0.6, 0.8, 0.2, 0.5, 0.5, 0.6, 0.4, 0.9, 0.1]}
df = pd.DataFrame(d)
df = (df.groupby(['group','round'])['score'].mean()*10+1).reset_index()
df.pivot_table(index='group',columns='round',values='score', fill_value='NA').reset_index().values
Output
array([['a', 'NA', 7.0, 3.0],
['b', 7.5, 'NA', 5.333333333333333],
['c', 'NA', 'NA', 5.0],
['d', 2.0, 10.0, 'NA']], dtype=object)

The imbalanced dataset may show different results, but I tested with the blow scripts and found out even with the pandas dataframe, the result shows okay performance. However, you can always compare it with the native python data structure.
import random
import datetime
import pandas as pd
def generate_data(): # augmentation
data = {'group': [], 'round': [], 'score': []}
for index in range(10 ** 6): # sample size
data['group'].append(random.choice(['a', 'b', 'c', 'd']))
data['round'].append(random.randrange(1, 4))
data['score'].append(round(random.random(), 1))
return data
def calc_with_native_ds(data): # native python data structure
pass
def calc_with_pandas_df(df): # pandas dataframe
return df.groupby(['group', 'round']).mean()
if __name__ == '__main__':
data = generate_data()
df = pd.DataFrame(data)
print(df.shape)
start_datetime = datetime.datetime.now()
# calc_with_native_ds(data)
calc_with_pandas_df(df)
end_datetime = datetime.datetime.now()
elapsed_time = round((end_datetime - start_datetime).total_seconds(), 5)
print(f"elapsed_time: {elapsed_time}")

bar plot with vertical lines for each bar

%matplotlib inline
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08], 'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 'NaN']}).set_index('index')
I want to plot a horizontal bar chart with first and second as bars.
I want to use the max column for displaying a vertical line at the corresponding values if the other columns.
I only managed the bar plot as for now.
Like this:
Any hints on how to achieve this?
thx

I have replaced the NaN with some finite value and then you can use the following code
df = pd.DataFrame({'index' : ['A', 'B', 'C', 'D'], 'first': [1.2, 1.23, 1.32, 1.08],
'second': [2, 2.2, 3, 1.08], 'max': [1.5, 3, 0.9, 2.5]}).set_index('index')
plt.barh(range(4), df['first'], height=-0.25, align='edge')
plt.barh(range(4), df['second'], height=0.25, align='edge', color='red')
plt.yticks(range(4), df.index);
for i, val in enumerate(df['max']):
plt.vlines(val, i-0.25, i+0.25, color='limegreen')

New column based on a filter and an index of multiples columns?

I've been trying to search/think about an answer, probably with a melt or stack, but still can't seem to do it.
Here's my DF :
d = {'type' : [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'company' : ['A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E'],
'value type': ['value car','value car','value car','value car','value car', 'value train','value train','value train','value train','value train',],
'value': [0.1, 0.2, 0.3, 0.4, 0.5, 0.15, 0.25, 0.35, 0.45, 0.55] }
df = pd.DataFrame(d)
Here is what I want (I have the array on the left, I want the one on the right):
As you can see, I want a new column "train value" based on the combination (type,company)
Something like
for each row :
if (df['value type'] == 'value train'):
#and (type,company) is the same
df['train value'] = df['value']
remove row
For example, the company A from type 1 will have a new value in a new column for its train value.
Is there a way to do this properly ?
EDIT::: There was a good answer but I didn't explain myself clearly. I want only a new column with only "one value type". For example my new DF :
d = {'type' : [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'company' : ['A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E'],
'month' : ['jan', 'feb', 'marc', 'apr', 'may', 'jan', 'feb', 'marc', 'apr', 'sep'],
'business' : ['business1', 'business2', 'business3', 'business4', 'business5', 'business6', 'business7', 'business8', 'business9', 'business10'],
'value time': ['past', 'past', 'past', 'past', 'present', 'present', 'present', 'present', 'future', 'future'],
'value': [0.1, 0.2, 0.3, 0.4, 0.11, 0.21, 0.31, 0.41, 0.45, 0.55] }
df = pd.DataFrame(d)
Heres what I want this time :
If possible, only the values with the "present" will be in the new column. Something like
if df['value time'] == 'present' then add to new column

You should pivot your dataframe:
company_to_type = df.set_index('company')['type'].to_dict()
df = df.pivot(index='company', columns='value type', values='value').reset_index()
df['type'] = df.company.map(company_to_type)
df = df.rename_axis(None, axis=1)
df = df[['type', 'company', 'value train', 'value car']]
and you'll get
type company value train value car
0 1 A 0.15 0.1
1 2 B 0.25 0.2
2 3 C 0.35 0.3
3 4 D 0.45 0.4
4 5 E 0.55 0.5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

histogram: setting y-axis label for pandas - python

Related

Python Plotnine (ggplot) add mean line per color to plot?

How to make a cell higher in matplotlib using the plt.table function?

How to optimize grouping of a DataFrame and performing operations on the groups

bar plot with vertical lines for each bar

New column based on a filter and an index of multiples columns?

Categories

Resources