How can I put labels in two charts using matplotlib - python

I'm trying to plot two histogram using the result of a group by. But the labels just appear in one of the labels.
How can I put the label in both charts?
And how can I put different title for the charts (e.g. first as Men's grade and Second as Woman's grade)
import pandas as pd
import matplotlib.pyplot as plt
microdataEnem = pd.read_csv('C:\\Users\\Lucas\\AppData\\Local\\Programs\\Python\\Python39\\Scripts\\Data Science\\Data Analysis\\Projects\\ENEM\\DADOS\\MICRODADOS_ENEM_2019.csv', sep = ';', encoding = 'ISO-8859-1', nrows=10000)
sex_essaygrade = ['TP_SEXO', 'NU_NOTA_REDACAO']
filter_sex_essaygrade = microdataEnem.filter(items = sex_essaygrade)
filter_sex_essaygrade.dropna(subset = ['NU_NOTA_REDACAO'], inplace = True)
filter_sex_essaygrade.groupby('TP_SEXO').hist()
plt.xlabel('Grade')
plt.ylabel('Number of students')
plt.show()

Instead of using filter_sex_essaygrade.groupby('TP_SEXO').hist() you can try the following format: axs = filter_sex_essaygrade['NU_NOTA_REDACAO'].hist(by=filter_sex_essaygrade['TP_SEXO']). This will automatically title each histogram with the group name.
You'll want to set an the variable axs equal to this histogram object so that you can modify the x and y labels for both plots.
I created some data similar to yours, and I get the following result:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
sex_essaygrade = ['TP_SEXO', 'NU_NOTA_REDACAO']
## create two distinct sets of grades
sample_grades = np.concatenate((np.random.randint(low=70,high=100,size=100), np.random.randint(low=80,high=100,size=100)))
filter_sex_essaygrade = pd.DataFrame({
'NU_NOTA_REDACAO': sample_grades,
'TP_SEXO': ['Men']*100 + ['Women']*100
})
axs = filter_sex_essaygrade['NU_NOTA_REDACAO'].hist(by=filter_sex_essaygrade['TP_SEXO'])
for ax in axs.flatten():
ax.set_xlabel("Grade")
ax.set_ylabel("Number of students")
plt.show()

Related

Annotate Min/Max/Median in Matplotlib Violin Plot

Given this example code:
import pandas as pd
import matplotlib.pyplot as plt
data = 'https://raw.githubusercontent.com/marsja/jupyter/master/flanks.csv'
df = pd.read_csv(data, index_col=0)
# Subsetting using Pandas query():
congruent = df.query('TrialType == "congruent"')['RT']
incongruent = df.query('TrialType == "incongruent"')['RT']
# Combine data
plot_data = list([incongruent, congruent])
fig, ax = plt.subplots()
xticklabels = ['Incongruent', 'Congruent']
ax.set_xticks([1, 2])
ax.set_xticklabels(xticklabels)
ax.violinplot(plot_data, showmedians=True)
Which results in the following plot:
How can I annotate the min, max, and mean lines with their respective values?
I haven't been able to find examples online that allude to how to annotate violin plots in this way. If we set plot = ax.violinplot(plot_data, showmedians=True) then we can access attributes like plot['cmaxes'] but I cant quite figure out how to use that for annotations.
Here is an example of what I am trying to achieve:
So this was as easy as getting the medians/mins/maxes and then enumerating, adding the annotation with plt.text, and adding some small values for positioning:
medians = results_df.groupby(['model_cat'])['test_f1'].median()
for i, v in enumerate(medians):
plt.text((i+.85), (v+.001), str(round(v, 3)), fontsize = 12)

How to display different values of the same column in plot graph/chart

The column class has 2 options for the value, either 'b' or 's'. I am trying to display a graph/chart that shows how many are 'b' and how many are 's'. I can't figure out how to do this when they are both in the same column.
The current code shows a scatter plot, but I'd like to use the data from column 'class'.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
#df = df.groupby('class')['class'].count()
#print(df)
df = pd.DataFrame(np.random.randint(0,10,size=(5, 2)), columns=['x','y'])
df['class'] = ['Benign','Malware','Benign','Malware','Malware']
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
for n, grp in df.groupby('class'):
ax.scatter(x = "x", y = "y", data=grp, label=n)
ax.legend(title="Label")
plt.show()
Don't group by class and loop, rather aggregate (with value_counts) then plot:
df['class'].value_counts().plot.bar()
or with matplotlib's functions:
s = df['class'].value_counts()
ax.bar(s.index, s)
output:

How to plot Multiline Graphs Via Seaborn library in Python?

I have written a code that looks like this:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
exp1= sns.lineplot(data=df1)
plt.savefig('exp1.png')
exp1_smooth= sns.lmplot(x='Size', y='Time', data=df, ci=None, order=4, truncate=False)
plt.savefig('exp1_smooth.png')
That gives me Graph_1:
The Size = x- axis is a constant line but as you can see in my code it varies from (10,100,1000).
How does this produces a constant line? I want to produce a multiline graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2).
Also I wanted to plot a smooth graph of the same graph I am getting right now but it gives me error. What needs to be done to achieve a smooth multi-line graph with x-axis = Size(T),y- axis= Encrypt_Time and Decrypt_Time (power1 & power2)?
I think it not the issue, the line represents for size looks like constant but it NOT.
Can see that values of size in range 10-1000 while the minimum division of y-axis is 20,000 (20 times bigger), make it look like a horizontal line on your graph.
You can try with a bigger values to see the slope clearly.
If you want 'size` as x-axis, you can try below example:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
T = np.array([10.03,100.348,1023.385])
power1 = np.array([100000,86000,73000])
power2 = np.array([1008000,95000,1009000])
df1 = pd.DataFrame(data = {'Size': T, 'Encrypt_Time': power1, 'Decrypt_Time': power2})
fig = plt.figure()
fig = sns.lineplot(data=df1, x='Size',y='Encrypt_Time' )
fig = sns.lineplot(data=df1, x='Size',y='Decrypt_Time' )

Change line shape to aggregated plot

I'm attempting to make a plot of net sentiment score by genre over time. I currently have a line plot that works, but I'd like to make each genre a different color/shape. However, I have about 12 different genres and I'd prefer to not manually assign a color/shape to each genre. Is there a way for matplotlib to dynamically assign a unique combination to each genre? Here's my current code.
# plot net sentiment per year by genre
fig, ax = plt.subplots(figsize=(15,7))
genre_known.groupby(['year', 'genre']).mean()['net_sentiment'].unstack().plot(ax=ax)
plt.xlabel('Year')
plt.ylabel('Sentiment')
plt.title('Sentiment by Year')
You can use a styling dictionary like this:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_tips = sns.load_dataset('tips')
fig, ax = plt.subplots(figsize=(15,7))
style_dic = {k:v for k,v in zip(df_tips['day'].unique(),['ro-','b^-','gs-','m+-'])}
df_tips.groupby(['sex','day'])['total_bill'].mean().unstack().plot(ax=ax, style=style_dic);
Output:

Plotting pandas dataframe with two groups

I'm using Pandas and matplotlib to try to replicate this graph from tableau:
So far, I have this code:
group = df.groupby(["Region","Rep"]).sum()
total_price = group["Total Price"].groupby(level=0, group_keys=False)
total_price.nlargest(5).plot(kind="bar")
Which produces this graph:
It correctly groups the data, but is it possible to get it grouped similar to how Tableau shows it?
You can create some lines and labels using the respective matplotlib methods (ax.text and ax.axhline).
import pandas as pd
import numpy as np; np.random.seed(5)
import matplotlib.pyplot as plt
a = ["West"]*25+ ["Central"]*10+ ["East"]*10
b = ["Mattz","McDon","Jeffs","Warf","Utter"]*5 + ["Susanne","Lokomop"]*5 + ["Richie","Florence"]*5
c = np.random.randint(5,55, size=len(a))
df=pd.DataFrame({"Region":a, "Rep":b, "Total Price":c})
group = df.groupby(["Region","Rep"]).sum()
total_price = group["Total Price"].groupby(level=0, group_keys=False)
gtp = total_price.nlargest(5)
ax = gtp.plot(kind="bar")
#draw lines and titles
count = gtp.groupby("Region").count()
cum = np.cumsum(count)
for i in range(len(count)):
title = count.index.values[i]
ax.axvline(cum[i]-.5, lw=0.8, color="k")
ax.text(cum[i]-(count[i]+1)/2., 1.02, title, ha="center",
transform=ax.get_xaxis_transform())
# shorten xticklabels
ax.set_xticklabels([l.get_text().split(", ")[1][:-1] for l in ax.get_xticklabels()])
plt.show()

Categories