Plotting multiple scatter plots of multiple years in Python - python

I have a dataframe that looks like:
Date Faculty Target Avg
2012-01-01 Arts 80 60
2012-01-01 Science 70 60
2012-02-01 Arts 91 89
2012-02-01 Gym 80 89
.
.
2012-07-01 Arts 83 67
2012-07-01 Science 72 67
2012-08-01 Arts 81 83
2012-08-01 Science 70 83
I want to plot all Faculty on a single scatter plot with each of their respective Target values (Y-Axis) and Avg values (X-Axis).
I'm trying to use (pseudo code) a scatterplot like:
ax1 = data.plot(kind='scatter', x='Avg', y='Target(Arts)', color='r', label='Arts')
ax2 = data.plot(kind='scatter', x='Avg', y='Target(Science)', color='g', ax=ax1, label='Science')
ax3 = data.plot(kind='scatter', x='Avg', y='Target(Gym)', color='b', ax=ax1, label='Gym')
I'd like all Faculties (there are 28 of them total) on the same plot for every Target value (marked by different colors) but there are too many to manually enter with loc (or at least I'd like to avoid this). I can't use iloc to count by index because each number of Faculty counts is different on each date.
Is there a simple way to do this?

You can groupby the Faculty, and iterate over the groups, plotting each one:
g = df.groupby('Faculty')
for faculty, data in g:
plt.scatter(data['Avg'], data['Target'], label=faculty)
plt.xlabel('Avg')
plt.ylabel('Target')
plt.legend()
plt.show()

Related

plot bar chart for multiple categorical values with groupby [duplicate]

Let's assume I have a dataframe and I'm looking at 2 columns of it (2 series).
Using one of the columns - "no_employees" below - Can someone kindly help me figure out how to create 6 different pie charts or bar charts (1 for each grouping of no_employees) that illustrate the value counts for the Yes/No values in the treatment column? I'll use matplotlib or seaborn, whatever you feel is easiest.
I'm using the attached line of code to generate the code below.
dataframe_title.groupby(['no_employees']).treatment.value_counts().
But now I'm stuck. Do I use seaborn? .plot? This seems like it should be easy, and I know there are some cases where I can make subplots=True, but I'm really confused. Thank you so much.
no_employees treatment
1-5 Yes 88
No 71
100-500 Yes 95
No 80
26-100 Yes 149
No 139
500-1000 No 33
Yes 27
6-25 No 162
Yes 127
More than 1000 Yes 146
No 135
The importance of data encoding:
The purpose of data visualization is to more easily convey information (e.g. in this case, the relative number of 'treatments' per category)
The bar chart accommodates easily displaying the important information
how many in each group said 'Yes' or 'No'
the relative sizes of each group
A pie plot is more commonly used to display a sample, where the groups within the sample, sum to 100%.
Wikipedia: Pie Chart
Research has shown that comparison by angle, is less accurate than comparison by length, in that people are less able to discern differences.
Statisticians generally regard pie charts as a poor method of displaying information, and they are uncommon in scientific literature.
This data is not well represented by a pie plot, because each company size is a separate population, which will require 6 pie plots to be correctly represented.
The data can be placed into a pie plot, as others have shown, but that doesn't mean it should be.
Regardless of the type of plot, the data must be in the correct shape for the plot API.
Tested with pandas 1.3.0, seaborn 0.11.1, and matplotlib 3.4.2
Setup a test DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # for sample data only
np.random.seed(365)
cats = ['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']
data = {'no_employees': np.random.choice(cats, size=(1000,)),
'treatment': np.random.choice(['Yes', 'No'], size=(1000,))}
df = pd.DataFrame(data)
# set a categorical order for the x-axis to be ordered
df.no_employees = pd.Categorical(df.no_employees, categories=cats, ordered=True)
no_employees treatment
0 26-100 No
1 1-5 Yes
2 >1000 No
3 100-500 Yes
4 500-1000 Yes
Plotting with pandas.DataFrame.plot():
This requires grouping the dataframe to get .value_counts, and unstacking with pandas.DataFrame.unstack.
# to get the dataframe in the correct shape, unstack the groupby result
dfu = df.groupby(['no_employees']).treatment.value_counts().unstack()
treatment No Yes
no_employees
1-5 78 72
6-25 83 86
26-100 83 76
100-500 91 84
500-1000 78 83
>1000 95 91
# plot
ax = dfu.plot(kind='bar', figsize=(7, 5), xlabel='Number of Employees in Company', ylabel='Count', rot=0)
ax.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
Plotting with seaborn
seaborn is a high-level API for matplotlib.
seaborn.barplot()
Requires a DataFrame in a tidy (long) format, which is done by grouping the dataframe to get .value_counts, and resetting the index with pandas.Series.reset_index
May also be done with the figure-level interface using sns.catplot() with kind='bar'
# groupby, get value_counts, and reset the index
dft = df.groupby(['no_employees']).treatment.value_counts().reset_index(name='Count')
no_employees treatment Count
0 1-5 No 78
1 1-5 Yes 72
2 6-25 Yes 86
3 6-25 No 83
4 26-100 No 83
5 26-100 Yes 76
6 100-500 No 91
7 100-500 Yes 84
8 500-1000 Yes 83
9 500-1000 No 78
10 >1000 No 95
11 >1000 Yes 91
# plot
p = sns.barplot(x='no_employees', y='Count', data=dft, hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
seaborn.countplot()
Uses the original dataframe, df, without any transformations.
May also be done with the figure-level interface using sns.catplot() with kind='count'
p = sns.countplot(data=df, x='no_employees', hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
Output of barplot and countplot
Let's reshape the dataframe and plot with subplots=True:
df_chart = df1.unstack()['Pct']
axs = df_chart.plot.pie(subplots=True, figsize=(4,9), layout=(2,1), legend=False, title=df_chart.columns.tolist())
ax_flat = axs.flatten()
for ax in ax_flat:
ax.yaxis.label.set_visible(False)
Output:

How to set a Seaborn scatterplot hue equal to features and not the values of a single feature?

I would like to create a scatterplot in Seaborn that sets the hue parameter so that values are coloured based on what feature they are from. (All features have values 0 to 100)
For example, supposing I have the following dataframe:
Happiness Kindness Sadness
1 100 70 0
2 60 50 1
3 34 32 10
4 23 65 54
5 43 54 87
When plotting the values, I would like to set all the values under happiness to red, kindness to blue, sadness to green. However, with Seaborn's scatterplot, the hue parameter only accepts one variable under the dataframe. Documentation here: http://man.hubwiz.com/docset/Seaborn.docset/Contents/Resources/Documents/generated/seaborn.scatterplot.html
I want to know if there are any workarounds to the only one variable being accepted feature so I can use the hue parameter to distinguish between the different variables.

Unintended Additional line drawn by Plotly express in Python

Plotly draws an extra diagonal line from the start to the endpoint of the original line graph.
Other data, other graphs work fine.
Only this data adds the line.
Why does this happen?
How can I fix this?
Below is the code
temp = pd.DataFrame(df[{KEY_WORD}])
temp['date'] = temp.index
fig=px.line(temp.melt(id_vars="date"), x='date', y='value', color='variable')
fig.show()
plotly.offline.plot(fig,filename='Fig_en1')
Just had the same issue -- try checking for duplicate values on the X axis. I was using the following code:
fig = px.line(df, x="weekofyear", y="interest", color="year")
fig.show()
That created the following plot:
I realised that this was because in certain years, some of the week numbers for the dates I had pertained to the previous years' weeks 52/53 and therefore created duplicates e.g. index 93 and 145 below:
date interest query year weekofyear
39 2015-12-20 44 home insurance 2015 51
40 2015-12-27 55 home insurance 2015 52
41 2016-01-03 69 home insurance 2016 53
92 2016-12-25 46 home insurance 2016 51
93 2017-01-01 64 home insurance 2017 52
144 2017-12-24 51 home insurance 2017 51
145 2017-12-31 79 home insurance 2017 52
196 2018-12-23 46 home insurance 2018 51
197 2018-12-30 64 home insurance 2018 52
248 2019-12-22 57 home insurance 2019 51
249 2019-12-29 73 home insurance 2019 52
By amending these (for week numbers that are high for dates in Jan, I subtracted 1 from the year column) I seem to have got rid of the phenomenon:
NB: there may be some other differences between the charts due to the dataset being somewhat fluid.
A similar question has been asked and answered in the post How to disable trendline in plotly.express.line?, but in your case I'm pretty sure the problem lies in temp.melt(id_vars="date"), x='date', y='value', color='variable'. It seems you're transfomring your data from a wide to a long format. You're using color='variable' without specifying that in temp.melt(id_vars="date"). And when the color specification does not properly correspond to the structure of your dataset, an extra line like yours can occur. Just take a look at this:
Command 1:
fig = px.line(data_frame=df_long, x='Timestamp', y='value', color='stacked_values')
Plot 1:
Command 2:
fig = px.line(data_frame=df_long, x='Timestamp', y='value')
Plot 2:
See the difference? That's why I think there's a mis-specification in your fig=px.line(temp.melt(id_vars="date"), x='date', y='value', color='variable').
So please share your data, or a sample of your data that reproduces the problem, and I'll have a better chance of verifying your problem.

Plotting multiple lines in one graph with pandas and matplotlib, using climate data

I'm trying to create a graph that shows whether or not average temperatures in my city are increasing. I'm using data provided by NOAA and have a DataFrame that looks like this:
DATE TAVG MONTH YEAR
0 1939-07 86.0 07 1939
1 1939-08 84.8 08 1939
2 1939-09 82.2 09 1939
3 1939-10 68.0 10 1939
4 1939-11 53.1 11 1939
5 1939-12 52.5 12 1939
This is saved in a variable called "avgs", and I then use groupby and plot functions like so:
avgs.groupby(["YEAR"]).plot(kind='line',x='MONTH', y='TAVG')
This produces a line graph (see below for example) for each year that shows the average temperature for each month. That's great stuff, but I'd like to be able to put all of the yearly line graphs into one graph, for the purposes of visual comparison (to see if the monthly averages are increasing).
Example output
I'm a total noob with matplotlib and pandas, so I don't know the best way to do this. Am I going wrong somewhere and just don't realize it? And if I'm on the right track, where should I go from here?
Very similar to the other answer (by Anake), but you can get control over legend here (the other answer, legends for all years will be "TAVG". I add a new year entries into your data just to show this.
avgs = '''
DATE TAVG MONTH YEAR
0 1939-07 86.0 07 1939
1 1939-08 84.8 08 1939
2 1939-09 82.2 09 1939
3 1939-10 68.0 10 1939
4 1940-11 53.1 11 1940
5 1940-12 52.5 12 1940
'''
ax = plt.subplot()
for key, group in avgs.groupby("YEAR"):
ax.plot(group.MONTH, group.TAVG, label = key)
ax.set_xlabel('Month')
ax.set_ylabel('TAVG')
plt.legend()
plt.show()
will result in
You can do:
ax = None
for group in df.groupby("YEAR"):
ax = group[1].plot(x="MONTH", y="TAVG", ax=ax)
plt.show()
Each plot() returns the matplotlib Axes instance where it drew the plot. So by feeding that back in each time, you can repeatedly draw on the same set of axes.
I don't think you can do that directly in the functional style as you have tried unfortunately.

Getting appropriate labels to show up in grouped DataFrameGroupBy.plot() legend

I'm wondering how to get the proper legend labels to show up in a DataFrame plot after grouped using the group by method. One would expect the legend labels to be the group names; instead they are showing up as redundant dependent variable names.
For example, I have a DataFrame that looks like this:
ECMO Year Sex Runs p_ecmo
102 111 2011 M 2106 0.052707
104 31 2012 F 1801 0.017213
105 42 2012 M 2664 0.015766
107 59 2013 F 1039 0.056785
108 72 2013 M 1386 0.051948
And I am trying to group by sex, and plot p_ecmo by year. Thus,
ecmo_by_year[ecmo_by_year.Year>=1996].groupby('Sex').plot('Year', 'p_ecmo', grid=False)
ylabel('Proportion ECMO')
plt.legend()
But what I get is this, which is not helpful. The legend should display the group name, not the variable.
It doesn't look like you need groupby since you are not splitting, combining, or applying on p_ecmo
You can use a pivot_table to get the desired result.
pt = ecmo_by_year.pivot_table(['p_ecmo'], rows='Year', cols='Sex').fillna(0)
pt['p_ecmo'].plot()

Categories