I'm trying to visually compare two columns in a data frame and it either makes a weird table with 'frequency' instead of one of the columns
I tried these options:
ct1=pd.crosstab(df['releaseyear'],df['score'],normalize=True)
ct1.plot()
df.plot( x='releaseyear', y='score', kind='hist')
and also a scatter plot which get the x and y right but I don't know how normalize it so it will only show the average of each year and not all the data
plt.scatter(df['releaseyear'], df['score'])
plt.show()
There is no proper data which can be used to reproduce the dataframe or clue about how dataframe looks.
This answer is according to what i understood if data is like this
year score
2001 20
2001 18
2002 12
2002 16
then first use groupby and group data according to year and apply required aggregate function.
df=df.groupby('year').mean().reset_index()
output
year score
0 2001 19.0
1 2002 14.0
you can then plot the data accordingly.
Let's assume I have a dataframe and I'm looking at 2 columns of it (2 series).
Using one of the columns - "no_employees" below - Can someone kindly help me figure out how to create 6 different pie charts or bar charts (1 for each grouping of no_employees) that illustrate the value counts for the Yes/No values in the treatment column? I'll use matplotlib or seaborn, whatever you feel is easiest.
I'm using the attached line of code to generate the code below.
dataframe_title.groupby(['no_employees']).treatment.value_counts().
But now I'm stuck. Do I use seaborn? .plot? This seems like it should be easy, and I know there are some cases where I can make subplots=True, but I'm really confused. Thank you so much.
no_employees treatment
1-5 Yes 88
No 71
100-500 Yes 95
No 80
26-100 Yes 149
No 139
500-1000 No 33
Yes 27
6-25 No 162
Yes 127
More than 1000 Yes 146
No 135
The importance of data encoding:
The purpose of data visualization is to more easily convey information (e.g. in this case, the relative number of 'treatments' per category)
The bar chart accommodates easily displaying the important information
how many in each group said 'Yes' or 'No'
the relative sizes of each group
A pie plot is more commonly used to display a sample, where the groups within the sample, sum to 100%.
Wikipedia: Pie Chart
Research has shown that comparison by angle, is less accurate than comparison by length, in that people are less able to discern differences.
Statisticians generally regard pie charts as a poor method of displaying information, and they are uncommon in scientific literature.
This data is not well represented by a pie plot, because each company size is a separate population, which will require 6 pie plots to be correctly represented.
The data can be placed into a pie plot, as others have shown, but that doesn't mean it should be.
Regardless of the type of plot, the data must be in the correct shape for the plot API.
Tested with pandas 1.3.0, seaborn 0.11.1, and matplotlib 3.4.2
Setup a test DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # for sample data only
np.random.seed(365)
cats = ['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']
data = {'no_employees': np.random.choice(cats, size=(1000,)),
'treatment': np.random.choice(['Yes', 'No'], size=(1000,))}
df = pd.DataFrame(data)
# set a categorical order for the x-axis to be ordered
df.no_employees = pd.Categorical(df.no_employees, categories=cats, ordered=True)
no_employees treatment
0 26-100 No
1 1-5 Yes
2 >1000 No
3 100-500 Yes
4 500-1000 Yes
Plotting with pandas.DataFrame.plot():
This requires grouping the dataframe to get .value_counts, and unstacking with pandas.DataFrame.unstack.
# to get the dataframe in the correct shape, unstack the groupby result
dfu = df.groupby(['no_employees']).treatment.value_counts().unstack()
treatment No Yes
no_employees
1-5 78 72
6-25 83 86
26-100 83 76
100-500 91 84
500-1000 78 83
>1000 95 91
# plot
ax = dfu.plot(kind='bar', figsize=(7, 5), xlabel='Number of Employees in Company', ylabel='Count', rot=0)
ax.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
Plotting with seaborn
seaborn is a high-level API for matplotlib.
seaborn.barplot()
Requires a DataFrame in a tidy (long) format, which is done by grouping the dataframe to get .value_counts, and resetting the index with pandas.Series.reset_index
May also be done with the figure-level interface using sns.catplot() with kind='bar'
# groupby, get value_counts, and reset the index
dft = df.groupby(['no_employees']).treatment.value_counts().reset_index(name='Count')
no_employees treatment Count
0 1-5 No 78
1 1-5 Yes 72
2 6-25 Yes 86
3 6-25 No 83
4 26-100 No 83
5 26-100 Yes 76
6 100-500 No 91
7 100-500 Yes 84
8 500-1000 Yes 83
9 500-1000 No 78
10 >1000 No 95
11 >1000 Yes 91
# plot
p = sns.barplot(x='no_employees', y='Count', data=dft, hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
seaborn.countplot()
Uses the original dataframe, df, without any transformations.
May also be done with the figure-level interface using sns.catplot() with kind='count'
p = sns.countplot(data=df, x='no_employees', hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
Output of barplot and countplot
Let's reshape the dataframe and plot with subplots=True:
df_chart = df1.unstack()['Pct']
axs = df_chart.plot.pie(subplots=True, figsize=(4,9), layout=(2,1), legend=False, title=df_chart.columns.tolist())
ax_flat = axs.flatten()
for ax in ax_flat:
ax.yaxis.label.set_visible(False)
Output:
df
SKU Comp Brand Jan_Sales Feb_Sales Mar_sales Apr_sales Dec_sales..
A AC BA 122 100 50 200 300
B BC BB 100 50 80 90 250
C CC BC 40 30 100 10 11
and so on
Now I want a graph which will plot Jan sales, feb sales and so on till dec in one line for SKU A, Similarly one line on the same graph for SKU B and same way for SKU C.
I read few answers which say that I need to transpose my data. Something like below
df.T. plot()
However my first column is SKU, and I want to plot based on that. Rest of the columns are numeric. So I want that on each line SKU Name should be mentioned. And plotting should be row wise
EDIT(added after receiving some answers as I am facing this issue in few other datasets):
lets say I dont want columns Company, brand etc, then what to do
Use DataFrame.set_index for convert SKU to index and then tranpose:
df.set_index('SKU').T.plot()
Use set_index then transpose:
df.set_index("SKU").T.plot()
Output:
I have two dataframes, let's say the first one corresponds to operational power plants and the second one to pipeline power plants. I want to plot both in the same area chart. They should be differenced by dark and light colors as the plots below. I have tried to insert a column into each dataframe and set it as an index (besides Country and fuel). I could't either set a new columns as an index or plot the dataframes in the same plot. I really appreciate some idea to execute that.
df1
2010 2020 2030 2040 2050
Country Fuel
A Gas 100 110 120 130 140
Coal 100 110 120 130 140
df2
2010 2020 2030 2040 2050
Country Fuel
A Gas 100 110 120 130 140
Coal 100 110 120 130 140
I find that hierarchical indexes get in the way for certain tasks (like this one) and it is better to do all of the manipulation on a flat dataframe. In the steps below, I use reset_index to turn the index levels into ordinary columns, and then set_index to put them back again in preparation for the plotting step.
# reset the indexes, and add a new column to both dataframes
df1.reset_index(inplace=True)
df1['Plant Type'] = 'Operational'
df2.reset_index(inplace=True)
df2['Plant Type'] = 'Pipeline'
# concatenate the two dataframes
df_combined = pd.concat([df1, df2])
# set the index back to how it was, but also include the new column, and then plot
df_combined.set_index(['Plant Type', 'Country', 'Fuel']).plot.area()
I'm trying to create a graph that shows whether or not average temperatures in my city are increasing. I'm using data provided by NOAA and have a DataFrame that looks like this:
DATE TAVG MONTH YEAR
0 1939-07 86.0 07 1939
1 1939-08 84.8 08 1939
2 1939-09 82.2 09 1939
3 1939-10 68.0 10 1939
4 1939-11 53.1 11 1939
5 1939-12 52.5 12 1939
This is saved in a variable called "avgs", and I then use groupby and plot functions like so:
avgs.groupby(["YEAR"]).plot(kind='line',x='MONTH', y='TAVG')
This produces a line graph (see below for example) for each year that shows the average temperature for each month. That's great stuff, but I'd like to be able to put all of the yearly line graphs into one graph, for the purposes of visual comparison (to see if the monthly averages are increasing).
Example output
I'm a total noob with matplotlib and pandas, so I don't know the best way to do this. Am I going wrong somewhere and just don't realize it? And if I'm on the right track, where should I go from here?
Very similar to the other answer (by Anake), but you can get control over legend here (the other answer, legends for all years will be "TAVG". I add a new year entries into your data just to show this.
avgs = '''
DATE TAVG MONTH YEAR
0 1939-07 86.0 07 1939
1 1939-08 84.8 08 1939
2 1939-09 82.2 09 1939
3 1939-10 68.0 10 1939
4 1940-11 53.1 11 1940
5 1940-12 52.5 12 1940
'''
ax = plt.subplot()
for key, group in avgs.groupby("YEAR"):
ax.plot(group.MONTH, group.TAVG, label = key)
ax.set_xlabel('Month')
ax.set_ylabel('TAVG')
plt.legend()
plt.show()
will result in
You can do:
ax = None
for group in df.groupby("YEAR"):
ax = group[1].plot(x="MONTH", y="TAVG", ax=ax)
plt.show()
Each plot() returns the matplotlib Axes instance where it drew the plot. So by feeding that back in each time, you can repeatedly draw on the same set of axes.
I don't think you can do that directly in the functional style as you have tried unfortunately.