Pandas (and seaborn) violinplot of state vs. year - python

I'm learning Pandas, (watching these helpful videos) and currently playing around with a UFO sighting table
import pandas as pd
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
ufo.head()
Now, I'd like to use Seaborn to make a violinplot of each state (on the x-axis) and the year (on the y-axis). Hence the plot shows the frequency density of sightings at any given year, in any given state.
If I use
ufo.State.value_counts()
I can get a Pandas Series of all the counts for each state. But how do I separate this data by year? I somehow need to get data with the ufo sightings per year per state?
Am I on the right track to create a Seaborn violinplot? Or going in completely the wrong direction?

According to the example shown in violinplot documentation of the following example:
ax = sns.violinplot(x="day", y="total_bill", data=tips)
You can directly assign your desired columns into x-axis by supplying the column name into x= and y-axis to the y= parameter. The following code shows the data structure of tips variable.
In [ ]: tips.head()
Out[ ]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Your question is to plot using violinplot, having x-axis to show ufo.State and y-axis to show ufo.Year. Therefore, I believe ufo.State.value_counts() is unnecessary, or even groupby since the ufo data is already well described and satisfy violinplot's parameter format.
You can achieve it by directly supplying both ufo.columnName into x= and y=. See the code below:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
City Colors Reported Shape Reported State \
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY
Time Year
0 1930-06-01 22:00:00 1930
1 1930-06-30 20:00:00 1930
2 1931-02-15 14:00:00 1931
3 1931-06-01 13:00:00 1931
4 1933-04-18 19:00:00 1933
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.violinplot(x=ufo.State, y=ufo.Year)
# ax = sns.violinplot(x='State', y='Year', data=ufo) # Works the same with the code one line above
plt.show()

Related

Seaborn boxplot with grouped data into categories with count column

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

Getting labels for legend after graphing pivot of dataframe in pandas

I am trying to have my plot show a legend where the column each value came from would have a label. I did not separate the plt.plot() from the pivot step but want to know if it is still possible to have a legend. One does not show up at all and if I add
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'), label='DC')
it just uses that string as every label, if I add df_EPErrorPercentByWeekAndDC['DC'] then it just shows one letter of it per legend item. Here is the code I have:
print("### Graphing Error Rates by Week and DC EP ###")
# remove percent sign from percent in place
df_EPErrorPercentByWeekAndDC['Error Percent'] = df_EPErrorPercentByWeekAndDC['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation = 90)
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'))
plt.legend()
plt.savefig('EPErrorPercentByWeekAndDC.png', bbox_inches="tight", dpi=500)
plt.close()
and I cant share any of the data but it is in the format of a pivot table with columns with state names and each column is full of percentages, the graph works fine but the legend isnt there.
Either save the pivoted frame then specify the legend as the columns:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation=90)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
plt.plot(pivoted_df)
plt.legend(pivoted_df.columns, title=pivoted_df.columns.name)
plt.tight_layout()
plt.show()
Or use DataFrame.plot directly on the pivoted DataFrame which will handle the legend automatically:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
ax = pivoted_df.plot(rot=90)
plt.tight_layout()
plt.show()
Sample DataFrame and imports used:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'hellofresh delivery week': np.repeat(
pd.date_range('2021-05-01', periods=5, freq='W'), 2),
'DC': ['A', 'B'] * 5,
'Error Percent': pd.Series(np.random.randint(10, 100, 10), dtype=str) + '%'
})
df:
hellofresh delivery week DC Error Percent
0 2021-05-02 A 88%
1 2021-05-02 B 71%
2 2021-05-09 A 26%
3 2021-05-09 B 83%
4 2021-05-16 A 18%
5 2021-05-16 B 72%
6 2021-05-23 A 37%
7 2021-05-23 B 40%
8 2021-05-30 A 90%
9 2021-05-30 B 17%
pivoted_df:
DC A B
hellofresh delivery week
2021-05-02 88.0 71.0
2021-05-09 26.0 83.0
2021-05-16 18.0 72.0
2021-05-23 37.0 40.0
2021-05-30 90.0 17.0

How to make multiple scatter subplots with sharing one-axis?

date
name
amount
1
harry
100
1
joe
20
2
harry
50
3
joe
60
3
lee
25
4
lee
60
4
harry
200
4
joe
90
I was trying to share 'date' axis (x-axis) with 432 person name. Image was too large to show.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec=pd.read_csv('december.csv')
sns.lmplot(x='date', y='amount',
data= dec, fit_reg=False, hue='name', legend=True, palette='Set1')
This code is giving one graph with 432 hue. But I want 432 graphs. How to do it?
Using the same code you wrote, but instead of putting hue='name', you put col='name' and it should give you the expected behavior:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec = pd.DataFrame(
[
[1,'harry',100],
[1,'joe',20],
[2,'harry',50],
[3,'joe',60],
[3,'lee',25],
[4,'lee',60],
[4,'harry',200],
[4,'joe',90],
],
columns=['date','name','amount'],
)
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
If you want to break the rows, you can define a column wrapper with col_wrap (number of plots per row):
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
col_wrap=1,
legend=True,
palette='Set1',
)
EDIT: using the groupby() method, you can easily get aggregates such as number of dots per plot and total amount per group.
The main idea is to group the records in the dec dataframe by name (has it was implicitly done in the plot above).
Continuing on the code above, you can have a preview of the groupby operation using the describe method:
dec.groupby('name').describe()
Out[2]:
date amount
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
name
harry 3.0 2.333333 1.527525 1.0 1.50 2.0 3.00 4.0 3.0 116.666667 76.376262 50.0 75.00 100.0 150.00 200.0
joe 3.0 2.666667 1.527525 1.0 2.00 3.0 3.50 4.0 3.0 56.666667 35.118846 20.0 40.00 60.0 75.00 90.0
lee 2.0 3.500000 0.707107 3.0 3.25 3.5 3.75 4.0 2.0 42.500000 24.748737 25.0 33.75 42.5 51.25 60.0
Using the pandas groupby method, we group records by 'name' and pick any column (here: 'amount') to get the count (the count is the same aggregate for each column, as it counts each individual occurence of each different 'name'):
counts = dec.groupby('name')['amount'].count()
counts
Out[3]:
name
harry 3
joe 3
lee 2
Name: amount, dtype: int64
To get the total amount, we do the same, we pick the 'amount' column and call the sum() method instead of the count() method:
total_amounts = dec.groupby('name')['amount'].sum()
total_amounts
Out[4]:
name
harry 350
joe 170
lee 85
Name: amount, dtype: int64
We now have two series indexed by 'name' containing the information we want: counts and total_amounts.
We're gonna use these two series to build a title for each subplot:
plot = sns.lmplot(
x='date',
y='amount',
data=dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
for name in plot.axes_dict:
sublot_title = f'name = {name}, number of dots = {counts[name]}, total amount = {total_amounts[name]}'
plot.axes_dict[name].set_title(sublot_title)
plot.fig
It prints:

How to I make a line graph out of this?

I've imported seaborn and typed in this:
Bunker2019_Jan_to_Jun.plot(x='2019', y='Total')
Bunker2019_Jan_to_Jun.plot(x='2019', y='MGO')
and it shows two graphs. Is there any way I can show the year 2019(Jan to Dec) and 2020(Jan to Jun)?
If you like them on the same plot, you need to combine the data frame, not very sure what is your "2019" column (date or string?), so below I tried to create a data.frame thats like yours:
import seaborn as sns
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
mths = pd.date_range(start='1/1/2019', periods=12,freq="M").strftime("%b").to_list()
Bunker2019 = pd.DataFrame({'2019':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Bunker2020 = pd.DataFrame({'2020':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Simple way to add the year to create a new date:
Bunker2019['Date'] = '2019_'+ Bunker2019['2019'].astype(str)
Bunker2020['Date'] = '2020_'+ Bunker2020['2020'].astype(str)
We concat and melt, setting an order:
df = pd.concat([Bunker2019[['Date','Total','MGO']],Bunker2020[['Date','Total','MGO']]])
df = df.melt(id_vars='Date')
df['Date'] = pd.Categorical(df['Date'],categories=df['Date'].unique(),ordered=True)
So now it is a long format, containing information for both 2020 and 2019:
Date variable value
0 2019_Jan Total 0.187751
1 2019_Feb Total 0.091374
2 2019_Mar Total 0.929739
3 2019_Apr Total 0.621981
4 2019_May Total 0.371236
5 2019_Jun Total 0.027078
6 2019_Jul Total 0.719046
7 2019_Aug Total 0.138531
Now to plot:
plt.figure(figsize=(12,5))
ax = sns.lineplot(data=df,x='Date',y='value',hue='variable')
sns.scatterplot(data=df,x='Date',y='value',hue='variable',ax=ax,legend=False)
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()
I created the source DataFrame as:
Month MGO MFO
0 2019-01 79.1 85.0
1 2019-02 69.9 91.2
2 2019-03 68.9 90.4
3 2019-04 71.1 87.0
4 2019-05 75.9 85.6
5 2019-06 60.9 82.1
6 2019-07 68.4 75.0
7 2019-08 75.8 60.7
8 2019-09 82.0 58.8
9 2019-10 95.3 56.6
10 2019-11 90.2 59.7
11 2019-12 86.5 57.7
12 2020-01 79.1 50.0
13 2020-02 88.9 52.2
14 2020-03 74.9 54.4
15 2020-04 87.1 51.0
16 2020-05 92.9 52.6
17 2020-06 105.9 53.1
(for now Month column as string).
If you have 2 separate source DataFrames, concatenate them.
The first processing step is to convert Month column to datetime
type and set it as the index:
df.Month = pd.to_datetime(df.Month)
df.set_index('Month', inplace=True)
The first, more straightforward possibility to create
the drawing is:
df.plot(style='-x');
For my data sample I got:
The second possibility is to generate the picture with smoothened lines.
To do this, you can draw two plots in a single axex:
first - smoothened line, from resampled DataFrame, with
interpolation, but without markers, as now there are much more points,
second - only markers, taken from the original DataFrame,
both with the same list of colors.
The code to do it is:
fig, ax = plt.subplots()
color = ['blue', 'orange']
df.resample('D').interpolate('quadratic').plot(ax=ax, color=color)
df.plot(ax=ax, marker='x', linestyle='None', legend=False, color=color);
This time the result is:

Stacked Histogram not working in Pandas

I'm playing with Pandas and have the following code:
tips.hist(stacked=True, column="total_bill", by="time")
The resulting graph looks nice:
However, it is not stacked! I wanted them both on one plot, stacked on top of each other. I wanted it to look like the one in the docs: http://pandas.pydata.org/pandas-docs/stable/visualization.html#histograms
Any help would be greatly appreciated.
You need the values in separate columns.
tips = pd.read_csv('https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv')
>>> tips[['time', 'tip']].pivot(columns='time').plot(kind='hist', stacked=True)
>>> tips[['time', 'tip']].pivot(columns='time').head()
tip
time Dinner Lunch
0 1.01 NaN
1 1.66 NaN
2 3.50 NaN
3 3.31 NaN
4 3.61 NaN

Categories