I've imported seaborn and typed in this:
Bunker2019_Jan_to_Jun.plot(x='2019', y='Total')
Bunker2019_Jan_to_Jun.plot(x='2019', y='MGO')
and it shows two graphs. Is there any way I can show the year 2019(Jan to Dec) and 2020(Jan to Jun)?
If you like them on the same plot, you need to combine the data frame, not very sure what is your "2019" column (date or string?), so below I tried to create a data.frame thats like yours:
import seaborn as sns
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
mths = pd.date_range(start='1/1/2019', periods=12,freq="M").strftime("%b").to_list()
Bunker2019 = pd.DataFrame({'2019':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Bunker2020 = pd.DataFrame({'2020':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Simple way to add the year to create a new date:
Bunker2019['Date'] = '2019_'+ Bunker2019['2019'].astype(str)
Bunker2020['Date'] = '2020_'+ Bunker2020['2020'].astype(str)
We concat and melt, setting an order:
df = pd.concat([Bunker2019[['Date','Total','MGO']],Bunker2020[['Date','Total','MGO']]])
df = df.melt(id_vars='Date')
df['Date'] = pd.Categorical(df['Date'],categories=df['Date'].unique(),ordered=True)
So now it is a long format, containing information for both 2020 and 2019:
Date variable value
0 2019_Jan Total 0.187751
1 2019_Feb Total 0.091374
2 2019_Mar Total 0.929739
3 2019_Apr Total 0.621981
4 2019_May Total 0.371236
5 2019_Jun Total 0.027078
6 2019_Jul Total 0.719046
7 2019_Aug Total 0.138531
Now to plot:
plt.figure(figsize=(12,5))
ax = sns.lineplot(data=df,x='Date',y='value',hue='variable')
sns.scatterplot(data=df,x='Date',y='value',hue='variable',ax=ax,legend=False)
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()
I created the source DataFrame as:
Month MGO MFO
0 2019-01 79.1 85.0
1 2019-02 69.9 91.2
2 2019-03 68.9 90.4
3 2019-04 71.1 87.0
4 2019-05 75.9 85.6
5 2019-06 60.9 82.1
6 2019-07 68.4 75.0
7 2019-08 75.8 60.7
8 2019-09 82.0 58.8
9 2019-10 95.3 56.6
10 2019-11 90.2 59.7
11 2019-12 86.5 57.7
12 2020-01 79.1 50.0
13 2020-02 88.9 52.2
14 2020-03 74.9 54.4
15 2020-04 87.1 51.0
16 2020-05 92.9 52.6
17 2020-06 105.9 53.1
(for now Month column as string).
If you have 2 separate source DataFrames, concatenate them.
The first processing step is to convert Month column to datetime
type and set it as the index:
df.Month = pd.to_datetime(df.Month)
df.set_index('Month', inplace=True)
The first, more straightforward possibility to create
the drawing is:
df.plot(style='-x');
For my data sample I got:
The second possibility is to generate the picture with smoothened lines.
To do this, you can draw two plots in a single axex:
first - smoothened line, from resampled DataFrame, with
interpolation, but without markers, as now there are much more points,
second - only markers, taken from the original DataFrame,
both with the same list of colors.
The code to do it is:
fig, ax = plt.subplots()
color = ['blue', 'orange']
df.resample('D').interpolate('quadratic').plot(ax=ax, color=color)
df.plot(ax=ax, marker='x', linestyle='None', legend=False, color=color);
This time the result is:
Related
I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)
I have these charts that I've created in Excel from dataframes of a structure like such:
so that the chart can be created like this, stacking the 5-Year Range area on top of the Min range (no fill) so that the range area can be shaded. The min/max/range/avg columns all calculate off of 2016-2020.
I know that I can plot lines for multiple years on the same axis by using a date index and applying month labels, but is there a way to replicate the shading of this chart, more specifically if my dataframes are in a simple date index-value format, like so:
Quantity
1/1/2016 6
2/1/2016 4
3/1/2016 1
4/1/2016 10
5/1/2016 7
6/1/2016 10
7/1/2016 10
8/1/2016 2
9/1/2016 1
10/1/2016 2
11/1/2016 3
… …
1/1/2020 4
2/1/2020 8
3/1/2020 3
4/1/2020 5
5/1/2020 8
6/1/2020 6
7/1/2020 6
8/1/2020 7
9/1/2020 8
10/1/2020 5
11/1/2020 4
12/1/2020 3
1/1/2021 9
2/1/2021 7
3/1/2021 7
I haven't been able to find anything similar in the plot libraries.
Two step process
restructure DF so that years are columns, rows indexed by uniform date time
plot using matplotlib
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# straight date as index, quantity as column
d = pd.date_range("1-Jan-2016", "1-Mar-2021", freq="MS")
df = pd.DataFrame({"Quantity":np.random.randint(1, 10, len(d))}, index=d)
# re-structure as multi-index, make year column
# add calculated columns
dfg = (df.set_index(pd.MultiIndex.from_arrays([df.index.map(lambda d: dt.date(dt.date.today().year, d.month, d.day)),
df.index.year], names=["month","year"]))
.unstack("year")
.droplevel(0, axis=1)
.assign(min=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].min(axis=1),
max=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].max(axis=1),
avg=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].mean(axis=1).round(1),
)
)
fig, ax = plt.subplots(1, figsize=[14,4])
# now plot all the parts
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5y range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="r")
ax.plot(dfg.index, dfg[2021], label="2021", c="g")
ax.plot(dfg.index, dfg.avg, label="5 yr avg", c="y", ls=(0,(1,2)), lw=3)
# adjust axis
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.legend(loc = 'best')
date
name
amount
1
harry
100
1
joe
20
2
harry
50
3
joe
60
3
lee
25
4
lee
60
4
harry
200
4
joe
90
I was trying to share 'date' axis (x-axis) with 432 person name. Image was too large to show.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec=pd.read_csv('december.csv')
sns.lmplot(x='date', y='amount',
data= dec, fit_reg=False, hue='name', legend=True, palette='Set1')
This code is giving one graph with 432 hue. But I want 432 graphs. How to do it?
Using the same code you wrote, but instead of putting hue='name', you put col='name' and it should give you the expected behavior:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec = pd.DataFrame(
[
[1,'harry',100],
[1,'joe',20],
[2,'harry',50],
[3,'joe',60],
[3,'lee',25],
[4,'lee',60],
[4,'harry',200],
[4,'joe',90],
],
columns=['date','name','amount'],
)
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
If you want to break the rows, you can define a column wrapper with col_wrap (number of plots per row):
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
col_wrap=1,
legend=True,
palette='Set1',
)
EDIT: using the groupby() method, you can easily get aggregates such as number of dots per plot and total amount per group.
The main idea is to group the records in the dec dataframe by name (has it was implicitly done in the plot above).
Continuing on the code above, you can have a preview of the groupby operation using the describe method:
dec.groupby('name').describe()
Out[2]:
date amount
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
name
harry 3.0 2.333333 1.527525 1.0 1.50 2.0 3.00 4.0 3.0 116.666667 76.376262 50.0 75.00 100.0 150.00 200.0
joe 3.0 2.666667 1.527525 1.0 2.00 3.0 3.50 4.0 3.0 56.666667 35.118846 20.0 40.00 60.0 75.00 90.0
lee 2.0 3.500000 0.707107 3.0 3.25 3.5 3.75 4.0 2.0 42.500000 24.748737 25.0 33.75 42.5 51.25 60.0
Using the pandas groupby method, we group records by 'name' and pick any column (here: 'amount') to get the count (the count is the same aggregate for each column, as it counts each individual occurence of each different 'name'):
counts = dec.groupby('name')['amount'].count()
counts
Out[3]:
name
harry 3
joe 3
lee 2
Name: amount, dtype: int64
To get the total amount, we do the same, we pick the 'amount' column and call the sum() method instead of the count() method:
total_amounts = dec.groupby('name')['amount'].sum()
total_amounts
Out[4]:
name
harry 350
joe 170
lee 85
Name: amount, dtype: int64
We now have two series indexed by 'name' containing the information we want: counts and total_amounts.
We're gonna use these two series to build a title for each subplot:
plot = sns.lmplot(
x='date',
y='amount',
data=dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
for name in plot.axes_dict:
sublot_title = f'name = {name}, number of dots = {counts[name]}, total amount = {total_amounts[name]}'
plot.axes_dict[name].set_title(sublot_title)
plot.fig
It prints:
I want to make a boxplot with on the the x-axis having the x variable split in different ranges, for eg: 0-5, 5-10, 10+. Is there a way to do this efficiently in Matplotlib/Seaborn without having to create uneven new columns based on subsetting? So for this example dataset below a I want a boxplot with 3 boxes 0-5 (1a4j,1a6u,1ahc), 5-10 (1brq,1bya), 10+ (1bya,1bbs) given the rot_bonds variable
structure rot_bonds no_atoms logP
0 1a4j 3 37 2.46
1 1a6u 4 17 1.58
2 1ahc 0 10 -0.06
3 1bbs 20 51 4.81
4 1brq 5 21 5.51
5 1bya 10 45 -9.75
Thanks in advance.
With seaborn you could use the slicing into ranges as the x axis, and for example 'no_atoms' as the y-values for the boxplot:
from matplotlib import pyplot as plt
from io import StringIO
import pandas as pd
import seaborn as sns
s = ''' structure rot_bonds no_atoms logP
0 1a4j 3 37 2.46
1 1a6u 4 17 1.58
2 1ahc 0 10 -0.06
3 1bbs 20 51 4.81
4 1brq 5 21 5.51
5 1bya 10 45 -9.75'''
df = pd.read_csv(StringIO(s), delim_whitespace=True)
sns.boxplot(x=pd.cut(df['rot_bonds'], [0, 5, 10, 1000]), y='no_atoms', data=df)
plt.show()
I'm learning Pandas, (watching these helpful videos) and currently playing around with a UFO sighting table
import pandas as pd
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
ufo.head()
Now, I'd like to use Seaborn to make a violinplot of each state (on the x-axis) and the year (on the y-axis). Hence the plot shows the frequency density of sightings at any given year, in any given state.
If I use
ufo.State.value_counts()
I can get a Pandas Series of all the counts for each state. But how do I separate this data by year? I somehow need to get data with the ufo sightings per year per state?
Am I on the right track to create a Seaborn violinplot? Or going in completely the wrong direction?
According to the example shown in violinplot documentation of the following example:
ax = sns.violinplot(x="day", y="total_bill", data=tips)
You can directly assign your desired columns into x-axis by supplying the column name into x= and y-axis to the y= parameter. The following code shows the data structure of tips variable.
In [ ]: tips.head()
Out[ ]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Your question is to plot using violinplot, having x-axis to show ufo.State and y-axis to show ufo.Year. Therefore, I believe ufo.State.value_counts() is unnecessary, or even groupby since the ufo data is already well described and satisfy violinplot's parameter format.
You can achieve it by directly supplying both ufo.columnName into x= and y=. See the code below:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
City Colors Reported Shape Reported State \
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY
Time Year
0 1930-06-01 22:00:00 1930
1 1930-06-30 20:00:00 1930
2 1931-02-15 14:00:00 1931
3 1931-06-01 13:00:00 1931
4 1933-04-18 19:00:00 1933
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.violinplot(x=ufo.State, y=ufo.Year)
# ax = sns.violinplot(x='State', y='Year', data=ufo) # Works the same with the code one line above
plt.show()