date
name
amount
1
harry
100
1
joe
20
2
harry
50
3
joe
60
3
lee
25
4
lee
60
4
harry
200
4
joe
90
I was trying to share 'date' axis (x-axis) with 432 person name. Image was too large to show.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec=pd.read_csv('december.csv')
sns.lmplot(x='date', y='amount',
data= dec, fit_reg=False, hue='name', legend=True, palette='Set1')
This code is giving one graph with 432 hue. But I want 432 graphs. How to do it?
Using the same code you wrote, but instead of putting hue='name', you put col='name' and it should give you the expected behavior:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec = pd.DataFrame(
[
[1,'harry',100],
[1,'joe',20],
[2,'harry',50],
[3,'joe',60],
[3,'lee',25],
[4,'lee',60],
[4,'harry',200],
[4,'joe',90],
],
columns=['date','name','amount'],
)
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
If you want to break the rows, you can define a column wrapper with col_wrap (number of plots per row):
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
col_wrap=1,
legend=True,
palette='Set1',
)
EDIT: using the groupby() method, you can easily get aggregates such as number of dots per plot and total amount per group.
The main idea is to group the records in the dec dataframe by name (has it was implicitly done in the plot above).
Continuing on the code above, you can have a preview of the groupby operation using the describe method:
dec.groupby('name').describe()
Out[2]:
date amount
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
name
harry 3.0 2.333333 1.527525 1.0 1.50 2.0 3.00 4.0 3.0 116.666667 76.376262 50.0 75.00 100.0 150.00 200.0
joe 3.0 2.666667 1.527525 1.0 2.00 3.0 3.50 4.0 3.0 56.666667 35.118846 20.0 40.00 60.0 75.00 90.0
lee 2.0 3.500000 0.707107 3.0 3.25 3.5 3.75 4.0 2.0 42.500000 24.748737 25.0 33.75 42.5 51.25 60.0
Using the pandas groupby method, we group records by 'name' and pick any column (here: 'amount') to get the count (the count is the same aggregate for each column, as it counts each individual occurence of each different 'name'):
counts = dec.groupby('name')['amount'].count()
counts
Out[3]:
name
harry 3
joe 3
lee 2
Name: amount, dtype: int64
To get the total amount, we do the same, we pick the 'amount' column and call the sum() method instead of the count() method:
total_amounts = dec.groupby('name')['amount'].sum()
total_amounts
Out[4]:
name
harry 350
joe 170
lee 85
Name: amount, dtype: int64
We now have two series indexed by 'name' containing the information we want: counts and total_amounts.
We're gonna use these two series to build a title for each subplot:
plot = sns.lmplot(
x='date',
y='amount',
data=dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
for name in plot.axes_dict:
sublot_title = f'name = {name}, number of dots = {counts[name]}, total amount = {total_amounts[name]}'
plot.axes_dict[name].set_title(sublot_title)
plot.fig
It prints:
Related
The data frame looks like:
import pandas as pd
import numpy as np # used for the nan values
data = {'card_name': ['Diamonds', 'Clovers', 'HorseShoe'], '$20': [1000.0, 10.0, np.nan], '$25': [500.0, np.nan, 1873.0], '$30': [25, 213, 4657], '$40': [np.nan, 2199.0, np.nan], '$50': [1500.0, np.nan, 344.0], '$70': [np.nan, 43.0, 239.0], '$75': [30.0, 2.0, np.nan], '$100': [1.0, np.nan, 748.0]}
df = pd.DataFrame(data)
card_name $20 $25 $30 $40 $50 $70 $75 $100
0 Diamonds 1000 500 25 NaN 1500 NaN 30 1
1 Clovers 10 NaN 213 2199 NaN 43 2 NaN
2 HorseShoe NaN 1873 4657 NaN 344 239 NaN 748
The figure under the dollar signed column names is how many prizes there are on the corresponding card_name.
I'm trying to graph each card_name and show how many prizes there are for all the column's.
I'm using Python and Pandas with Matplotlib/Seaborn
The shape of the required dataframe depends on which plot API is being used to plot.
pandas and seaborn are both dependent upon matplotlib, but require a different shape to get the same result.
pandas
Set 'card_name' as the index, and then transpose the dataframe with .T.
Plot the dataframe directly with pandas.DataFrame.plot and kind='bar'. The index is plotted as the axis.
# set the index and transpose
dft = df.set_index('card_name').T
# display(dft)
card_name Diamonds Clovers HorseShoe
$20 1000.0 10.0 NaN
$25 500.0 NaN 1873.0
$30 25.0 213.0 4657.0
$40 NaN 2199.0 NaN
$50 1500.0 NaN 344.0
$70 NaN 43.0 239.0
$75 30.0 2.0 NaN
$100 1.0 NaN 748.0
# plot
dft.plot(kind='bar', rot=0)
seaborn
Convert the dataframe from a wide to long format using pandas.DataFrame.melt
Plot the data with seaborn.barplot, or with seaborn.catplot and kind='bar', then use hue= to specify the column to color by.
# convert the dataframe to long format
dfm = df.melt(id_vars='card_name')
# display(dfm.head())
card_name variable value
0 Diamonds $20 1000.0
1 Clovers $20 10.0
2 HorseShoe $20 NaN
3 Diamonds $25 500.0
4 Clovers $25 NaN
ax = sns.barplot(data=dfm, x='variable', y='value', hue='card_name')
subplots
pandas
add the parameter subplots=True
# using the previously transformed dataframe dft
axes = dft.plot(kind='bar', rot=0, subplots=True, figsize=(6, 10))
seaborn
It's easier to use .catplot to get subplots by specifying the row= and/or col= parameter.
# using the previously transformed dataframe dfm
p = sns.catplot(kind='bar', data=dfm, x='variable', y='value', row='card_name', height=3, aspect=1.5)
I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)
I've imported seaborn and typed in this:
Bunker2019_Jan_to_Jun.plot(x='2019', y='Total')
Bunker2019_Jan_to_Jun.plot(x='2019', y='MGO')
and it shows two graphs. Is there any way I can show the year 2019(Jan to Dec) and 2020(Jan to Jun)?
If you like them on the same plot, you need to combine the data frame, not very sure what is your "2019" column (date or string?), so below I tried to create a data.frame thats like yours:
import seaborn as sns
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
mths = pd.date_range(start='1/1/2019', periods=12,freq="M").strftime("%b").to_list()
Bunker2019 = pd.DataFrame({'2019':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Bunker2020 = pd.DataFrame({'2020':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Simple way to add the year to create a new date:
Bunker2019['Date'] = '2019_'+ Bunker2019['2019'].astype(str)
Bunker2020['Date'] = '2020_'+ Bunker2020['2020'].astype(str)
We concat and melt, setting an order:
df = pd.concat([Bunker2019[['Date','Total','MGO']],Bunker2020[['Date','Total','MGO']]])
df = df.melt(id_vars='Date')
df['Date'] = pd.Categorical(df['Date'],categories=df['Date'].unique(),ordered=True)
So now it is a long format, containing information for both 2020 and 2019:
Date variable value
0 2019_Jan Total 0.187751
1 2019_Feb Total 0.091374
2 2019_Mar Total 0.929739
3 2019_Apr Total 0.621981
4 2019_May Total 0.371236
5 2019_Jun Total 0.027078
6 2019_Jul Total 0.719046
7 2019_Aug Total 0.138531
Now to plot:
plt.figure(figsize=(12,5))
ax = sns.lineplot(data=df,x='Date',y='value',hue='variable')
sns.scatterplot(data=df,x='Date',y='value',hue='variable',ax=ax,legend=False)
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()
I created the source DataFrame as:
Month MGO MFO
0 2019-01 79.1 85.0
1 2019-02 69.9 91.2
2 2019-03 68.9 90.4
3 2019-04 71.1 87.0
4 2019-05 75.9 85.6
5 2019-06 60.9 82.1
6 2019-07 68.4 75.0
7 2019-08 75.8 60.7
8 2019-09 82.0 58.8
9 2019-10 95.3 56.6
10 2019-11 90.2 59.7
11 2019-12 86.5 57.7
12 2020-01 79.1 50.0
13 2020-02 88.9 52.2
14 2020-03 74.9 54.4
15 2020-04 87.1 51.0
16 2020-05 92.9 52.6
17 2020-06 105.9 53.1
(for now Month column as string).
If you have 2 separate source DataFrames, concatenate them.
The first processing step is to convert Month column to datetime
type and set it as the index:
df.Month = pd.to_datetime(df.Month)
df.set_index('Month', inplace=True)
The first, more straightforward possibility to create
the drawing is:
df.plot(style='-x');
For my data sample I got:
The second possibility is to generate the picture with smoothened lines.
To do this, you can draw two plots in a single axex:
first - smoothened line, from resampled DataFrame, with
interpolation, but without markers, as now there are much more points,
second - only markers, taken from the original DataFrame,
both with the same list of colors.
The code to do it is:
fig, ax = plt.subplots()
color = ['blue', 'orange']
df.resample('D').interpolate('quadratic').plot(ax=ax, color=color)
df.plot(ax=ax, marker='x', linestyle='None', legend=False, color=color);
This time the result is:
I wish to create a (2x3) stacked barchart subplot from results using a groupby.size command, let me explain. I have a list of dataframes: list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]. A small example of these df's would be:
... Create Time Location Area Id Beat Priority ... Closed Time
2011-01-01 00:00:00 ST&SAN PABLO AV 1.0 06X 1.0 ... 2011-01-01 00:28:17
2011-01-01 00:01:11 ST&HANNAH ST 1.0 07X 1.0 ... 2011-01-01 01:12:56
.
.
.
(can only add a few columns as the layout messes up)
I'm using a groupby.size command to get a required count of events for these databases, see below:
list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]
for i in list_df:
print(i.groupby(['Beat', 'Priority']).size())
print(' ')
Producing:
Beat Priority
01X 1.0 394
2.0 1816
02X 1.0 644
2.0 1970
02Y 1.0 661
2.0 2309
03X 1.0 857
2.0 2962
.
.
.
I wish to identify which is the top 10 TOTALS using the beat column. So for e.g. the totals above are:
Beat Priority Total for Beat
01X 1.0 394
2.0 1816 2210
02Y 1.0 661
2.0 2309 2970
03X 1.0 857
2.0 2962 3819
.
.
.
So far I have used plot over my groupby.size but it hasn't done the collective total as I described above. Check out below:
list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]
fig, axes = plt.subplots(2, 3)
for d, i in zip(list_df, range(6)):
ax = axes.ravel()[i];
d.groupby(['Beat', 'Priority']).size().nlargest(10).plot(ax=ax, kind='bar', figsize=(15, 7), stacked=True, legend=True)
ax.set_title(f"Top 10 Beats for {i+ 2011}")
plt.tight_layout()
I wish to have the 2x3 subplot layout, but with stacked barcharts like this one I have done previously:
Thanks in advance. This has been harder than I thought it would be!
The data series need to be the columns, so you probably want
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# create fake input data
ncols = 300
list_df = [pd.DataFrame({'Beat': np.random.choice(['{:02d}X'.format(i) for i in range(15)], ncols),
'Priority': np.random.choice(['1', '2'], ncols),
'othercolumn1': range(ncols),
'othercol2': range(ncols),
'year': [yr] * ncols}) for yr in range(2011, 2017)]
In [22]: print(list_df[0].head(5))
Beat Priority othercolumn1 othercol2 year
0 06X 1 0 0 2011
1 05X 1 1 1 2011
2 04X 1 2 2 2011
3 01X 2 3 3 2011
4 00X 1 4 4 2011
fig, axes = plt.subplots(2, 3)
for i, d in enumerate(list_df):
ax = axes.flatten()[i]
dplot = d[['Beat', 'Priority']].pivot_table(index='Beat', columns='Priority', aggfunc=len)
dplot = (dplot.assign(total=lambda x: x.sum(axis=1))
.sort_values('total', ascending=False)
.head(10)
.drop('total', axis=1))
dplot.plot.bar(ax=ax, figsize=(15, 7), stacked=True, legend=True)
I'm learning Pandas, (watching these helpful videos) and currently playing around with a UFO sighting table
import pandas as pd
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
ufo.head()
Now, I'd like to use Seaborn to make a violinplot of each state (on the x-axis) and the year (on the y-axis). Hence the plot shows the frequency density of sightings at any given year, in any given state.
If I use
ufo.State.value_counts()
I can get a Pandas Series of all the counts for each state. But how do I separate this data by year? I somehow need to get data with the ufo sightings per year per state?
Am I on the right track to create a Seaborn violinplot? Or going in completely the wrong direction?
According to the example shown in violinplot documentation of the following example:
ax = sns.violinplot(x="day", y="total_bill", data=tips)
You can directly assign your desired columns into x-axis by supplying the column name into x= and y-axis to the y= parameter. The following code shows the data structure of tips variable.
In [ ]: tips.head()
Out[ ]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Your question is to plot using violinplot, having x-axis to show ufo.State and y-axis to show ufo.Year. Therefore, I believe ufo.State.value_counts() is unnecessary, or even groupby since the ufo data is already well described and satisfy violinplot's parameter format.
You can achieve it by directly supplying both ufo.columnName into x= and y=. See the code below:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
ufo = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo.head()
City Colors Reported Shape Reported State \
0 Ithaca NaN TRIANGLE NY
1 Willingboro NaN OTHER NJ
2 Holyoke NaN OVAL CO
3 Abilene NaN DISK KS
4 New York Worlds Fair NaN LIGHT NY
Time Year
0 1930-06-01 22:00:00 1930
1 1930-06-30 20:00:00 1930
2 1931-02-15 14:00:00 1931
3 1931-06-01 13:00:00 1931
4 1933-04-18 19:00:00 1933
ufo.Time = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year
fig, ax = plt.subplots(figsize=(12,8))
ax = sns.violinplot(x=ufo.State, y=ufo.Year)
# ax = sns.violinplot(x='State', y='Year', data=ufo) # Works the same with the code one line above
plt.show()