My data is in a dataframe of two columns: y and x. The data refers to the past few years. Dummy data is below:
np.random.seed(167)
rng = pd.date_range('2017-04-03', periods=365*3)
df = pd.DataFrame(
{"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365*3)])
}, index=rng
)
In first attempt, I plotted a scatterplot with Seaborn using the following code:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'])
plot_scatter(data=df, title='dummy title', figsize=(10,7))
However, I would like to generate a 4x3 matrix including 12 scatterplots, one for each month with year as hue. I thought I could create a third column in my dataframe that tells me the year and I tried the following:
import seaborn as sns
import matplotlib.pyplot as plt
def plot_scatter(data, title, figsize):
fig, ax = plt.subplots(figsize=figsize)
ax.set_title(title)
sns.scatterplot(data=data,
x=data['x'],
y=data['y'],
hue=data.iloc[:, 2])
df['year'] = df.index.year
plot_scatter(data=df, title='dummy title', figsize=(10,7))
While this allows me to see the years, it still shows all the data in the same scatterplot instead of creating multiple scatterplots, one for each month, so it's not offering the level of detail I need.
I could slice the data by month and build a for loop that plots one scatterplot per month but I actually want a matrix where all the scatterplots use similar axis scales. Does anyone know an efficient way to achieve that?
To create multiple subplots at once, seaborn introduces figure-level functions. The col= argument indicates which column of the dataframe should be used to identify the subplots. col_wrap= can be used to tell how many subplots go next to each other before starting an additional row.
Note that you shouldn't create a figure, as the function creates its own new figure. It uses the height= and aspect= arguments to tell the size of the individual subplots.
The code below uses a sns.relplot() on the months. An extra column for the months is created; it is made categorical to fix an order.
To remove the month= in the title, you can loop through the generated axes (a recent seaborn version is needed for axes_dict). With sns.set(font_scale=...) you can change the default sizes of all texts.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
np.random.seed(167)
dates = pd.date_range('2017-04-03', periods=365 * 3, freq='D')
df = pd.DataFrame({"y": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)]),
"x": np.cumsum([np.random.uniform(-0.01, 0.01) for _ in range(365 * 3)])
}, index=dates)
df['year'] = df.index.year
month_names = pd.date_range('2017-01-01', periods=12, freq='M').strftime('%B')
df['month'] = pd.Categorical.from_codes(df.index.month - 1, month_names)
sns.set(font_scale=1.7)
g = sns.relplot(kind='scatter', data=df, x='x', y='y', hue='year', col='month', col_wrap=4, height=4, aspect=1)
# optionally remove the `month=` in the title
for name, ax in g.axes_dict.items():
ax.set_title(name)
plt.setp(g.axes, xlabel='', ylabel='') # remove all x and y labels
g.axes[-2].set_xlabel('x', loc='left') # set an x label at the left of the second to last subplot
g.axes[4].set_ylabel('y') # set a y label to 5th subplot
plt.subplots_adjust(left=0.06, bottom=0.06) # set some more spacing at the left and bottom
plt.show()
Related
I'm a beginner in Python.
In my internship project I am trying to plot bloxplots from data contained in a csv
I need to plot bloxplots for each of the 4 (four) variables showed above (AAG, DENS, SRG e RCG). Since each variable presents values in the range from [001] to [100], there will be 100 boxplots for each variable, which need to be plotted in a single graph as shown in the image.
This is the graph I need to plot, but for each variable there will be 100 bloxplots as each one has 100 columns of values:
The x-axis is the "Year", which ranges from 2025 to 2030, so I need a graph like the one shown in figure 2 for each year and the y-axis is the sets of values for each variable.
Using Pandas-melt function and seaborn library I was able to plot only the boxplots of a column. But that's not what I need:
import pandas as pd
import seaborn as sns
df = pd.read_csv("2DBM_50x50_Central_Aug21_Sim.cliped.csv")
mdf= df.melt(id_vars=['Year'], value_vars='AAG[001]')
print(mdf)
ax=sns.boxplot(x='Year', y='value',width = 0.2, data=mdf)
Result of the code above:
What can I try to resolve this?
The following code gives you five subplots, where each subplot only contains the data of one variable. Then a boxplot is generated for each year. To change the range of columns used for each variable, change the upper limit in var_range = range(1, 101), and to see the outliers change showfliers to True.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv("2DBM_50x50_Central_Aug21_Sim.cliped.csv")
variables = ["AAG", "DENS", "SRG", "RCG", "Thick"]
period = range(2025, 2031)
var_range = range(1, 101)
fig, axes = plt.subplots(2, 3)
flattened_axes = fig.axes
flattened_axes[-1].set_visible(False)
for i, var in enumerate(variables):
var_columns = [f"TB_acc_{var}[{j:05}]" for j in var_range]
data = df.melt(id_vars=["Period"], value_vars=var_columns, value_name=var)
ax = flattened_axes[i]
sns.boxplot(x="Period", y=var, width=0.2, data=data, ax=ax, showfliers=False)
plt.tight_layout()
plt.show()
output:
In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()
I am trying to create a visualization of vehicles passing by in the first 25 weeks of the years 2015-2020 all in one graph (one curve for every year).
df_data_groups = df_data[(df_data['week']<=25)].groupby(['year','week'])
df_data_weekly = df_data_groups[['NO','nr_of_vehicles']].mean()
fig, ax = plt.subplots()
bp = df_data_weekly['nr_of_vehicles'].groupby('year').plot(ax=ax)
The following is what i get
The x-axis is not right. It should not contain the year, only the weeks, but I don't know how to solve this correctly. It also is not allowing me to create a legend to show which lines belongs to the color of the line, by using:
bp.set_legend()
The index shown, is the index of the last dataframe in the group. This dataframe has a 2-level index: the year and the week. Dropping the first index (the year) will only show the week:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df_data = pd.DataFrame({'year': np.repeat(np.arange(2015, 2021), 52),
'week': np.tile(np.arange(1, 53), 6),
'nr_of_vehicles': 200_000 + np.random.randint(-9_000, 10_000, 52 * 6).cumsum()})
df_data_groups = df_data[(df_data['week'] <= 25)].groupby(['year', 'week'])
df_data_weekly = df_data_groups[['nr_of_vehicles']].mean()
fig, ax = plt.subplots()
for year, df in df_data_weekly['nr_of_vehicles'].groupby('year'):
df.reset_index(level=0, drop=True).plot(ax=ax, label=year)
ax.legend()
ax.margins(x=0.02)
plt.show()
PS: Note that in the question's code, bp is a list of axes, one ax per year. In this case, all of them point to the same ax. bp is organized as a pandas Series, to obtain the legend, get one of the axes: bp[2015].legend() (or bp.iloc[0].legend()).
I have a dataframe which stores the number of clients, predicted revenue, and actual revenue for a discrete set of products. I would like to plot a combo chart with number of clients on the first y axis as a bar plot, and both predicted and actual revenue plotted on the second y axis with the same scale.
I'm able to create a combo chart with a single secondary y axis using the following:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({
'product' : ['A','B','C','D'],
'number_of_clients' : [234,473,325,389],
'pred_turnover' : [1287,2311,5283,3211],
'act_turnover' : [1221,1927,5433,3888]})
df['number_of_clients'].plot.bar()
df['pred_turnover'].plot(secondary_y=True)
However, I am stuck on how to add a second variable to the secondary y axis using the same scale.
Here is what I would like to create as an end product:
I rewrote it in the general format instead of the df.plot format.
The point is that ax1=axtwinx() is a biaxial graph, and the difference between the maximum and minimum on the right axis is divided by the number of ticks on the left axis to adjust the ticks.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({
'product' : ['A','B','C','D'],
'number_of_clients' : [234,473,325,389],
'pred_turnover' : [1287,2311,5283,3211],
'act_turnover' : [1221,1927,5433,3888]})
fig, ax = plt.subplots(figsize=(8,4))
ax.bar(df['product'], df['number_of_clients'], label='number_of_clients')
ax1 = ax.twinx()
ax1.plot(df['product'], df['pred_turnover'], lw=2, color='orange', label='pred_product')
ax1.set_yticks(np.arange(ax1.get_yticks()[0], ax1.get_yticks()[-1], (ax1.get_yticks()[-1] - ax1.get_yticks()[0])/(len(ax.get_yticks())-1)))
ax.grid(which='major', axis='y')
plt.show()
In Pandas, I am doing:
bp = p_df.groupby('class').plot(kind='kde')
p_df is a dataframe object.
However, this is producing two plots, one for each class.
How do I force one plot with both classes in the same plot?
Version 1:
You can create your axis, and then use the ax keyword of DataFrameGroupBy.plot to add everything to these axes:
import matplotlib.pyplot as plt
p_df = pd.DataFrame({"class": [1,1,2,2,1], "a": [2,3,2,3,2]})
fig, ax = plt.subplots(figsize=(8,6))
bp = p_df.groupby('class').plot(kind='kde', ax=ax)
This is the result:
Unfortunately, the labeling of the legend does not make too much sense here.
Version 2:
Another way would be to loop through the groups and plot the curves manually:
classes = ["class 1"] * 5 + ["class 2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "vals": vals})
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('class'):
df.vals.plot(kind="kde", ax=ax, label=label)
plt.legend()
This way you can easily control the legend. This is the result:
import matplotlib.pyplot as plt
p_df.groupby('class').plot(kind='kde', ax=plt.gca())
Another approach would be using seaborn module. This would plot the two density estimates on the same axes without specifying a variable to hold the axes as follows (using some data frame setup from the other answer):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# data to create an example data frame
classes = ["c1"] * 5 + ["c2"] * 5
vals = [1,3,5,1,3] + [2,6,7,5,2]
# the data frame
df = pd.DataFrame({"cls": classes, "indices":idx, "vals": vals})
# this is to plot the kde
sns.kdeplot(df.vals[df.cls == "c1"],label='c1');
sns.kdeplot(df.vals[df.cls == "c2"],label='c2');
# beautifying the labels
plt.xlabel('value')
plt.ylabel('density')
plt.show()
This results in the following image.
There are two easy methods to plot each group in the same plot.
When using pandas.DataFrame.groupby, the column to be plotted, (e.g. the aggregation column) should be specified.
Use seaborn.kdeplot or seaborn.displot and specify the hue parameter
Using pandas v1.2.4, matplotlib 3.4.2, seaborn 0.11.1
The OP is specific to plotting the kde, but the steps are the same for many plot types (e.g. kind='line', sns.lineplot, etc.).
Imports and Sample Data
For the sample data, the groups are in the 'kind' column, and the kde of 'duration' will be plotted, ignoring 'waiting'.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('geyser')
# display(df.head())
duration waiting kind
0 3.600 79 long
1 1.800 54 short
2 3.333 74 long
3 2.283 62 short
4 4.533 85 long
Plot with pandas.DataFrame.plot
Reshape the data using .groupby or .pivot
.groupby
Specify the aggregation column, ['duration'], and kind='kde'.
ax = df.groupby('kind')['duration'].plot(kind='kde', legend=True)
.pivot
ax = df.pivot(columns='kind', values='duration').plot(kind='kde')
Plot with seaborn.kdeplot
Specify hue='kind'
ax = sns.kdeplot(data=df, x='duration', hue='kind')
Plot with seaborn.displot
Specify hue='kind' and kind='kde'
fig = sns.displot(data=df, kind='kde', x='duration', hue='kind')
Plot
Maybe you can try this:
fig, ax = plt.subplots(figsize=(10,8))
classes = list(df.class.unique())
for c in classes:
df2 = data.loc[data['class'] == c]
df2.vals.plot(kind="kde", ax=ax, label=c)
plt.legend()