Side-by-side boxplot of multiple columns of a pandas DataFrame

Side-by-side boxplot of multiple columns of a pandas DataFrame - python

One year of sample data:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A":rnd.randn(n), "B":rnd.randn(n)+1},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
I want to boxplot these data side-by-side grouped by the month (i.e., two boxes per month, one for A and one for B).
For a single column sns.boxplot(df.index.month, df["A"]) works fine. However, sns.boxplot(df.index.month, df[["A", "B"]]) throws an error (ValueError: cannot copy sequence with size 2 to array axis with dimension 365). Melting the data by the index (pd.melt(df, id_vars=df.index, value_vars=["A", "B"], var_name="column")) in order to use seaborn's hue property as a workaround doesn't work either (TypeError: unhashable type: 'DatetimeIndex').
(A solution doesn't necessarily need to use seaborn, if it is easier using plain matplotlib.)
Edit
I found a workaround that basically produces what I want. However, it becomes somewhat awkward to work with once the DataFrame includes more variables than I want to plot. So if there is a more elegant/direct way to do it, please share!
df_stacked = df.stack().reset_index()
df_stacked.columns = ["date", "vars", "vals"]
df_stacked.index = df_stacked["date"]
sns.boxplot(x=df_stacked.index.month, y="vals", hue="vars", data=df_stacked)
Produces:

here's a solution using pandas melting and seaborn:
import pandas as pd
import numpy.random as rnd
import seaborn as sns
n = 365
df = pd.DataFrame(data = {"A": rnd.randn(n),
"B": rnd.randn(n)+1,
"C": rnd.randn(n) + 10, # will not be plotted
},
index=pd.date_range(start="2017-01-01", periods=n, freq="D"))
df['month'] = df.index.month
df_plot = df.melt(id_vars='month', value_vars=["A", "B"])
sns.boxplot(x='month', y='value', hue='variable', data=df_plot)

month_dfs = []
for group in df.groupby(df.index.month):
month_dfs.append(group[1])
plt.figure(figsize=(30,5))
for i,month_df in enumerate(month_dfs):
axi = plt.subplot(1, len(month_dfs), i + 1)
month_df.plot(kind='box', subplots=False, ax = axi)
plt.title(i+1)
plt.ylim([-4, 4])
plt.show()
Will give this
Not exactly what you're looking for but you get to keep a readable DataFrame if you add more variables.
You can also easily remove the axis by using
if i > 0:
y_axis = axi.axes.get_yaxis()
y_axis.set_visible(False)
in the loop before plt.show()

This is quite straight-forward using Altair:
alt.Chart(
df.reset_index().melt(id_vars = ["index"], value_vars=["A", "B"]).assign(month = lambda x: x["index"].dt.month)
).mark_boxplot(
extent='min-max'
).encode(
alt.X('variable:N', title=''),
alt.Y('value:Q'),
column='month:N',
color='variable:N'
)
The code above melts the DataFrame and adds a month column. Then Altair creates box-plots for each variable broken down by months as the plot columns.

Related

Python vs matplotlib - Chart generation issue

I have the below python code. but as an output it gives a chart like in the attachment. And its really messy in python. Can anybody tell me hw to fix the issue and make the day in ascenting order in X axis?
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("C/desktop/data.xlsx")
df = df.loc[df['month'] == 8]
df = df.astype({'day': str})
plt.plot( 'day', 'cases', data=df)
In the first instance, i didnt take the day as str. So it came like this.
Because it had decimal numbers, i have converted it to str. now this happens.

What you got is typical of an unsorted dataset with many points per group.
As you did not provide an example, here is one:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'day': np.random.randint(1,21,size=100),
'cases': np.random.randint(0,50000,size=100),
})
plt.plot('day', 'cases', data=df)
There is no reason to plot a line in this case, you can use a scatter plot instead:
plt.scatter('day', 'cases', data=df)
To make more sense of your data, you can also compute an aggregated value (ex. mean):
plt.plot('day', 'cases', data=df.groupby('day', as_index=False)['cases'].mean())

how do I change the frequency while producing a bar plot

This is an addition to the original question i had asked here
Unable to change the tick frequency on my chart
The answer works absolutely fine, but when my index starts from say 2100 (instead of 0) in my original Q,the graph looks incorrect.
How do I fix it?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,(90,1)),columns=['Values'])
df.index = np.arange(2100,2190,1)
df.plot(kind='bar', xticks=np.arange(2100,2190,5))

With bar plots, the xticks are the range index. So you want:
df = pd.DataFrame(np.random.randint(1,10,(90,1)),columns=['Values'])
df.index = np.arange(2100,2190,1)
ax = df.plot(kind='bar')
ax.set_xticks(np.arange(0,len(df),5))
ax.set_xticklabels(df.index[::5]);
Output:

Seaborn tsplot not showing data

I'm trying to use seaborn to make a simple tsplot, but for reasons that aren't clear to me nothing shows up when I run the code. Here's a minimal example:
import numpy as np
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
ax = sns.tsplot(data=df, value='value', time='time')
sns.plt.show()
Usually tsplot you supply multiple data points for each time point, but does it just not work if you only supply one?
I know matplotlib can be used to do this pretty easily, but I wanted to use seaborn for some of its other functionality.

You are missing individual units. When using a data frame the idea is that multiple timeseries for the same unit have been recorded, which can be individually identifier in the data frame. The error is then calculated based on the different units.
So for one series only, you can get it working again like this:
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df['subject'] = 0
sns.tsplot(data=df, value='value', time='time', unit='subject')
Just to see how the error is computed, look at this example:
dfs = []
for i in range(10):
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df['subject'] = i
dfs.append(df)
all_dfs = pd.concat(dfs)
sns.tsplot(data=all_dfs, value='value', time='time', unit='subject')

You can use set_index for index from column time and then plot Series:
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df = df.set_index('time')['value']
ax = sns.tsplot(data=df)
sns.plt.show()

Annotate timeseries plot by merging two timeseries

Given I have two time series (or two columns in a data frame) like this:
rng1 = pd.date_range('1/1/2017', periods=3, freq='H')
ts1 = pd.Series(np.random.randn(len(rng)), index=rng)
ts2 = pd.Series(['HE','NOT','SHE'], index=rng)
I want to do a plot of ts1.plot() where ts2 is used to annotate ts1 time series, HOWEVER I only want to annotate the timestamps that are <> NOT.
What I have found so far is using markers would be what Im looking for. For example having one marker for 'HE' and another for 'SHE' and No marker for 'NOT'. However I cant figure out how to use another time series as input and only to annotate the timestamps <> some value.

You can use the pandas dataframe groupby method to split the dataset by the labels you're using and just ignore the values you don't want to plot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
rng = pd.date_range('1/1/2017', periods=3, freq='H')
ts1 = pd.Series(np.random.randn(len(rng)), index=rng)
ts2 = pd.Series(['HE','NOT','SHE'], index=rng)
df = pd.concat([ts1, ts2], keys=['foo', 'bar'], axis=1)
ax = None # trick to keep everything plotted on a single axis
labels = [] # keep track of the labels you actually use
for key, dat in df.groupby('bar'):
if key == 'NOT':
continue
labels.append(key)
ax = dat.plot(ax=ax, marker='s', ls='none', legend=False)
# handle the legend through matplotlib directly, rather than pandas' interface
ax.legend(ax.get_lines(), labels)
plt.show()

Date removed from x axis on overlaid plots matplotlib

I am trying to show time series lines representing an effort amount using matplotlib and pandas.
I've got my DF's to all to overlay in one plot, however when I do python seems to strip the x axis of the date and input some numbers. (I'm not sure where these come from but at a guess, not all days contain the same data so python has reverted to using an index id number). If I plot any one of these they come up with date on the x-axis.
Any hints or solutions to make the x axis show date for the multiple plot would be much appreciated.
This is the single figure plot with time axis:
Code I'm using to plot is
fig = pl.figure()
ax = fig.add_subplot(111)
ax.plot(b342,color='black')
ax.plot(b343,color='blue')
ax.plot(b344,color='red')
ax.plot(b345,color='green')
ax.plot(b346,color='pink')
ax.plot(fi,color='yellow')
plt.show()
This is the multiple plot fig with weird x axis:

One option would be to manually specify the x-axis based on the DataFrame index, and then plot directly using matplotlib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# make up some data
n = 100
dates = pd.date_range(start = "2015-01-01", periods = n, name = "yearDate")
dfs = []
for i in range(3):
df = pd.DataFrame(data = np.random.random(n)*(i + 1), index = dates,
columns = ["FishEffort"] )
df.df_name = str(i)
dfs.append(df)
# plot it directly using matplotlib instead of through the DataFrame
fig = plt.figure()
ax = fig.add_subplot()
for df in dfs:
plt.plot(df.index,df["FishEffort"], label = df.df_name)
plt.legend()
plt.show()
Another option would be to concatenate your DataFrames and plot using Pandas. If you give your "FishEffort" field the correct label name when loading the data or via DataFrame.rename then the labels will be specified automatically.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n = 100
dates = pd.date_range(start = "2015-01-01", periods = n, name = "yearDate")
dfs = []
for i in range(3):
df = pd.DataFrame(data = np.random.random(n)*(i + 1), index = dates,
columns = ["DataFrame #" + str(i) ] )
df.df_name = str(i)
dfs.append(df)
df = pd.concat(dfs, axis = 1)
df.plot()

I've found an answer that does what I want, it seems that calling plt.plot wasn't using the date as the x axis, however calling it using the pandas documentation did the trick.
ax = b342.plot(label='342')
b343.plot(ax=ax, label='test')
b344.plot(ax=ax)
b345.plot(ax=ax)
b346.plot(ax=ax)
fi.plot(ax=ax)
plt.show()
I was wondering if anyone knew hwo to change the labels here?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Side-by-side boxplot of multiple columns of a pandas DataFrame - python

Related

Python vs matplotlib - Chart generation issue

how do I change the frequency while producing a bar plot

Seaborn tsplot not showing data

Annotate timeseries plot by merging two timeseries

Date removed from x axis on overlaid plots matplotlib

Categories

Resources