I have a dataframe with a categorical column and a column of float values and would like to visualise the sum of the floats by categorical group as a fraction of total (e.g. 20% vs 80%). In addition, I would like to visualise the uncertainty, i.e. plot the confidence interval around the point estimates.
Here is a stylised example:
import pandas as pd
df = pd.DataFrame(data={
'flag':[True, False, True, True, True, False, True, False, True, True, True, True],
'revenue': [1,2,3,4,5,6,7,8,9,10,11,12]
})
print(df.groupby('flag').revenue.sum()/df.revenue.sum())
flag
False 0.205128
True 0.794872
Name: revenue, dtype: float64
I tried to specify the estimator in the seaborn barplot function and got this Seaborn barplot output:
import seaborn as sns
sns.barplot(data=df, x='flag', y='revenue', estimator=lambda x: np.sum(x)/df.revenue.sum())
The problem is that seaborn allows the confidence interval of the bars to extend beyond 100%, i.e. to more than the total (while it should be capped at 100%).
Here is a quick-and-dirty code sample of bootstrapping the confidence intervals properly:
def bootstrap_ratio(df):
df_bs = df.sample(n=len(df), replace=True)
return df_bs.groupby('flag').revenue.sum()/df_bs.revenue.sum()
N = 1000 # bootstrap samples
pd.concat([bootstrap_ratio(df) for i in range(N)], axis=1).quantile([0.05, 0.95], axis=1)
False True
0.05 0.052912 0.575636
0.95 0.427810 0.973333
In this case, the 90% confidence interval of flag==True is [58%, 97%], with the upper edge not more than 100%.
How would I get seaborn to do this? Or at least how could I specify confidence interval values for seaborn to plot (instead of compute)?
Related
I'm new to pandas and trying to compare peak/trough values in time series price data to determine whether they are higher or lower than the previous peaks/troughs. I'd like to find three consecutively higher peaks and three consecutively higher troughs, and vise versa. If this is true, I'd perform some function. I've used the following code to create a dataframe in Quantconnect which returns a true value for each peak/trough. History is the name of the dataframe.
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
qb = QuantBook()
spy = qb.AddEquity("BA")
history = qb.History(qb.Securities.Keys, 360, Resolution.Daily).reset_index(level=0)
ilocs_min = argrelextrema(history.close.values, np.less_equal, order=3)[0]
ilocs_max = argrelextrema(history.close.values, np.greater_equal, order=3)[0]
print(ilocs_min)
history.close.plot(figsize=(20,8), alpha=.3)
# filter prices that are peaks and plot them differently to be visable on the plot
history.iloc[ilocs_max].close.plot(style='.', lw=10, color='green', marker="v");
history.iloc[ilocs_min].close.plot(style='.', lw=10, color='red', marker="^");
history['weekly_max'] = False
history['weekly_min'] = False
history.loc[history.iloc[ilocs_min].index, 'weekly_min'] = True
history.loc[history.iloc[ilocs_max].index, 'weekly_max'] = True
history.close.plot(figsize=(20,8), alpha =.3)
history[history['weekly_max']].close.plot(style='.', lw=10, color='green', marker="v");
history[history['weekly_min']].close.plot(style='.', lw=10, color='red', marker="^");
This returns the following dataframe:
Dataframe
Any help would be amazing!
Is there a way I could modify properties of individual points on matplotlib scatter plot for example make certain points invisible or change theirsize/shape ?
Let's consider example data set using pandas.DataFrame():
import pandas as pd
import matplotlib.pyplot as plt
import random
df = pd.DataFrame()
df['name'] = ['cat', 'dog', 'bird', 'fish', 'frog']
df['id'] = [1, 1, 1, 2, 2]
df['x'] = [random.randint(-10, 10) for n in range(5)]
df['y'] = [random.randint(-10, 10) for n in range(5)]
Let's plot it on scatter plot:
sc = plt.scatter(df['x'].tolist(), df['y'].tolist())
plt.show()
#easy-peasy
Plot was generated.
Let's say I want all datapoints that have id=1 in df removed from the existing plot (for example with button click). By removed I don't necessary mean deleted. Set-invisible or something will be ok. In general I'm interested in a way to iterate over each point existing on the plot and do something with it.
EDIT #1
using inspect module I noticed that sc plot object holds property named sc._offsets.
Those seems to be 2D numpy arrays holding coordinates of datapoints on the scatter plot (for 2D plot).
This _offsets property consists of 2 components? .. should I say?: "data" (2D array of coordinates) and "mask" (2D aray of bool values: in this case = False) and "fill value" which seems to be of no concern to me.
I've managed to remove points of choice from the scatter plot by deleting _offsets elements at certain indexes like this:
sc._offsets = numpy.delete(sc._offsets, [0, 1, 3], axis=0)
and then re-drawing the plot:
sc.figure.canvas.draw()
Since values in 'id' column of the dataframe and coordinates in sc._offsets are aligned, I can remove coordinates by index where 'id' value was (for example) = 1.
This does what I wanted cause original dataframe with dataset remains intact so I can re-create points on scatter plot on demand.
I think I could use the "mask" to somehow hide/show points of choice on scatter plot but I don't yet know how. I'm investigating it.
SOLVED
Answer is setting mask of numpy.core.ma.MaskedArray that lies under sc._offsets.mask property of matplotlib scatter plot.
This can be done in the following way both during plot generation and after plot has been generated, in interactive mode:
#before change:
#sc._offsets.mask = [[False, False], [False, False], [False, False], [False, False], [False, False]]
sc._offsets.mask = [[1, 1], [1, 1], [1, 1], [0, 0], [0, 0]]
#after change:
#sc._offsets.mask = [[True, True], [True, True], [True, True], [False, False], [False, False]]
#then re-draw plot
sc.figure.canvas.draw() #docs say that it's better to use draw_idle() but I don't see difference
Setting to True value coressponding with index of point you would like to exclude from plot, removes that particular point from the plot. It does not "deletes" it. Points can be restored by setting bool values back to "False". Note that it is 2D array so passing simple: [1, 1, 1, 0, 0]
will not do and you need to take into account both x and y coordinates of the plot.
Consult numpy docs for details:
https://numpy.org/doc/stable/reference/maskedarray.generic.html#accessing-the-mask
I'll edit if something comes up.
Thank you all for help.
A basic solution. If your dataset is not a big one, and you know the conditions that differentiates the data, you want to plot differently, you can create one column per condition and plot each one with different markers and colors.
Suppose you want to plot different the y that are greater than 3:
import pandas as pd
import matplotlib.pyplot as plt
import random
df = pd.DataFrame()
df['name'] = ['cat', 'dog', 'bird', 'fish', 'frog']
df['id'] = [1, 1, 1, 2, 2]
df['x'] = [random.randint(-10, 10) for n in range(5)]
df['y'] = [random.randint(-10, 10) for n in range(5)]
_mask = df.y > 3
df.loc[_mask, 'y_case_2'] = df.y
df.loc[~_mask, 'y_case_1'] = df.y
sc = plt.scatter(df.x, df.y_case_1, marker='*', color='r')
sc = plt.scatter(df.x, df.y_case_2, marker='.', color='b')
plt.show()
df
Note: Be aware that random data could not generate data greater than 3. If so, try again.
I'm making a barplot using 3 datasets in seaborn, however each datapoint overlays the previous, regardless of if it is now hiding the previous plot. eg:
sns.barplot(x="Portfolio", y="Factor", data=d2,
label="Portfolio", color="g")
sns.barplot(x="Benchmark", y="Factor", data=d2,
label="Benchmark", color="b")
sns.barplot(x="Active Exposure", y="Factor", data=d2,
label="Active", color="r")
ax.legend(frameon=True)
ax.set(xlim=(-.1, .5), ylabel="", xlabel="Sector Decomposition")
sns.despine(left=True, bottom=True)
However, I want it to show green, even if the blue being overlayed is greater. Any ideas?
Without being able to see your data I can only guess that your dataframe is not in long-form. There's a section on the seaborn tutorial on the expected shape of DataFrames that seaborn is expecting, I'd take a look there for more info, specifically the section on messy data.
Because I can't see your DataFrame I have made some assumptions about it's shape:
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
"Factor": list("ABC"),
"Portfolio": np.random.random(3),
"Benchmark": np.random.random(3),
"Active Exposure": np.random.random(3),
})
# Active Exposure Benchmark Factor Portfolio
# 0 0.140177 0.112653 A 0.669687
# 1 0.823740 0.078819 B 0.072474
# 2 0.450814 0.702114 C 0.039068
We can melt this DataFrame to get the long-form data seaborn wants:
d2 = df.melt(id_vars="Factor", var_name="exposure")
# Factor exposure value
# 0 A Active Exposure 0.140177
# 1 B Active Exposure 0.823740
# 2 C Active Exposure 0.450814
# 3 A Benchmark 0.112653
# 4 B Benchmark 0.078819
# 5 C Benchmark 0.702114
# 6 A Portfolio 0.669687
# 7 B Portfolio 0.072474
# 8 C Portfolio 0.039068
Then, finally we can plot out box plot using the seaborn's builtin aggregations:
ax = sns.barplot(x="value", y="Factor", hue="exposure", data=d2)
ax.set(ylabel="", xlabel="Sector Decomposition")
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Which produces:
Here's the plot params I used to make this chart:
import matplotlib as mpl
# Plot configuration
mpl.style.use("seaborn-pastel")
mpl.rcParams.update(
{
"font.size": 14,
"figure.facecolor": "w",
"axes.facecolor": "w",
"axes.spines.right": False,
"axes.spines.top": False,
"axes.spines.bottom": False,
"xtick.top": False,
"xtick.bottom": False,
"ytick.right": False,
"ytick.left": False,
}
)
If you are fine without using seaborn you can use pandas plotting to create a stacked horizontal bar chart (barh):
import pandas as pd
import matplotlib as mpl
# Plot configuration
mpl.style.use("seaborn-pastel")
mpl.rcParams.update(
{
"font.size": 14,
"figure.facecolor": "w",
"axes.facecolor": "w",
"axes.spines.right": False,
"axes.spines.top": False,
"axes.spines.bottom": False,
"xtick.top": False,
"xtick.bottom": False,
"ytick.right": False,
"ytick.left": False,
}
)
df = pd.DataFrame({
"Factor": list("ABC"),
"Portfolio": [0.669687, 0.072474, 0.039068],
"Benchmark": [0.112653, 0.078819, 0.702114],
"Active Exposure": [0.140177, 0.823740, 0.450814],
}).set_index("Factor")
ax = df.plot.barh(stacked=True)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_ylabel("")
ax.set_xlabel("Sector Decomposition")
Notice in the code above the index is set to Factor which then becomes the y axis.
If you don't set stacked=True you get almost the same chart as seaborn produced:
ax = df.plot.barh(stacked=False)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
ax.set_ylabel("")
ax.set_xlabel("Sector Decomposition")
I am working on hyperparameter tuning of neural networks and going through examples. I came across this code in one example:
train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
I understand that adding noise has regularization effect on data. Reading the documentation for this tells that it adds guassian noise. However, in above code, I could not understand what does it means to add 0.05 noise in the data. How would this affect data mathematically here?
I tried below code. I could see values changing but could not figure out how, for example, row1 values of x in array 1 changed by adding noise= .05 to corresponding row in array 2 i.e. x_1 here?
np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])
x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])
Output:
[[-9.92114701e-01 -1.25333234e-01]
[-1.49905052e-01 -7.85829801e-01]
[ 9.68583161e-01 2.48689887e-01]
[ 6.47213595e-01 4.70228202e-01]
[-8.00000000e-01 -2.57299624e-16]]
[[-0.66187208 0.75151712]
[-0.86331995 -0.56582111]
[-0.19574479 0.7798686 ]
[ 0.40634757 -0.78263011]
[-0.7433193 0.26658851]]
According to the documentation:
sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)
Make a large circle containing a smaller circle in 2d. A simple toy dataset to visualize clustering and classification algorithms.
noise: double or None (default=None)
Standard deviation of Gaussian noise added to the data.
The statement make_circles(noise=0.05) means that it is creating random circles with a little bit of variation following a Gaussian distribution, also known as a normal distribution. You should already know that a random Gaussian distribution means that the numbers being generated have some mean and standard definition. In this case, the call make_circles(noise=0.05) means that the standard deviation is 0.05.
Let's invoke the function, check out its output, and see what's the effect of changing the parameter noise. I'll borrow liberally from this nice tutorial on generating scikit-learn dummy data.
Let's first call make_circles() with noise=0.0 and take a look at the data. I'll use a Pandas dataframe so we can see the data in a tabular way.
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd
n_samples = 100
noise = 0.00
features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
# x y label
# 0 -0.050232 0.798421 1
# 1 0.968583 0.248690 0
# 2 -0.809017 0.587785 0
# 3 -0.535827 0.844328 0
# 4 0.425779 -0.904827 0
You can see that make_circles returns data instances where each instance is a point with two features, x and y, and a label. Let's plot them to see how they actually look like.
# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()
So it looks like it's creating two concentric circles, each with a different label.
Let's increase the noise to noise=0.05 and see the result:
n_samples = 100
noise = 0.05 # <--- The only change
features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()
It looks like the noise is added to each of the x, y coordinates to make each point shift around a little bit. When we inspect the code for make_circles() we see that the implementation does exactly that:
def make_circles( ..., noise=None, ...):
...
if noise is not None:
X += generator.normal(scale=noise, size=X.shape)
So now we've seen two visualizations of the dataset with two values of noise. But two visualizations isn't cool. You know what's cool? Five visualizations with the noise increasing progressively by 10x. Here's a function that does it:
def make_circles_plot(n_samples, noise):
assert n_samples > 0
assert noise >= 0
# Use make_circles() to generate random data points with noise.
features, labels = make_circles(n_samples=n_samples, noise=noise)
# Create a dataframe for later plotting.
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(5, 5))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points with noise=%f' % noise)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.tight_layout()
plt.show()
Calling the above function with different values of noise, it can clearly be seen that increasing this value makes the points move around more, i.e. it makes them more "noisy", exactly as we should expect intuitively.
for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
make_circles_plot(500, noise)
I have a Pandas dataframe of data on generator plant capacity (MW) by fuel type. I wanted to show the estimated distribution of plant capacity in two different ways: by plant (easy) and by MW (harder). Here's an example:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# create and seed a randomstate object (to make #s repeatable below)
rnd = np.random.RandomState(7)
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
mymean = rnd.uniform(low=2.8,high=3.2)
mysigma = rnd.uniform(low=0.6,high=1.0)
df = df.append(
pd.DataFrame({'Fuel': myfuel,
'MW': np.array(rnd.lognormal(mean=mymean,sigma=mysigma,size=1000))
}),
ignore_index=True
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5
)
And here's the plot of the estimated distributions of plant size by MW this code makes:
This violinplot is very deceptive, without more context. Because it is not weighted, the thin tail at the top of each category hides the fact that the relatively few plants in the tail contain a lot of (maybe even most of) the MWs of capacity. So I want a second plot with the distribution by MWs--basically a weighted version of this first violinplot.
I wanted to know if anyone has figured out an elegant way to make such a "weighted" violinplot, or if anyone has an idea about the most elegant way to do that.
I figured I could loop through each row of my plant-level dataframe and decompose the plant data (into a new dataframe) into MW-level data. For instance, for a row in the plant-level dataframe that shows a plant with 350 MW, I could decompose that into 3500 new rows of my new dataframe, each representing 100 kW of capacity. (I think I have to go to at least the 100 kW level of resolution, because some of these plants are pretty small, in the 100 kW range.) That new dataframe would be enormous, but I could then do a violinplot of that decomposed data. That seemed slightly brute force. Any better ideas for approach?
Update:
I implemented the brute force method described above. Here's what it looks like if anyone is interested. This is not the "answer" to this question, because I still would be interested if anyone knows a more elegant/simple/efficient way to do this. So please chime in if you know of such a way. Otherwise, I hope this brute force approach might be helpful to someone in the future.
So that it's easy to see that the weighted violinplot makes sense, I replaced the random data with a simple uniform series of numbers from 0 to 10. Under this new approach, the violinplot of df should look pretty uniform, and the violinplot of the weighted data (dfw) should get steadily wider towards the top of the violins. That's exactly what happens (see image of violinplots below).
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate empty dataframe
df = pd.DataFrame(data=None,columns=['Fuel','MW'])
# generate fake data for each fuel type and append to df
for myfuel in ['Biomass','Coal','Hydro','Natural Gas','Oil','Solar','Wind','Other']:
df = df.append(
pd.DataFrame({'Fuel': myfuel,
# To make it easy to see that the violinplot of dfw (below)
# makes sense, here we'll just use a simple range list from
# 0 to 10
'MW': np.array(range(11))
}),
ignore_index=True
)
# I have to recast the data type here to avoid an error when using violinplot below
df.MW = df.MW.astype(float)
# create another empty dataframe
dfw = pd.DataFrame(data=None,columns=['Fuel','MW'])
# since dfw will be huge, specify data types (in particular, use "category" for Fuel to limit dfw size)
dfw = dfw.astype(dtype={'Fuel':'category', 'MW':'float'})
# Define the MW size by which to normalize all of the units
# Careful: too big -> loss of fidelity in data for small plants
# too small -> dfw will need to store an enormous amount of data
norm = 0.1 # this is in MW, so 0.1 MW = 100 kW
# Define a var to represent (for each row) how many basic units
# of size = norm there are in each row
mynum = 0
# loop through rows of df
for index, row in df.iterrows():
# calculate and store the number of norm MW there are within the MW of each plant
mynum = int(round(row['MW']/norm))
# insert mynum rows into dfw, each with Fuel = row['Fuel'] and MW = row['MW']
dfw = dfw.append(
pd.DataFrame({'Fuel': row['Fuel'],
'MW': np.array([row['MW']]*mynum,dtype='float')
}),
ignore_index=True
)
# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey='row')
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=df,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax1
)
# make violinplot
sns.violinplot(x = 'Fuel',
y = 'MW',
data=dfw,
inner=None,
scale='area',
cut=0,
linewidth=0.5,
ax = ax2
)
# loop through the set of tick labels for both axes
# set tick label size and rotation
for item in (ax1.get_xticklabels() + ax2.get_xticklabels()):
item.set_fontsize(8)
item.set_rotation(30)
item.set_horizontalalignment('right')
plt.show()