Let's say I have a DataFrame that looks (simplified) like this
>>> df
freq
2 2
3 16
1 25
where the index column represents a value, and the freq column represents the frequency of occurance of that value, as in a frequency table.
I'd like to plot a density plot for this table like one obtained from plot kind kde. However, this kind is apparently only meant for pd.Series. My df is too large to flatten out to a 1D Series, i.e. df = [2, 2, 3, 3, 3, ..,, 1, 1].
How can I plot such a density plot under these circumstances?
I know you have asked for the case where df is too large to flatten out, but the following answer works where this isn't the case:
pd.Series(df.index.repeat(df.freq)).plot.kde()
Or more generally, when the values are in a column called val and not the index:
df.val.repeat(df.freq).plot.kde()
You can plot a density distribution using a bar plot if you normalize the y values by the product of the size of the population. This will make the area covered by the bars equal to 1.
plt.bar(
df.index,
df.freq / df.freq.sum(),
width=-1,
align='edge'
)
The width and align parameters are to make sure each bar covers the interval (k-1, k].
Somebody with better knowledge of statistics should answer whether kernel density estimation actually makes sense for discrete distributions.
Maybe this will work:
import matplotlib.pyplot as plt
plt.plot(df.index, df['freq'])
plt.show()
Seaborn was built to do this on top of Matplotlib and automatically calculates kernel density estimates if you want.
import seaborn as sns
x = pd.Series(np.random.randint(0, 20, size = 10000), name = 'freq')
sns.distplot(x, kde = True)
Related
I have used the seaborn pairplot function and would like to extract a data array.
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")
I want to get an array of the points I show below in black color:
Thanks.
Just this line:
data = iris[iris['species'] == 'setosa']['sepal_length']
You are interested in the blue line, so the 'setosa' scpecie. In order to filter the iris dataframe, I create this filter:
iris['species'] == 'setosa'
which is a boolean array, whose values are True if the corresponding row in the 'species' columns of the iris dataframe is 'setosa', False otherwise. With this line of code:
iris[iris['species'] == 'setosa']
I apply the filter to the dataframe, in order to extract only the rows associated with the 'setosa' specie. Finally, I extract the 'sepal_length' column:
iris[iris['species'] == 'setosa']['sepal_length']
If I plot a KDE for this data array with this code:
data = iris[iris['species'] == 'setosa']['sepal_length']
sns.kdeplot(data)
I get:
that is the plot above you are interested in
The values are different from the plot above by the way KDE is calculated.
I quote this reference:
The y-axis in a density plot is the probability density function for
the kernel density estimation. However, we need to be careful to
specify this is a probability density and not a probability. The
difference is the probability density is the probability per unit on
the x-axis. To convert to an actual probability, we need to find the
area under the curve for a specific interval on the x-axis. Somewhat
confusingly, because this is a probability density and not a
probability, the y-axis can take values greater than one. The only
requirement of the density plot is that the total area under the curve
integrates to one. I generally tend to think of the y-axis on a
density plot as a value only for relative comparisons between
different categories.
My data consists of a 2-D array of masses and distances. I want to produce a plot where the x-axis is distance and the y axis is the number of data elements with distance <= x (i.e. a cumulative histogram plot). What is the most efficient way to do this with Python?
PS: the masses are irrelevant since I already have filtered by mass, so all I am trying to produce is a plot using the distance data.
Example plot below:
You can combine numpy.cumsum() and plt.step():
import matplotlib.pyplot as plt
import numpy as np
N = 15
distances = np.random.uniform(1, 4, 15).cumsum()
counts = np.random.uniform(0.5, 3, 15)
plt.step(distances, counts.cumsum())
plt.show()
Alternatively, plt.bar can be used to draw a histogram, with the widths defined by the difference between successive distances. Optionally, an extra distance needs to be appended to give the last bar a width.
plt.bar(distances, counts.cumsum(), width=np.diff(distances, append=distances[-1]+1), align='edge')
plt.autoscale(enable=True, axis='x', tight=True) # make x-axis tight
Instead of appending a value, e.g. a zero could be prepended, depending on the exact interpretation of the data.
plt.bar(distances, counts.cumsum(), width=-np.diff(distances, prepend=0), align='edge')
This is what I figured I can do given a 1D array of data:
plt.figure()
counts = np.ones(len(data))
plt.step(np.sort(data), counts.cumsum())
plt.show()
This apparently works with duplicate elements also, as the ys will be added for each x.
Is it possible to add some spacing in the heatmaps created by using mark_rect() in Altair python plots? The heatmap in figure 1 will be converted to the one in figure 2. You can assume that this is from a dataframe and each column corresponds to a variable. I deliberately drew the white bars like this to avoid any hardcoded indexed solution. Basically, I am looking for a solution where I can provide the column name and/or the index name to get white spacings drawn both vertically and/or horizontally.
You can specify the spacing within heatmaps using the scale.bandPaddingInner configuration parameter, which is a number between zero and one that specifies the fraction of the rectangle mark that should be padded, and defaults to zero. For example:
import altair as alt
import numpy as np
import pandas as pd
# Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
# Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
alt.Chart(source).mark_rect().encode(
x='x:O',
y='y:O',
color='z:Q'
).configure_scale(
bandPaddingInner=0.1
)
One way to create these bands would be to facet the chart using custom bins. Here is a way to do that, using pandas.cut to create the bins.
import pandas as pd
import altair as alt
df = (pd.util.testing.makeDataFrame()
.reset_index(drop=True) # drop string index
.reset_index() # add an index column
.melt(id_vars=['index'], var_name="column"))
# To include all the indices and not create NaNs, I add -1 and max(indices) + 1 to the desired bins.
bins= [-1, 3, 9, 15, 27, 30]
df['bins'] = pd.cut(df['index'], bins, labels=range(len(bins) - 1))
# This was done for the index, but a similar approach could be taken for the columns as well.
alt.Chart(df).mark_rect().encode(
x=alt.X('index:O', title=None),
y=alt.Y('column:O', title=None),
color="value:Q",
column=alt.Column("bins:O",
title=None,
header=alt.Header(labelFontSize=0))
).resolve_scale(
x="independent"
).configure_facet(
spacing=5
)
Note the resolve_scale(x='independent') to not repeat the axis in each facet, and thhe spacing parameter in configure_facet to control the width of the spacing. I set labelFontSize=0 in the header so that we do not see the bins names on top of each facet.
I'm trying to plot the relationship of two independent variables x and y with a dependent variable score as a heatmap: x and y are integer values from 0 to infinity and score is a real value between 0 and 1.
Desired appearance
There are a large number of seen values for x and y, so I would like to have it look more like a typical density plot like the example below, since the exact values for each individual (x, y) are not of great importance:
(example taken from Seaborn's documentation)
Current approach
Currently, I'm trying to use Seaborn's heatmap(..) function to plot the data, but the resulting plot is almost unreadable, with a large amount of space between each discrete data point rather than a "continuous" gradient. The logic for plotting used is as follows:
import pandas as pd
from matplotlib.pyplot import cm
import seaborn as sns
sns.set_style("whitegrid")
df = read_df_using_pandas(...)
table = df.pivot_table(
values="score",
index="y",
columns="x", aggfunc='mean')
ax = sns.heatmap(table, cmap=cm.magma_r)
ax.invert_yaxis()
fig = sns_plot.get_figure()
fig.savefig("some_outfile.png", format="png")
The result plot looks like the following, which is wrong, as it does not match the desired appearance described in the section above:
I do not know why there is a large amount of space between each discrete data point rather than a "continuous" gradient. How can I plot the relationship between my data composed of two discrete values (x and y) which is represented as a third, scalar value (score), in a way which mimics the style of a gradient density plot? The solution need not use either Seaborn or even matplotlib.
use imshow
an example that works for me, where 'toplot' is a matrix containing the values you want the heatmap for:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(6,6))
plt.clf()
ax = fig.add_subplot(111)
toplot = INSERT MATRIX HERE
res = ax.imshow(toplot, cmap=plt.cm.viridis, vmin = 0)
cb = fig.colorbar(res,fraction=0.046, pad=0.04)
plt.title('Heatmap')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
row = np.where(toplot == toplot.max())[0][0]
column= np.where(toplot == toplot.max())[1][0]
plt.plot(column,row,'*')
plt.savefig('plots/heatmap.png', format='png')
I also added a star, indicating the highest point in the plot, which I needed.
In a classifieds website I maintain, I'm comparing classifieds that receive greater-than-median views vs classifieds that are below median in this criterion. I call the former "high performance" classifieds. Here's a simple countplot showing this:
The hue is simply the number of photos the classified had.
My question is - is there a plot type in seaborn or matplotlib which shows proportions instead of absolute counts?
I essentially want the same countplot, but with each bar as a % of the total items in that particular category. For example, notice that in the countplot, classifieds with 3 photos make up a much larger proportion of the high perf category. It takes a while to glean that information. If each bar's height was instead represented by its % contribution to its category, it'd be a much easier comparison. That's why I'm looking for what I'm looking for.
An illustrative example would be great.
Instead of trying to find a special case plotting function that would do exactly what you want, I would suggest to consider keeping data generation and visualization separate. At the end what you want is to plot a bar graph of some values, so the idea would be to generate the data in such a way that they can easily be plotted.
To this end, you may crosstab the two columns in question and divide each row (or column) in the resulting table by its sum. This table can then easily be plotted using the pandas plotting wrapper.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
plt.rcParams["figure.figsize"] = 5.6, 7.0
n = 100
df = pd.DataFrame({"performance": np.random.choice([0,1], size=n, p=[0.7,0.3]),
"photo" : np.random.choice(range(4), size=n, p=[0.6,0.1,0.2,0.1]),
"someothervalue" : np.random.randn(n) })
fig, (ax,ax2, ax3) = plt.subplots(nrows=3)
freq = pd.crosstab(df["performance"],df["photo"])
freq.plot(kind="bar", ax=ax)
relative = freq.div(freq.sum(axis=1), axis=0)
relative.plot(kind="bar", ax=ax2)
relative = freq.div(freq.sum(axis=0), axis=1)
relative.plot(kind="bar", ax=ax3)
ax.set_title("countplot of absolute frequency")
ax2.set_title("barplot of relative frequency by performance")
ax3.set_title("barplot of relative frequency by photo")
for a in [ax, ax2, ax3]: a.legend(title="Photo", loc=6, bbox_to_anchor=(1.02,0.5))
plt.subplots_adjust(right=0.8,hspace=0.6)
plt.show()