I have used the seaborn pairplot function and would like to extract a data array.
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species")
I want to get an array of the points I show below in black color:
Thanks.
Just this line:
data = iris[iris['species'] == 'setosa']['sepal_length']
You are interested in the blue line, so the 'setosa' scpecie. In order to filter the iris dataframe, I create this filter:
iris['species'] == 'setosa'
which is a boolean array, whose values are True if the corresponding row in the 'species' columns of the iris dataframe is 'setosa', False otherwise. With this line of code:
iris[iris['species'] == 'setosa']
I apply the filter to the dataframe, in order to extract only the rows associated with the 'setosa' specie. Finally, I extract the 'sepal_length' column:
iris[iris['species'] == 'setosa']['sepal_length']
If I plot a KDE for this data array with this code:
data = iris[iris['species'] == 'setosa']['sepal_length']
sns.kdeplot(data)
I get:
that is the plot above you are interested in
The values are different from the plot above by the way KDE is calculated.
I quote this reference:
The y-axis in a density plot is the probability density function for
the kernel density estimation. However, we need to be careful to
specify this is a probability density and not a probability. The
difference is the probability density is the probability per unit on
the x-axis. To convert to an actual probability, we need to find the
area under the curve for a specific interval on the x-axis. Somewhat
confusingly, because this is a probability density and not a
probability, the y-axis can take values greater than one. The only
requirement of the density plot is that the total area under the curve
integrates to one. I generally tend to think of the y-axis on a
density plot as a value only for relative comparisons between
different categories.
Related
Using this dataset, I tried to make a categorical scatterplot with two mutation types (WES + RNA-Seq & WES) being shown on the x-axis and a small set of numerically ordered numbers spaced apart by a scale on the y-axis. Although I was able to get the x-axis the way I intended it to be, the y-axis instead used every single value in the Mutation Count column as a number on the y-axis. In addition to that, the axis is ordered from the descending order on the dataset, meaning it isn't numerically ordered either. How can I go about fixing these aspects of the graph?
The code and its output are shown below:
import seaborn as sns
g = sns.catplot(x="Sequencing Type", y="Mutation Count", hue="Sequencing Type", data=tets, height=16, aspect=0.8)
I have a data frame called 'train' with a column 'string' and a column 'string length' and a column 'rank' which has ranking ranging from 0-4.
I want to create a histogram of the string length for each ranking and plot all of the histograms on one graph to compare. I am experiencing two issues with this:
The only way I can manage to do this is by creating separate datasets e.g. with the following type of code:
S0 = train.loc[train['rank'] == 0]
S1 = train.loc[train['rank'] == 1]
Then I create individual histograms for each dataset using:
plt.hist(train['string length'], bins = 100)
plt.show()
This code doesn't plot the density but instead plots the counts. How do I alter my code such that it plots density instead?
Is there also a way to do this without having to create separate datasets? I was told that my method is 'unpythonic'
You could do something like:
df.loc[:, df.columns != 'string'].groupby('rank').hist(density=True, bins =10, figsize=(5,5))
Basically, what it does is select all columns except string, group them by rank and make an histogram of all them following the arguments.
The density argument set to density=True draws it in a normalized manner, as
Hope this has helped.
EDIT:
f there are more variables and you want the histograms overlapped, try:
df.groupby('rank')['string length'].hist(density=True, histtype='step', bins =10,figsize=(5,5))
Is the color in seaborn heatmap based on z score?
Does anybody know the answer?
The color in a seaborn heatmap is based on pure values, there is no normalization. It will only be based on Z-score if your values are already Z-score normalized.
If you want to base your heatmap on Z scores without precomputing zscores you can use the clustermap of seaborn. clustermap accepts a z_score argument. Default is None but it can accept the value of 0 or 1. 0 means z score is calculated on a row basis and 1 on column basis.
If you do not want to display the clustering in your final heatmap you also need to set col_cluster and row_cluster to False.
data_example = np.array([[100,50,-50,67],[0,1,-2,3],[4000,-4000,2000,-1000]]).T
sns.clustermap(data_example,z_score=1, col_cluster=False,row_cluster=False,cmap="RdBu_r")
Results in this heatmap that uses z score instead of the original values.
Let's say I have a DataFrame that looks (simplified) like this
>>> df
freq
2 2
3 16
1 25
where the index column represents a value, and the freq column represents the frequency of occurance of that value, as in a frequency table.
I'd like to plot a density plot for this table like one obtained from plot kind kde. However, this kind is apparently only meant for pd.Series. My df is too large to flatten out to a 1D Series, i.e. df = [2, 2, 3, 3, 3, ..,, 1, 1].
How can I plot such a density plot under these circumstances?
I know you have asked for the case where df is too large to flatten out, but the following answer works where this isn't the case:
pd.Series(df.index.repeat(df.freq)).plot.kde()
Or more generally, when the values are in a column called val and not the index:
df.val.repeat(df.freq).plot.kde()
You can plot a density distribution using a bar plot if you normalize the y values by the product of the size of the population. This will make the area covered by the bars equal to 1.
plt.bar(
df.index,
df.freq / df.freq.sum(),
width=-1,
align='edge'
)
The width and align parameters are to make sure each bar covers the interval (k-1, k].
Somebody with better knowledge of statistics should answer whether kernel density estimation actually makes sense for discrete distributions.
Maybe this will work:
import matplotlib.pyplot as plt
plt.plot(df.index, df['freq'])
plt.show()
Seaborn was built to do this on top of Matplotlib and automatically calculates kernel density estimates if you want.
import seaborn as sns
x = pd.Series(np.random.randint(0, 20, size = 10000), name = 'freq')
sns.distplot(x, kde = True)
My goal is to obtain a plot with the spatial frequencies of an image - kind of like doing a fourier transformation on it. I don't care about the position on the image of features with the frequency f (for instance); I'd just like to have a graphic which tells me how much of every frequency I have (the amplitude for a frequency band could be represented by the sum of contrasts with that frequency).
I am trying to do this via the numpy.fft.fft2 function.
Here is a link to a minimal example portraying my use case.
As it turns out I only get distinctly larger values for frequencies[:30,:30], and of these the absolute highest value is frequencies[0,0]. How can I interpret this?
What exactly does the amplitude of each value stand for?
What does it mean that my highest value is in frequency[0,0] What is a 0 Hz frequency?
Can I bin the values somehow so that my frequency spectrum is orientation agnostic?
freq has a few very large values, and lots of small values. You can see that by plotting
plt.hist(freq.ravel(), bins=100)
(See below.) So, when you use
ax1.imshow(freq, interpolation="none")
Matplotlib uses freq.min() as the lowest value in the color range (which is by default colored blue), and freq.max() as the highest value in the color range (which is by default colored red). Since almost all the values in freq are near the blue end, the plot as a whole looks blue.
You can get a more informative plot by rescaling the values in freq so that the low values are more widely distributed on the color range.
For example, you can get a better distribution of values by taking the log of freq. (You probably don't want to throw away the highest values, since they correspond to frequencies with the highest power.)
import matplotlib as ml
import matplotlib.pyplot as plt
import numpy as np
import Image
file_path = "data"
image = np.asarray(Image.open(file_path).convert('L'))
freq = np.fft.fft2(image)
freq = np.abs(freq)
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(14, 6))
ax[0,0].hist(freq.ravel(), bins=100)
ax[0,0].set_title('hist(freq)')
ax[0,1].hist(np.log(freq).ravel(), bins=100)
ax[0,1].set_title('hist(log(freq))')
ax[1,0].imshow(np.log(freq), interpolation="none")
ax[1,0].set_title('log(freq)')
ax[1,1].imshow(image, interpolation="none")
plt.show()
From the docs:
The output, analogously to fft, contains the term for zero frequency
in the low-order corner of the transformed axes,
Thus, freq[0,0] is the "zero frequency" term. In other words, it is the constant term in the discrete Fourier Transform.