Vary legend properties for different data in seaborn scatterplots - python

I have created a seaborn scatterplot for a dataset, where I set the sizes parameter to one column, and the hue parameter to another. Now the hue parameter only consists of five different values and is supposed to help classifying my data, while the sizes parameter consists of a lot more to represent actual numeric data. In this current data set, my hue values only consist of 0, 2, and 4, but in the "brief" legend option, the legend labels are not synchronized to that, which is very confusing. In the "full" legend option, the hue-labels are correct, but the size-labels are way too many. Therefore I would like to display the full legend for my hue parameter, but only a brief legend for the sizes parameter, because it consists of lots of unique values.
How the overcrowded "full" legend looks
The "brief" legend that is confusingly labeled
Edit: I edited some code in that demonstrates the issue for a random dataset. To specify my question again, I want the "shape" parameters to get fully depicted on the legend, while the "size" parameters have to be shortened (equivalent to the legend setting "brief").
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x_condition=np.arange(0,20,1)
y_condition=np.arange(0,20,1)
size=np.random.randint(0,200,20)
# I haven't made a random distribution here, because I wanted to make sure it contains at least one of each [0,2,4]
shape=[0,2,0,4]*5
df=pd.DataFrame({"x_condition":x_condition,"y_condition":y_condition,"size":size,"shape":shape})
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="brief")
plt.show()
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="full")
plt.show()

Related

Seaborn clustermap showing less columns that the input dataframe has [duplicate]

I'm trying to visualize what filters are learning in CNN text classification model. To do this, I extracted feature maps of text samples right after the convolutional layer, and for size 3 filter, I got an (filter_num)*(length_of_sentences) sized tensor.
df = pd.DataFrame(-np.random.randn(50,50), index = range(50), columns= range(50))
g= sns.clustermap(df,row_cluster=True,col_cluster=False)
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) # ytick rotate
g.cax.remove() # remove colorbar
plt.show()
This code results in :
Where I can't see all the ticks in the y-axis. This is necessary
because I need to see which filters learn which information. Is there
any way to properly exhibit all the ticks in the y-axis?
kwargs from sns.clustermap get passed on to sns.heatmap, which has an option yticklabels, whose documentation states (emphasis mine):
If True, plot the column names of the dataframe. If False, don’t plot the column names. If list-like, plot these alternate labels as the xticklabels. If an integer, use the column names but plot only every n label. If “auto”, try to densely plot non-overlapping labels.
Here, the easiest option is to set it to an integer, so it will plot every n labels. We want every label, so we want to set it to 1, i.e.:
g = sns.clustermap(df, row_cluster=True, col_cluster=False, yticklabels=1)
In your complete example:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame(-np.random.randn(50,50), index=range(50), columns=range(50))
g = sns.clustermap(df, row_cluster=True, col_cluster=False, yticklabels=1)
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) # ytick rotate
g.cax.remove() # remove colorbar
plt.show()

Histogram shows unlimited bins despite bin specification in matplotlib

I have a error data and when I tried to make a histogram of the data the intervals or the bin sizes were showing large as shown in the below image
Below is the code
import matplotlib.pyplot as plt
plt.figure()
plt.hist(error)
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
When I tried to explicitly mention the bins as we usually do, like in the below code I get the hist plot as shown below
plt.figure()
plt.hist(error, bins=[-4,-3,-2,-1, 0,1, 2,3, 4,])
#plt.hist(error, bins = 6)
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
I wish to make the hist look nice, something like below (an example from google) with bins clearly defined.
i Tried with seaborn displot and it gave a nice plot as shown below.
import seaborn as sns
sns.displot(error, bins=[-4,-3,-2,-1, 0,1, 2,3, 4,])
plt.title('histogram of error')
plt.xlabel('error')
plt.show()
Why is that the matplotlib not able to make this plot? Did I miss anything or do I need to set something in order to make the usual histogram plot? Please highlight
The matplotlib documentation for plt.hist() explains that the first parameter can either by a 1D array or a sequence of 1D arrays. The latter case is used if you pass in a 2D array and will result in plotting a separate bar with cycling colors for each of the rows.
This is what we see in your example: The X-axis ticks still correspond to the bin-edges that were passed in - but for each bin there are many bars. So, I'm assuming you passed in a multidimensional array.
To fix this, simply flatten your data before passing it to matplotlib, e.g. plt.hist(np.ravel(error), bins=bins).

Differences between bar plots in Matplotlib and pandas

I feel like I'm missing something ridiculously basic here.
If I'm trying to create a bar chart with values from a dataframe, what's the difference between calling .plot on the dataframe object and just entering the data within plt.plot's parentheses?
e.g.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
VERSUS
df.groupby('category').count().plot(kind='bar')?
Can someone please walk me through what the difference is and when I should use either? I get that with plt.plot I'm calling the plot method of the plt (Matplotlib) library, whereas when I do df.plot I'm calling plot on the dataframe? What does that mean exactly -- that the dataframe has a plot object?
Those are different plotting methods. Fundamentally, they both produce a matplotlib object, which can be shown via one of the matplotlib backends.
There is however an important difference. Pandas bar plots are categorical in nature. This means, bars are positionned at subsequent integer numbers, and each bar gets a tick with a label according to the index of the dataframe.
For example:
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
s.plot.bar()
plt.show()
Here, there are four bars, the first is at positon 0, with the first label of the series' index, 1. The second is at positon 1, with the label 4 etc.
In contrast, a matplotlib bar plot is numeric in nature. Compare this to
import matplotlib.pyplot as plt
import pandas as pd
s = pd.Series([30,20,10,40], index=[1,4,5,9])
plt.bar(s.index, s.values)
plt.show()
Here the bars are at the numerical position of the index; the first bar at 1, the second at 4 etc. and the axis labelling is independent of where the bars are.
Note that you can achieve a categorical bar plot with matplotlib by casting your values to strings.
plt.bar(s.index.astype(str), s.values)
The result looks similar to the pandas plot, except for some minor tweaks like rotated labels and bar widths. In case you are interested in tweaking some sophisticated properties, it will be easier to do with a matplotlib bar plot, because that directly returns the bar container with all the bars.
bc = plt.bar()
for bar in bc:
bar.set_some_property(...)
Pandas plot function is using Matplotlib's pyplot to do the plotting, but it's like a shortcut.
I was similarly confused when I started trying to visualise my data, but I decided in the end to learn matplotlib because in the end you get more control of the visualisation.
I think it depends on the data you have. If you have a clean data frame and you just want to print something quickly, then you can use df.plot. For example, you can group by a column and then specify x and y axes.
If you want a more complicated graph, then working directly with matplotlib is better. At the end, matplotlib will give you more options.
This is a good reference to start with: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/understand-df-plot-in-pandas/

Matplotlib plotting range of values as a bar

I am stuck with the following problem: Using Matplotlib I need to plot an array of data, where the abscissa is a range of values (i.e. [1000..2000]), while the ordinate is represented by a single value.
I need to plot the data in a form of a bar, which starts at the value of 1000 (from the example above), and finishes at 2000. While in ordinate, the bar is located at the level of certain value defined above.
Any ideas ? I looked through various examples, but I only see bars and histograms which do something different.
Just use plot to make a wide line:
import matplotlib.pyplot as plt
plt.plot([1000, 2000], [5, 5], lw=10, color="orange", solid_capstyle="butt")#Setting capstyle to butt, because otherwise the length of the line is slightly longer, than required
plt.yticks(range(10))
plt.xticks(range(500, 3000, 500))
plt.margins(0.5)
plt.show()

Compare 1 independent vs many dependent variables using seaborn pairplot in an horizontal plot

The pairplot function from seaborn allows to plot pairwise relationships in a dataset.
According to the documentation (highlight added):
By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
I could find only one example of subsetting different variables for rows and columns, here (it's the 6th plot under the Plotting pairwise relationships with PairGrid and pairplot() section). As you can see, it's plotting many independent variables (x_vars) against the same single dependent variable (y_vars) and the results are pretty nice.
I'm trying to do the same plotting a single independent variable against many dependent ones.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
ages = np.random.gamma(6,3, size=50)
data = pd.DataFrame({"age": ages,
"weight": 80*ages**2/(ages**2+10**2)*np.random.normal(1,0.2,size=ages.shape),
"height": 1.80*ages**5/(ages**5+12**5)*np.random.normal(1,0.2,size=ages.shape),
"happiness": (1-ages*0.01*np.random.normal(1,0.3,size=ages.shape))})
pp = sns.pairplot(data=data,
x_vars=['age'],
y_vars=['weight', 'height', 'happiness'])
The problem is that the subplots get arranged vertically, and I couldn't find a way to change it.
I know that then the tiling structure would not be so neat as the Y axis should be labeled at every subplot. Also, I know I could generate the plots making it by hand with something like this:
fig, axes = plt.subplots(ncols=3)
for i, yvar in enumerate(['weight', 'height', 'happiness']):
axes[i].scatter(data['age'],data[yvar])
Still, I'm learning to use the seaborn and I find interface very convenient, so I wonder if there's a way. Also, this example is pretty easy, but for more complex datasets seaborn handles for you many more things that would make the raw-matplotlib approach much more complex quite quickly (hue, to start)
You can achieve what it seems you are looking for by swapping the variable names passed to the x_vars and y_vars parameters. So revisiting the sns.pairplot portion of your code:
pp = sns.pairplot(data=data,
y_vars=['age'],
x_vars=['weight', 'height', 'happiness'])
Note that all I've done here is swap x_vars for y_vars. The plots should now be displayed horizontally:
The x-axis will now be unique to each plot with a common y-axis determined by the age column.

Categories