How to access/change properties of individual points on matplotlib scatter plot - python

Is there a way I could modify properties of individual points on matplotlib scatter plot for example make certain points invisible or change theirsize/shape ?
Let's consider example data set using pandas.DataFrame():
import pandas as pd
import matplotlib.pyplot as plt
import random
df = pd.DataFrame()
df['name'] = ['cat', 'dog', 'bird', 'fish', 'frog']
df['id'] = [1, 1, 1, 2, 2]
df['x'] = [random.randint(-10, 10) for n in range(5)]
df['y'] = [random.randint(-10, 10) for n in range(5)]
Let's plot it on scatter plot:
sc = plt.scatter(df['x'].tolist(), df['y'].tolist())
plt.show()
#easy-peasy
Plot was generated.
Let's say I want all datapoints that have id=1 in df removed from the existing plot (for example with button click). By removed I don't necessary mean deleted. Set-invisible or something will be ok. In general I'm interested in a way to iterate over each point existing on the plot and do something with it.
EDIT #1
using inspect module I noticed that sc plot object holds property named sc._offsets.
Those seems to be 2D numpy arrays holding coordinates of datapoints on the scatter plot (for 2D plot).
This _offsets property consists of 2 components? .. should I say?: "data" (2D array of coordinates) and "mask" (2D aray of bool values: in this case = False) and "fill value" which seems to be of no concern to me.
I've managed to remove points of choice from the scatter plot by deleting _offsets elements at certain indexes like this:
sc._offsets = numpy.delete(sc._offsets, [0, 1, 3], axis=0)
and then re-drawing the plot:
sc.figure.canvas.draw()
Since values in 'id' column of the dataframe and coordinates in sc._offsets are aligned, I can remove coordinates by index where 'id' value was (for example) = 1.
This does what I wanted cause original dataframe with dataset remains intact so I can re-create points on scatter plot on demand.
I think I could use the "mask" to somehow hide/show points of choice on scatter plot but I don't yet know how. I'm investigating it.

SOLVED
Answer is setting mask of numpy.core.ma.MaskedArray that lies under sc._offsets.mask property of matplotlib scatter plot.
This can be done in the following way both during plot generation and after plot has been generated, in interactive mode:
#before change:
#sc._offsets.mask = [[False, False], [False, False], [False, False], [False, False], [False, False]]
sc._offsets.mask = [[1, 1], [1, 1], [1, 1], [0, 0], [0, 0]]
#after change:
#sc._offsets.mask = [[True, True], [True, True], [True, True], [False, False], [False, False]]
#then re-draw plot
sc.figure.canvas.draw() #docs say that it's better to use draw_idle() but I don't see difference
Setting to True value coressponding with index of point you would like to exclude from plot, removes that particular point from the plot. It does not "deletes" it. Points can be restored by setting bool values back to "False". Note that it is 2D array so passing simple: [1, 1, 1, 0, 0]
will not do and you need to take into account both x and y coordinates of the plot.
Consult numpy docs for details:
https://numpy.org/doc/stable/reference/maskedarray.generic.html#accessing-the-mask
I'll edit if something comes up.
Thank you all for help.

A basic solution. If your dataset is not a big one, and you know the conditions that differentiates the data, you want to plot differently, you can create one column per condition and plot each one with different markers and colors.
Suppose you want to plot different the y that are greater than 3:
import pandas as pd
import matplotlib.pyplot as plt
import random
df = pd.DataFrame()
df['name'] = ['cat', 'dog', 'bird', 'fish', 'frog']
df['id'] = [1, 1, 1, 2, 2]
df['x'] = [random.randint(-10, 10) for n in range(5)]
df['y'] = [random.randint(-10, 10) for n in range(5)]
_mask = df.y > 3
df.loc[_mask, 'y_case_2'] = df.y
df.loc[~_mask, 'y_case_1'] = df.y
sc = plt.scatter(df.x, df.y_case_1, marker='*', color='r')
sc = plt.scatter(df.x, df.y_case_2, marker='.', color='b')
plt.show()
df
Note: Be aware that random data could not generate data greater than 3. If so, try again.

Related

Python Plotly hoverover information for two or more heatmaps

I am plotting 3 heatmaps in plotly on top of each other and would like to display the z value of all 3 when I hover over the (x,y) points.
I have seen that for scatter plots you can use unified x to display the info of all plots on hover. Is there a similar way to do a unified z for heatmap plots?
I have also seen that you can create a data frame of custom texts and use those as hoverlabels but that seems a little too excessive for what I'm trying to do.
Thanks
this is effectively the answer Plotly Python - Heatmap - Change Hovertext (x,y,z)
simulated 3 heat maps on top of each other
build text array which are the z values across all three layers
import pandas as pd
import numpy as np
import plotly.graph_objects as go
dfs = [pd.DataFrame(index=list("abcd"), columns=list("ab"),
data=np.where(np.random.randint(1, 8, [4, 2]) == 1,
np.nan, np.random.randint(1, 500, [4, 2]),)
)
for i in range(3)]
# create text array same shape as z
text = pd.concat(dfs).groupby(level=0).agg({c:lambda v: ", ".join(v.astype(str)) for c in dfs[0].columns}).values
# figure
go.Figure([go.Heatmap(z=df.values, x=df.columns, y=df.index, name=i, text=text, hoverinfo="text")
for i, df in enumerate(dfs)
])

Python: Barplot colored according to a third variable

Currently I am trying to create a Barplot that shows the amount of reviews for an app per week. The bar should however be colored according to a third variable which contains the average rating of the reviews in each week (range: 1 to 5).
I followed the instructions of the following post to create the graph: Python: Barplot with colorbar
The code works fine:
# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]]
df = pd.DataFrame(data, columns = ["week", "count", "score"])
# Convert to lists
data_x = list(df["week"])
data_hight = list(df["count"])
data_color = list(df["score"])
#Create Barplot:
data_color = [x / max(data_color) for x in data_color]
fig, ax = plt.subplots(figsize=(15, 4))
my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)
sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(1,5))
sm.set_array([])
cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)
plt.show()
Now to the issue: As you might notice the value of the average score in week 4 is "1.2". The Barplot does however indicate that the value lies around "2.5". I understand that this stems from the following code line, which standardizes the values by dividing it with the max value:
data_color = [x / max(data_color) for x in data_color]
Unfortunatly I am not able to change this command in a way that the colors resemble the absolute values of the scores, e.g. with a average score of 1.2 the last bar should be colored in deep red not light orange. I tried to just plug in the regular score values (Not standardized) to solve the issue, however, doing so creates all bars with the same green color... Since this is only my second python project, I have a hard time comprehending the process behind this matter and would be very thankful for any advice or solution.
Cheers Neil
You identified correctly that the normalization is the problem here. It is in the linked code by valued SO user #ImportanceOfBeingEarnest defined for the interval [0, 1]. If you want another normalization range [normmin, normmax], you have to take this into account during the normalization:
# Import Packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
# Create Dataframe
data = [[1, 10, 3.4], [2, 15, 3.9], [3, 12, 3.6], [4, 30,1.2]]
df = pd.DataFrame(data, columns = ["week", "mycount", "score"])
# Not necessary to convert to lists, pandas series or numpy array is also fine
data_x = df.week
data_hight = df.mycount
data_color = df.score
#Create Barplot:
normmin=1
normmax=5
data_color = [(x-normmin) / (normmax-normmin) for x in data_color] #see the difference here
fig, ax = plt.subplots(figsize=(15, 4))
my_cmap = plt.cm.get_cmap('RdYlGn')
colors = my_cmap(data_color)
rects = ax.bar(data_x, data_hight, color=colors)
sm = ScalarMappable(cmap=my_cmap, norm=plt.Normalize(normmin,normmax))
sm.set_array([])
cbar = plt.colorbar(sm)
cbar.set_label('Color', rotation=270,labelpad=25)
plt.show()
Sample output:
Obviously, this does not check that all values are indeed within the range [normmin, normmax], so a better script would make sure that all values adhere to this specification. We could, alternatively, address this problem by clipping the values that are outside the normalization range:
#...
import numpy as np
#.....
#Create Barplot:
normmin=1
normmax=3.5
data_color = [(x-normmin) / (normmax-normmin) for x in np.clip(data_color, normmin, normmax)]
#....
You may also have noticed another change that I introduced. You don't have to provide lists - pandas series or numpy arrays are fine, too. And if you name your columns not like pandas functions such as count, you can access them as df.ABC instead of df["ABC"].

Pyplot/Matplotlib: Binary data with strings on x-axis

I know it's such a basic thing, but due to ridiculous time constraints and the severity of the situation I'm forced to ask something like this:
I've got two arrays of 160 000 entries. One contains strings(names I need to use), the other contains corresponding 1's and 0's.
I'm trying to make a simple "step" graph in pyplot with the array of names along the X-axis and 0 and 1 along the Y-axis.
I have this currently:
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 4, 5, 9]
bindata = [0,1,1,0,1,1,0,0,0,1]
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
plt.step(xaxis, yaxis)
plt.xlabel('Filter Degree Combinations')
plt.ylabel('Negative Or Positive')
plt.title("Car 1")
#plt.savefig('foo.png') #For saving
plt.show()
It gives me this:
But I want something like this:
I cobbled the code together from some examples, tutorials and stackoverflow questions, but I run into "ValueError: x and y must have same first dimension" so often that I'm not getting anywhere when I try to experiment my way forward.
You can achieve the desired plot by specifying the tick labels and their positions on the x-axis using plt.xticks. The first argument range(0, 10, 2) is the positions followed by the strings
import numpy as np
import matplotlib.pyplot as plt
data = [1, 2, 4, 5, 9]
bindata = [0,1,1,0,1,1,0,0,0,1]
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
plt.step(xaxis, yaxis)
xlabels = ['Josh', 'Anna', 'Kevin', 'Sophie', 'Steve'] # <-- specify tick-labels
plt.xlabel('Filter Degree Combinations')
plt.ylabel('Negative Or Positive')
plt.title("Car 1")
plt.xticks(range(0, 10, 2), xlabels) # <-- assign tick-labels
plt.show()

Swap leafs of Python scipy's dendrogram/linkage

I generated a dendrogram plot for my dataset and I am not happy how the splits at some levels have been ordered. I am thus looking for a way to swap the two branches (or leaves) of a single split.
If we look at the code and dendrogram plot at the bottom, there are two labels 11 and 25 split away from the rest of the big cluster. I am really unhappy about this, and would like that the branch with 11 and 25 to be the right branch of the split and the rest of the cluster to be the left branch. The shown distances would still be the same, and thus the data would not be changed, just the aesthetics.
Can this be done? And how? I am specifically for a manual intervention because the optimal leaf ordering algorithm supposedly does not work in this case.
import numpy as np
# random data set with two clusters
np.random.seed(65) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)
# create linkage and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward')
plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=12., # font size for the x axis labels
)
plt.show()
I had a similar problem and got solved by using optimal_ordering option in linkage. I attach the code and result for your case, which might not be exactly what you like but seems highly improved to me.
import numpy as np
import matplotlib.pyplot as plt
# random data set with two clusters
np.random.seed(65) # for repeatability of this tutorial
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[10,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[20,])
X = np.concatenate((a, b),)
# create linkage and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward', optimal_ordering = True)
plt.figure(figsize=(15, 5))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=12., # font size for the x axis labels
distance_sort=False,
show_leaf_counts=True,
count_sort=False
)
plt.show()
result of using optimal_ordering in linkage

pandas 3x3 scatter-matrix missing labels

I create a pandas scatter-matrix usng the following code:
import numpy as np
import pandas as pd
a = np.random.normal(1, 3, 100)
b = np.random.normal(3, 1, 100)
c = np.random.normal(2, 2, 100)
df = pd.DataFrame({'A':a,'B':b,'C':c})
pd.scatter_matrix(df, diagonal='kde')
This result in the following scatter-matrix:
The first row has no ytick labels, the 3th column no xtick labels, the 3th item 'C' is not labeled.
Any idea how to complete this plot with the missing labels ?
Access the subplot in question and change its settings like so.
axes = pd.scatter_matrix(df, diagonal='kde')
ax = axes[2, 2] # your bottom-right subplot
ax.xaxis.set_visible(True)
draw()
You can inspect how the scatter_matrix function goes about labeling at the link below. If you find yourself doing this over and over, consider copying the code into file and creating your own custom scatter_matrix function.
https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py#L160
Edit, in response to a rejected comment:
The obvious extensions of this, doing ax[0, 0].xaxis.set_visible(True) and so forth, do not work. For some reason, scatter_matrix seems to set up ticks and labels for axes[2, 2] without making them visible, but it does not set up ticks and labels for the rest. If you decide that it is necessary to display ticks and labels on other subplots, you'll have to dig deeper into the code linked above.
Specifically, change the conditions on the if statements to:
if i == 0
if i == n-1
if j == 0
if j == n-1
respectively. I haven't tested that, but I think it will do the trick.
Since I can't reply above, the non-changing source code version for anybody googling is:
n = len(features)
for x in range(n):
for y in range(n):
sax = axes[x, y]
if ((x%2)==0) and (y==0):
if not sax.get_ylabel():
sax.set_ylabel(features[-1])
sax.yaxis.set_visible(True)
if (x==(n-1)) and ((y%2)==0):
sax.xaxis.set_visible(True)
if ((x%2)==1) and (y==(n-1)):
if not sax.get_ylabel():
sax.set_ylabel(features[-1])
sax.yaxis.set_visible(True)
if (x==0) and ((y%2)==1):
sax.xaxis.set_visible(True)
features is the list of column names

Categories