Removing datapoints outside interval for both axes of a plot - python

I am trying to plot some data using matplotlib.
import matplotlib.pyplot as plt
x_data = np.arange(0,100)
y_data = np.random.randint(11, size=(100,))
plt.plot(x_data, y_data)
plt.show
This, of course, works fine. However, I would like to remove the data that is outside a given interval (e.g. 4 < y_data < 6). For the y_data, this is done by
y_data_2 = [x for x in y_data if 4 <= x <= 6]
However, since the first dimensions are no longer equal, you are no longer able to plot y_data_2 vs. x_data. If you try to
plt.plot(x_data, y_data_2)
you will, of course, get an error stating that
ValueError: x and y must have same first dimension, but have shapes (100,) and (35,)
My question is thus twofold: is there a simple way for me to remove the equivalent datapoints in x_data? Also, is there a way I could find the indices of the points that are to be removed?
Thank you.

You can use masking together with indexing. Here you create a mask to capture values y values which lie between 4 and 6. You then apply this conditional mask to your x_data and y_data to get the corresponding values. This way you don't need any for loop or list comprehensions.
x_data = np.arange(0,100)
y_data = np.random.randint(11, size=(100,))
mask = (y_data>=4) & (y_data<=6)
plt.plot(x_data[mask], y_data[mask], 'bo')

First, you can get the index of y_data_2 in y_data, and then get the subarray x_data_2 of x_data. Then, plot the x_data_2, y_data_2.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x_data = np.arange(0,100)
y_data = np.random.randint(11, size=(100,))
y = pd.Series(y_data)
y_data_2 = [x for x in y_data if 4 <= x <= 6]
index = y[y.isin(y_data_2)].index
print(index)
x_data_2 = x_data[index]
plt.plot(x_data, y_data)
plt.scatter(x_data_2, y_data_2)
plt.show()

Related

Is there a way to slice an x,y array diagonally?

I have a 3D array (time, y direction, x direction), and I want to split it up spatially. However, is there a way to slice a spatial array diagonally instead of just in y and x?
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
data = np.random.rand(100,45,60)
data_1 = data[:,0:30,0:30]
X,Y = np.meshgrid(np.arange(0,60,1),np.arange(0,45,1))
plt.contourf(X,Y,data[2])
plt.show()
plt.contourf(data_1[2])
plt.xlim(0,60)
plt.ylim(0,45)
plt.show()
first graph shows the contour plot if data, and then the data_1, but is there a way to slice it diagonally? For example, where the red line is.
By slicing I mean selecting only sections of the 3D data array in x and y direction. For example get only the data under the red arrow.
import numpy as np
from numpy import ma
import matplotlib.pyplot as plt
data = np.random.rand(5,45,60)
data1 = data[2,0:30,0:30]
x2, y2 = np.meshgrid(np.arange(0, 30, 1), np.arange(0, 30, 1))
data1 = ma.masked_where(x2 + y2 > 30, data1)
plt.contourf(x2, y2, data1)
plt.xlim(0,60)
plt.ylim(0,45)
plt.show()
I have used a masked array above, but it is also possible to use np.where instead and set values to np.NaN:
data1 = np.where(x2 + y2 > 30, np.NaN, data1)
Matplotlib will also not plot NaN values.
Setting values to NaN, however, will lose the original values, while a mask simply hides them (removing the mask will retrieve the original values). NaNs can also be tricky in comparisons. So a mask may be better.

y axis has decreasing values instead of increasing ones for plt

I am trying to build a histogram and here is my code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
x = ['0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','38','40','41','42','43','44','45','48','50','51','53','54','57','60','64','70','77','93','104','108','147'] #sample names
y = ['164','189','288','444','311','216','122','111','92','54','45','31','31','30','18','15','15','10','4','15','2','8','6','4','7','5','3','3','1','10','3','3','3','2','4','2','1','1','1','2','2','1','1','1','1','1','2','1','2','2','2','1','1','2','1','1','1','1']
plt.bar(x, y)
plt.xlabel('Number of Methods')
plt.ylabel('Variables')
plt.show()
Here is the histogram I obtain:
I would like the values in the y axis to be in an increasing order. This means that 1 should be first followed by 3, 5, 7, etc. How can I fix this?
They're not decreasing, they're in the order in which they are in the list, because the list items are strings. Try
x = [int(i) for i in x]
y = [int(i) for i in y]
to convert them to numbers before plotting.

How to adjust branch lengths of dendrogram in matplotlib (like in astrodendro)? [Python]

Here is my resulting plot below but I would like it to look like the truncated dendrograms in astrodendro such as this:
There is also a really cool looking dendrogram from this paper that I would like to recreate in matplotlib.
Below is the code for generating an iris data set with noise variables and plotting the dendrogram in matplotlib.
Does anyone know how to either: (1) truncate the branches like in the example figures; and/or (2) to use astrodendro with a custom linkage matrix and labels?
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import astrodendro
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial import distance
def iris_data(noise=None, palette="hls", desat=1):
# Iris dataset
X = pd.DataFrame(load_iris().data,
index = [*map(lambda x:f"iris_{x}", range(150))],
columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), load_iris().feature_names)])
y = pd.Series(load_iris().target,
index = X.index,
name = "Species")
c = map_colors(y, mode=1, palette=palette, desat=desat)#y.map(lambda x:{0:"red",1:"green",2:"blue"}[x])
if noise is not None:
X_noise = pd.DataFrame(
np.random.RandomState(0).normal(size=(X.shape[0], noise)),
index=X_iris.index,
columns=[*map(lambda x:f"noise_{x}", range(noise))]
)
X = pd.concat([X, X_noise], axis=1)
return (X, y, c)
def dism2linkage(DF_dism, method="ward"):
"""
Input: A (m x m) dissimalrity Pandas DataFrame object where the diagonal is 0
Output: Hierarchical clustering encoded as a linkage matrix
Further reading:
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html
https://pypi.python.org/pypi/fastcluster
"""
#Linkage Matrix
Ar_dist = distance.squareform(DF_dism.as_matrix())
return linkage(Ar_dist,method=method)
# Get data
X_iris_with_noise, y_iris, c_iris = iris_data(50)
# Get distance matrix
df_dism = 1- X_iris_with_noise.corr().abs()
# Get linkage matrix
Z = dism2linkage(df_dism)
#Create dendrogram
with plt.style.context("seaborn-white"):
fig, ax = plt.subplots(figsize=(13,3))
D_dendro = dendrogram(
Z,
labels=df_dism.index,
color_threshold=3.5,
count_sort = "ascending",
#link_color_func=lambda k: colors[k]
ax=ax
)
ax.set_ylabel("Distance")
I'm not sure this really constitutes a practical answer, but it does allow you to generate dendrograms with truncated hanging lines. The trick is to generate the plot as normal, then manipulate the resulting matplotlib plot to recreate the lines.
I couldn't get your example to work locally, so I've just created a dummy dataset.
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
a = np.random.multivariate_normal([0, 10], [[3, 1], [1, 4]], size=[5,])
b = np.random.multivariate_normal([0, 10], [[3, 1], [1, 4]], size=[5,])
X = np.concatenate((a, b),)
Z = linkage(X, 'ward')
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
dendrogram(Z, ax=ax)
The resulting plot is the usual long-arm dendrogram.
Now for the more interesting bit. A dendrogram is made up of a number of LineCollection objects (one for each colour). To update the lines we iterate through these, extracting the details about their constituent paths, modifying these to remove any lines reaching to a y of zero, and then recreating a LineCollection for these modified paths.
The updated path is then added to the axes, and the original is removed.
The one tricky part is determining what height to draw to instead of zero. Since we are iterating over each dendrograms path, we don't know which point came before — we basically have no idea where we are. However, we can exploit the fact that hanging lines hang vertically. Assuming there are no lines on the same x, we can look for the known other y values for a given x and use that as the basis for our new y when calculating. The downside is that in order to make sure we have this number, we have to pre-scan the data.
Note: If you can get dendrogram hanging lines on the same x, you would need to include the y and search for nearest y above this x to do this.
import numpy as np
from matplotlib.path import Path
from matplotlib.collections import LineCollection
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
dendrogram(Z, ax=ax);
for c in ax.collections[:]: # use [:] to get a copy, since we're adding to the same list
paths = []
for path in c.get_paths():
segments = []
y_at_x = {}
# Pre-pass over all elements, to find the lowest y value at each x value.
# we can use this to caculate where to cut our lines.
for n, seg in enumerate(path.iter_segments()):
x, y = seg[0]
# Don't store if the y is zero, or if it's higher than the current low.
if y > 0 and y < y_at_x.get(x, np.inf):
y_at_x[x] = y
for n, seg in enumerate(path.iter_segments()):
x, y = seg[0]
if y == 0:
# If we know the last y at this x, use it - 0.5, limit > 0
y = max(0, y_at_x.get(x, 0) - 0.5)
segments.append([x,y])
paths.append(segments)
lc = LineCollection(paths, colors=c.get_colors()) # Recreate a LineCollection with the same params
ax.add_collection(lc)
ax.collections.remove(c) # Remove the original LineCollection
The resulting dendrogram looks like this:

Plot multiple bars for categorical data

I'm looking for a way to plot multiple bars per value in matplotlib. For numerical data, this can be achieved be adding an offset to the X data, as described for example here:
import numpy as np
import matplotlib.pyplot as plt
X = np.array([1,3,5])
Y = [1,2,3]
Z = [2,3,4]
plt.bar(X - 0.4, Y) # offset of -0.4
plt.bar(X + 0.4, Z) # offset of 0.4
plt.show()
plt.bar() (and ax.bar()) also handle categorical data automatically:
X = ['A','B','C']
Y = [1,2,3]
plt.bar(X, Y)
plt.show()
Here, it is obviously not possible to add an offset, as the categories are not directly associated with a value on the axis. I can manually assign numerical values to the categories and set labels on the x axis with plt.xticks():,
X = ['A','B','C']
Y = [1,2,3]
Z = [2,3,4]
_X = np.arange(len(X))
plt.bar(_X - 0.2, Y, 0.4)
plt.bar(_X + 0.2, Z, 0.4)
plt.xticks(_X, X) # set labels manually
plt.show()
However, I'm wondering if there is a more elegant way that makes use of the automatic category handling of bar(), especially if the number of categories and bars per category is not known in before (this causes some fiddling with the bar widths to avoid overlaps).
There is no automatic support of subcategories in matplotlib.
Placing bars with matplotlib
You may go the way of placing the bars numerically, like you propose yourself in the question. You can of course let the code manage the unknown number of subcategories.
import numpy as np
import matplotlib.pyplot as plt
X = ['A','B','C']
Y = [1,2,3]
Z = [2,3,4]
def subcategorybar(X, vals, width=0.8):
n = len(vals)
_X = np.arange(len(X))
for i in range(n):
plt.bar(_X - width/2. + i/float(n)*width, vals[i],
width=width/float(n), align="edge")
plt.xticks(_X, X)
subcategorybar(X, [Y,Z,Y])
plt.show()
Using pandas
You may also use pandas plotting wrapper, which does the work of figuring out the number of subcategories. It will plot one group per column of a dataframe.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X = ['A','B','C']
Y = [1,2,3]
Z = [2,3,4]
df = pd.DataFrame(np.c_[Y,Z,Y], index=X)
df.plot.bar()
plt.show()

Looping within matplotlib

I am trying to plot multiple graphs on a single set of axis.
I have a 2D array of data and want to break it down into 111 1D arrays and plot them. Here is an example of my code so far:
from numpy import *
import matplotlib.pyplot as plt
x = linspace(1, 130, 130) # create a 1D array of 130 integers to set as the x axis
y = Te25117.data # set 2D array of data as y
plt.plot(x, y[1], x, y[2], x, y[3])
This code works fine, but I cannot see a way of writing a loop which will loop within the plot itself. I can only make the code work if I explicitly write a number 1 to 111 each time, which is not ideal! (The range of numbers I need to loop over is 1 to 111.)
Let me guess...long time matlab user?
Matplotlib automatically add a line plot to the present plot if you don't create a new one. So your code can be simply:
from numpy import *
import matplotlib.pyplot as plt
x = linspace(1, 130, 130) # create a 1D array of 130 integers to set as the x axis
y = Te25117.data # set 2D array of data as y
L = len(y) # I assume you can infere the size of the data in this way...
#L = 111 # this is if you don't know any better
for i in range(L)
plt.plot(x, y[i], color='mycolor',linewidth=1)
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1,2])
y = np.array([[1,2],[3,4]])
In [5]: x
Out[5]: array([1, 2])
In [6]: y
Out[6]:
array([[1, 2],
[3, 4]])
In [7]: for y_i in y:
....: plt.plot(x, y_i)
Will plot these in one figure.

Categories