Related
I am currently trying to create a hierarchical clustering using seaborn.
My code for this is currently
clustermap = sns.clustermap(df, metric='yule', col_cluster=True,
figsize=(7, 5))
where my dataframs looks like 10 row-names (strings) and a couple thousand col-names (0-3000 numeric ascending) with all other values being 0 or 1.
When I try this with euclidean and other metrics, I have no issues. However when trying this with the Yule distance I get "ValueError: The condensed distance matrix must contain only finite values.".
I checked that there were no NA or blank data values in df and tried df.replace(np.nan, 0) to double check this was not this issue.
Additionally I could not find any values other than 0 or 1 in the dataframe.
A small example of what's going on is below:
# initialize list of lists
data = [[1,0,0,0,0,0], [ 1,0,1,1,0,1], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0], [ 0,0,1,0,1,0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=[ 'col1','col2','col3','col4','col5','col6'])
df = df.transpose()
print(df)
clustermap = sns.clustermap(df, metric='yule', col_cluster=True, figsize=(8, 6))
plt.show()
Is there some check that I am missing specific to this metric? How can I fix this?
Thanks!!
I'm using
cv2.HoughCircles
function of python.
I want to find circles in an image like this:
In this image there is a big circle and many little circles. I want only the biggest.
The image has dimension 280x300 pixels, but if I set as function's parameters minRadius=90 and maxRadius=150
circles = cv2.HoughCircles(edges, cv2.cv.CV_HOUGH_GRADIENT, 1, 30,
> param1=20,
> param2=10,
> minRadius=80,
> maxRadius=150)
print (circles)
I find an output like this:
[[[ 149.5 125.5 141.63510132]
[ 141.5 155.5 112.5544281 ]
[ 173.5 144.5 103.35617828]
[ 115.5 134.5 98.32852936]
[ 173.5 105.5 87.82083893]
[ 174.5 176.5 85.20856476]
[ 130.5 99.5 83.69289398]
[ 105.5 165.5 81.62413788]
[ 141.5 187.5 80.62567902]
[ 75.5 134.5 104.03124237]]]
So, I think that all these circles are possible, but probably one of these results is best than the others. How can I find it?
Fourth function's parameter is the minimum distance between the centers of the detected circles. To find only one circles you should only set this parameter bigger.
I am struggling with numpy's implementation of the fast Fourier transform. My signal is not of periodic nature and therefore certainly not an ideal candidate, the result of the FFT however is far from what I was expecting. It is the same signal, simply stretched by some factor. I plotted a sinus curve, approximating my signal next to it which should illustrate, that I use the FFT function correctly:
import numpy as np
from matplotlib import pyplot as plt
signal = array([[ 0.], [ 0.1667557 ], [ 0.31103874], [ 0.44339886], [ 0.50747922],
[ 0.47848347], [ 0.64544846], [ 0.67861755], [ 0.69268326], [ 0.71581176],
[ 0.726552 ], [ 0.75032795], [ 0.77133769], [ 0.77379966], [ 0.80519187],
[ 0.78756476], [ 0.84179849], [ 0.85406538], [ 0.82852684], [ 0.87172407],
[ 0.9055542 ], [ 0.90563205], [ 0.92073452], [ 0.91178145], [ 0.8795554 ],
[ 0.89155587], [ 0.87965686], [ 0.91819571], [ 0.95774404], [ 0.95432073],
[ 0.96326252], [ 0.99480947], [ 0.94754962], [ 0.9818627 ], [ 0.9804966 ],
[ 1.], [ 0.99919711], [ 0.97202208], [ 0.99065786], [ 0.90567128],
[ 0.94300558], [ 0.89839004], [ 0.87312245], [ 0.86288378], [ 0.87301008],
[ 0.78184963], [ 0.73774451], [ 0.7450479 ], [ 0.67291666], [ 0.63518575],
[ 0.57036157], [ 0.5709147 ], [ 0.63079811], [ 0.61821523], [ 0.49526048],
[ 0.4434457 ], [ 0.29746173], [ 0.13024641], [ 0.17631683], [ 0.08590552]])
sinus = np.sin(np.linspace(0, np.pi, 60))
plt.plot(signal)
plt.plot(sinus)
The blue line is my signal, the green line is the sinus.
transformed_signal = abs(np.fft.fft(signal)[:30] / len(signal))
transformed_sinus = abs(np.fft.fft(sinus)[:30] / len(sinus))
plt.plot(transformed_signal)
plt.plot(transformed_sinus)
The blue line is transformed_signal, the green line is the transformed_sinus.
Plotting only transformed_signal illustrates the behavior described above:
Can someone explain to me what's going on here?
UPDATE
I was indeed a problem of calling the FFT. This is the correct call and the correct result:
transformed_signal = abs(np.fft.fft(signal,axis=0)[:30] / len(signal))
Numpy's fft is by default applied over rows. Since your signal variable is a column vector, fft is applied over the rows consisting of one element and returns the one-point FFT of each element.
Use the axis option of fft to specify that you want FFT applied over the columns of signal, i.e.,
transformed_signal = abs(np.fft.fft(signal,axis=0)[:30] / len(signal))
[EDIT] I overlooked the crucial thing stated by Stelios! Nevertheless I leave my answer here, since, while not spotting the root cause of your trouble, it is still true and contains things you have to reckon with for a useable FFT.
As you say you're tranforming a non-periodical signal.
Your signal has some ripples (higher harmonics) which nicely show up in the FFT.
The sine does have far less higher freq's and consists largely of a DC component.
So far so good. What I don't understand is that your signal also has a DC component, which doesn't show up at all. Could be that this is a matter of scale.
Core of the matter is that while the sinus and your signal look quite the same, they have a totally different harmonic content.
Most notable none of both hold a frequency that corresponds to the half sinus. This is because a 'half sinus' isn't built by summing whole sinusses. In other words: the underlying full sinus wave isn't in the spectral content of the sinus over half the period.
BTW having only 60 samples is a bit meager, Shannon states that your sample frequency should be at least twice the highest signal frequency, otherwise aliasing will happen (mapping freqs to the wrong place). In other words: your signal should visually appear smooth after sampling (unless of course it is discontinuous or has a discontinuous derivative, like a block or triangle wave). But in your case it looks like the sharp peaks are an artifact of undersampling.
I have a question similar to the question asked here:
simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).
In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.
Is there a faster way to detect and fuse close points in a numpy 2d array?
To be complete I added an example:
points=array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 820.57417943, 84.27702407],
[ 806.71416007, 108.50307828]])
A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.
In the output of the fuse function the last to points are fused. This will look like:
#output
array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 813.64416975, 96.390051175]])
If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.KDTree, then query it for pairs of points that are closer than some threshold:
import numpy as np
from scipy.spatial import KDTree
tree = KDTree(points)
rows_to_fuse = tree.query_pairs(r=30)
print(repr(rows_to_fuse))
# {(8, 9)}
print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.
You can use scipy's distance functions such as pdist in order to quickly find which points should be merged:
import numpy as np
from scipy.spatial.distance import pdist, squareform
d = squareform(pdist(a))
d = np.ma.array(d, mask=np.isclose(d, 0))
a[d.min(axis=1) < 30]
#array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
NOTE
For large samples this method can cause memory errors since it is storing a full matrix containing the relative distances.
I am trying to get the vertices of hexagons drawn from hexbin() method using matplotlib and python. Got the no. of points in each hexagon using .get_arrays() and tried getting vertices co-ordinates with get_paths() but it gives me just 1 path (i.e vertices of just 1 hexagon).
How can I retrieve vertices of all the hexagons? The tried code is written below with the output.
x, y = np.random.normal(size=(2, 10000))
fig, ax = plt.subplots()
im = ax.hexbin(x, y, gridsize=20)
paths=im.get_paths()
print(paths)
fig.colorbar(im, ax=ax)
[the output map that is being generated has more than 1 hexagon. I can't upload it here due to account restrictions since I am new to this.][2]
[Path(array([[ 0.18489907, -0.1102285 ],
[ 0.18489907, 0.1102285 ],
[ 0. , 0.22045701],
[-0.18489907, 0.1102285 ],
[-0.18489907, -0.1102285 ],
[ 0. , -0.22045701],
[ 0.18489907, -0.1102285 ]]), array([ 1, 2, 2, 2, 2, 2, 79], dtype=uint8))]
I am answering it myself the way I solved it and it worked.
Step 1: Get the sample hexagon co-ordinates with im.get_paths() as done above in the question.
Step 2: Get offsets of all the hexagons created by hexbin() method using im.get_offsets(). It gives x and y offsets for all the hexagons.
Step 3: Just add the offsets to the sample 'path' co-ordinates obtained in step 1 and this will give the actual co-ordinates of the hexagon collection.