Plot multidimensional vectors in Python - python

I have a Matrix that contains N users and K items. I want to plot that matrix in Python by considering each line as a vector with multiple coordinates. For example a simple point plot require X,Y. My vector hasK coordinates and I want to plot each one of those N vectors as a point to see there similarities. Can anyone help me with that ?
UPDATE:
#Matrix M shape = (944, 1683)
plt.figure()
plt.imshow(M, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()
plt.show()
but this gave me as result :
What I want is something like that:

It is difficult from this question to be sure if my answer is relevant, but here's my best guess. I believe deltascience is asking how multidimensional vectors are generally plotted into two-dimensional space, as would be the case with a scatter plot. I think the best answer is that some kind of dimension reduction algorithm is generally performed. In other words, you don't do this by finding the right matplotlib code; you get your data into the right shape (one list for the X axis, and another list for the Y axis) and you then plot it using a typical matplotlib approach:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
M = np.random.rand(944, 1683)
pca = PCA(n_components=2)
reduced = pca.fit_transform(M)
# We need a 2 x 944 array, not 944 by 2 (all X coordinates in one list)
t = reduced.transpose()
plt.scatter(t[0], t[1])
plt.show()
Here are some relevant links:
https://stats.stackexchange.com/questions/63589/how-to-project-high-dimensional-space-into-a-two-dimensional-plane
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
https://www.evl.uic.edu/documents/etemadpour_choosingvisualization_springer2016.pdf
July 2019 Addendum: It didn't occur to me at the the time, but another way people often visualize multi-dimensional data is with network visualization. Each multi-dimensional array in this context would be a node, and the edge weight would be something like the cosine similarity of two nodes, or the Euclidian distance. Networkx in python has some really nice visualization options.

Related

How to plot an array as if the indices i,j were the x,y coordinates?

Hi guys first question here, looked for an answer but could not find anything, I will try to give it my best.
I am currently working on a problem in the field of Computational Physics and I am solving the Navier-Stokes equations numerically using the Finite Difference Method. It`s my first time working with Python (using a Google Colaboratory notebook with Python 3). I am solving the equations for a grid of points in a two-dimensional plane. I created this grid using np.arrays
import numpy as np
import matplotlib.pyplot as plt
N = 10
data = np.zeros((N,N))
and then manipulating it. For example
for i in range(N):
for j in range(N):
data[i,j] = i
which makes the values of the array increase with index i. However, if I plot my data-array now using
x = np.arange(N)
y = np.arange(N)
plt.contourf(x, y, data)
plt.colorbar()
The result of the example:
It shows that the plotted data increases along the y-axis even though my manipulation of the array should make it increase along the x-axis.
I noticed this happens because the indexing of arrays (i,j) is different from the standard orientation of x- and y-axis, but how can I plot my data-array as if i=x and j=y?
You can use numpy's ndindex function to get the indices based on shape and then unzip the result.
x,y=list(zip(*np.ndindex((N,N))))
The data is row by column and can be obtained with meshgrid. If you're interested in the same manipulation. You can make the data with meshgrid as
dx,dy=np.meshgrid(np.arange(N),np.arange(N))
And then plot the dy to get variation in the x axis.

Plotting a set of given points to form a closed curve in matplotlib

I have (tons of) coordinates of points for closed curve(s) sorted in x-increasing order.
When plot it in the regular way the result i get is this:
(circle as an example only, the shapes I currently have can be, at best, classified as amoeboids)
But the result I am looking for is something like this:
I have looked through matplotlib and I couldn't find anything. (Maybe I had my keywords wrong...?)
I have tried to reformat the data in the following ways:
Pick a point at random, find its nearest neighbor and then the next nearest neighbor and so on..
It fails at the edges where, sometimes the data isn't too consistent (the nearest neighbor maybe on the opposite side of the curve).
To account for inconsistent data, I tried to check if the slope between two points (which are being considered as nearest neighbors) matches with the previously connected slope - Fails, for reasons I could not find. (spent considerable number of hours before I gave up)
Pick x_minimum and x_maximum (and corresponding y coordinates) and draw an imaginary line and sort for points on either side of the line. - Fails when you have a curve that looks like a banana.
Is there a python package/library that can help me get to where I want.? Or can you help me with ideas to sort my data points better.? Thanks in advance.
EDIT:
Tried the ConcaveHull on the circle I had, any idea why the lines are overlapping at places.? Here's the image:
EDIT2:
The problem was sorted out by changing part of my code as suggested by #Reblochon Masque in the comment section in his answer.
If you don't know how your points are set up (if you do I recommend you follow that order, it will be faster) you can use Convex Hull from scipy:
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull
# RANDOM DATA
x = np.random.normal(0,1,100)
y = np.random.normal(0,1,100)
xy = np.hstack((x[:,np.newaxis],y[:,np.newaxis]))
# PERFORM CONVEX HULL
hull = ConvexHull(xy)
# PLOT THE RESULTS
plt.scatter(x,y)
plt.plot(x[hull.vertices], y[hull.vertices])
plt.show()
, which in the example above results is this:
Notice this method will create a bounding box for your points.
Here is an example that will maybe do what you want and solve your problem:
more info here
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull
points = np.random.rand(30, 2) # 30 random points in 2-D
hull = ConvexHull(points)
#xs = np.array([point[0] for point in points])
#ys = np.array([point[1] for point in points])
#xh = np.array([point[0] for point in hull.points])
#yh = np.array([point[1] for point in hull.points])
plt.plot(points[:,0], points[:,1], 'o')
for simplex in hull.simplices:
plt.plot(points[simplex, 0], points[simplex, 1], 'k-')
plt.plot(points[hull.vertices,0], points[hull.vertices,1], 'r--', lw=2)
plt.plot(points[hull.vertices[0],0], points[hull.vertices[0],1], 'ro')
plt.show()
The points on the of the convex hull are plotted separately and joined to form a polygon. You can further manipulate them if you want.
I think this is maybe a good solution (easy and cheap) to implement in your case. It will work well if your shapes are convex.
In case your shapes are not all convex, one approach that might be successful could be to sort the points according to which neighbor is closest, and draw a polygon from this sorted set.

K-means Clustering in Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in range(len(x)):
plt.plot(x[i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], marker = "x", s = 150, linewidths = 5, zorder = 10)
plt.show()
The code above displays 4 clusters, but they are definitely not something I want to have.
I also get an error, which makes it even worst. The output I get is in the picture below.
The error I get is: TypeError: scatter() missing 1 required positional argument: 'y' Error is not a big deal because I don't like what I have anyways.
Following is the image of how I want my output of clusters to look like.
your data is one-dimension (a line), if you want to visualize in two-dimension like pic in your post, your should use two-dimension or multi-dimension data, for example [[1,3], [2,3], [1,5]].
after k-means they are divided into k clusters, and you can use scatter to visualize the output. by the way, scatter take x and y, scatter is two-dimension visualization.
i suggest you to take a look at Orange, a python data mining tool. you can do k-means by drag and drop.
and visualize the output of k-means easily.
good luck! data mining is fun :-)
Your data is 1 dimensional
Don't expect a pretty 2d plot without making up data.
To get rid of the warning, you can set y=x. But it will not change much, the data will continue to be a 1-dimensional line.
You could of course add random noise, and set y to random values. But that means making up fake data.
For one-dimensional algorithm, I recommend to not use clustering at all. These algorithms are designed for complex multivariate data where you cannot afforf a good statistical model anymore. One-dimensional data can be sorted which allows for much more efficient algorithms. You can easily do KDE on such data, and fit thousands of statistical distributions. This will give you a much more meaningful model of higher statistical power.
From a quick look at your plot, I'd say there are no clusters. Instead your data looks like a skewed normal distribution with one clear outlier (to be expected at this data set size) to me. Please, try a more statistical approach.
Since you work with only one dimensional, you should understand what exactly you are computing. With KMeans, you extract four average values; the best thing you can do here is draw your data as below with four horizontal lines showing these values. I get the following picture with the code below. This picture is like the equivalent for 1D of the picture you are showing for 2D.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in centroids: plt.plot( [0, len(x)-1],[i,i], "k" )
for i in range(len(x)):
plt.plot(i, x[i], colors[labels[i]], markersize = 10)
plt.show()
Computing kmeans with 1D data is more interesting with curves like the following one (from the page http://lasp.colorado.edu/home/sorce/2013/01/28/the-sorce-mission-celebrates-ten-years/) because you obviously can see tow distinct average values:

How to plot a multi-dimensional data point in python

Some background first:
I want to plot of Mel-Frequency Cepstral Coefficients of various songs and compare them.
I calculate MFCC's throughout a song and then average them to get one array of 13 coefficients. I want this to represent one point on a graph that I plot.
I'm new to Python and very new to any form of plotting (though I've seen some recommendations to use matplotlib).
I want to be able to visualize this data. Any thoughts on how I might go about doing this?
Firstly, if you want to represent an array of 13 coefficients as a single point in your graph, then you need to break the 13 coefficients down to the number of dimensions in your graph as yan king yin pointed out in his comment.
For projecting your data into 2 dimensions you can either create relevant indicators yourself such as max/min/standard deviation/.... or you apply methods of dimensionality reduction such as PCA.
Whether or not to do so and how to do so is another topic.
Then, plotting is easy and is done as here:
http://matplotlib.org/api/pyplot_api.html
I provide an example code for this solution:
import matplotlib.pyplot as plt
import numpy as np
#fake example data
song1 = np.asarray([1, 2, 3, 4, 5, 6, 2, 35, 4, 1])
song2 = song1*2
song3 = song1*1.5
#list of arrays containing all data
data = [song1, song2, song3]
#calculate 2d indicators
def indic(data):
#alternatively you can calulate any other indicators
max = np.max(data, axis=1)
min = np.min(data, axis=1)
return max, min
x,y = indic(data)
plt.scatter(x, y, marker='x')
plt.show()
The results looks like this:
Yet i want to suggest another solution to your underlying problem, namely: plotting multidimensional data.
I recommend using something parralel coordinate plot which can be constructed with the same fake data:
import pandas as pd
pd.DataFrame(data).T.plot()
plt.show()
Then the result shows all coefficents for each song along the x axis and their value along the y axis. I would looks as follows:
UPDATE:
In the meantime I have discovered the Python Image Gallery which contains two nice example of high dimensional visualization with reference code:
Radar chart
Parallel plot

Solving for zeroes in interpolated data in numpy/matplotlib

I have some data over a 2D range that I am interested in analyzing. These data were originally in lists x,y, and z where z[i] was the value for the point located at (x[i],y[i]). I then interpolated this data onto a regular grid using
x=np.array(x)
y=np.array(y)
z=np.array(z)
xi=np.linspace(minx,maxx,100)
yi=np.linspace(miny,maxy,100)
zi=griddata(x,y,z,xi,yi)
I then plotted the xi,yi,zi data using
plt.contour(xi,yi,zi)
plt.pcolormesh(xi,yi,zi,cmap=plt.get_cmap('PRGn'),norm=plt.Normalize(-10,10),vmin=-10,vmax=10)
This produced this plot:
In this plot you can see the S-like curve where the values are equal to zero (aside: the data doesn't vary as rapidly as shown in the colorbar -- that's simply a result of me normalizing the data to -10-10 when it actually extends far beyond that range; I did this to make the zero-valued region show up better -- maybe there's a better way of doing this too...).
The scattered dots are simply the points at which I have original data (yes, in this case my data was already on a regular grid). What I'm curious about is whether there is a good way for me to extract the values for which the curve is zero and obtain x,y pairs that, if plotted as a line, would trace that zero-region in the colormesh. I could interpolate to a really fine grid and then just brute force search for the values which are closest to zero. But is there a more automatic way of doing this, or a more automatic way of plotting this "zero-line"?
And a secondary question: I am using griddata correctly, right? I have these simple 1D arrays although elsewhere people use various meshgrids, loading texts, etc., before calling griddata.
Here is a full example:
import numpy as np
import matplotlib.pyplot as plt
y, x = np.ogrid[-1.5:1.5:200j, -1.5:1.5:200j]
f = (x**2 + y**2)**4 - (x**2 - y**2)**2
plt.figure(figsize=(9,4))
plt.subplot(121)
extent = [np.min(x), np.max(x), np.min(y), np.max(y)]
cs = plt.contour(f, extent=extent, levels=[0.1],
colors=["b", "r"], linestyles=["solid", "dashed"], linewidths=[2, 2])
plt.subplot(122)
# get the points on the lines
for c in cs.collections:
data = c.get_paths()[0].vertices
plt.plot(data[:,0], data[:,1],
color=c.get_color()[0], linewidth=c.get_linewidth()[0])
plt.show()
here is the output:

Categories