How to find Mahalanobis distance between two 1D arrays in Python? - python
I have two 1D arrays, and I need to find out the Mahalanobis distance between them.
Array 1
-0.125510275,0.067021735,0.140631825,-0.014300184,-0.122152582,0.002372072,-0.050777748,-0.106606245,0.149123222,-0.159149423,0.210138127,0.031959131,-0.068411253,-0.038253143,-0.024590122,0.101361006,-0.160774037,-0.183688596,-0.07163775,-0.096662685,-0.000117288,0.14251323,-0.030461289,-0.006710192,-0.217195332,-0.338565469,-0.030219197,-0.100772612,0.144092739,-0.092911556,-0.008420993,0.042907588,-0.212668449,-0.009366207,-7.01E-05,0.134508118,-0.015715659,-0.050884761,0.18804647,0.04946585,-0.242626131,0.099951334,0.053660966,0.275807977,0.216019884,-0.009127878,0.019819722,-0.043750495,0.12940146,-0.259942383,0.061821692,0.107142501,0.098196507,0.022301452,0.079412982,-0.131031215,-0.049483716,0.126781181,-0.195536733,0.077051811,0.061049294,-0.039563753,0.02573989,0.025330214,0.204785526,0.099218346,-0.050533134,-0.109173119,0.205652237,-0.168003649,-0.062734045,0.100320764,-0.063513778,-0.120843001,-0.223983109,0.075016715,0.481291831,0.107607022,-0.141365036,0.075003348,-0.042418435,-0.041501854,0.096700639,0.083469011,-0.033227846,-0.050748199,-0.045331556,0.065955319,0.26927036,0.082820699,-0.014033476,0.176714703,0.042264186,-0.011814327,0.041769091,-0.00132945,-0.114337325,-0.013483777,-0.111367472,-0.051828772,-0.022199111,0.030011443,0.015529033,0.171916366,-0.172722578,0.214662731,-0.0219073,-0.067695767,0.040487193,0.04814541,0.003313571,-0.01360167,0.115932293,-0.235844463,0.185181856,0.130868644,0.010789306,0.171733275,0.059378762,0.003508842,0.039326921,0.024174646,-0.195897669,-0.088932432,0.025385177,-0.134177506,0.08158315,0.049005955
And, Array 2
-0.120652862,0.030241199,0.146165773,-0.044423241,-0.138606027,-0.048646796,-0.00780057,-0.101798892,0.185339138,-0.210505784,0.1637595,0.015000292,-0.10359703,0.102251172,-0.043159217,0.183324724,-0.171825036,-0.173819616,-0.112194099,-0.161590934,-0.002507193,0.163269699,-0.037766434,0.041060638,-0.178659558,-0.268946916,-0.055348843,-0.11808344,0.113775767,-0.073903576,-0.039505914,0.032382272,-0.159118786,0.007761603,0.057116233,0.043675732,-0.057895001,-0.104836114,0.22844176,0.055832602,-0.245030299,0.006276659,0.140012532,0.21449241,0.159539059,-0.049584024,0.016899824,-0.074179329,0.119686954,-0.242336214,-0.001390997,0.097442642,0.059720818,0.109706804,0.073196828,-0.16272822,0.022305552,0.102650747,-0.192103565,0.104134969,0.099571452,-0.101140082,-0.038911857,0.071292967,0.202927336,0.12729995,-0.047885433,-0.165100336,0.220239595,-0.19612211,-0.075948663,0.096906625,-0.07410948,-0.108219706,-0.155030385,-0.042231761,0.484629512,0.093194947,-0.105109185,0.072906494,-0.056871444,-0.057923764,0.101847053,0.092042476,-0.061295755,-0.031595342,-0.01854251,0.074671492,0.266587347,0.052284949,0.003548023,0.171518356,0.053180017,-0.022400264,0.061757766,0.038441688,-0.139473096,-0.05759665,-0.101672307,-0.074863717,-0.02349415,-0.011674869,0.010008151,0.141401738,-0.190440938,0.216421023,-0.028323224,-0.078021556,-0.011468113,0.100600921,-0.019697987,-0.014288296,0.114862509,-0.162037179,0.171686187,0.149788797,-0.01235011,0.136169329,0.008751356,0.024811052,0.003802934,0.00500867,-0.1840965,-0.086204343,0.018549766,-0.110649876,0.068768717,0.03012047
I found that Scipy has already implemented the function. However, I am confused about what the value of IV should be. I tried to do the following
V = np.cov(np.array([array_1, array_2]))
IV = np.linalg.inv(V)
print(mahalanobis(array_1, array_2, IV))
But, I get the following error:
File
"C:\Users\XXXXXX\AppData\Local\Continuum\anaconda3\envs\face\lib\site-packages\scipy\spatial\distance.py",
line 1043, in mahalanobis
m = np.dot(np.dot(delta, VI), delta)
ValueError: shapes (128,) and (2,2) not aligned: 128 (dim 0) != 2 (dim 0)
EDIT:
array_1 = [-0.10577646642923355, 0.09617947787046432, 0.029290344566106796, 0.02092641592025757, -0.021434104070067406, -0.13410840928554535, 0.028282659128308296, -0.12082239985466003, 0.21936850249767303, -0.06512433290481567, 0.16812698543071747, -0.03302834928035736, -0.18088334798812866, -0.04598559811711311, -0.014739632606506348, 0.06391328573226929, -0.15650317072868347, -0.13678401708602905, 0.01166679710149765, -0.13967938721179962, 0.14632365107536316, 0.025218486785888672, 0.046839646995067596, 0.09690812975168228, -0.13414686918258667, -0.2883925437927246, -0.1435326784849167, -0.17896348237991333, 0.10746842622756958, -0.09142691642045975, 0.04860316216945648, 0.031577128916978836, -0.17280976474285126, -0.059613555669784546, -0.05718057602643967, 0.0401446670293808, 0.026440180838108063, -0.017025159671902657, 0.22091664373874664, 0.024703698232769966, -0.15607595443725586, -0.0018572667613625526, -0.037675946950912476, 0.3210170865058899, 0.10884962230920792, 0.030370134860277176, 0.056784629821777344, -0.030112050473690033, 0.023124486207962036, -0.1449904441833496, 0.08885903656482697, 0.17527811229228973, 0.08804896473884583, 0.038310401141643524, -0.01704210229218006, -0.17355971038341522, -0.018237406387925148, 0.030551932752132416, -0.23085585236549377, 0.13475817441940308, 0.16338199377059937, -0.06968289613723755, -0.04330683499574661, 0.04434924200177193, 0.22637797892093658, 0.07463733851909637, -0.15070196986198425, -0.07500549405813217, 0.10863590240478516, -0.22288714349269867, 0.0010778247378766537, 0.057608842849731445, -0.12828609347343445, -0.17236559092998505, -0.23064571619033813, 0.09910193085670471, 0.46647992730140686, 0.0634111613035202, -0.13985536992549896, 0.052741192281246185, -0.1558966338634491, 0.022585246711969376, 0.10514408349990845, 0.11794176697731018, -0.06241249293088913, 0.06389056891202927, -0.14145469665527344, 0.060088545083999634, 0.09667345881462097, -0.004665130749344826, -0.07927791774272919, 0.21978208422660828, -0.0016187895089387894, 0.04876316711306572, 0.03137822449207306, 0.08962501585483551, -0.09108036011457443, -0.01795950159430504, -0.04094596579670906, 0.03533276170492172, 0.01394269522279501, -0.08244197070598602, -0.05095399543642998, 0.04305890575051308, -0.1195211187005043, 0.16731074452400208, 0.03894471749663353, -0.0222858227789402, -0.07944411784410477, 0.0614166259765625, -0.1481470763683319, -0.09113290905952454, 0.14758692681789398, -0.24051085114479065, 0.164126917719841, 0.1753545105457306, -0.003193420823663473, 0.20875433087348938, 0.03357946127653122, 0.1259773075580597, -0.00022807717323303223, -0.039092566817998886, -0.13582147657871246, -0.01937306858599186, 0.015938198193907738, 0.00787206832319498, 0.05792934447526932, 0.03294186294078827]
array_2 = [-0.1966051608324051, 0.0940953716635704, -0.0031937970779836178, -0.03691547363996506, -0.07240629941225052, -0.07114037871360779, -0.07133384048938751, -0.1283963918685913, 0.15377545356750488, -0.091400146484375, 0.10803385823965073, -0.09235749393701553, -0.1866973638534546, -0.021168243139982224, -0.09094691276550293, 0.07300164550542831, -0.20971564948558807, -0.1847742646932602, -0.009817334823310375, -0.05971141159534454, 0.09904412180185318, 0.0278592761605978, -0.012554554268717766, 0.09818517416715622, -0.1747943013906479, -0.31632938981056213, -0.0864541232585907, -0.13249783217906952, 0.002135572023689747, -0.04935726895928383, 0.010047778487205505, 0.04549024999141693, -0.26334646344184875, -0.05263081565499306, -0.013573898002505302, 0.2042253464460373, 0.06646320968866348, 0.08540669083595276, 0.12267164140939713, -0.018634958192706108, -0.19135263562202454, 0.01208433136343956, 0.09216200560331345, 0.2779296934604645, 0.1531585156917572, 0.10681629925966263, -0.021275708451867104, -0.059720948338508606, 0.06610126793384552, -0.21058350801467896, 0.005440462380647659, 0.18833838403224945, 0.08883830159902573, 0.025969548150897026, 0.0337764173746109, -0.1585341989994049, 0.02370697632431984, 0.10416869819164276, -0.19022507965564728, 0.11423652619123459, 0.09144753962755203, -0.08765758574008942, -0.0032832929864525795, -0.0051014479249715805, 0.19875964522361755, 0.07349056005477905, -0.1031823456287384, -0.10447365045547485, 0.11358538269996643, -0.24666038155555725, -0.05960353836417198, 0.07124857604503632, -0.039664581418037415, -0.20122921466827393, -0.31481748819351196, -0.006801256909966469, 0.41940364241600037, 0.1236235573887825, -0.12495145946741104, 0.12580059468746185, -0.02020396664738655, -0.03004150651395321, 0.11967054009437561, 0.09008713811635971, -0.07470540702342987, 0.09324200451374054, -0.13763070106506348, 0.07720538973808289, 0.19568027555942535, 0.036567769944667816, 0.030284458771348, 0.14119629561901093, -0.03820852190256119, 0.06232285499572754, 0.036639824509620667, 0.07704029232263565, -0.12276224792003632, -0.0035170004703104496, -0.13103705644607544, 0.027697769924998283, -0.01527332328259945, -0.04027168080210686, -0.03659897670149803, 0.03330300375819206, -0.12293602526187897, 0.09043421596288681, -0.019673841074109077, -0.07563626766204834, -0.13991905748844147, 0.014788001775741577, -0.07630413770675659, 0.00017269013915210962, 0.16345393657684326, -0.25710681080818176, 0.19869503378868103, 0.19393865764141083, -0.07422225922346115, 0.19553625583648682, 0.09189949929714203, 0.051557887345552444, -0.0008843056857585907, -0.006250975653529167, -0.1680600494146347, -0.10320111364126205, 0.03232177346944809, -0.08931156992912292, 0.11964476853609085, 0.00814182311296463]
The co-variance matrix of the above arrays turn out to be a singular matrix, and thus I am unable to inverse it. Why does it end up being a singular matrix?
EDIT 2: Solution
Since the co-variance matrix here is singular matrix, I had to pseudo inverse it using np.linalg.pinv(V).
From the numpy.cov docs, the first argument should be an array m such that:
Each row of m represents a variable, and each column a single observation of all those variables.
So to fix your code just take the transpose (with .T) of your array before you call cov:
V = np.cov(np.array([array_1, array_2]).T)
IV = np.linalg.inv(V)
print(mahalanobis(array_1, array_2, IV))
I just tested this out on some random data, and I can confirm it works.
Also, calculating covariance from just two observations is a bad idea, and not likely to be very accurate. If your data is coming from an image, you should use the entire image img (or at least the entire region of interest) when calculating the covariance matrix, then use that matrix to find the Mahalanobis distance between the two vectors of interest:
V = np.cov(np.array(img))
IV = np.linalg.inv(V)
print(mahalanobis(array_1, array_2, IV))
You may or may not need to replace img with img.T, depending on how you generated array_1 and array_2 in the first place.
If you're getting singular covariance matrices, what you have is a math problem, not a code problem. It's apparently a common enough problem that the question "why is my covariance matrix singular?" has already been asked and answered. Very broadly, it seems like it can happen when enough of your data points are "too similar", in some sense. I'd imagine using just two data points also makes this more likely.
Related
Chumpy minimization of gaussian pyramid leads to dimension mismatch
I am attempting to minimize an energy function between a rendered 3d scene and an image with opendr as in the example given in the OpenDR Paper I have essentially copied the code from the paper, but used my own image and 3d object. Here is the code to create the renderer: V = ch.array(m.v) rn = TexturedRenderer() rn.frustum = {'near': 1., 'far': 1., 'width': 350, 'height': 500} rn.camera = ProjectPoints(v=m.v, t=np.array([0, 0, 1]), rt=np.zeros(3), f=[450, 450], c=[350/2, 500/2], k=np.zeros(5)) rn.set(v=m.v, f=m.f, texture_image =img, ft=m.ft, vt=m.vt, bgcolor=ch.zeros(3)) A = SphericalHarmonics(vn=VertNormals(v=V, f=rn.f), components=ch.array([4,0,0,0]), light_color=ch.array([1,1,1])) rn.set(vc = A) where m is the mesh of the object loaded with opendr's load_mesh. And then I calculate the energy and use ch.minimize just as in the paper: translation, rotation = ch.array([0,0,4]), ch.zeros(3) rn.v = translation + m.v.dot(Rodrigues(rotation)) # Create the energy difference = rn - load_image(img_file) E = gaussian_pyramid(difference, n_levels=5, normalization='SSE') # Minimize the energy light_parms = A.components ch.minimize(E, x0=[translation]) ch.minimize(E, x0=[translation, rotation, light_parms]) I've confirmed the renderer and the image shape are both 500x350x3. It always returns a dimension mismatch which occurs in a scipy sparse matmul operation that occurs when the state is updated in the minimize_dogleg function of chumpy. I've modified the file to output the shapes of the matrices in the matmul operation through the iterations and with n=5 the shapes it returns are (129978, 519912) (519912, 525000) (31992, 127452) (127452, 129978) (7686, 30744) (30744, 31992) (1800, 7080) (7080, 7686) (378, 1512) (1512, 1800) (256, 256) (105279, 3) with the last one being where it returns the error. I've altered the number of pyramids and the results are the same and the final shapes of the two matrices are also always the same (unless I go too high then it gives a different error because the matrix becomes too small). One thing I notice is that the error always occurs at n_levels + 1 iterations which seems odd to me as I would think it stops at n_levels, but I have little understanding of what's actually happening in the minimization. For a little extra context the mesh has 35093 vertices E has a length of 686970. Is there anyone that could assist me as to what exactly is happening here or how the minimize is functioning and why the error is occurring on n_levels + 1 iterations.
The problem was fixed by instead using the opendr version from https://github.com/polmorenoc/opendr as opposed to the mattloper version
Resampling 2-d array using Fourier transform method
I have a question on the resampling 2-d array. Sometimes, the original size of the geoscience data should be transformed to other size. If the ratio for each axis is equal, the task is simple, in which np.reshape allow a 2-d array of 100x100 to 50x50 without data loss. The code is shown as: ## creat a original data xc1, xc2, yc1, yc2 = 100, 110, 35, 45 XSIZE,YSIZE=100,100 lon,lat = np.linspace(xc1,xc2,XSIZE),np.linspace(yc1,yc2,YSIZE) pop = np.random.uniform(low=1000, high=50000, size=(XSIZE*YSIZE,)).reshape(YSIZE,XSIZE) ## reshape shape = np.array(pop.shape, dtype=float) coarseness = 2 # the new shape is in 50 x 50 new_shape = coarseness * np.ceil(shape/coarseness).astype(int) zp_pop = np.zeros(new_shape) zp_pop[:int(shape[0]), :int(shape[1])] = pop temp = zp_pop.reshape((new_shape[0] // coarseness, coarseness, new_shape[1] // coarseness, coarseness)) coarse_pop = np.sum(temp, axis=(1,3)) print (pop.sum()) print (coarse_pop.sum()) However, when the coarse factor is different for each axis, this method can not be implemented. I turned to apply other method. Here is an example I tried to use FFT to generate a 60*80 array as output from scipy import fftpack pop_fft = fftpack.fft2(pop,shape = (60,80)) pop_res = fftpack.ifft2(pop_fft).real print(pop.sum()) print(pop_res.sum()) 254208134.8356425 122048754.13639387 The data loss was significant. Thus, I posted my issue here. Maybe the resampling function I used was not correct. Or there are some better approach to deal with this situation. Any advices or comments are highly appreciated!
When you set up the 'coarse array' yourself you sum over adjacent entries, instead of computing the average or interpolating. This way the sum over all elements in the coarse and original array are identical str((coarse_pop.sum()-pop.sum())/(0.5*(pop.sum()+coarse_pop.sum()))) gives '-1.1638426077573779e-16' only a tiny numerical error. if you compare the mean of the fftpack resampled coarse array it matches up: print(pop.mean()) print(pop_res.mean()) 25606.832220313503 25496.03271480075 alternatively you can correct for the number of elements yourself: print(pop.sum()) print(pop_res.sum()*100*100/(60*80)) 256068322.20313504 254960327.14800745 I don't know about your problem but the fftpack way of downsampling the array makes more sense to me. if it's not what you want you can apply the prefactor to the original array, like pop_fft = fftpack.fft2(pop*100*100/(60*80),shape = (60,80))
How should I modify the test data for SVM method to be able to use the `precomputed` kernel function without error?
I am using sklearn.svm.SVR for a "regression task" which I want to use my "customized kernel method". Here is the dataset samples and the code: index density speed label 0 14 58.844020 77.179139 1 29 67.624946 78.367394 2 44 77.679100 79.143744 3 59 79.361877 70.048869 4 74 72.529289 74.499239 .... and so on from sklearn import svm import pandas as pd import numpy as np density = np.random.randint(0,100, size=(3000, 1)) speed = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1)) label = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1)) d = np.hstack((a,b,c)) data = pd.DataFrame(d, columns=['density', 'speed', 'label']) data.density = data.density.astype(dtype=np.int32) def my_kernel(X,Y): return np.dot(X,X.T) svr = svm.SVR(kernel=my_kernel) x = data[['density', 'speed']].iloc[:2000] y = data['label'].iloc[:2000] x_t = data[['density', 'speed']].iloc[2000:3000] y_t = data['label'].iloc[2000:3000] svr.fit(x,y) y_preds = svr.predict(x_t) the problem happens in the last line svm.predict which says: X.shape[1] = 1000 should be equal to 2000, the number of samples at training time I searched the web to find a way to deal with the problem but many questions alike (like {1}, {2}, {3}) were left unanswered. Actually, I had used SVM methods with rbf, sigmoid, ... before and the code was working just fine but this was my first time using customized kernels and I suspected that it must be the reason why this error happened. So after a little research and reading documentation I found out that when using precomputed kernels, the shape of the matrix for SVR.predict() must be like [n_samples_test, n_samples_train] shape. I wonder how to modify x_test in order to get predictions and everything works just fine with no problem like when we don't use customized kernels? If possible please describe "the reason that why the inputs for svm.predict function in precomputed kernel differentiates with the other kernels". I really hope the unanswered questions that are related to this issue could be answered respectively.
The problem is in your kernel function, it doesn't do the job. As the documentation https://scikit-learn.org/stable/modules/svm.html#using-python-functions-as-kernels says, "Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2, n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2)." The sample kernel on the same page satisfies this criteria: def my_kernel(X, Y): return np.dot(X, Y.T) In your function the second argument of dot is X.T and thus the output will have shape (n_samples_1, n_samples_1) which is not that is expected.
The shape does not match means the test data and train data are of not equal shape, always think about matrix or array in numpy. If you are doing any arithmetic operation you always need a similar shape. That's why we check array.shape. [n_samples_test, n_samples_train] you can modify shapes but its not best idea. array.shape, reshape, resize are used for that
How to represent Word2Vec model to graph? (or convert a 1x300 numpy array to just 1x2 array)
I have a 1x300 numpy array from my Word2Vec model which is returns like this: [ -2.55022556e-01 1.06162608e+00 -5.86191297e-01 -4.43067521e-01 4.46810514e-01 4.31743741e-01 2.16610283e-01 9.27684903e-01 -4.47879761e-01 -9.11142007e-02 3.27048987e-01 -8.05553675e-01 -8.54483843e-02 -2.85595834e-01 -2.70745698e-02 -3.08014955e-02 1.53204888e-01 3.16114485e-01 -2.82659411e-01 -2.98218042e-01 -1.03240972e-02 2.12806061e-01 1.63605273e-01 9.42423999e-01 1.20789325e+00 4.11570221e-01 -5.46323597e-01 1.95108235e-01 -4.53743488e-01 -1.28625661e-01 -7.43277609e-01 1.11551750e+00 -4.51873302e-01 -1.14495361e+00 -6.69551417e-02 6.88364863e-01 -6.01781428e-01 -2.36386538e-01 -3.64305973e-01 1.18274912e-01 2.03438237e-01 -1.01153564e+00 6.67958856e-01 1.80363625e-01 1.26524955e-01 -2.96024203e-01 -9.93479714e-02 -4.93405871e-02 1.02504417e-01 7.63318688e-02 -3.68398607e-01 3.03587675e-01 -2.90227026e-01 1.51891649e-01 -6.93689287e-03 -3.99766594e-01 -1.86124116e-01 -2.86920428e-01 2.04880714e-01 1.39914978e+00 1.84370011e-01 -4.58923727e-01 3.91094625e-01 -7.52937734e-01 3.05261135e-01 -4.55163687e-01 7.22679734e-01 -3.76093656e-01 6.05900526e-01 3.26470852e-01 4.72957864e-02 -1.18182398e-01 3.51043999e-01 -3.07209432e-01 -6.10330477e-02 4.14131492e-01 7.57511556e-02 -6.48704231e-01 1.42518353e+00 -9.20495167e-02 6.36665523e-01 5.48510313e-01 5.92754841e-01 -6.29535854e-01 -4.47180003e-01 -8.99413109e-01 -1.52441502e-01 -1.98326513e-01 4.74154204e-01 -2.07036674e-01 -6.70400202e-01 6.67807996e-01 -1.04234733e-01 7.16163218e-01 3.32825005e-01 8.20083246e-02 5.88186264e-01 4.06852067e-01 2.66174138e-01 -5.35981596e-01 3.26077454e-02 -4.04357493e-01 2.19569445e-01 -2.74264365e-01 -1.65187627e-01 -4.06753153e-01 6.12065434e-01 -1.89857081e-01 -5.56927800e-01 -6.78636551e-01 -7.52498448e-01 1.04564428e+00 5.32510102e-01 5.05628288e-01 1.95120305e-01 -6.40793025e-01 5.73082231e-02 -1.58281475e-02 -2.62718409e-01 1.74351722e-01 -6.95129633e-02 3.44214857e-01 -4.24746841e-01 -2.75907904e-01 -6.60992935e-02 -1.19041657e+00 -6.01056278e-01 5.67718685e-01 -6.47478551e-02 1.55902460e-01 -2.48480186e-01 5.56753576e-01 1.29889056e-01 3.91534269e-01 1.28707469e-01 1.29670590e-01 -6.98880851e-01 2.43386969e-01 7.70289376e-02 -1.14947490e-01 -4.31593180e-01 -6.16873622e-01 6.03831768e-01 -2.07050622e-01 1.23276520e+00 -1.67524610e-02 -4.67656374e-01 1.00281858e+00 5.17916441e-01 -7.99495637e-01 -4.22653735e-01 -1.45487636e-01 -8.71369673e-04 1.25453219e-01 -1.25869447e-02 4.66426492e-01 5.07026255e-01 -6.53024793e-01 7.53435045e-02 8.33864748e-01 3.37398499e-01 7.50920832e-01 -4.80326146e-01 -4.52838868e-01 5.92808545e-01 -3.57870340e-01 -1.07011057e-01 -1.13945460e+00 3.97635132e-01 1.23554178e-01 4.81683850e-01 5.47445454e-02 -2.18614921e-01 -2.00085923e-01 -3.73975009e-01 8.74632657e-01 6.71471596e-01 -4.01738763e-01 4.76147681e-01 -5.79257011e-01 -1.51511624e-01 1.43170074e-01 5.00052273e-01 1.46719962e-01 2.43085429e-01 5.89158475e-01 -5.25088668e-01 -2.65306592e-01 2.18211919e-01 3.83228660e-01 -2.51622144e-02 2.32621357e-01 8.06669474e-01 1.37254462e-01 4.59401071e-01 5.63044667e-01 -5.79878241e-02 2.68106610e-01 5.47239482e-01 -5.05441546e-01] It's so frustrating to read because I just want to get a 1x2 array like [12,19] so I can represent it to graph and make a cosine distance measurement to the 1x2 array. How to do it? Or how to represent the 1x300 Word2Vec model to a 2D graph?
There are many ways to apply "dimensionality reduction" to high-dimensional data, for aid in interpretation or graphing. One super-simple way to reduce your 300-dimensions to just 2-dimensions, for plotting on a flat screen/paper: just discard 298 of the dimensions! You'll have something to plot – such as the point (-0.255022556, 1.06162608) if taking just the 1st 2 dimensions of your example vector. However, starting from word2vec vectors, those won't likely be very interesting points, individually or when you start plotting multiple words. The exact axes dimensions of such vectors are unlikely to be intuitively meaningful to humans, and you're throwing 99.7% of all the meaning per vector away – and quite likely the dimensions which (in concert with each other) capture semantically-meaningful relationships. So you'd be more likely to do some more thoughtful dimensionality-reduction. A super-simple technique would be to pick two vector-directions that are thought to be meaningful as your new X and Y axes. In the word2vec world, these wouldn't necessarily be existing vectors in the set – though they could be – but might be the difference between two vectors. (The analogy-solving power of word2vec vectors essentially comes from discovering the difference-between two vectors A and B, then applying that difference to a third vector C to find a 4th vector D, at which point D often has the same human-intuitive analogical-relationship to C as B had to A.) For example, you might difference the word-vectors for 'man' and 'woman', to get a vector which bootstraps your new X-axis. Then difference the word-vectors for 'parent' and 'worker', to get vector which bootstraps your new Y-axis. Then, for every candidate 300-dimensional vector you want to plot, find that candidate vector's "new X" by calculating the magnitude of its projection onto your X-direction-vector. Then, find that candidate vector's "new Y" by calculating the magnitude of its projection onto your Y-direction-vector. This might result in a set of relative values that, on a 2-D chart, vaguely match human intuitions about often-observed linguisti relationships between gender and familial/workplace roles. As #poorna-prudhvi's comment mentions, PCA and t-SNE are other techniques – which may specifically do better at preserving certain interesting qualities of the full-dimensional data. t-SNE, especially, was invented for to support machine-learning and plotting, and tries to keep the distance-relationships that existed in the higher-number-of-dimensions similar in the lower-number-of-dimensions.
In addition to #gojomo's answer, if it's only for experimenting i'd recommend using tensorflow's projector which provides a nice GUI for out of the box (approx) PCA and t-SNE. Just use numpy.savetxt to format your vectors properly.
signal.correlate 'same'/'full' meaning?
I am wondering what the mode arguments in signal.correlate (or numpy.correlate) mean? def crossCorrelator(sig1, sig2): correlate = signal.correlate(sig1,sig2,mode='same') return(correlate) flux0 = [ 0.02006948 0.01358697 -0.06196026 -0.03842506 -0.09023056 -0.05464169 -0.02530553 -0.01937054 -0.01237411 0.03472263 0.17865012 0.27441767 0.23532932 0.16358341 0.08743969 0.12166425 0.10287468 0.13430794 0.08262321 0.0515434 0.04657624 0.09017276 0.09131331 0.04696824 -0.03901519 -0.01413654 0.05448175 0.1236946 0.09968044 -0.001584 -0.06094561 -0.02998289 -0.00113092 0.04336605 0.01105071 0.0527657 0.03825847 0.02309524] flux1 = [-0.02946104 -0.02590192 -0.02274955 0.00485888 -0.0149776 0.01757462 0.02820086 0.0379213 0.03580811 0.06507382 0.09995243 0.12814133 0.16109725 0.12371425 0.08273643 0.09433014 0.05137761 0.04057405 -0.08171598 -0.06541216 0.00126869 0.09223577 0.06811737 0.0795967 0.08689563 0.0928949 0.09971169 0.05413958 0.05410236 0.00120439 0.02454734 0.06450544 0.01508899 -0.06100537 -0.10038889 -0.00651572 0.01095773 0.05517478] correlation = crossCorrelator(flux0,flux1) f, axarr = plt.subplots(2) axarr[0].plot(np.arange(len(flux0)),flux0) axarr[0].plot(np.arange(len(flux1)),flux1) axarr[1].plot(np.arange(len(correlation)),correlation) plt.show() When I use mode 'same' the correlation array has same dimension as the fluxes for full it has double? If the len(flux0/1) is of dimension time what dimension would len(correlation) be ? I am really more looking for a mathematical explanation, the answers I have found so far were more of technical nature...
Given two sequences (a[0], .., a[A-1]) and (b[0], .., b[B-1]) of lengths A and B, respectively, the convolution is calculated as c[n] = sum_m a[m] * b[n-m] If mode=="full" then the convolution is calculated for n ranging from 0 to A+B-2, so the return array has A+B-1 elements. If mode=="same" then scipy.signal.correlate computes the convolution for n ranging from (B-1)/2 to A-1+(B-1)/2, where integer division is assumed. The return array has A elements. numpy.correlate behaves the same way only if A>=B; if A is less than B it switches the two arrays (and the returned array has B elements). If mode=="valid" then the convolution is calculated for n ranging from min(A,B)-1 to max(A,B)-1, and therefore has max(A,B)-min(A,B)+1 elements.