Understanding the output of scipy.stats.multivariate_normal - python

I am trying to build a multidimensional gaussian model using scipy.stats.multivariate_normal. I am trying to use the output of scipy.stats.multivariate_normal.pdf() to figure out if a test value fits reasonable well in the observed distribution.
From what I understand, high values indicate a better fit to the given model and low values otherwise.
However, in my dataset, I see extremely large PDF(x) results, which lead me to question if I understand things correctly. The area under the PDF curve must be 1, so very large values are hard to comprehend.
For e.g., consider:
x = [-0.0007569417915494715, -0.01394295997613827, 0.000982078369890444, -0.03633664354397629, -0.03730583036106844, 0.013920453054506978, -0.08115836865224338, -0.07208494497398354, -0.06255237023298793, -0.0531888840386906, -0.006823760545565131]
mean = [0.01663645201261102, 0.07800335614699873, 0.016291452384234965, 0.012042931155488702, 0.0042637244100103885, 0.016531331606477996, -0.021702714746699842, -0.05738646649459681, 0.00921296058625439, 0.027940994009345254, 0.07548111758006244]
covariance = [[0.07921927017771506, 0.04780185747873293, 0.0788086850274493, 0.054129466248481264, 0.018799028456661045, 0.07523731808137141, 0.027682748950487425, -0.007296954729572955, 0.07935165417756569, 0.0569381100965656, 0.04185848489472492], [0.04780185747873293, 0.052300105044833595, 0.047749467098423544, 0.03254872837949123, 0.010582358713999951, 0.045792252383799206, 0.01969282984717051, -0.006089301208961258, 0.05067712814145293, 0.03146214776997301, 0.04452949330387575], [0.0788086850274493, 0.047749467098423544, 0.07841809405745602, 0.05374461924031552, 0.01871005609017673, 0.07487015790787396, 0.02756781074862818, -0.007327131572569985, 0.07895548129950304, 0.056417456686115544, 0.04181063355048408], [0.054129466248481264, 0.03254872837949123, 0.05374461924031552, 0.04538801863296238, 0.015795381235224913, 0.05055944754764062, 0.02017033995851422, -0.006505939129684573, 0.05497361331950649, 0.043858860182247515, 0.029356699144606032], [0.018799028456661045, 0.010582358713999951, 0.01871005609017673, 0.015795381235224913, 0.016260640022897347, 0.015459548918222347, 0.0064542528152879705, -0.0016656858963383602, 0.018761682220822192, 0.015361512546799405, 0.009832025009280924], [0.07523731808137141, 0.045792252383799206, 0.07487015790787396, 0.05055944754764062, 0.015459548918222347, 0.07207012779105286, 0.026330967917717253, -0.006907504360835279, 0.0753380831201204, 0.05335128471397023, 0.03998397595850863], [0.027682748950487425, 0.01969282984717051, 0.02756781074862818, 0.02017033995851422, 0.0064542528152879705, 0.026330967917717253, 0.020837940236441078, -0.003320408544812026, 0.027859582829638897, 0.01967636950969646, 0.017105000942890598], [-0.007296954729572955, -0.006089301208961258, -0.007327131572569985, -0.006505939129684573, -0.0016656858963383602, -0.006907504360835279, -0.003320408544812026, 0.024529061074105817, -0.007869287828047853, -0.006228903058681195, -0.0058974553248417995], [0.07935165417756569, 0.05067712814145293, 0.07895548129950304, 0.05497361331950649, 0.018761682220822192, 0.0753380831201204, 0.027859582829638897, -0.007869287828047853, 0.08169291677188911, 0.05731196406065222, 0.04450058445993234], [0.0569381100965656, 0.03146214776997301, 0.056417456686115544, 0.043858860182247515, 0.015361512546799405, 0.05335128471397023, 0.01967636950969646, -0.006228903058681195, 0.05731196406065222, 0.05064023101024737, 0.02830810316675855], [0.04185848489472492, 0.04452949330387575, 0.04181063355048408, 0.029356699144606032, 0.009832025009280924, 0.03998397595850863, 0.017105000942890598, -0.0058974553248417995, 0.04450058445993234, 0.02830810316675855, 0.040658283674780395]]
For this, if I compute y = multivariate_normal.pdf(x, mean, cov);
the result is 342562705.3859754.
How could this be the case? Am I missing something?
Thanks.

This is fine. The probability density function can be larger than 1 at a specific point. It's the integral than must be equal to 1.
The idea that pdf < 1 is correct for discrete variables. However, for continuous ones, the pdf is not a probability. It's a value that is integrated to a probability. That is, the integral from minus infinity to infinity, in all dimensions, is equal to 1.

Related

High difference in predictions on different train test split sizes

I am unable to figure out the reason behind the contrasting difference in predictions on different test train splits while training the linear model using LinearRegression.
This is the my initial try on the data:
x_train,x_test,y_train,true_p=train_test_split(train,y,random_state=121,test_size=0.2,shuffle=True)
lreg=LinearRegression()
lreg.fit(x_train,y_train)
train_pred=lreg.predict(x_train)
test_pred=lreg.predict(x_test)
this is the output in train_pred:
train_pred
array([12.37512481, 11.67234874, 11.82821202, ..., 12.61139596,
12.13886881, 12.42435563])
this is the output in test_pred:
test_pred
array([ 1.21885520e+01, 1.13462088e+01, 1.14144208e+01, 1.22832932e+01,
1.29980626e+01, 1.17641183e+01, 1.20982465e+01, 1.15846156e+01,
1.17403904e+01, 4.17353113e+07, 1.27941840e+01, 1.21739628e+01,
..., 1.22022858e+01, 1.15779229e+01, 1.24931376e+01, 1.26387188e+01,
1.18341585e+01, 1.18411881e+01, 1.21475986e+01, 1.25104774e+01])
The predicted data of both variables have very huge difference, while the latter one is the wrong predicted data.
I have tried increasing the test size to 0.4. Now I have received good prediction.
x_train,x_test,y_train,true_p=train_test_split(train,y,random_state=121,test_size=0.4,shuffle=True)
lreg=LinearRegression()
lreg.fit(x_train,y_train)
train_pred=lreg.predict(x_train)
test_pred=lreg.predict(x_test)
These are the outputs of train_pred and test_pred:
train_pred
array([11.95505983, 12.66847164, 11.81978843, 12.82992812, 12.44707462,
11.78809995, 11.92753084, 12.6082893 , 12.22644843, 11.93325658,
12.2449481 ,..., 11.69256008, 11.67984786, 12.54313682, 12.30652695])
test_pred
array([12.22133867, 11.18863973, 11.46923967, 12.26340761, 12.99240451,
11.77865948, 12.04321231, 11.44137667, 11.71213919, 11.44206212,
..., 12.15412777, 12.39184805, 10.96310233, 12.06243916, 12.11383494,
12.28327695, 11.19989021, 12.61439939, 12.22474378])
What is the reason behind this? How to rectify this problem on 0.2 test train split?
Thank you
Check units of your test_pred. They are all x10 (seen by the e+01). If you set the print settings of numpy to remove the scientific notation by np.set_printoptions(suppress=True) and then print your test_pred you should see that it looks very similar to train_pred. So in short, nothing is wrong.
Just when the data has very high variance, in a very small test set, significant differences in predictions can occur. I would say it is an * underfitting *.
Start by analyzing your dataset and you will see the main causes of this variance through basic descriptive statistics (graphs, measures of position and dispersion, etc.). After that, increase the size of your test set, so that it is balanced otherwise your study will be biased.
But from what I saw, everything is fine, the only problem is the notation e + 01 means that the number is multiplied by 10

Extracting confidence from scikit PassiveAggressiveClassifier() for single prediction

I have trained an PassiveAggressiveClassifier with a set of 165 categories.
Now I can already use it to predict certain inputs but it fails sometimes and it would be very helpful know how "confident" is the classifier on each prediction and what are the other considerations.
As far as I understand I get the distances for each category using decision_function
distances = np.array(ppl.decision_function(sample))
which gives me something like this for the distances:
[-1.4222 -1.5083 -2.6488 -2.3428 -1.3167 -3.9615 -2.7804 -1.9563 -0.5054
-1.9524 -3.0026 -3.422 -2.1301 -2.0119 -2.1381 -2.2186 -2.0848 -2.4514
-1.9478 -2.3101 -2.4044 -1.9155 -1.569 -1.31 -1.4865 -2.3251 -1.7773
-1.304 -1.5215 -2.0634 -1.6987 -1.9217 -2.2863 -1.8166 -2.0219 -1.9594
-1.747 -2.1503 -2.162 -1.9507 -1.5971 -3.4499 -1.8946 -2.4328 -2.2415
-1.9045 -2.065 -1.9671 -1.8592 -1.6283 -1.7626 -2.2175 -2.1725 -3.7855
-5.1397 -3.6485 -4.4072 -2.2109 -2.048 -2.4887 -2.2324 -2.7897 -1.2932
-1.975 -1.516 -1.6127 -1.7135 -1.8243 -1.4887 -2.8973 -1.9656 -2.2236
-2.2466 -2.1224 -1.2247 -1.9657 -1.6138 -2.7787 -1.5004 -2.0136 -1.1001
-1.7226 -1.5829 -2.0317 -1.0834 -1.7444 -1.356 -2.3453 -1.7161 -2.2683
-2.2725 -0.4512 -4.5038 -2.0386 -2.1849 -2.4256 -1.5678 -1.8114 -2.2138
-2.2654 -1.8823 -2.7489 -1.8477 -2.1383 -1.6019 -2.84 -2.2595 -2.0764
-1.6758 -2.4279 -2.3489 -2.1884 -2.1888 -1.6289 -1.7358 -1.2989 -1.5656
-1.3362 -1.888 -2.1061 -1.4517 -2.0572 -2.4971 -2.2966 -2.6121 -2.4728
-2.8977 -1.7571 -2.4363 -1.4775 -1.7144 -2.047 -3.9252 -1.9907 -2.1808
-2.066 -1.9862 -1.4898 -2.3335 -2.6088 -2.4554 -2.4139 -1.7187 -2.2909
-1.4846 -1.8696 -2.444 -2.6253 -1.7738 -1.7192 -1.8737 -1.9977 -1.9948
-1.7667 -2.0704 -3.0147 -1.9014 -1.7713 -2.2551]
Now I have two questions:
1st whether it is possible to map the distances back to the categories since the length of the array (159) does not match my categories array.
2nd how can I calculate a confidence for the single prediction using the distances?
Question 1
As per the comment, make sure all your classes are contained in the training set. You can achieve this for example by using the train_test_split function and passing your targets into the stratify parameter.
Once you do this, the problem will disappear and there will be one classifier per each class. As a result, if you pass a sample to decision_function method there will be one distance to the hyperplane for each class.
Question 2
You can turn the distances into probabilities through rescaling and normalizing (i.e. softmax). This is already implemented internally in the _predict_proba_lr method. See the source code here.

Using k-means clustering to cluster based on single variable

I’m just trying to get my head around clustering.
I have a series of data points - y - which have a noise function associated with them (gaussian)
There are two classes of values 0 and >0 (obviously with noise). I’m trying to find the centre point of the group which is >0.
I’ve plotted the points with a simple moving average to be able to eyeball the data.
Moving average plot:
How can I cluster the data just based on the y value?
I’d like to have two clusters - one covering the points on the left and right (roughly <120 and >260 by the looks of it) and the other for the middle points (x = 120 to 260)
If I try with two clusters I get this:
k means plot - k=2:
How should I amend my code to achieve this?
x = range(315)
y= [-0.0019438692324050865, 0.0028994208839327852, 0.0051483573976274649, -0.0033242993359676809, -0.007205517954705391, 0.0023493638544448323, 0.0021109981155292179, 0.0035990200904119076, -0.0039516797159245328, 0.0046512034107712786, -0.0019248189368846083, 0.0036744109953683823, 0.0007898612768152954, 0.0050059088808496474, -0.0021084425769681558, 0.0014692258570182986, -0.0030711206115484175, -0.0026614801222815628, 0.0022816301256991535, 0.00019923934682088178, -0.0013181161659271139, -0.0021956355547661358, 0.0012941895041076283, 0.00337197586896105, -0.0019792508536746402, -0.002020497762984554, 0.0014495021773240431, 0.0011887337096206894, 0.0016667792145975404, -0.0010119590445208419, -0.0024506337087077676, 0.0072264471843846339, -0.0014126073097276062, -0.00065673498034648755, -0.0011355352304356647, -0.00042657980930307281, -0.0032875547481258042, -0.002351265010099495, -0.00073344218847348742, -0.0031555991687002589, 0.0026170287799315104, 0.0019289080666337198, -0.0021804765064623076, 0.0026221290350876979, 0.0019831827145683828, -0.005422907223254632, -0.0014107046201467732, -0.0049438583709020423, 0.00081884635937855494, 0.0054783747880986361, -0.0011282600170147909, -0.00436581779762948, 0.0024421851848953177, -0.0018564229613786095, -0.0052492274840120123, 0.0051775747035086306, 0.0052413417491534494, 0.0030817295096650732, -0.0014106391941506153, 0.00074380887788818206, -0.0041507550699856439, -0.00074928547462217287, -9.3938667619130614e-05, -0.00060592968804004362, 0.0064913597798387348, 0.0018098075166183621, 0.00099550852535854441, 0.0037322288350247917, 0.0027039351321340869, 0.0060238021513650541, -0.006567405116575234, 0.0020858553839503175, -0.0040329574871009084, -0.0029337227854833213, 0.0020743996957790969, 0.0041249738085716511, -0.0016678673351373336, -0.00081387164524554967, -0.0028411340446090278, 0.00013572776045231967, -0.00025350369023925548, 0.00071609777542998309, -0.0018427036825796074, -0.0015513575887011904, -0.0016357115978466398, 0.0038235991426514866, 0.0017693050063256977, -0.00029816429542494152, -0.0016071303644783605, -0.0031883070092131086, -0.0010340123778528594, -0.0049194467790889653, 0.0012109237666701397, 0.0024532524488299246, 0.0069307209537693721, 0.0009573350812806618, -6.0022322637651027e-05, -0.00050143013334696311, 0.0023415017810229548, 0.0033053845403900849, -0.0061156769150035222, 0.00022216114877491691, 0.0017257349557975464, 4.6919738262423826e-05, -0.0035257466102171162, -0.0043673831041441185, -0.0016592116617178102, -0.003298933045964781, -0.001667158964114637, 0.0011283739877531254, -0.0055098513985193534, 0.0023564462221116358, 0.0041971132878626258, 0.0061727231077443314, 0.0047583822927202779, 0.0022475414486232245, 0.0048682822792560521, 0.0022415648209199016, 0.00044859963858686957, -0.0018519391698513449, 0.0031460918774998763, 0.0038614233082916809, -0.0043409564348247066, -0.0055560805453666326, -0.00025133196059449212, 0.012436346397552794, 0.01136022093203152, 0.011244278807602391, 0.01470018209739289, 0.0075560289478025277, 0.012568781764361209, 0.0076068752709663838, 0.011022209533236597, 0.010545997929846045, 0.01084340614623565, 0.011728388118710915, 0.0075043238708055885, 0.012860298948366296, 0.0097297636410632864, 0.0098800557729756874, 0.011536517297700085, 0.0082316420968713416, 0.012612386004592427, 0.016617154743589352, 0.0091391582296167315, 0.014952150276251052, 0.011675391002362373, 0.01568297072839233, 0.01537664322062633, 0.01622711654371662, 0.010708828344561546, 0.016625354383482532, 0.010757807468539406, 0.016867909081979202, 0.010354635736138377, 0.014345365677006765, 0.011114328315579219, 0.010034249196973242, 0.015846180181371881, 0.014303841146954242, 0.011608682896746103, 0.0086826955459553216, 0.0088576104599897426, 0.011250553207393772, 0.005522552439745569, 0.011185993425936373, 0.010241377537878162, 0.0079206732150164348, 0.0052965651546758108, 0.011104715912291204, 0.010506408714857187, 0.010153282642128673, 0.010286986015082572, 0.01187330766677645, 0.014541420264499783, 0.013092204890199896, 0.012979246400649271, 0.012595814351669916, 0.014714607377710237, 0.011727516021525658, 0.011035077266739704, 0.0089698030032708698, 0.0087245475140550147, 0.011139467365240661, 0.0094505568595650603, 0.014430361388952871, 0.0089241578716030695, 0.014616210804585136, 0.013295072783119581, 0.014430633057603408, 0.01200577022494694, 0.011315388654675421, 0.013359877656434442, 0.017704146495248471, 0.0089900858719559155, 0.014731590728415532, 0.0053244009632545759, 0.011199377929150522, 0.0098899254166580439, 0.012220397221188688, 0.015315682643295272, 0.0042842773538990919, 0.0098560854848898077, 0.0088592602102698509, 0.011682575531316278, 0.0098450268165344631, 0.015508017179782136, 0.0083959771972897564, 0.0057504382506886418, 0.010149849298310511, 0.011467172305959087, 0.019354427705224483, 0.013200207481702888, 0.0084555200083286791, 0.011458643458455485, 0.0067582116806278788, 0.01083616691886825, 0.013189184991857963, 0.011774794518724967, 0.014419252448288828, 0.011252283438046358, 0.013346699363583018, 0.0070752340082163006, 0.013215300343131422, 0.0083841320189162287, 0.0067600805611729283, 0.014043517055899181, 0.0098241497159076551, 0.011466675085574904, 0.01155354571355972, 0.012051701509217881, 0.010150596813866767, 0.0093930906430917619, 0.003368481869910186, 0.0048359029438027378, 0.0072083852964288445, 0.010112266453748613, 0.014009345326404186, 0.0050187514558796657, 0.0076315122645601551, 0.0098572381625301152, 0.0114902035403828, 0.018390212262653569, 0.020552166087412803, 0.010428735773226807, 0.011717974670325962, 0.011586303572796604, 0.0092978832913345726, 0.0040060048273946845, 0.012302496528511328, 0.0076707934776137684, 0.014700766223305586, 0.013491092168119941, 0.016244916923257174, 0.010387716692694397, 0.0072564046806323553, 0.0089420045528720883, 0.012125390630607462, 0.013274623392811291, 0.012783388635585766, 0.013859113028817658, 0.0080975189401925642, 0.01379241865445455, 0.012648552766643405, 0.011380280655911323, 0.010109646424218717, 0.0098577688652478051, 0.0064661895943772208, 0.010848835432253455, -0.0010986941731458047, -0.00052875821639583262, 0.0020423603076171414, 0.0035710440970171805, 0.001652886517437206, 0.0023512717524485573, -0.002695275440737862, 0.002253880812688683, -0.0080855104018828141, -0.0020090808966136161, -0.0029794078852333791, 0.00047537441103425869, -0.0010168825525621432, 0.0028683012479151873, -0.0014733214239664142, 0.0019432702158397569, -0.0012411849653504801, -0.00034507088510895141, -0.0023587874349834145, 0.0018156591123708393, 0.0040923006067568324, 0.0043522232127477072, -0.0055992642684123371, -0.0019368557792245147, 0.0026257395447205848, 0.0025594329536029635, 0.00053681548609292378, 0.0032186216144045742, -0.003338121135450386, 0.00065996843114729585, 0.006711173245189642, 0.0032877327776177517, 0.0039528629317296367, 0.0063732674764248719, -0.0026207617244284023, 0.0061381482567009048, -0.003024741769256066, -0.0023891419421980839, -0.004011235930513047, 0.0018372067754070733, -0.0045928077859572689, -0.0021420171112169601, 0.001665179522797816, 0.0074356736689407859, 0.0065680163280897891, -0.0038116640825467678]
data = np.column_stack([x,y])
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
plt.scatter(data[:, 0], data[:, 1], c=y_kmeans, s=5, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.grid()
I’d also like to be able to return the max, min and average for the values in each cluster - is this possible?
Some ideas on your problem.
k-means is actually a multivariate method, so it is probably not a good choice in your case. You can take advantage of the 1-dimensionality of you data by looking for minima of a kernel density estimation of the y-data. A plot of the density estimation will show a bimodal density function with the two modes divided by a minimum which is the y-value at which you want to divide the two clusters.
Have a look at http://scikit-learn.org/stable/modules/density.html#kernel-density
To get the x-values at which you divide, you could use the moving average you already computed.
However, there might be methods better suited to your kind of data. You might want to ask your question at https://stats.stackexchange.com/ as it is not really a programming problem but one about the appropriate method.
You can reshape your data to a n x 1 array.
But if you want to take the time into account, I suggest you look into change detection in time series instead. It can detect a change in mean.
Using your code, the simplest way to get what you want is to change:
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
to
kmeans.fit(data[:,1].reshape(-1,1))
y_kmeans = kmeans.predict(data[:,1].reshape(-1,1))
You can get max, min, mean etc by using index, for example:
np.max(data[:,1][y_kmeans == 1])

How to represent Word2Vec model to graph? (or convert a 1x300 numpy array to just 1x2 array)

I have a 1x300 numpy array from my Word2Vec model which is returns like this:
[ -2.55022556e-01 1.06162608e+00 -5.86191297e-01 -4.43067521e-01
4.46810514e-01 4.31743741e-01 2.16610283e-01 9.27684903e-01
-4.47879761e-01 -9.11142007e-02 3.27048987e-01 -8.05553675e-01
-8.54483843e-02 -2.85595834e-01 -2.70745698e-02 -3.08014955e-02
1.53204888e-01 3.16114485e-01 -2.82659411e-01 -2.98218042e-01
-1.03240972e-02 2.12806061e-01 1.63605273e-01 9.42423999e-01
1.20789325e+00 4.11570221e-01 -5.46323597e-01 1.95108235e-01
-4.53743488e-01 -1.28625661e-01 -7.43277609e-01 1.11551750e+00
-4.51873302e-01 -1.14495361e+00 -6.69551417e-02 6.88364863e-01
-6.01781428e-01 -2.36386538e-01 -3.64305973e-01 1.18274912e-01
2.03438237e-01 -1.01153564e+00 6.67958856e-01 1.80363625e-01
1.26524955e-01 -2.96024203e-01 -9.93479714e-02 -4.93405871e-02
1.02504417e-01 7.63318688e-02 -3.68398607e-01 3.03587675e-01
-2.90227026e-01 1.51891649e-01 -6.93689287e-03 -3.99766594e-01
-1.86124116e-01 -2.86920428e-01 2.04880714e-01 1.39914978e+00
1.84370011e-01 -4.58923727e-01 3.91094625e-01 -7.52937734e-01
3.05261135e-01 -4.55163687e-01 7.22679734e-01 -3.76093656e-01
6.05900526e-01 3.26470852e-01 4.72957864e-02 -1.18182398e-01
3.51043999e-01 -3.07209432e-01 -6.10330477e-02 4.14131492e-01
7.57511556e-02 -6.48704231e-01 1.42518353e+00 -9.20495167e-02
6.36665523e-01 5.48510313e-01 5.92754841e-01 -6.29535854e-01
-4.47180003e-01 -8.99413109e-01 -1.52441502e-01 -1.98326513e-01
4.74154204e-01 -2.07036674e-01 -6.70400202e-01 6.67807996e-01
-1.04234733e-01 7.16163218e-01 3.32825005e-01 8.20083246e-02
5.88186264e-01 4.06852067e-01 2.66174138e-01 -5.35981596e-01
3.26077454e-02 -4.04357493e-01 2.19569445e-01 -2.74264365e-01
-1.65187627e-01 -4.06753153e-01 6.12065434e-01 -1.89857081e-01
-5.56927800e-01 -6.78636551e-01 -7.52498448e-01 1.04564428e+00
5.32510102e-01 5.05628288e-01 1.95120305e-01 -6.40793025e-01
5.73082231e-02 -1.58281475e-02 -2.62718409e-01 1.74351722e-01
-6.95129633e-02 3.44214857e-01 -4.24746841e-01 -2.75907904e-01
-6.60992935e-02 -1.19041657e+00 -6.01056278e-01 5.67718685e-01
-6.47478551e-02 1.55902460e-01 -2.48480186e-01 5.56753576e-01
1.29889056e-01 3.91534269e-01 1.28707469e-01 1.29670590e-01
-6.98880851e-01 2.43386969e-01 7.70289376e-02 -1.14947490e-01
-4.31593180e-01 -6.16873622e-01 6.03831768e-01 -2.07050622e-01
1.23276520e+00 -1.67524610e-02 -4.67656374e-01 1.00281858e+00
5.17916441e-01 -7.99495637e-01 -4.22653735e-01 -1.45487636e-01
-8.71369673e-04 1.25453219e-01 -1.25869447e-02 4.66426492e-01
5.07026255e-01 -6.53024793e-01 7.53435045e-02 8.33864748e-01
3.37398499e-01 7.50920832e-01 -4.80326146e-01 -4.52838868e-01
5.92808545e-01 -3.57870340e-01 -1.07011057e-01 -1.13945460e+00
3.97635132e-01 1.23554178e-01 4.81683850e-01 5.47445454e-02
-2.18614921e-01 -2.00085923e-01 -3.73975009e-01 8.74632657e-01
6.71471596e-01 -4.01738763e-01 4.76147681e-01 -5.79257011e-01
-1.51511624e-01 1.43170074e-01 5.00052273e-01 1.46719962e-01
2.43085429e-01 5.89158475e-01 -5.25088668e-01 -2.65306592e-01
2.18211919e-01 3.83228660e-01 -2.51622144e-02 2.32621357e-01
8.06669474e-01 1.37254462e-01 4.59401071e-01 5.63044667e-01
-5.79878241e-02 2.68106610e-01 5.47239482e-01 -5.05441546e-01]
It's so frustrating to read because I just want to get a 1x2 array like [12,19] so I can represent it to graph and make a cosine distance measurement to the 1x2 array.
How to do it? Or how to represent the 1x300 Word2Vec model to a 2D graph?
There are many ways to apply "dimensionality reduction" to high-dimensional data, for aid in interpretation or graphing.
One super-simple way to reduce your 300-dimensions to just 2-dimensions, for plotting on a flat screen/paper: just discard 298 of the dimensions! You'll have something to plot – such as the point (-0.255022556, 1.06162608) if taking just the 1st 2 dimensions of your example vector.
However, starting from word2vec vectors, those won't likely be very interesting points, individually or when you start plotting multiple words. The exact axes dimensions of such vectors are unlikely to be intuitively meaningful to humans, and you're throwing 99.7% of all the meaning per vector away – and quite likely the dimensions which (in concert with each other) capture semantically-meaningful relationships.
So you'd be more likely to do some more thoughtful dimensionality-reduction. A super-simple technique would be to pick two vector-directions that are thought to be meaningful as your new X and Y axes. In the word2vec world, these wouldn't necessarily be existing vectors in the set – though they could be – but might be the difference between two vectors. (The analogy-solving power of word2vec vectors essentially comes from discovering the difference-between two vectors A and B, then applying that difference to a third vector C to find a 4th vector D, at which point D often has the same human-intuitive analogical-relationship to C as B had to A.)
For example, you might difference the word-vectors for 'man' and 'woman', to get a vector which bootstraps your new X-axis. Then difference the word-vectors for 'parent' and 'worker', to get vector which bootstraps your new Y-axis. Then, for every candidate 300-dimensional vector you want to plot, find that candidate vector's "new X" by calculating the magnitude of its projection onto your X-direction-vector. Then, find that candidate vector's "new Y" by calculating the magnitude of its projection onto your Y-direction-vector. This might result in a set of relative values that, on a 2-D chart, vaguely match human intuitions about often-observed linguisti relationships between gender and familial/workplace roles.
As #poorna-prudhvi's comment mentions, PCA and t-SNE are other techniques – which may specifically do better at preserving certain interesting qualities of the full-dimensional data. t-SNE, especially, was invented for to support machine-learning and plotting, and tries to keep the distance-relationships that existed in the higher-number-of-dimensions similar in the lower-number-of-dimensions.
In addition to #gojomo's answer, if it's only for experimenting i'd recommend using tensorflow's projector which provides a nice GUI for out of the box (approx) PCA and t-SNE.
Just use numpy.savetxt to format your vectors properly.

computing an integral using an empirical integrand

I have an empirical probability function p(z). In the first column z and the second column contains p(z) values. The data is given as following :
data.cat
+0.01234 +0.002816
+0.03693 +0.003265
+0.06152 +0.003551
+0.08611 +0.006612
+0.1107 +0.008898
+0.1353 +0.01041
+0.1599 +0.01269
+0.1845 +0.01404
+0.2091 +0.01616
+0.2336 +0.01657
+0.2582 +0.01865
+0.2828 +0.01951
+0.3074 +0.02024
+0.332 +0.02131
+0.3566 +0.0222
+0.3812 +0.02306
+0.4058 +0.02241
+0.4304 +0.02347
+0.4549 +0.02461
+0.4795 +0.02306
+0.5041 +0.02298
+0.5287 +0.02212
+0.5533 +0.02392
+0.5779 +0.02118
+0.6025 +0.02359
+0.6271 +0.024
+0.6517 +0.02196
+0.6762 +0.02155
+0.7008 +0.02314
+0.7254 +0.02037
+0.75 +0.02065
+0.7746 +0.0211
+0.7992 +0.02037
+0.8238 +0.0198
+0.8484 +0.01984
+0.873 +0.01959
+0.8976 +0.01869
+0.9221 +0.01873
+0.9467 +0.01861
+0.9713 +0.01714
+0.9959 +0.01739
+1.02 +0.01678
+1.045 +0.01633
+1.07 +0.01624
+1.094 +0.01543
+1.119 +0.01494
+1.143 +0.01547
+1.168 +0.01445
+1.193 +0.01384
+1.217 +0.01339
+1.242 +0.01384
+1.266 +0.01298
+1.291 +0.0109
+1.316 +0.0122
+1.34 +0.0111
+1.365 +0.0109
+1.389 +0.009592
+1.414 +0.01114
+1.439 +0.0111
+1.463 +0.009061
+1.488 +4.082e-05
I have to compute the following integral using the empirical probability density by some kind of interpolation:
or it can be re-written as
where is defined as
and a is given as
I am wondering how I could compute this complicated integral regarding the existence of an empirical probability density function in the middle?
Do I need to do some kind of interpolation?

Categories