Label Propagation - Array is too big - python

I am using label propagation in scikit learn for semi-supervised classification. I have 17,000 data points with 7 dimensions. I am unable to use it on this data set. Its throwing a numpy big array error. However, it works fine when I work on a relatively small data set say 200 points. Can anyone suggestion a fix?
label_prop_model.fit(np.array(data), labels)
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 58, in fit
graph_matrix = self._build_graph()
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 108, in _build_graph
affinity_matrix = self._get_kernel(self.X_) # get the affinty martix from the data using rbf kernel
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 26, in _get_kernel
return rbf_kernel(X, X, gamma=self.gamma)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 350, in rbf_kernel
K = euclidean_distances(X, Y, squared=True)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 173, in euclidean_distances
distances = safe_sparse_dot(X, Y.T, dense_output=True)
File "/usr/lib/pymodules/python2.7/sklearn/utils/extmath.py", line 79, in safe_sparse_dot
return np.dot(a, b)
ValueError: array is too big.

How much memory does your computer have?
What sklearn might be doing here (I haven't gone through the source, so I might be wrong) is calculating euclidean lengths of vectors between each data point by taking the square of a 17000xK matrix. This would yield squared euclidean distance for all data points, but unfortunately produces an NxN ouput matrix if you have N data points. As far as I know numpy uses double precision, which results in an 17000x17000x8 bytes matrix, approximately 2.15 GB.
If your memory can't hold a matrix of that size that would cause trouble. Try creating a matrix of this size with numpy:
import numpy
mat = numpy.ones(17000, 17000)
If it succeeds I'm mistaken and the problem is something else (though certainly related to memory size and matrices sklearn is trying to allocate).
On the top of my head, one way to resolve this might be to propagate labels in parts by subsampling the unlabeled data points (and possibly the labeled points, if you have many of them). If you are able to run the algorithm for 17000/2 data points and you have L labeled points, build your new data set by randomly drawing (17000-L)/2 of the unlabeled points from the original set and combining them with the L labeled points. Run the algorithm for each partition of the full set.
Note that this probably will reduce the performance of the label propagation algorithm, since it will have fewer data points to work with. Inconsistencies between labels in each of the sets might also cause trouble.
Use with extreme caution and only if you have some way to evaluate the performance :)
A safer approach would be to A: Get more memory or B: Get a label propagation algorithm that is less memory intensive. It is certainly possible to exchange memory complexity for time complexity by recalculating euclidean distances when needed rather than constructing a full all pairs distance matrix as scikit appears to be doing here.

Related

Python - How to resample a 2D shape?

I am writing a python script for some geometrical data manipulation (calculating motion trajectories for a multi-drive industrial machine). Generally, the idea is that there is a given shape (let's say - an ellipse, but it general case it can be any convex shape, defined with a series of 2D points), which is rotated and it's uppermost tangent point must be followed. I don't have a problem with the latter part but I need a little hint with the 2D shape preparation.
Let's say that the ellipse was defined with too little points, for example - 25. (As I said, ultimately this can be any shape, for example a rounded hexagon). To maintain necessary precision I need far more points (let's say - 1000), preferably equally distributed over whole shape or with higher density of points near corners, sharp curves, etc.
I have a few things ringing in my head, I guess that DFT (FFT) would be a good starting point for this resampling, analyzing the scipy.signal.resample() I have found out that there are far more functions in the scipy.signal package which sound promising to me...
What I'm asking for is a suggestion which way I should follow, what tool I should try for this job, which may be the most suitable. Maybe there is a tool meant exactly for what I'm looking for or maybe I'm overthinking this and one of the implementations of FFT like resample() will work just fine (of course, after some adjustments at the starting and ending point of the shape to make sure it's closing without issues)?
Scipy.signal sounds promising, however, as far as I understand, it is meant to work with time series data, not geometrical data - I guess this may cause some problems as my data isn't a function (in a mathematical understanding).
Thanks and best regards!
As far as I understood, what you want is to get an interpolated version of your original data.
The DFT (or FFT) will not achieve this purpose, since it will perform an Fourier Transform (which is not what you want).
Talking theoretically, what you need to interpolate your data is to define a function to calculate the result in the new-data-points.
So, let's say your data contains 5 points, in which one you have a 1D (to simplify) number stored, representing your data, and you want a new array with 10 points, filled with the linear-interpolation of your original data.
Using numpy.interp:
import numpy as np
original_data = [2, 0, 3, 5, 1] # define your data in 1D
new_data_resolution = 0.5 # define new sampling distance (i.e, your x-axis resolution)
interp_data = np.interp(
x = np.arange(0, 5-1+new_data_resolution , new_data_resolution), # new sampling points (new axis)
xp = range(original_data),
fp = original_data
)
# now interp_data contains (5-1) / 0.5 + 1 = 9 points
After this, you will have a (5-1) / new_resolution (which is greater than 5, since new_resolution < 1)-length data, which values will be (in this case) a linear interpolation of your original data.
After you have achieved/understood this example, you can dive in the scipy.interpolate module to get a better understanding in the interpolation functions (my example uses a linear function to get the data in the missing points).
Applying this to n-D dimensional arrays is straight-forward, iterating over each dimension of your data.

Is t-SNE's computational bottleneck its memory complexity?

I've been exploring different dimensionality reduction algorithms, specifically PCA and T-SNE. I'm taking a small subset of the MNIST dataset (with ~780 dimensions) and attempting to reduce the raw down to three dimensions to visualize as a scatter plot. T-SNE can be described in great detail here.
I'm using PCA as an intermediate dimensional reduction step prior to T-SNE, as described by the original creators of T-SNE on the source code from their website.
I'm finding that T-SNE takes forever to run (10-15 minutes to go from a 2000 x 25 to a 2000 x 3 feature space), while PCA runs relatively quickly (a few seconds for a 2000 x 780 => 2000 X 20).
Why is this the case? My theory is that in the PCA implementation (directly from primary author's source code in Python), he utilizes Numpy dot product notations to calculate X and X.T:
def pca(X = Math.array([]), no_dims = 50):
"""Runs PCA on the NxD array X in order to reduce its dimensionality to no_dims dimensions."""
print "Preprocessing the data using PCA..."
(n, d) = X.shape;
X = X - Math.tile(Math.mean(X, 0), (n, 1));
(l, M) = Math.linalg.eig(Math.dot(X.T, X));
Y = Math.dot(X, M[:,0:no_dims]);
return Y;
As far as I recall, this is significantly more efficient than scalar operations, and also means that only 2N (where N is the number of rows) of data is loaded into memory (you need to load one row of X and one column of X.T).
However, I don't think this is the root reason. T-SNE definitely also contains vector operations, for example, when calculating the pairwise distances D:
D = Math.add(Math.add(-2 * Math.dot(X, X.T), sum_X).T, sum_X);
Or, when calculating P (higher dimension) and Q (lower dimension). In t-SNE, however, you have to create two N X N matrices to store your pairwise distances between each data, one for its original high-dimensional space representation and the other for its reduced dimensional space.
In computing your gradient, you also have to create another N X N matrix called PQ, which is P - Q.
It seems to me that the memory complexity here is the bottleneck. T-SNE requires 3N^2 of memory. There is no way this can fit in local memory, so the algorithm experiences significant cache line misses and needs to go to global memory to retrieve the values.
Is this correct? How do I explain to a client or a reasonable non-technical person why t-SNE is slower than PCA?
The co-author's Python implementation is found here.
The main reason for t-SNE being slower than PCA is that no analytical solution exists for the criterion that is being optimised. Instead, a solution must be approximated through gradient descend iterations.
In practice, this means lots of for loops. Not in the least the main iteration for-loop in line 129, that runs up to max_iter=1000 times. Additionally, the x2p function iterates over all data points with a for loop.
The reference implementation is optimised for readability, not for computational speed. The authors link to an optimised Torch implementation as well, which should speed up the computation a lot. If you want to stay in pure Python, I recommend the implementation in Scikit-Learn, which should also be a lot faster.
t-SNE tries to lower the dimensionality while preserving the distributions of distances between elements.
This requires computing distances between all the points. Pairwise distance matrix has N^2 entries where N is the number of examples.

How to use sklearn's IncrementalPCA partial_fit

I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.
I can load in the data just fine:
f = h5py.File('my_big_data.h5')
features = f['data']
And from this example, it seems I need to decide what size chunks I want to read from it:
num_rows = data.shape[0] # total number of rows in data
chunk_size = 10 # how many rows at a time to feed ipca
Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):
ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])
This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?
EDIT
The code above was for testing on a smaller subset of my data – as #ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.
As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).
What you have to do now to transform it, is actually transforming it:
out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.
Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.
Edit: as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.
You can follow the procedure in this answer to extract how much information is preserved by the first n components:
>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.
Thus, what is usually done, is to fit your PCA to the same number of components than your original data:
ipca = IncrementalPCA(n_components=features.shape[1])
Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:
k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.
Then you can tweek the transformation to reflect it:
cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
NOTE the slicing to :k to just select only the first k components while ignoring the rest.

How to perform linear/non-linear regression between two 2-D numpy arrays and visualize it with matplotlib?

First I'd like to clear that I need to perform regression on data between a disease and a number of other environmental factors for a particular large country, so I have lot of data.
Now I have this data stored in tiff files and I'm reading them into numpy arrays through gdal. Each dataset is read into a numpy array of shape <54L,53L>. I have several such arrays for each dataset. And I need to perform regression between such two 2-D numpy arrays. The values in arrays are Float64. Here's an example:
[[ 162.32145691 158.19345093 153.15704346 ..., 123.77481079 123.63883972 123.6770401 ]
[ 164.55152893 160.59266663 155.75968933 ..., 121.28504181 121.1164093 121.16275024] ...,
[ 321.38272095 329.53326416 338.85699463 ..., 193.69404602 192.50938416 191.42672729]]
Like DiseaseDataset vs EnvironmentFactor1, DiseasDataset vs EnvironementFactor2 etc. Since the relationship is rather unknown, arbitrary and complex I want to plot these 2-D arrays first, but I could not find an appropriate way.
So how do I plot the 2-D arrays in a scatter plot in matplotlib I said scatter plot because it'd be easier for me to infer the relationship and move onto appropriate regression model (linear, non-linear, logarithmic etc). I used following code to plot the relationship row-wise between each numpy array:
for i in range(55):
plt.scatter(JanTemp[i],can02[i])
plt.title('Disease vs Temperature')
plt.ylabel('DiseaseCases')
plt.xlabel('Temp')
plt.show()
Here can02 is the response variable and JanTemp is predictor variable. As expected I got 54 consecutive graph and in same color for both variables, which is frustrating (It's my first ever experience with matplotlib and I don't know how to get each variable its own color). Is there a better way to do it? If yes, please suggest I think it would be in 3-D visualization but then how would I will be able to infer from it? So please suggest a way to visualize in 2-D space but better than above.
Since I couldn't get much info from plots, I decided to begin with linear regression. I used scipy.stats.linregress similar to above iteratively for each row, in the following manner:
months =[JanTemp,FebTemp,MarTemp1,AprTemp,MayTemp,JunTemp,JulTemp,AugTemp,SepTemp,OctTemp,NovTemp,DecTemp]
for month in months:
csum=0
pcsum=0
for i in range(54):
slope, intercept, r_value, p_value, std_err = stats.linregress(month[i],can02[i])
csum +=r_value
pcsum += (r_value**2)*100
print "mean correlation coefficient is", csum/53
print "The avg COD is", pcsum/53
Here JanTemp,FebTemp etc are each file of dimension 54,53. For each file, I'm doing row vs row regression 53 times. This is also rather mundane. Is there a better way to do it, like a function, module etc?
The other method I was aware of was using Ordinary Least Square(OLS) of statsmodels.api module in the following manner:
y = can02
x = JanTemp
X = sm.add_constant(x) #Adds a constant to the linear eq of regression
est = sm.OLS(y, X) #OLS performs the regression of predictor on response
est = est.fit() #fit object of OLS fits the mode
est.summary() #Gives the summary of whole calculation
est.params #gives the coefficient of regression
But I get the following long error:
Traceback (most recent call last):
File "H:\Python\results.py", line 77, in <module>
est.summary() #Gives the summary of whole calculation
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 1230, in summary
top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 959, in rsquared
return 1 - self.ssr/self.centered_tss
File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 95, in __get__
_cachedval = self.fget(obj)
File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 931, in ssr
return np.dot(wresid, wresid)
ValueError: matrices are not aligned
I didn't get how the matrices are not aligned. Anyway, sticking to my original question, Is there any other way similar to this to perform regression and how would I do it on 2-D arrays
Thanks, I know I took a lot of your precious time in this long question but I wanted to be clear. I've searched numerous questions on this site and at other websites but I couldn't find an appropriate or related solution. Thanks.
Do you actually have have 3D data with axes location, parameter, year? Then there is very little geographical in this.
I do not think the problem is numpy at all, but the way how to analyze the data. (Toolwise you might be interested in pandas, as soon as you know what you want.)
There are some very sophisticated statistical methods for this type of work, but you may start with some simple concepts as you have done with the linear regression. First, you should separate the dependent variables (outcomes, e.g. diseases) and independent variables (e.g. temperatures) and look at one dependent variable at a time.
A simple example: Take just one disease. For that you have the number of cases at N locations during M years. Then take all P environmental factors you have. Now you can calculate the time series correlation at each location between the disease and all P environmental factors. This results in P numbers for each N locations.
If you show this as an image (N rows, P columns), you may look for columns with high intensity. They represent disease-environmental factor pairs which seem to repeat in many locations. This is not a statistically rigorous method, but it gives a quick overview.
I am not giving too many code examples, as the statistical basis needs to be thought of before making any visualizations. The visualization part is then usually easier. Unfortunately, there is no simple visualization for the type of data you have.
But for the scatter plot: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.scatter . For example, to draw red markers instead of blue ones: scatter(x, y, c='r') If you only want single color per data series, you may also use plt.plot(x, y, 'r.') (r defines the color, . that we want separate data points.)

Using Numpy to find the average distance in a set of points

I have an array of points in unknown dimensional space, such as:
data=numpy.array(
[[ 115, 241, 314],
[ 153, 413, 144],
[ 535, 2986, 41445]])
and I would like to find the average euclidean distance between all points.
Please note that I have over 20,000 points, so I would like to do this as efficiently as possible.
Thanks.
If you have access to scipy, you could try the following:
scipy.spatial.distance.cdist(data,data)
Well, I don't think that there is a super fast way to do this, but this should do it:
tot = 0.
for i in xrange(data.shape[0]-1):
tot += ((((data[i+1:]-data[i])**2).sum(1))**.5).sum()
avg = tot/((data.shape[0]-1)*(data.shape[0])/2.)
Now that you've stated your goal of finding the outliers, you are probably better off computing the sample mean and, with that, the sample variance, since both those operations will give you an O(nd) operation. With that, you should be able to find outliers (e.g. excluding points further from the mean than some fraction of the std. dev.), and that filtering process should be possible to perform in O(nd) time for a total of O(nd).
You might be interested in a refresher on Chebyshev's inequality.
Is it ever worthwhile to optimize without a working solution? Also, computation of a distance matrix over the entire data set rarely needs to be fast because you only do it once--when you need to know a distance between two points, you just look it up, it's already calculated.
So if you don't have a place to start, here's one. If you want to do this in Numpy without the need to write any inline fortran or C, that should be no problem, though perhaps you want to include this small vector-based virtual machine called "numexpr" (available on PyPI, trivial to intall) which in this case gave a 5x performance boost versus Numpy alone.
Below i've calculated a distance matrix for 10,000 points in 2D space (a 10K x 10k matrix giving the distance between all 10k points). This took 59 seconds on my MBP.
import numpy as NP
import numexpr as NE
# data are points in 2D space (x, y)--obviously, this code can accept data of any dimension
x = NP.random.randint(0, 10, 10000)
y = NP.random.randint(0, 10, 10000)
fnx = lambda q : q - NP.reshape(q, (len(q), 1))
delX = fnx(x)
delY = fnx(y)
dist_mat = NE.evaluate("(delX**2 + delY**2)**0.5")
There's no getting around the number of evaluations:
Sum[n-i, {i, 0, n}] = http://www.equationsheet.com/latexrender/pictures/27744c0bd81116aa31c138ab38a2aa87.gif
But you can save yourself the expense of all those square roots if you can get by with an approximate result. It depends on your needs.
If you're going to calculate an average, I would advise you to not try putting all the values into an array before calculating. Just calculate the sum (and sum of squares if you need standard deviation as well) and throw away each value as you calculate it.
Since
and
, I don't know if this means you have to multiply by two somewhere.
If you want a fast and inexact solution, you could probably adapt the Fast Multipole Method algorithm.
Points that are separated by a small distance have a smaller contribution to the final average distance, so it would make sense to group points into clusters and compare the clusters distances.

Categories