possibility of Pymoo to work within a candidate search space - python

I have a problem of two objective functions, three variables, and zero constraints.
I have also a search space for these variables read from CSV.
Is it possible to use pymoo to use that search space of variables (instead of xl, and xu) to get the best combination of them that maximize the two functions.
class MyProblem (Problem):
def __init__(self):
super().__init__(n_var=3,
n_obj=2,
n_constr=0,
#I want to use the search space of the three variables (I already have)
xl=np.array([0.0,0.0,0.0]),
xu=np.array([1.0,1.0,1.0])
)
def _evaluate(self,X,out,*args,**kwargs):
#Maximizing the triangle area of the three variables
f1=-1*(0.5*math.sin(120)*(X[:,0]*X[:,1] +X[:,2]*X[:,1]+X[:,0]*X[:,2]))
#maximizing the sum of the variables
f2 = -1*(X[:,0]+X[:,1]+X[:,2])
out["F"] = np.column_stack([f1, f2])
problem = MyProblem()
When I use the xl and xu, it always gets the combination of ones [1.0,1.0,1.0], but I want to get the best combination out of my numpy multi-dimension array.
import csv
with open("sample_data/dimensions.csv", 'r') as f:
dimensions = list(csv.reader(f, delimiter=","))
import numpy as np
dimensions = np.array(dimensions[1:])
dimensions=np.array(dimensions[:,1:], dtype=np.float)
dimensions
that looks like the following:
array([[0.27 , 0.45 , 0.23 ],
[0. , 0.23 , 0.09 ],
[0.82 , 0.32 , 0.27 ],
[0.64 , 0.55 , 0.32 ],
[0.77 , 0.55 , 0.36 ],
[0.25 , 0.86 , 0.18 ],
[0. , 0.68 , 0.09 ],...])
Thanks for your help!

Have you tried sampling by numpy.array?
class pymoo.algorithms.nsga2.NSGA2(self, pop_size=100, sampling=numpy.array)
where (from pymoo API)
The sampling process defines the initial set of solutions which are
the starting point of the optimization algorithm. Here, you have three
different options by passing
(i) A Sampling implementation which is an implementation of a random
sampling method.
(ii) A Population object containing the variables to be evaluated
initially OR already evaluated solutions (F needs to be set in this
case).
(iii) Pass a two dimensional numpy.array with (n_individuals, n_var)
which contains the variable space values for each individual.

Related

Scikit FDA - Landmark_registration Problem

After a smoothing procedure, I have a problem with the landmark registration in this line:
skfda.preprocessing.registration.landmark_registration_warping(fd, land)
It return the following error:
ValueError: `x` must be strictly increasing sequence.
fd is a FDataGrid (typical type of data required to represent the function) with 5 samples, while land is an array of landmark that I want to align and it is an increasing sequence of points (see below)
land <- array([[[0.1 , 0.134, 0.258, 0.292, 0.328, 0.558, 0.602],
[0.1 , 0.126, 0.23 , 0.256, 0.292, 0.454, 0.474],
[0.1 , 0.148, 0.25 , 0.278, 0.34 , 0.514, 0.568],
[0.1 , 0.116, 0.25 , 0.276, 0.298, 0.508, 0.612],
[0.1 , 0.132, 0.258, 0.286, 0.376, 0.59 , 0.648]]])
fd <-
Can somebody help me? I'm using scikit fda package to perform this kind of analysis
https://fda.readthedocs.io/en/latest/modules/preprocessing/autosummary/skfda.preprocessing.registration.landmark_registration.html#skfda.preprocessing.registration.landmark_registration
This is the link to the function that I'm using
I had this error when finding my own landmarks. I forgot to pass in the actual domain value at that point (in my case the peak(s) I was wanting to align). Once I did that my error changed to: ValueError: Sample points must be within the domain range. Which brings me to my next point:
Manually specifying the end result landmark locations allowed the code to run, and from what I can tell "work." I'm not sure if this is a bug, or if I am doing something wrong myself. However, the examples they provide do explicitly state that the end result landmark locations shouldn't have to be specified.
Additionally, the end result landmark locations do not seem to end up at the specified points. They end up at the closest point in the grid_point array. This may not be too obvious/a problem for high sample rate data, but for the demo GAIT data scikit-fda provides, there are only 20 sample points so it is noticeably visible that the landmarks do not go exactly where specified. This is also the case for when converting to a basis function as well. One could possibly toy around with the interpolation options and see if it helps.

Labels obtained from clustering seem visually incorrect

I have the following distance matrix based on 10 datapoints:
import numpy as np
distance_matrix = np.array([[0. , 0.00981376, 0.0698306 , 0.01313118, 0.05344448,
0.0085152 , 0.01996724, 0.14019663, 0.03702411, 0.07054652],
[0.00981376, 0. , 0.06148157, 0.00563764, 0.04473798,
0.00905327, 0.01223233, 0.13140022, 0.03114453, 0.06215728],
[0.0698306 , 0.06148157, 0. , 0.05693448, 0.02083512,
0.06390897, 0.05107812, 0.07539802, 0.04003773, 0.00703263],
[0.01313118, 0.00563764, 0.05693448, 0. , 0.0408836 ,
0.00787845, 0.00799949, 0.12779965, 0.02552774, 0.05766039],
[0.05344448, 0.04473798, 0.02083512, 0.0408836 , 0. ,
0.04846382, 0.03638932, 0.0869414 , 0.03579818, 0.0192329 ],
[0.0085152 , 0.00905327, 0.06390897, 0.00787845, 0.04846382,
0. , 0.01284173, 0.13540522, 0.03010677, 0.0646998 ],
[0.01996724, 0.01223233, 0.05107812, 0.00799949, 0.03638932,
0.01284173, 0. , 0.12310601, 0.01916205, 0.05188323],
[0.14019663, 0.13140022, 0.07539802, 0.12779965, 0.0869414 ,
0.13540522, 0.12310601, 0. , 0.11271352, 0.07346808],
[0.03702411, 0.03114453, 0.04003773, 0.02552774, 0.03579818,
0.03010677, 0.01916205, 0.11271352, 0. , 0.04157886],
[0.07054652, 0.06215728, 0.00703263, 0.05766039, 0.0192329 ,
0.0646998 , 0.05188323, 0.07346808, 0.04157886, 0. ]])
I transform the distance_matrix to an affinity_matrix by using the following
delta = 0.1
np.exp(- distance_matrix ** 2 / (2. * delta ** 2))
Which gives
affinity_matrix = np.array([[1. , 0.99519608, 0.7836321 , 0.99141566, 0.86691389,
0.99638113, 0.98026285, 0.37427863, 0.93375682, 0.77970427],
[0.99519608, 1. , 0.82778719, 0.99841211, 0.90477015,
0.9959103 , 0.99254642, 0.42176757, 0.95265821, 0.82433657],
[0.7836321 , 0.82778719, 1. , 0.85037594, 0.97852875,
0.81528476, 0.8777015 , 0.75258369, 0.92297697, 0.99753016],
[0.99141566, 0.99841211, 0.85037594, 1. , 0.91982353,
0.99690131, 0.99680552, 0.44191509, 0.96794184, 0.84684633],
[0.86691389, 0.90477015, 0.97852875, 0.91982353, 1. ,
0.88919645, 0.93593511, 0.68527137, 0.9379342 , 0.98167476],
[0.99638113, 0.9959103 , 0.81528476, 0.99690131, 0.88919645,
1. , 0.9917884 , 0.39982486, 0.95569077, 0.81114925],
[0.98026285, 0.99254642, 0.8777015 , 0.99680552, 0.93593511,
0.9917884 , 1. , 0.46871776, 0.9818083 , 0.87407117],
[0.37427863, 0.42176757, 0.75258369, 0.44191509, 0.68527137,
0.39982486, 0.46871776, 1. , 0.52982057, 0.76347268],
[0.93375682, 0.95265821, 0.92297697, 0.96794184, 0.9379342 ,
0.95569077, 0.9818083 , 0.52982057, 1. , 0.91719051],
[0.77970427, 0.82433657, 0.99753016, 0.84684633, 0.98167476,
0.81114925, 0.87407117, 0.76347268, 0.91719051, 1. ]])
I transform the distance_matrix into a heatmap to get a better visual of the data
import seaborn as sns
distance_matrix_df = pd.DataFrame(distance_matrix)
distance_matrix_df.columns = [x + 1 for x in range(10))]
distance_matrix_df.index = [x + 1 for x in range(10)]
sns.heatmap(distance_matrix_df, cmap='RdYlGn_r', annot=True, linewidths=0.5)
Next I want to cluster the affinity_matrix in 3 clusters. Before running the actual clustering, I inspect the heatmap to forecast the clusters. Clearly #8 is an outlier and will be a cluster on its own.
Next I run the actual clustering.
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3,
assign_labels='kmeans',
affinity='precomputed').fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
The outputs yields
[1, 1, 2, 1, 2, 1, 1, 2, 3, 2]
So, #8 is part of cluster 2 which consists of three other data points. Initially, I would assume that it would be a cluster on its own. Did I do something wrong? Or can someone show me why #8 looks like #3, #5 and #10. Please advice.
When we are moving away from relatively simple clustering algorithms, say like k-means, whatever intuition we may carry along regarding algorithms results and expected behaviors breaks down; indeed, the scikit-learn documentation on spectral clustering gives an implicit warning about that:
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of
the individual clusters is highly non-convex or more generally when a
measure of the center and spread of the cluster is not a suitable
description of the complete cluster. For instance when clusters are
nested circles on the 2D plane.
Now, even if one pretends to understand exactly what "a projection of the normalized Laplacian" means (I won't), the rest of the description arguably makes clear enough that here we should not expect results similar with more intuitive, distance-based clustering algorithms like k-means.
Nevertheless, your own intuition is not unfounded, and it shows if you just try a k-means clustering instead of a spherical one; using your exact data, we get
from sklearn.cluster import KMeans
clustering = KMeans(n_clusters=3, random_state=42).fit(affinity_matrix)
clusters = clustering.labels_.copy()
clusters = clusters.astype(np.int32) + 1
clusters
# result:
array([2, 2, 1, 2, 1, 2, 2, 3, 2, 1], dtype=int32)
where indeed sample #8 stands out as an outlier in a cluster of its own (#3).
Nevertheless, the same intuition is not necessarily applicable or useful with other clustering algorithms, whose value is arguably exactly that they can uncover regularities of different kinds in the data - arguably they would not be that useful if they just replicated results from existing algorithms like k-means, would they?
The scikit-learn vignette Comparing different clustering algorithms on toy datasets might be useful to get an idea of how different clustering algorithms behave on some toy 2D datasets; here is the summary finding:

'list' object has no attribute 'matmul'

I have the code below to compute Markov chain iterations. Having two matrices: the current state matrix and transitional matrix; when stating the number of iterations (multiplications between the two matrices) the code should save the result of the state matrix after one iteration for the next iteration, and so on. When compiling the code, there is an error:
AttributeError: 'list' object has no attribute 'matmul'.
I'm working with NumPy version 1.17. How can I solve it?
import numpy as np
transitionalMatrix = ([0.42, 0.16, 0.36, 0.02 ],[0.05, 0.43, 0.04, 0.11 ], [0.24, 0.16, 0.51 , 0.04 ], [0.01, 0.31, 0.01, 0.59 ])
stateMatrix = ([0.20461531, 0.26104588, 0.19799357, 0.14561973])
maxIterations = 6
res = [stateMatrix]
for iteration in range(1, maxIterations):
prev = res[iteration - 1]
res.append(prev.matmul(transitionalMatrix))
As the error says, you are trying to apply matmul to a list, which doesn't have any such attribute. Assuming that what you want to use is np.matmul(), what you should be doing is:
np.matmul(prev, transitionalMatrix))
However, as Prune pointed out, the lack of a minimal, reproducible example makes it impossible to help you any further.

Difference in output between numpy linspace and numpy logspace

Numpy linspace returns evenly spaced numbers over a specified interval. Numpy logspace return numbers spaced evenly on a log scale.
I don't understand why numpy logspace often returns values "out of range" from the bounds I set. Take numbers between 0.02 and 2.0:
import numpy as np
print np.linspace(0.02, 2.0, num=20)
print np.logspace(0.02, 2.0, num=20)
The output for the first is:
[ 0.02 0.12421053 0.22842105 0.33263158 0.43684211 0.54105263
0.64526316 0.74947368 0.85368421 0.95789474 1.06210526 1.16631579
1.27052632 1.37473684 1.47894737 1.58315789 1.68736842 1.79157895
1.89578947 2. ]
That looks correct. However, the output for np.logspace() is wrong:
[ 1.04712855 1.33109952 1.69208062 2.15095626 2.73427446
3.47578281 4.41838095 5.61660244 7.13976982 9.07600522
11.53732863 14.66613875 18.64345144 23.69937223 30.12640904
38.29639507 48.68200101 61.88408121 78.6664358 100. ]
Why does it output 1.047 to 100.0?
2017 update: The numpy 1.12 includes a function that does exactly what the original question asked, i.e. returns a range between two values evenly sampled in log space.
The function is numpy.geomspace
>>> np.geomspace(0.02, 2.0, 20)
array([ 0.02 , 0.0254855 , 0.03247553, 0.04138276, 0.05273302,
0.06719637, 0.08562665, 0.1091119 , 0.13903856, 0.17717336,
0.22576758, 0.28768998, 0.36659614, 0.46714429, 0.59527029,
0.75853804, 0.96658605, 1.23169642, 1.56951994, 2. ])
logspace computes its start and end points as base**start and base**stop respectively. The base value can be specified, but is 10.0 by default.
For your example you have a start value of 10**0.02 == 1.047 and a stop value of 10**2 == 100.
You could use the following parameters (calculated with np.log10) instead:
>>> np.logspace(np.log10(0.02) , np.log10(2.0) , num=20)
array([ 0.02 , 0.0254855 , 0.03247553, 0.04138276, 0.05273302,
0.06719637, 0.08562665, 0.1091119 , 0.13903856, 0.17717336,
0.22576758, 0.28768998, 0.36659614, 0.46714429, 0.59527029,
0.75853804, 0.96658605, 1.23169642, 1.56951994, 2. ])
This is pretty simple.
NumPy gives you numbers evenly distributed in log space.
i.e. 10^(value). where value is evenly spaced between your start and stop values.
You'll note 10^0.02 is 1.04712 ... and 10^2 is 100
From documentation for numpy.logspace() -
numpy.logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)
Return numbers spaced evenly on a log scale.
In linear space, the sequence starts at base ** start (base to the
power of start) and ends with base ** stop (see endpoint below).
For your case, base is defaulting to 10, so its going from 10 raised to 0.02 to 10 raised to 2 (100).

Proximity Matrix in sklearn.ensemble.RandomForestClassifier

I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version?
We don't implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the apply function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest (through forest.estimators_) and count the number of times they fall in the same leaf, i.e., the number of times apply give the same node id for both samples in the pair.
Hope this helps.
Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.
def proximityMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return proxMat
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
train = load_breast_cancer()
model = RandomForestClassifier(n_estimators=500, max_features=2, min_samples_leaf=40)
model.fit(train.data, train.target)
proximityMatrix(model, train.data, normalize=True)
## array([[ 1. , 0.414, 0.77 , ..., 0.146, 0.79 , 0.002],
## [ 0.414, 1. , 0.362, ..., 0.334, 0.296, 0.008],
## [ 0.77 , 0.362, 1. , ..., 0.218, 0.856, 0. ],
## ...,
## [ 0.146, 0.334, 0.218, ..., 1. , 0.21 , 0.028],
## [ 0.79 , 0.296, 0.856, ..., 0.21 , 1. , 0. ],
## [ 0.002, 0.008, 0. , ..., 0.028, 0. , 1. ]])
There is nothing currently implemented for this in python. I took a first try at it here. It would be great if somebody would be interested in adding these methods to scikit.

Categories