CLARANS pyclustering implementation - is there a mistake in the code?

CLARANS pyclustering implementation - is there a mistake in the code? - python

I know my question is a bit weird but I am trying to implement my own version of CLARANS algorithm for sake of learning. To understand it better, I tried to get through the CLARANS code of pyclustering library (which is outdated but still seems popular and I've seen it used in some places). Here it is:
https://github.com/annoviko/pyclustering/blob/master/pyclustering/cluster/clarans.py
I understood everything (or thought so), until the line 210, so just before the cost calculation takes place.
distance_nearest = float('inf')
if ( (point_medoid_index != candidate_medoid_index) and (point_medoid_index != current_medoid_cluster_index) ):
distance_nearest = euclidean_distance_square(self.__pointer_data[point_index], self.__pointer_data[point_medoid_index])
Is there a bug inside the library? Let's say we have 1000 data points and 3 clusters (so 3 medoids of course).
Why would we say the distance_nearest = float('inf') for any of the points (especially since we're adding the distance_nearest later in the code)? And what is more, why would we compare the index of analyzed point's medoid (so could be equal to 400) to the current_medoid_cluster_index (which could only take values from 0 to 2)? What's the point of that?
I'm sorry if it's a bit chaotic, but I'm honestly looking for someone either interested in going through the code or someone who understands the code already - I am willing to elaborate further if needed.
If someone understands it and knows there is no bug, could you please explain the cost calculation part?

Related

Python GEKKO Unexpected Behavior with Constraints

I've been playing around with GEKKO for solving flow optimizations and I have come across behavior that is confusing me.
Context:
Sources --> [mixing and delivery] --> Sinks
I have multiple sources (where my flow is coming from), and multiple sinks (where my flow goes to). For a given source (e.g., SOURCE_1), the total flow to the resulting sinks must equal to the volume from SOURCE_1. This is my idea of conservation of mass where the 'mixing' plant blends all the source volumes together.
Constraint Example (DOES NOT WORK AS INTENDED):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume:
m.Equation(volume_sink_1[i] + volume_sink_2[i] == max_volumes_for_source_1)
I end up with weird results. With that, I mean, it's not actually optimal, it ends up assigning values very poorly. I am off from the optimal by at least 10% (I tried with different max volumes).
Constraint Example (WORKS BUT I DON'T GET WHY):
When I try to create a constraint for the two SINK volumes, and the one SOURCE volume like this:
m.Equation(volume_sink_1[i] + volume_sink_2[i] <= max_volumes_for_source_1 * 0.999999)
With this, I get MUCH closer to the actual optimum to the point where I can just treat it as the optimum. Please note that I had to change it to a less than or equal to and also multiply by 0.999999 which was me messing around with it nonstop and eventually leading to that.
Also, please note that this uses practically all of the source (up to 99.9999% of it) as I would expect. So both formulations make sense to me but the first approach doesn't work.
The only thing I can think of for this behavior is that it's stricter to solve for == than <=. That doesn't explain to me why I have to multiply by 0.999999 though.
Why is this the case? Also, is there a way for me to debug occurrences like this easier?

This same improvement occurs with complementary constraints for conditional statements when using s1*s2<=0 (easier to solve) versus s1*s2==0 (harder to solve).
From the research papers I've seen, the justification is that the solver has more room to search to find the optimal solution even if it always ends up at s1*s2==0. It sounds like your problem may have multiple local minima as well if it converges to a solution, but it isn't the global optimum.
If you can post a complete and minimal problem that demonstrates the issue, we can give more specific suggestions.

How do you know when memoization should be used?

I'm learning about memoization and although I have no trouble implementing it when it's needed in Python, I still don't know how to identify situations where it will be needed.
I know it's implemented when there are overlapping subcalls, but how do you identify if this is the case? Some recursive functions seem to go deep before an overlapping call is made. How would you identify this in a coding interview? Would you draw out all the calls (even if some of them go 5-6 levels deep in an O(2^n) complexity brute force solution)?
Cases like the Fibonacci sequence make sense because the overlap happens immediately (the 'fib(i-1)' call will almost immediately overlap with the 'fib(i-2)' call). But for other cases, like the knapsack problem, I still can't wrap my head around how anyone can identify that memoization should be used while at an interview. Is there a quick way to check for an overlap?
I hope my question makes sense. If someone could point me to a good resource, or give me clues to look for, I would really appreciate it. Thanks in advance.

In order to reach the "memoization" solution, you first need to identify the following two properties in the problem:
Overlapping subproblems
Optimal substructure
Looks like you do understand (1). For (2):
Optimal Substructure: If the optimal solution to a problem, S, of size n can be calculated by JUST looking at optimal solution of a subproblem, s, with size < n and NOT ALL solutions to a subproblem, AND it will also result in an optimal solution for problem S, then this problem S is considered to have optimal substructure.
For more detail, please take a look at the following answer:
https://stackoverflow.com/a/59461413/12346373

Restrict SciPy optimization to integers

I'm using SciPy's optimization functions (in particular shgo()) in order to optimize my problem. Right now I'm managing to get a valid solution, however I would like to improve this a little bit.
My function is solving a NLU problem. Basically, I have a tokenized sentence and for each word I have a potential interpretation. For each combination I can apply black box grammar rules which will result in a score.
The problem with this is that in terms of complexity it can be disastrous, since it's O(exp(n)).
For this reason I'm using the shgo() optimization algorithm (or similar things) which so far gives me good results, the only thing is that the minimizing function uses real values instead of integer, yet my parameters are integer (word 1 = interpretation 2, word 2 = interpretation 1, ..., word N = interpretation I).
In the end, for some options that are fairly obvious (1 interpretation or less for each word) it takes 170 runs because it's trying to find the exact value while it's actually exploring things in the range [0, 1[ which is actually all the same thing for me.
I would like to have integer steps but after playing with the different parameters a bit I couldn't find how to tell the minimizer to have smaller steps. Even if it's not strictly integers, just have the thing to stop when it's 0.5 away from a solution would already be a wonderful improvement.
Edit: you can have a look at the code if you want.
Thanks!

Conditionals & Control Flow - Code Academy

I'm taking the Python course in Code Academy. I have never been good at math, never put much effort into it to be honest. My question is, below, when they say "Set bool_three equal to the result of 100 ** 0.5 >= 50 or False" as an example is "100 ** 0.5 >= 50 or False" - Just an example made up or would I be needing numbers like that when I start coding Python on my own? I have been doing great in the course so far, but when i get to questions like that I go brain dead for a second, I end up figuring it out, given some of them I do need to look up, but is this going to be a common theme I'm going to use when coding or did just explain it this way?
Thanks
Time to practice with or!
Set bool_one equal to the result of 2**3 == 108 % 100 or 'Cleese' == 'King Arthur'
Set bool_two equal to the result of True or False
Set bool_three equal to the result of 100**0.5 >= 50 or False
Set bool_four equal to the result of True or True
Set bool_five equal to the result of 1**100 == 100**1 or 3 * 2 * 1 != 3 + 2 + 1

What kind of math you will need depends entirely on your application. YOUR code, may not use math at all and all math needed is done by libraries and frameworks you use. - There is always math involved, whether you have contact with it is another question.
If you develop a solver for equations, then chances are, most of your code will end up "beeing maths".
If you meant to ask, if it's common to use boolean logic crossed with math, then in my experience, no it isn't. Typically you have flow control dictate which piece of math is used and then proceed. You don't mangle it all into a single expression.

Math (algebra) and programming are really coupled in my head. Being good at math would sharpen your skills in problems solving.
Don't worry! That's a thing that you can acquire by learning more about problems solving. You don't have to take classes in Math! (Even though it would still be better to improve the Math part by taking classes).
From the example you given, I can see that it caught your attention in the sense that you possibly were like:
OMG! Math?!!!!
The example is nothing at all. Therefore, this kind of math would be required to comprehend whilst you are taking the programming course. That's of course if you are interested to be a good programmer.
stackoverflow is here and always will be. Whenever you have questions about some problems you don't know how to work it out, just post your questions. That's how we all learn ;)
Happy learning.

Scan Matching Algorithm giving wrong values for translation but right value for rotation

I've already posted it on robotics.stackexchange but I had no relevant answer.
I'm currently developing a SLAM software on a robot, and I tried the Scan Matching algorithm to solve the odometry problem.
I read this article :
Metric-Based Iterative Closest Point Scan Matching
for Sensor Displacement Estimation
I found it really well explained, and I strictly followed the formulas given in the article to implement the algorithm.
You can see my implementation in python there :
ScanMatching.py
The problem I have is that, during my tests, the right rotation was found, but the translation was totally false. The values of translation are extremely high.
Do you have guys any idea of what can be the problem in my code ?
Otherwise, should I post my question on the Mathematics Stack Exchange ?
The ICP part should be correct, as I tested it many times, but the Least Square Minimization doesn't seem to give good results.
As you noticed, I used many bigfloat.BigFloat values, cause sometimes the max float was not big enough to contain some values.

don't know if you already solved this issue.
I didn't read the full article, but I noticed it is rather old.
IMHO (I'm not the expert here), I would try bunching specific algorithms, like feature detection and description to get a point cloud, descriptor matcher to relate points, bundle adjustement to get the rototraslation matrix.
I myself am going to try sba (http://users.ics.forth.gr/~lourakis/sba/), or more specifically cvsba (http://www.uco.es/investiga/grupos/ava/node/39/) because I'm on opencv.
If you have enough cpu/gpu power, give a chance to AKAZE feature detector and descriptor.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

CLARANS pyclustering implementation - is there a mistake in the code? - python

Related

Python GEKKO Unexpected Behavior with Constraints

How do you know when memoization should be used?

Restrict SciPy optimization to integers

Conditionals & Control Flow - Code Academy

Scan Matching Algorithm giving wrong values for translation but right value for rotation

Categories

Resources