Python data structure to implement photon simulation - python

The Problem
I have a function, with 2 independent variables, using which I need to construct a 2-D grid, with parameters of my choice. These parameters will be the ranges of the two independent variables, smallest/largest cell size etc.
Visually this can be thought of as a space partition data structure whose geometry is provided by my function. To each cell, I will be then assigning some properties (that will also be determined by the function and other things).
My Idea
Once such data structure is prepared, I can simulate a photon (in a Monte-Carlo based approach) which will travel to a cell randomly (with some constraints, which are given by the function and cell properties), will be absorbed or scattered (re-emitted) from that cell with some probabilities (at which point I will be solving the radiative transfer equation in that cell). After this, the photon, if re-emitted, with its wavelength now different, moves to a neighbouring cell, and will keep moving till it escapes the grid (whose boundaries were decided by the parameters) or is completely absorbed in one of the cells. So, my data structure should also be in such a way that it can access the nearest neighboring cells efficiently from a computation point of view.
What I have done
I looked into scipy.spatial.kdtree but I am not sure I will be able to assign properties to cells the way I want (though if someone can explain that, it would be really helpful as it is very good with accessing nearest neighbours). I have also looked at othe tree-based algorithms but I am a bit lost at how to implement them. Numpy arrays are, at the end of the day, matrices, which does not suit my function (it leads to wastage of memory).
So any suggestions on data structures I can use (and some nudge towards how I can) will be really nice.

Related

Match the vertices of two identical labelled graphs

I have a rather simple problem to define but I did not find a simple answer so far.
I have two graphs (ie sets of vertices and edges) which are identical. Each of them has independently labelled vertices. Look at the example below:
How can the computer detect, without prior knowledge of it, that 1 is identical to 9, 2 to 10 and so on?
Note that in the case of symmetry, there may be several possible one to one pairings which give complete equivalence, but just finding one of them is sufficient to me.
This is in the context of a Python implementation. Does someone have a pointer towards a simple algorithm publicly available on the Internet? The problem sounds simple but I simply lack the mathematical knowledge to come up to it myself or to find proper keywords to find the information.
EDIT: Note that I also have atom types (ie labels) for each graphs, as well as the full distance matrix for the two graphs to align. However the positions may be similar but not exactly equal.
This is known as the graph isomorphism problem, and probably very hard; although the exactly details of how hard are still subject of research.
(But things look better if you graphs are planar.)
So, after searching for it a bit, I think that I found a solution that works most of the time for moderate computational cost. This is a kind of genetic algorithm which uses a bit of randomness, but it is practical enough for my purposes it seems. I didn't have any aberrant configuration with my samples so far even if it is theoretically possible that this happens.
Here is how I proceeded:
Determine the complete set of 2-paths, 3-paths and 4-paths
Determine vertex types using both atom type and surrounding topology, creating an "identity card" for each vertex
Do the following ten times:
Start with a random candidate set of pairings complying with the allowed vertex types
Evaluate how much of 2-paths, 3-paths and 4-paths correspond between the two pairings by scoring one point for each corresponding vertex (also using the atom type as an additional descriptor)
Evaluate all other shortlisted candidates for a given vertex by permuting the pairings for this candidate with its other positions in the same way
Sort the scores in descending order
For each score, check if the configuration is among the excluded configurations, and if it is not, take it as the new configuration and put it into the excluded configurations.
If the score is perfect (ie all of the 2-paths, 3-paths and 4-paths correspond), then stop the loop and calculate the sum of absolute differences between the distance matrices of the two graphs to pair using the selected pairing, otherwise go back to 4.
Stop this process after it has been done 10 times
Check the difference between distance matrices and take the pairings associated with the minimal sum of absolute differences between the distance matrices.

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.
The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

Appropriate encoding using Particle Swarm Optimization

The Problem
I've been doing a bit of research on Particle Swarm Optimization, so I said I'd put it to the test.
The problem I'm trying to solve is the Balanced Partition Problem - or reduced simply to the Subset Sum Problem (where the sum is half of all the numbers).
It seems the generic formula for updating velocities for particles is
but I won't go into too much detail for this question.
Since there's no PSO attempt online for the Subset Sum Problem, I looked at the Travelling Salesman Problem instead.
They're approach for updating velocities involved taking sets of visited towns, subtracting one from another and doing some manipulation on that.
I saw no relation between that and the formula above.
My Approach
So I scrapped the formula and tried my own approach to the Subset Sum Problem.
I basically used gbest and pbest to determine the probability of removing or adding a particular element to the subset.
i.e - if my problem space is [1,2,3,4,5] (target is 7 or 8), and my current particle (subset) has [1,None,3,None,None], and the gbest is [None,2,3,None,None] then there is a higher probability of keeping 3, adding 2 and removing 1, based on gbest
I can post code but don't think it's necessary, you get the idea (I'm using python btw - hence None).
So basically, this worked to an extent, I got decent solutions out but it was very slow on larger data sets and values.
My Question
Am I encoding the problem and updating the particle "velocities" in a smart way?
Is there a way to determine if this will converge correctly?
Is there a resource I can use to learn how to create convergent "update" formulas for specific problem spaces?
Thanks a lot in advance!
Encoding
Yes, you're encoding this correctly: each of your bit-maps (that's effectively what your 5-element lists are) is a particle.
Concept
Your conceptual problem with the equation is because your problem space is a discrete lattice graph, which doesn't lend itself immediately to the update step. For instance, if you want to get a finer granularity by adjusting your learning rate, you'd generally reduce it by some small factor (say, 3). In this space, what does it mean to take steps only 1/3 as large? That's why you have problems.
The main possibility I see is to create 3x as many particles, but then have the transition probabilities all divided by 3. This still doesn't satisfy very well, but it does simulate the process somewhat decently.
Discrete Steps
If you have a very large graph, where a high velocity could give you dozens of transitions in one step, you can utilize a smoother distance (loss or error) function to guide your model. With something this small, where you have no more than 5 steps between any two positions, it's hard to work with such a concept.
Instead, you utilize an error function based on the estimated distance to the solution. The easy one is to subtract the particle's total from the nearer of 7 or 8. A harder one is to estimate distance based on that difference and the particle elements "in play".
Proof of Convergence
Yes, there is a way to do it, but it requires some functional analysis. In general, you want to demonstrate that the error function is convex over the particle space. In other words, you'd have to prove that your error function is a reliable distance metric, at least as far as relative placement goes (i.e. prove that a lower error does imply you're closer to a solution).
Creating update formulae
No, this is a heuristic field, based on shape of the problem space as defined by the particle coordinates, the error function, and the movement characteristics.
Extra recommendation
Your current allowable transitions are "add" and "delete" element.
Include "swap elements" to this: trade one present member for an absent one. This will allow the trivial error function to define a convex space for you, and you'll converge in very little time.

Half-integer indices

I am working on a fluid dynamics simulation tool in Python. Traditionally (at least in my field), integer indices refer to the center of a cell. Some quantities are stored on the faces between cells, and in the literature are denoted by half-integer indices. In codes, however, these are shifted to integers to fit into arrays. The problem is, the shift is not always consistent: do you add a half or subtract? If you switch between codes enough, you quickly lose track of the conventions for any particular code. And honestly, there are enough quirky conventions in each code that eliminating a few would be a big help... but not if the remedy is worse than the original problem.
I have thought about using even indices for cell centers and odd for faces, but that is counterintuitive. Also, it's rare for a quantity to exist on both faces and in cell centers, so you never use half of your indices. You could also implement functions plus_half(i) and minus_half(i), but that gets verbose and inelegant (at least in my opinion). And of course floating point comparisons are always problematic in case someone gets cute in how they calculate 1/2.
Does anyone have a suggestion for an elegant way to implement half-integer indices in Python? I'm sure I'm not the first person to wish for this, but I've never seen it done in a simple way that is obvious to a user (without requiring the user to memorize the shift convention you've chosen).
And just to clarify: I assume there is likely to be a remap step hidden from the user to get to integer indices (I intend to wrap NumPy arrays for my data grids). I'm thinking of the interface exposed to the user, rather than how the data is stored.
You can make __getitem__ which takes arbitrary objects as indices (and floating point numbers in particular).

A container for accessing contents by 2d/3d coordinates

There are a lot of games that can generally be viewed as a bunch of objects spread out through space, and a very common operation is to pick all objects in a sub-area. The typical example would be a game with tons of units across a large map, and an explosion that affects units in a certain radius. This requires picking every unit in the radius in order to apply the effects of the explosion.
Now, there are several ways to store objects that allows efficiently picking a sub-area. The easiest method is probably to divide the map into a grid; picking units in an area would involve selecting only the parts of the grid that is affected, and do a fine-grained coordinate check grid tiles that aren't 100% inside the area.
What I don't like about this approach is answering "How large should the grid tiles be?" Too large, and efficiency may become a real problem. Too small, and the grid takes up tons of memory if the game world is large enough (and can become ridiculous if the game is 3d). There may not even be a suitable golden mean.
The obvious solution to the above is to make a large grid with some kind of intelligent subdivision, like a pseudo tree-structure. And it is at this point I know for sure I am far into premature optimization. (Then there are proper dynamic quad/octrees, but that's even more complex to code and I'm not even confident it will perform any better.)
So my question is: Is there a standard solution to the above problem? Something, in the lines of an STL container, that can just store any object with a coordinate, and retreive a list of objects within a certain area? It doesn't have to be different than what I described above, as long as it's something that has been thought out and deemed "good enough" for a start.
Bonus points if there is an implementation of the algorithm in Python, but C would also do.
The first step to writing a practical program is accepting that choices for some constants come from real-world considerations and not transcendent mathematical truths. This especially applies to game design/world simulation type coding, where you'd never get anywhere if you persisted in trying to optimally model the real world. :-)
If your objects will all be of fairly uniform size, I would just choose a grid size proportional to the average object size, and go with that. It's the simplest - and keep in mind simplicity will buy you some speed even if you end up searching over a few more objects than absolutely necessary!
Things get a big harder if your objects vary greatly in size - for example if you're trying to use the same engine to deal with bullets, mice, humans, giant monsters, vehicles, asteroids, planets, etc. If that's the case, a common accepted (but ugly) approach is to have different 'modes' of play depending on the type of situation you're in. Short of that, one idea might be to use a large grid with a binary-tree subdivision of grid cells after they accumulate too many small objects.
One aside: if you're using floating point coordinates, you need to be careful with precision and rounding issues for your grid size, since points close to the origin have a lot more precision than those far away, which could lead to errors where grid cells miss some objects.
Here is a free book available online that will answer your question.
Specifically look at Chapter 18 on collision detection and intersection.
I don't know anything about games programming, but I would imagine (based on intuition and what I've read in the past) that a complete grid will get very inefficient for large spaces; you'll lose out in both storage, and also in time, because you'll melt the cache.
STL containers are fundamentally one-dimensional. Yes, things like set and map allow you to define arbitrary sort relationships, but it's still ordered in only one dimension. If you want to do better, you'll probably need to use a quad-tree, a kd-tree, or something like that.

Categories