Making a Marginal Distribution from Joint Distribution in python - python

I have 3 arrays of X values, Y values, and probabilities. I'm trying to do two things but they're practically the same coding wise I imagine.
I want to find all the X values that are the same, and add up the corresponding probabilities into another array. (So if my X values are [3,7,4,7] and my probabilities are [.2,.3,.1,.4] I would want to add .3 and .4 together. I'm trying to do this with a loop, but because I only picked up python two weeks ago I'm struggling.
My thought process that I want to try:
MargX=np.unique(X array)
MargXp=np.zeros(len(MargX))
for Ind in range(len(MargX):
?
(Here I want to take the values in my X array that are equal, grab the corresponding value from my p array, and then add them into my zero array MargXp)
I've tried a couple of different ways to set up my loop so that it would add the values into the zero arrays that I made, but to no avail because I keep getting syntax errors and various other things.

If you try to compact the X's down with unique, then finding the corresponding probabilities would involve searching through the X array for indices and using those to find the corresponding probabilities. I think you'd be happier using python's dictionary concept to associate keys (x-values) with values (probabilities). Using defaultdict allows you to specify a default value for keys which aren't already in the dictionary. In this case, start off with the idea that an arbitrary x-value has zero probability. As you iterate through the x/probability pairs, you can then use increment to add the probability to the stored or default value associated with the x.
The result looks something like this:
from collections import defaultdict
# synchronized arrays from your example.
x = [3, 7, 4, 7]
probs = [0.2, 0.3, 0.1, 0.4]
marginal = defaultdict(lambda: 0.0) # 0.0 is the default value
for key, value in zip(x, probs): # zip combines the arrays as pairs
marginal[key] += value # increment to accumulate total probability
# The following line is not strictly needed since all values are
# in the dictionary, but by default key values are not ordered.
orderedkeys = sorted(marginal.keys())
for key in orderedkeys:
print(key, marginal[key])
which produces:
3 0.2
4 0.1
7 0.7

Related

check if subarray is in array of arrays

I've got an array of arrays where I store x,y,z coordinates and a measurement at that coordinate like:
measurements = [[x1,y1,z1,val1],[x2,y2,z2,val2],[...]]
Now before adding a measurement for a certain coordinate I want to check if there is already a measurement for that coordinate. So I can only keep the maximum val measurement.
So the question is:
Is [xn, yn, zn, ...] already in measurements
My approach so far would be to iterate over the array and compare with a sclied entry like
for measurement in measurements:
if measurement_new[:3] == measurement[:3]:
measurement[3] = measurement_new[3] if measurement_new[3] > measurement[3] else measurement[3]
But with the measurements array getting bigger this is very unefficient.
Another approach would be two separate arrays coords = [[x1,y1,z1], [x2,y2,z2], [...]] and vals = [val1, val2, ...]
This would allow to check for existing coordinates effeciently with [x,y,z] in coords but would have to merge the arrays later on.
Can you suggest a more efficent method for soving this problem?
If you want to stick to built-in types (if not see last point in Notes below) I suggest using a dict for the measurements:
measurements = {(x1,y1,z1): val1,
(x2,y2,z2): val2}
Then adding a new value (x,y,z,val) can simply be:
measurements[(x,y,z)] = max(measurements.get((x,y,z), 0), val)
Notes:
The value 0 in measurements.get is supposed to be the lower bound of the values you are expecting. If you have values below 0 then change it to an appropriate lower bound such that whenever (x,y,z) is not present in your measures get returns the lower bound and thus max will return val. You can also avoid having to specify the lower bound and write:
measurements[(x,y,z)] = max(measurements.get((x,y,z), val), val)
You need to use tuple as type for your keys, hence the (x,y,z). This is because lists cannot be hashed and so not permitted as keys.
Finally, depending on the complexity of the task you are performing, consider using more complex data types. I would recommend having a look at pandas DataFrames they are ideal to deal with such kind of things.

Finding mode in np.array 1d and get the first one

I want to find the mode and get the first one for numpyarray
for example
[1,2,2,3,3,4] there are two mode(most frequently appears) 2,3
However , in this case I want to get the most left one 2
There are some examples to get mode by numpy or scipy or staticts
My array is numpy, so if I can do with only numpy, it is simple...
However how can I make this??
Have you had a look at collections.Counter?
import numpy as np
import collections
x = np.array([1, 2, 3, 2, 4, 3])
c = collections.Counter(x)
largest_num, count = c.most_common(1)[0]
The documentation of Counter.most_common states:
Elements with equal counts are ordered in the order first encountered
If you want to use only numpy, you could try this, based on the scipy mode implementation
a = np.array([1,2,2,3,3,4])
# Unique values
scores = set(np.ravel(a))
# Retrieve keys, counts
keys, counts = zip(*[(score, np.sum((a == score))) for score in scores])
# Maximum key
keys[np.argmax(counts)]
>>> 2
where the argmax function states
"In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned."

Selecting randomly from two arrays based upon condition in Python

Suppose i have two arrays of equal lengths:
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
Now i want to pick up elements from these two arrays , in the sequence given such that they form a new array of same length as a & b by randomly selecting values between a & b, in the ratio of a:b = 4.68 i.e for every 1 value picked from a , there should be 4.68 values picked from b in the resultant array.
So effectively the resultant array could be something like :
res = [0,1,1,0,1, 1(from a) ,0(from a),1,1,0,0,1,1,0, 0(from a),0,0]
res array has : first 5 values are from b ,6th & 7th from a ,8th-14th from b , 15th from a ,16th-17th from b
Overall ratio of values from a:b in the given res array example is a:b 4.67 ( from a = 3 ,from b = 14 )
Thus between the two arrays, values have to be chosen at random however the sequence needs to be maintained i.e cannot take 7th value from one array and 3rd value from other .If the value to be populated in resultant array is 3rd then the choice is between the 3rd element of both input arrays at random.Also, overall ratio needs to be maintained as well.
Can you please help me in developing an efficient Pythonic way of reaching this resultant solution ? The solution need not be consistent with every run w.r.t values
Borrowing the a_count calculation from Barmar's answer (because it seems to work and I can't be bothered to reinvent it), this solution preserves the ordering of the values chosen from a and b:
from future_builtins import zip # Only on Python 2, to avoid temporary list of tuples
import random
# int() unnecessary on Python 3
a_count = int(round(1/(1 + 4.68) * len(a)))
# Use range on Python 3, xrange on Python 2, to avoid making actual list
a_indices = frozenset(random.sample(xrange(len(a)), a_count))
res = [aval if i in a_indices else bval for i, (aval, bval) in enumerate(zip(a, b))]
The basic idea here is that you determine how many a values you need, get a unique sample of the possible indices of that size, then iterate a and b in parallel, keeping the a value for the selected indices, and the b value for all others.
If you don't like the complexity of the list comprehension, you could use a different approach, copying b, then filling in the a values one by one:
res = b[:] # Copy b in its entirety
# Replace selected indices with a values
# No need to convert to frozenset for efficiency here, and it's clean
# enough to just iterate the sample directly without storing it
for i in random.sample(xrange(len(a)), a_count):
res[i] = a[i]
I believe this should work. You specify how many you want from a (you can simply use your ratio to figure out that number), you randomly generate a 'mask' of numbers and choose from a or be based on the cutoff (notice that you only sort to figure out the cutoff, but you use the unsorted mask later)
import numpy as np
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
mask = np.random.random(len(a))
from_a = 3
cutoff = np.sort(mask)[from_a]
res = []
for i in range(len(a)):
if (mask[i]>=cutoff):
res.append(a[i])
else:
res.append(b[i])

Finding the closest possible values from two dictionaries

Let's suppose you have two existing dictionaries A and B
If you already choose an initial two items from dictionaries A and B with values A1 = 1.0 and B1 = 2.0, respectively, is there any way to find any two different existing items in the dictionaries A and B that each have different values (i.e. A2 and B2) from A1 and B1, and would also minimize the value (A2-A1)**2 + (B2-B1)**2?
The number of items in the dictionary is unfixed and could exceed 100,000.
Edit - This is important: the keys for A and B are the same, but the values corresponding to those keys in A and B are different. A particular choice of key will yield an ordered pair (A1,B1) that is different from any other possible order pair (A2,B2)—different keys have different order pairs. For example, both A and B will have the key 3,4 and this will yield a value of 1.0 for dict A and 2.0 for B. This one key will then be compared to every other key possible to find the other ordered pair (i.e. both the key and values of the items in A and B) that minimizes the squared differences between them.
You'll need a specialized data structure, not a standard Python dictionary. Look up quad-tree or kd-tree. You are effectively minimizing the Euclidean distance between two points (your objective function is just a square root away from Euclidean distance, and your dictionary A is storing x-coordinates, B y-coordinates.). Computational-geometry people have been studying this for years.
Well, maybe I am misreading your question and making it harder than it is. Are you saying that you can pick any value from A and any value from B, regardless of whether their keys are the same? For instance, the pick from A could be K:V (3,4):2.0, and the pick from B could be (5,6):3.0? Or does it have to be (3,4):2.0 from A and (3,4):6.0 from B? If the former, the problem is easy: just run through the values from A and find the closest to A1; then run through the values from B and find the closest to B1. If the latter, my first paragraph was the right answer.
Your comment says that the harder problem is the one you want to solve, so here is a little more. Sedgewick's slides explain how the static grid, the 2d-tree, and the quad-tree work. http://algs4.cs.princeton.edu/lectures/99GeometricSearch.pdf . Slides 15 through 29 explain mainly the 2d-tree, with 27 through 29 covering the solution to the nearest-neighbor problem. Since you have the constraint that the point the algorithm finds must share neither x- nor y-coordinate with the query point, you might have to implement the algorithm yourself or modify an existing implementation. One alternative strategy is to use a kNN data structure (k nearest neighbors, as opposed to the single nearest neighbor), experiment with k, and hope that your chosen k will always be large enough to find at least one neighbor that meets your constraint.

Finding unique maximum values in a list using python

I have a list of points as shown below
points=[ [x0,y0,v0], [x1,y1,v1], [x2,y2,v2].......... [xn,yn,vn]]
Some of the points have duplicate x,y values. What I want to do is to extract the unique maximum value x,y points
For example, if I have points [1,2,5] [1,1,3] [1,2,7] [1,7,3]
I would like to obtain the list [1,1,3] [1,2,7] [1,7,3]
How can I do this in python?
Thanks
For example:
import itertools
def getxy(point): return point[:2]
sortedpoints = sorted(points, key=getxy)
results = []
for xy, g in itertools.groupby(sortedpoints, key=getxy):
results.append(max(g, key=operator.itemgetter(2)))
that is: sort and group the points by xy, for every group with fixed xy pick the point with the maximum z. Seems straightforward if you're comfortable with itertools (and you should be, it's really a very powerful and useful module!).
Alternatively you could build a dict with (x,y) tuples as keys and lists of z as values and do one last pass on that one to pick the max z for each (x, y), but I think the sort-and-group approach is preferable (unless you have many millions of points so that the big-O performance of sorting worries you for scalability purposes, I guess).
You can use dict achieve this, using the property that "If a given key is seen more than once, the last value associated with it is retained in the new dictionary." This code sorts the points to make sure that the highest values come later, creates a dictionary whose keys are a tuple of the first two values and whose value is the third coordinate, then translates that back into a list
points = [[1,2,5], [1,1,3], [1,2,7], [1,7,3]]
sp = sorted(points)
d = dict( ( (a,b), c) for (a,b,c) in sp)
results = [list(k) + [v] for (k,v) in d.iteritems()]
There may be a way to further improve that, but it satisfies all your requirements.
If I understand your question .. maybe use a dictionary to map (x,y) to the max z
something like this (not tested)
dict = {}
for x,y,z in list
if dict.has_key((x,y)):
dict[(x,y)] = max(dict[(x,y)], z)
else:
dict[(x,y)] = z
Though the ordering will be lost

Categories