check if subarray is in array of arrays - python

I've got an array of arrays where I store x,y,z coordinates and a measurement at that coordinate like:
measurements = [[x1,y1,z1,val1],[x2,y2,z2,val2],[...]]
Now before adding a measurement for a certain coordinate I want to check if there is already a measurement for that coordinate. So I can only keep the maximum val measurement.
So the question is:
Is [xn, yn, zn, ...] already in measurements
My approach so far would be to iterate over the array and compare with a sclied entry like
for measurement in measurements:
if measurement_new[:3] == measurement[:3]:
measurement[3] = measurement_new[3] if measurement_new[3] > measurement[3] else measurement[3]
But with the measurements array getting bigger this is very unefficient.
Another approach would be two separate arrays coords = [[x1,y1,z1], [x2,y2,z2], [...]] and vals = [val1, val2, ...]
This would allow to check for existing coordinates effeciently with [x,y,z] in coords but would have to merge the arrays later on.
Can you suggest a more efficent method for soving this problem?

If you want to stick to built-in types (if not see last point in Notes below) I suggest using a dict for the measurements:
measurements = {(x1,y1,z1): val1,
(x2,y2,z2): val2}
Then adding a new value (x,y,z,val) can simply be:
measurements[(x,y,z)] = max(measurements.get((x,y,z), 0), val)
Notes:
The value 0 in measurements.get is supposed to be the lower bound of the values you are expecting. If you have values below 0 then change it to an appropriate lower bound such that whenever (x,y,z) is not present in your measures get returns the lower bound and thus max will return val. You can also avoid having to specify the lower bound and write:
measurements[(x,y,z)] = max(measurements.get((x,y,z), val), val)
You need to use tuple as type for your keys, hence the (x,y,z). This is because lists cannot be hashed and so not permitted as keys.
Finally, depending on the complexity of the task you are performing, consider using more complex data types. I would recommend having a look at pandas DataFrames they are ideal to deal with such kind of things.

Related

Can anyone explain why the maximum value of a concatenation between two arrays is so much higher than the max value in either single array?

I have the following two datasets - both from netCDF files:
ds1 = observed_1979_01
ds2 = observed_1979_02
I want to extract the variable labelled 'swvl1' from both datasets, and I do this by:
m = ds1.variables['swvl1'][0,:,:]
n = ds2.variables['swvl1'][0,:,:]
I want to concantenate these two arrays together, which I do using np.dstack (though the same problem outlined here occurs with np.concatenate as well), such like:
d = np.dstack((m,n))
Now if I look at the maximum value in either array, I get that:
max_m = 0.76293164
max_n = 0.76335037
However, the max value f the concatenated arrays is:
max_d = 9.96921e+36
Why is this happening? I believe something must be going massively wrong in the concatenating of the two arrays to give a different maximum value, but I can't figure out what it is. Does anyone have any ideas?
The maximum value 9.96921e+36 is identical to the default _FillValue, which could indicate that your arrays contain uninitialized values before (and after) they are concatenated. Be sure all values are initialized to valid values before computing the maximum, and/or give the routine that computes the maximum the value 9.96921e+36 as the missing value to ignore.
Responding to question in comment below:
Yes. Uninitialized in this context means that the variable was defined and space allocated on disk to hold its values, however, no values were ever written. By default in netCDF, unwritten values appear as 9.96921e+36 when read.

Data structure for a diamond-shaped array in python

I have two arrays that are related to each other via a mapping operation. I will call them S(fk,fq) and Z(fi,αj). The arguments are all sampling frequencies. The mapping rule is fairly straightforward:
fi = 0.5 · (fk - fq)
αj = fk + fq
S is the result of several FFTs and complex multiplications and is defined on a rectangular grid. However, Z is defined on a diamond-shaped grid and it is not clear to me how best to store this. The image below is an attempt at visualizing the operation for a simple example of a 4×4 array, but in general the dimensions are not equal and are much larger (maybe 64×16384, but this is user-selectable). Blue points are the resulting values of fi and αj and the text describes how these are related to fk, fq, and the discrete indices.
The diamond-shaped nature of Z means that in one "row" there will be "columns" that fall in between the "columns" of adjacent "rows". Another way to think of this is that fi can take on fractional index values!
Note that using zero's or nan's to fill in elements that don't exist in any given row has two drawbacks 1) it inflates the size of what may already be a very large 2-D array and 2) it does not really represent the true nature of Z (e.g. the array size will not really be correct).
Currently I am using a dictionary indexed on the actual values of αj to store the results:
import numpy as np
from collections import defaultdict
nrows = 64
ncolumns = 16384
fk = np.fft.fftfreq(nrows)
fq = np.fft.fftfreq(ncolumns)
# using random numbers here to simplify the example
# in practice S is the result of several FFTs and complex multiplications
S = np.random.random(size=(nrows,ncolumns)) + 1j*np.random.random(size=(nrows,ncolumns))
ret = defaultdict(lambda: {"fi":[],"Z":[]})
for k in range(-nrows//2,nrows//2):
for q in range(-ncolumns//2,ncolumns//2):
fi = 0.5*fk[k] - fq[q]
alphaj = fk[k] + fq[q]
Z = S[k,q]
ret[alphaj]["fi"].append(fi)
ret[alphaj]["Z"].append(Z)
I still find this a bit cumbersome to work with and wonder if anyone has suggestions for a better approach? "Better" here would be defined as more computationally and memory efficient and/or easier to interact with and visualize using something like matplotlib.
Note: This is related to another question about how to get rid of those nasty for-loops. Since this is about storing the results I thought it would be better to create two separate questions.
You can still view it as a straight two-dimensional array. But you can represent it as an array of rows, each row of which has a different number of items. For example, here's your 4x4 as a 2D array: (each 0 here is a unique data item)
xxx0xxx
xx0x0xx
x0x0x0x
0x0x0x0
x0x0x0x
xx0x0xx
xxx0xxx
Its sparse representation would be:
[
[0],
[0,0],
[0,0,0],
[0,0,0,0],
[0,0,0],
[0,0],
[0]
]
With this representation you eliminate the empty space. There's a little math involved in converting from Color Temperature to row, and from Spectral Frequency to column (and vice-versa), but that's tractable. You know the bounds and that items are evenly spaced out across each row. So it should be easy enough to do the translation.
Unless I'm missing something . . .
It turns out that the answer to a related question on optimization effectively solved my problem of how to better store the data. The new code returns 2-D arrays for fi, %alpha;j, and these can be used to directly index S. So to get all values of S for %alpha;j = 0, for example, one can do
S[alphaj == 0]
I can use this pretty effectively and it seems like the quickest way to create a reasonable data structure.

looping through complicated nested dictionary

I have a rather complex list of dictionaries with nested dictionaries and arrays. I am trying to figure out a way to either,
make the list of data less complicated and then loop through the
raster points or,
find a way to loop through the array of raster points as is.
What I am ultimately trying to do is loop through all raster points within each polygon, perform a simple greater than or less than on the value assigned to that raster point (values are elevation values). If greater than a given value assign 1, if less than given value assign 0. I would then create a separate array of these 1s and 0s of which I can then get an average value.
I have found all these points (allpoints within pts), but they are in arrays within a dictionary within another dictionary within a list (of all polygons) at least I think, I could be wrong in the organization as dictionaries are rather new to me.
The following is my code:
import numpy as np
def mystat(x):
mystat = dict()
mystat['allpoints'] = x
return mystat
stats = zonal_stats('acp.shp','myGeoTIFF.tif')
pts = zonal_stats('acp.shp','myGeoTIFF.tif', add_stats={'mystat':mystat})
Link to my documents. Any help or direction would be greatly appreciated!
I assume you are using rasterstats package. You could try something like this:
threshold_value = 15 # You may change this threshold value to yours
for o_idx in range(0, len(pts)):
data = pts[o_idx]['mystat']['allpoints'].data
for d_idx in range(0, len(data)):
for p_idx in range(0, len(data[d_idx])):
# You may change the conditions below as you want
if data[d_idx][p_idx] > threshold_value:
data[d_idx][p_idx] = 1
elif data[d_idx][p_idx] <= threshold_value:
data[d_idx][p_idx] = 0;
It is going to update the data within the pts list

Calculating cartesian coordinates using python

Having not worked with cartesian graphs since high school, I have actually found a need for them relevant to real life. It may be a strange need, but I have to allocate data to points on a cartesian graph, that will be accessible by calling cartesian coordinates. There needs to be infinite points on the graphs. For Eg.
^
[-2-2,a ][ -1-2,f ][0-2,k ][1-2,p][2-2,u]
[-2-1,b ][ -1-1,g ][0-1,l ][1-1,q][1-2,v]
<[-2-0,c ][ -1-0,h ][0-0,m ][1-0,r][2-0,w]>
[-2--1,d][-1--1,i ][0--1,n][1-1,s][2-1,x]
[-2--2,e][-1--2,j ][0--2,o][1-2,t][2-2,y]
v
The actual values aren't important. But, say I am on variable m, this would be 0-0 on the cartesian graph. I need to calculate the cartesian coordinates for if I moved up one space, which would leave me on l.
Theoretically, say I have a python variable which == ("0-1"), I believe I need to split it at the -, which would leave x=0, y=1. Then, I would need to perform (int(y)+1), then re-attach x to y with a '-' in between.
What I want to be able to do is call a function with the argument (x+1,y+0), and for the program to perform the above, and then return the cartesian coordinate it has calculated.
I don't actually need to retrieve the value of the space, just the cartesian coordinate. I imagine I could utilise re.sub(), however I am not sure how to format this function correctly to split around the '-', and I'm also not sure how to perform the calculation correctly.
How would I do this?
To represent an infinite lattice, use a dictionary which maps tuples (x,y) to values.
grid[(0,0)] = m
grid[(0,1)] = l
print(grid[(0,0)])
I'm not sure I fully understand the problem but I would suggest using a list of lists to get the 2D structure.
Then to look up a particular value you could do coords[x-minX][y-minY] where x,y are the integer indices you want, and minX and minY are the minimum values (-2 in your example).
You might also want to look at NumPy which provides an n-dim object array type that is much more flexible, allowing you to 'slice' each axis or get subranges. The NumPy documentation might be helpful if you are new to working with arrays like this.
EDIT:
To split a string like 0-1 into the constituent integers you can use:
s = '0-1'
[int(x) for x in s.split('-')]
You want to create a bidirectional mapping between the variable names and the coordinates, then you can look up coordinates by variable name, apply your function to it, then find the next variable using the new set of coordinates produced by your function.
Mapping between numeric tuples you can apply your function to, and strings usable as keys in a dict, and back, is easy.

Finding unique maximum values in a list using python

I have a list of points as shown below
points=[ [x0,y0,v0], [x1,y1,v1], [x2,y2,v2].......... [xn,yn,vn]]
Some of the points have duplicate x,y values. What I want to do is to extract the unique maximum value x,y points
For example, if I have points [1,2,5] [1,1,3] [1,2,7] [1,7,3]
I would like to obtain the list [1,1,3] [1,2,7] [1,7,3]
How can I do this in python?
Thanks
For example:
import itertools
def getxy(point): return point[:2]
sortedpoints = sorted(points, key=getxy)
results = []
for xy, g in itertools.groupby(sortedpoints, key=getxy):
results.append(max(g, key=operator.itemgetter(2)))
that is: sort and group the points by xy, for every group with fixed xy pick the point with the maximum z. Seems straightforward if you're comfortable with itertools (and you should be, it's really a very powerful and useful module!).
Alternatively you could build a dict with (x,y) tuples as keys and lists of z as values and do one last pass on that one to pick the max z for each (x, y), but I think the sort-and-group approach is preferable (unless you have many millions of points so that the big-O performance of sorting worries you for scalability purposes, I guess).
You can use dict achieve this, using the property that "If a given key is seen more than once, the last value associated with it is retained in the new dictionary." This code sorts the points to make sure that the highest values come later, creates a dictionary whose keys are a tuple of the first two values and whose value is the third coordinate, then translates that back into a list
points = [[1,2,5], [1,1,3], [1,2,7], [1,7,3]]
sp = sorted(points)
d = dict( ( (a,b), c) for (a,b,c) in sp)
results = [list(k) + [v] for (k,v) in d.iteritems()]
There may be a way to further improve that, but it satisfies all your requirements.
If I understand your question .. maybe use a dictionary to map (x,y) to the max z
something like this (not tested)
dict = {}
for x,y,z in list
if dict.has_key((x,y)):
dict[(x,y)] = max(dict[(x,y)], z)
else:
dict[(x,y)] = z
Though the ordering will be lost

Categories