Re-distributing 2d data with max in middle - python

Hey all I have a set up seemingly random 2D data that I want to reorder. This is more for an image with specific values at each pixel but the concept will be the same.
I have large 2d array that looks very random, say:
x = 100
y = 120
np.random.random((x,y))
and I want to re-distribute the 2d matrix so that the maximum value is in the center and the values from the maximum surround it giving it sort of a gaussian fall off from the center.
small example:
output = [[0.0,0.5,1.0,1.0,1.0,0.5,0.0]
[0.0,1.0,1.0,1.5,1.0,0.5,0.0]
[0.5,1.0,1.5,2.0,1.5,1.0,0.5]
[0.0,1.0,1.0,1.5,1.0,0.5,0.0]
[0.0,0.5,1.0,1.0,1.0,0.5,0.0]]
I know it wont really be a gaussian but just trying to give a visualization of what I would like. I was thinking of sorting the 2d array into a list from max to min and then using that to create a new 2d array but Im not sure how to distribute the values down to fill the matrix how I want.
Thank you very much!

If anyone looks at this in the future and needs help, Here is some advice on how to do this effectively for a lot of data. Posted below is the code.
def datasort(inputarray,spot_in_x,spot_in_y):
#get the data read
center_of_y = spot_in_y
center_of_x = spot_in_x
M = len(inputarray[0])
N = len(inputarray)
l_list = list(itertools.chain(*inputarray)) #listed data
l_sorted = sorted(l_list,reverse=True) #sorted listed data
#Reorder
to_reorder = list(np.arange(0,len(l_sorted),1))
x = np.linspace(-1,1,M)
y = np.linspace(-1,1,N)
centerx = int(M/2 - center_of_x)*0.01
centery = int(N/2 - center_of_y)*0.01
[X,Y] = np.meshgrid(x,y)
R = np.sqrt((X+centerx)**2 + (Y+centery)**2)
R_list = list(itertools.chain(*R))
values = zip(R_list,to_reorder)
sortedvalues = sorted(values)
unzip = list(zip(*sortedvalues))
unzip2 = unzip[1]
l_reorder = zip(unzip2,l_sorted)
l_reorder = sorted(l_reorder)
l_unzip = list(zip(*l_reorder))
l_unzip2 = l_unzip[1]
sorted_list = np.reshape(l_unzip2,(N,M))
return(sorted_list)
This code basically takes your data and reorders it in a sorted list. Then zips it together with a list based on a circular distribution. Then using the zip and sort commands you can create the distribution of data you wish to have based on your distribution function, in my case its a circle that can be offset.

Related

scipy interpolate griddata return indices

I have ben able to successfully utilize scipy's interpolate griddata function on multiple different datasets. However, I have now reached the point where I want to extract the same region from a large grid and interpolate over different datapoints. In other words, I can do the following:
# Get one sample of data
sample_data = alldata[0,:,:]
small_data = griddata((largelats.flatten(),largelons.flatten()),sample_data.flatten(),(smalllats,smalllons),'nearest')
Now, if I wanted to loop through this data based on the length of the number of first-index indices for the variable alldata:
final_data= np.zeros([len(alldata),smalllats.shape[0],smalllons.shape[1]])
for i in range(0,len(alldata)):
sample_data = alldata[i,:,:]
small_data = griddata((largelats.flatten(),largelons.flatten()),sample_data.flatten(),(smalllats,smalllons),'nearest')
final_data[i,:,:] = small_data
The problem with the above method is that I am calculating the indices to extract every loop with the griddata function. Is there something that can be done to return the indices so I could, instead, do something like the following:
xind = []
yind = []
final_data= np.zeros([len(alldata),smalllats.shape[0],smalllons.shape[1]])
for i in range(0,len(alldata)):
sample_data = alldata[i,:,:]
if len(xind) == 0:
small_data,return_x_inds_of_large_grid,return_y_inds_of_large_grid = griddata((largelats.flatten(),largelons.flatten()),sample_data.flatten(),(smalllats,smalllons),'nearest')
xind = return_x_inds_of_large_grid
yind = return_y_inds_of_large_grid
final_data[i,xind,yind] = sample_data[xind,yind]
Basically, I would not like to loop over the griddata function. I would like to return the indices that can then be used for other iterations. Can something like this be done?

Generate simulated data in Python while meeting a range of correlations with respect to a predefined variable

Let's denote refVar, a variable of interest that contains experimental data.
For the simulation study, I would like to generate other variables V0.05, V0.10, V0.15 until V0.95.
Note that for the variable name, the value following V represents the correlation between the variable and refVar (in order to quick track in the final dataframe).
My readings led me to multivariate_normal() from numpy. However, when using this function, it generates 2 1D-arrays both with random numbers. What I want is to always keep refVar and generate other arrays filled with random numbers, while meeting the specified correlation.
Please, find below my my code. To cut it short, I've no clue how to generate other variables relative to my experimental variable refVar. Ideally, I would like to build a data frame containing the following columns: refVar,V0.05,V0.10,...,V0.95. I hope you get my point and thank you in advance for your time
import numpy as np
import pandas as pd
from numpy.random import multivariate_normal as mvn
refVar = [75.25,77.93,78.2,61.77,80.88,71.95,79.88,65.53,85.03,61.72,60.96,56.36,23.16,73.36,64.18,83.07,63.25,49.3,78.2,30.96]
mean_refVar = np.mean(refVar)
for r in np.arange(0,1,0.05):
var1 = 1
var2 = 1
cov = r
cov_matrix = [[var1,cov],
[cov,var2]]
data = mvn([mean_refVar,mean_refVar],cov_matrix,size=len(refVar))
output = 'corr_'+str(r.round(2))+'.txt'
df = pd.DataFrame(data,columns=['refVar','v'+str(r.round(2)])
df.to_csv(output,sep='\t',index=False) # Ideally, instead of creating an output for each correlation, I would like to generate a DF with refVar and all these newly created Series
Following this answer we can generate the sequence as follow:
def rand_with_corr(refVar, corr):
# center and normalize refVar
X = np.array(refVar) - np.mean(refVar)
X = X/np.linalg.norm(X)
# random sampling Y
Y = np.random.rand(len(X))
# centralize Y
Y = Y - Y.mean()
# find the orthorgonal component to X
Y = Y - Y.dot(X) * X
# normalize Y
Y = Y/np.linalg.norm(Y)
# output
return Y + (1/np.tan(np.arccos(corr))) * X
# test
out = rand_with_corr(refVar, 0.05)
pd.Series(out).corr(pd.Series(refVar))
# out
# 0.050000000000000086

combining/merging multiple 2d arrays into single array by using python

I have four 2 dimensional np arrays. Shape of each array is (203 , 135). Now I want join all these arrays into one single array with respect to latitude and longitude.
I have used code below to read data
import pandas as pd
import numpy as np
import os
import glob
from pyhdf import SD
import datetime
import mpl_toolkits.basemap.pyproj as pyproj
DATA = ({})
files = glob.glob('MOD04*')
files.sort()
for n, f in enumerate(files):
SDS_NAME='Deep_Blue_Aerosol_Optical_Depth_550_Land'
hdf=SD.SD(f)
lat = hdf.select('Latitude')
latitude = lat[:]
min_lat=latitude.min()
max_lat=latitude.max()
lon = hdf.select('Longitude')
longitude = lon[:]
min_lon=longitude.min()
max_lon=longitude.max()
sds=hdf.select(SDS_NAME)
data=sds.get()
p = pyproj.Proj(proj='utm', zone=45, ellps='WGS84')
x,y = p(longitude, latitude)
def set_element(elements, x, y, data):
# Set element with two coordinates.
elements[x + (y * 10)] = data
elements = []
set_element(elements,x,y,data)
But I got error: only integer arrays with one element can be converted to an index
you can find the data: https://drive.google.com/open?id=0B2rkXkOkG7ExMElPRDd5YkNEeDQ
I have created toy datasets for this problem as per requested.
what I want is to get one single array from four (a,b,c,d) arrays. whose dimension should be something like (406, 270)
a = (np.random.rand(27405)).reshape(203,135)
b = (np.random.rand(27405)).reshape(203,135)
c = (np.random.rand(27405)).reshape(203,135)
d = (np.random.rand(27405)).reshape(203,135)
a_x = (np.random.uniform(10,145,27405)).reshape(203,135)
a_y = (np.random.uniform(204,407,27405)).reshape(203,135)
d_x = (np.random.uniform(150,280,27405)).reshape(203,135)
d_y = (np.random.uniform(204,407,27405)).reshape(203,135)
b_x = (np.random.uniform(150,280,27405)).reshape(203,135)
b_y = (np.random.uniform(0,202,27405)).reshape(203,135)
c_x = (np.random.uniform(10,145,27405)).reshape(203,135)
c_y = (np.random.uniform(0,202,27405)).reshape(203,135)
any help?
This should be a comment, yet the comment space is not enough for these questions. Therefore I am posting here:
You say that you have 4 input arrays (a,b,c,d) which are somehow to be intergrated into an output array. As far as is understood, two of these arrays contain positional information (x,y) such as longitude and latitude. The only line in your code, where you combine several input arrays is here:
def set_element(elements, x, y, data):
# Set element with two coordinates.
elements[x + (y * 10)] = data
Here you have four input variables (elements, x, y, data) which I assume to be your input arrays (a,b,c,d). In this operation yet you do not combine them, but you overwrite an element of elements (index: x + 10y) with a new value (data).
Therefore, I do not understand your target output.
When I was asking for toy data, I had something like this in mind:
a = [[1,2]]
b = [[3,4]]
c = [[5,6]]
d = [[7,8]]
This would be such an easy example that you could easily say:
What I want is this:
res = [[[1,2],[3,4]],[[5,6],[7,8]]]
Then we could help you to find an answer.
Please, thus, provide more information about the operation that you want to conduct either mathematically notated ( such as x = a +b*c +d) or with toy data so that we can deduce the function you ask for.

Indexing and vectorizing a nested-loop

I am running a particular script that will calculate the fractal dimension of the input data. While the script does run fine, it is very slow, and a look into it using cProfile showed that the function boxcount is accounting for around 90% of the run time. I have had similar issues in a previous questions,More efficient way to loop?, and Vectorization of a nested for-loop. While looking at cProfile, the function itself does not run slow, but in the script is needs to be called a large number of times. I'm struggling to find a way to re-write this to eliminate the large number of function calls. Here is the code below:
for j in range(starty, endy):
jmin=j-half_tile
jmax=j+half_tile+1
# Loop over columns
for i in range(startx, endx):
imin=i-half_tile
imax=i+half_tile+1
# Extract a subset of points from the input grid, centered on the current
# point. The size of tile is given by the current entry of the tile list.
z = surface[imin:imax, jmin:jmax]
# print 'Tile created. Size:', z.shape
# Calculate fractal dimension of the tile using 3D box-counting
fd, intercept = boxcount(z,dx,nside,cell,slice_size,box_size)
FractalDim[i,j] = fd
Lacunarity[i,j] = intercept
My real problem is that for each loop through i,j, it finds the values of imin,imax,jmin,jmax, which is basically creating a subset of the input data, centered around the values of imin,imax,jmin,jmax. The function of interest, boxcount is evaluated over the range of imin,imax,jmin,jmax as well. For this example, the value of half_tile is 6, and the values for starty,endy,startx,endx are 6,271,5,210 respectively. The values of dx,cell,nside,slice_size,box_size are all just constants used in the boxcount function.
I have done problems similar to this, just not with the added complication of centering the slice of data around a particular point. Can this be vectorized? or improved at all?
EDIT
Here is the code for the function boxcount as requested.
def boxcount(z,dx,nside,cell,slice_size,box_size):
# fractal dimension calculation using box-counting method
n = 5 # number of graph points for simple linear regression
gx = [] # x coordinates of graph points
gy = [] # y coordinates of graph points
boxCount = np.zeros((5))
cell_set = np.reshape(np.zeros((5*(nside**3))), (nside**3,5))
nslice=nside**2
# Box is centered at the mid-point of the tile. Calculate for each point in the
# tile, which voxel the contains the point
z0 = z[nside/2,nside/2]-dx*nside/2
for j in range(1,13):
for i in range(1,13):
ij = (j-1)*12 + i
# print 'i, j:', i, j
delz1 = z[i-1,j-1]-z0
delz2 = z[i-1,j]-z0
delz3 = z[i,j-1]-z0
delz4 = z[i,j]-z0
delz = 0.25*(delz1+delz2+delz3+delz4)
if delz < 0.0:
break
slice = ceil(delz)
# print " delz:",delz," slice:",slice
# Identify the voxel occupied by current point
ijk = int(slice-1.)*nslice + (j-1)*nside + i
for k in range(5):
if cell_set[cell[ijk,k],k] != 1:
cell_set[cell[ijk,k],k] = 1
# Set any cells deeper than this one equal to one aswell
# index = cell[ijk,k]
# for l in range(int(index),box_size[k],slice_size[k]):
# cell_set[l,k] = 1
# Count number of filled boxes for each box size
boxCount = np.sum(cell_set,axis=0)
# print "boxCount:", boxCount
for ib in range(1,n+1):
# print "ib:",ib," x(ib):",math.log(1.0/ib)," y(ib):",math.log(boxCount[ib-1])
gx.append( math.log(1.0/ib) )
gy.append( math.log(boxCount[ib-1]) )
# simple linear regression
m, b = np.polyfit(gx,gy,1)
# print "Polyfit: Slope:", m,' Intercept:', b
# fd = m-1
fd = max(2.,m)
return(fd,b)

python - combining argsort with masking to get nearest values in moving window

I have some code for calculating missing values in an image, based on neighbouring values in a 2D circular window. It also uses the values from one or more temporally-adjacent images at the same locations (i.e. the same 2D window shifted in the 3rd dimension).
For each position that is missing, I need to calculate the value based not necessarily on all the values available in the whole window, but only on the spatially-nearest n cells that do have values (in both images / Z-axis positions), where n is some value less than the total number of cells in the 2D window.
At the minute, it's much quicker to calculate for everything in the window, because my means of sorting to get the nearest n cells with data is the slowest part of the function as it has to be repeated each time even though the distances in terms of window coordinates do not change. I'm not sure this is necessary and feel I must be able to get the sorted distances once, and then mask those in the process of only selecting available cells.
Here's my code for selecting the data to use within a window of the gap cell location:
# radius will in reality be ~100
radius = 2
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
dist = np.sqrt(x**2 + y**2)
circle_template = dist > radius
# this will in reality be a very large 3 dimensional array
# representing daily images with some gaps, indicated by 0s
dataStack = np.zeros((2,5,5))
dataStack[1] = (np.random.random(25) * 100).reshape(dist.shape)
dataStack[0] = (np.random.random(25) * 100).reshape(dist.shape)
testdata = dataStack[1]
alternatedata = dataStack[0]
random_gap_locations = (np.random.random(25) * 30).reshape(dist.shape) > testdata
testdata[random_gap_locations] = 0
testdata[radius, radius] = 0
# in reality we will go through every gap (zero) location in the data
# for each image and for each gap use slicing to get a window of
# size (radius*2+1, radius*2+1) around it from each image, with the
# gap being at the centre i.e.
# testgaplocation = [radius, radius]
# and the variables testdata, alternatedata below will refer to these
# slices
locations_to_exclude = np.logical_or(circle_template, np.logical_or
(testdata==0, alternatedata==0))
# the places that are inside the circular mask and where both images
# have data
locations_to_include = ~locations_to_exclude
number_available = np.count_nonzero(locations_to_include)
# we only want to do the interpolation calculations from the nearest n
# locations that have data available, n will be ~100 in reality
number_required = 3
available_distances = dist[locations_to_include]
available_data = testdata[locations_to_include]
available_alternates = alternatedata[locations_to_include]
if number_available > number_required:
# In this case we need to find the closest number_required of elements, based
# on distances recorded in dist, from available_data and available_alternates
# Having to repeat this argsort for each gap cell calculation is slow and feels
# like it should be avoidable
sortedDistanceIndices = available_distances.argsort(kind = 'mergesort',axis=None)
requiredIndices = sortedDistanceIndices[0:number_required]
selected_data = np.take(available_data, requiredIndices)
selected_alternates = np.take(available_alternates , requiredIndices)
else:
# we just use available_data and available_alternates as they are...
# now do stuff with the selected data to calculate a value for the gap cell
This works, but over half of the total time of the function is taken in the argsort of the masked spatial distance data. (~900uS of a total 1.4mS - and this function will be running tens of billions of times, so this is an important difference!)
I am sure that I must be able to just do this argsort once outside of the function, when the spatial distance window is originally set up, and then include those sort indices in the masking, to get the first howManyToCalculate indices without having to re-do the sort. The answer might involve putting the various bits that we are extracting from, into a record array - but I can't figure out how, if so. Can anyone see how I can make this part of the process more efficient?
So you want to do the sorting outside of the loop:
sorted_dist_idcs = dist.argsort(kind='mergesort', axis=None)
Then using some variables from the original code, this is what I could come up with, though it still feels like a major round-trip..
loc_to_incl_sorted = locations_to_include.take(sorted_dist_idcs)
sorted_dist_idcs_to_incl = sorted_dist_idcs[loc_to_incl_sorted]
required_idcs = sorted_dist_idcs_to_incl[:number_required]
selected_data = testdata.take(required_idcs)
selected_alternates = alternatedata.take(required_idcs)
Note the required_idcs refer to locations in the testdata and not available_data as in the original code. And this snippet I used take for the purpose of conveniently indexing the flattened array.
#moarningsun - thanks for the comment and answer. These got me on the right track, but don't quite work for me when the gap is < radius from the edge of the data: in this case I use a window around the gap cell which is "trimmed" to the data bounds. In this situation the indices reflect the "full" window and thus can't be used to select cells from the bounded window.
Unfortunately I edited that part of my code out when I clarified the original question but it's turned out to be relevant.
I've realised now that if you use argsort again on the output of argsort then you get ranks; i.e. the position that each item would have when the overall array was sorted. We can safely mask these and then take the smallest number_required of them (and do this on a structured array to get the corresponding data at the same time).
This implies another sort within the loop, but in fact we can use partitioning rather than a full sort, because all we need is the smallest num_required items. If num_required is substantially less than the number of data items then this is much faster than doing the argsort.
For example with num_required = 80 and num_available = 15000 the full argsort takes ~900µs whereas argpartition followed by index and slice to get the first 80 takes ~110µs. We still need to do the argsort to get the ranks at the outset (rather than just partitioning based on distance) in order to get the stability of the mergesort, and thus get the "right one" when distance is not unique.
My code as shown below now runs in ~610uS on real data, including the actual calculations that aren't shown here. I'm happy with that now, but there seem to be several other apparently minor factors that can have an influence on the runtime that's hard to understand.
For example putting the circle_template in the structured array along with dist, ranks, and another field not shown here, doubles the runtime of the overall function (even if we don't access circle_template in the loop!). Even worse, using np.partition on the structured array with order=['ranks'] increases the overall function runtime by almost two orders of magnitude vs using np.argpartition as shown below!
# radius will in reality be ~100
radius = 2
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
dist = np.sqrt(x**2 + y**2)
circle_template = dist > radius
ranks = dist.argsort(axis=None,kind='mergesort').argsort().reshape(dist.shape)
diam = radius * 2 + 1
# putting circle_template in this array too doubles overall function runtime!
fullWindowArray = np.zeros((diam,diam),dtype=[('ranks',ranks.dtype.str),
('thisdata',dayDataStack.dtype.str),
('alternatedata',dayDataStack.dtype.str),
('dist',spatialDist.dtype.str)])
fullWindowArray['ranks'] = ranks
fullWindowArray['dist'] = dist
# this will in reality be a very large 3 dimensional array
# representing daily images with some gaps, indicated by 0s
dataStack = np.zeros((2,5,5))
dataStack[1] = (np.random.random(25) * 100).reshape(dist.shape)
dataStack[0] = (np.random.random(25) * 100).reshape(dist.shape)
testdata = dataStack[1]
alternatedata = dataStack[0]
random_gap_locations = (np.random.random(25) * 30).reshape(dist.shape) > testdata
testdata[random_gap_locations] = 0
testdata[radius, radius] = 0
# in reality we will loop here to go through every gap (zero) location in the data
# for each image
gapz, gapy, gapx = 1, radius, radius
desLeft, desRight = gapx - radius, gapx + radius+1
desTop, desBottom = gapy - radius, gapy + radius+1
extentB, extentR = dataStack.shape[1:]
# handle the case where the gap is < search radius from the edge of
# the data. If this is the case, we can't use the full
# diam * diam window
dataL = max(0, desLeft)
maskL = 0 if desLeft >= 0 else abs(dataL - desLeft)
dataT = max(0, desTop)
maskT = 0 if desTop >= 0 else abs(dataT - desTop)
dataR = min(desRight, extentR)
maskR = diam if desRight <= extentR else diam - (desRight - extentR)
dataB = min(desBottom,extentB)
maskB = diam if desBottom <= extentB else diam - (desBottom - extentB)
# get the slice that we will be working within
# ranks, dist and circle are already populated
boundedWindowArray = fullWindowArray[maskT:maskB,maskL:maskR]
boundedWindowArray['alternatedata'] = alternatedata[dataT:dataB, dataL:dataR]
boundedWindowArray['thisdata'] = testdata[dataT:dataB, dataL:dataR]
locations_to_exclude = np.logical_or(boundedWindowArray['circle_template'],
np.logical_or
(boundedWindowArray['thisdata']==0,
boundedWindowArray['alternatedata']==0))
# the places that are inside the circular mask and where both images
# have data
locations_to_include = ~locations_to_exclude
number_available = np.count_nonzero(locations_to_include)
# we only want to do the interpolation calculations from the nearest n
# locations that have data available, n will be ~100 in reality
number_required = 3
data_to_use = boundedWindowArray[locations_to_include]
if number_available > number_required:
# argpartition seems to be v fast when number_required is
# substantially < data_to_use.size
# But partition on the structured array itself with order=['ranks']
# is almost 2 orders of magnitude slower!
reqIndices = np.argpartition(data_to_use['ranks'],number_required)[:number_required]
data_to_use = np.take(data_to_use,reqIndices)
else:
# we just use available_data and available_alternates as they are...
pass
# now do stuff with the selected data to calculate a value for the gap cell

Categories