How to create a two-column matrix in rpy2 - python

I'm using rpy2 to run a method from an R library. According to the documentation:
method_name(x, range.x)
x: a two-column numeric matrix.
range.x: a list containing two vectors.
And it includes an example:
data(geyser, package="MASS")
x <- cbind(geyser$duration, geyser$waiting)
est <- method_name(x, range.x)
I checked the type of geyser$duration and geyser$waiting and both are double. I also tried replacing geyser$duration and geyser$waiting by g = c(.016, 2.15, 4.00) and h = c(.012, 2.11, 2.50) in R, and the code still works.
In my current Python code, I have:
import numpy as np
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector, FloatVector, ListVector # I tried these before too
from rpy2.robjects import numpy2ri, pandas2ri
numpy2ri.activate()
pandas2ri.activate()
base = rpackages.importr(('base'))
a = np.array([1.2, 2.1, 2.5]); b = np.array([5.2, 1.3, 2.15])
x = base.cbind(base.c(a), base.c(b))
ranges = base.range(x)
result = method_name(x, ranges)
As you can see, I'm trying to make my code as similar to the example as possible. However, I can't make the method work. I get the error Error in seq.default(a[2L], b[2L], length = M[2L]) which probably has to do with a problem in the arguments.
There's and obvious problem with ranges because it contains just two values, the minimum and maximum of x, however, it should contain two minimum values and to maximum values (one pair for each column of the matrix). I can achieve that by doing this:
ranges = base.cbind(base.range(a), base.range(b))
But this implies that there's a problem with the way I'm creating the matrix. Otherwise, I would get two pairs of values just by using base.range(x).
I also tried x = robjects.r.matrix(x, ncol = 2) but didn't work. I still get just a global minimum and maximum value for the whole matrix when calling range.
What is the correct way of creating this matrix so that the method can run?

According to the documentation of the range function, it accepts as input vectors (one dimensional arrays). Thus, it would work by applying it to each column of your matrix or by applying it first directly to the a and b elements as you have mentioned. Thus, you second approach should work
# Define a,b vectors
a = np.array([1.2, 2.1, 2.5]); b = np.array([5.2, 1.3, 2.15])
# Calculate vector ranges
range_a = base.range(base.c(a))
range_b = base.range(base.c(b))
# Define the matrix
x = base.cbind(base.c(a), base.c(b))
print(x)
>>> [[1.2 5.2 ]
[2.1 1.3 ]
[2.5 2.15]]
# Define the ranges
ranges = base.cbind(range_a, range_b)
print(ranges)
>>> [[1.2 1.3]
[2.5 5.2]]

Related

Efficiently get indices of histogram bins in Python

Short Question
I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.
How do I extract the indices of each bin to efficiently perform my calculation using the bins values?
What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?
Detailed Explanation
1. Straightforward Solution
One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below
import numpy as np
# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
y = np.sum(x)
return y
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = [func(vals[ind == j]) for j in range(1, nbins)]
This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.
2. Using binned_statistics
The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following
import numpy as np
from scipy.stats import binned_statistics
vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]
3. Using labeled_comprehension
Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become
import numpy as np
from scipy.ndimage import labeled_comprehension
vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)
Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.
4. Comparison with IDL language
To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?
Specifically, using the IDL language the above example could be written as
vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)
for j=0, nbins-1 do begin
jbins = r[r[j]:r[j+1]-1] ; Selects indices of bin j
result[j] = func(vals[jbins])
endfor
The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.
I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster
import numpy as np
from scipy.sparse import csr_matrix
def binned_statistic(x, values, func, nbins, range):
'''The usage is nearly the same as scipy.stats.binned_statistic'''
N = len(values)
r0, r1 = range
digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))
return [func(group) for group in np.split(S.data, S.indptr[1:-1])]
I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.
I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.
vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()
#example of using the data:
for j in range(nbins-1):
#I am using a plotting function for your f, to show that they cluster
plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')
The figure shows that the bins are actually clusters as expected:
You can halve the computation time by sorting the array first, then use np.searchsorted.
vals = np.random.random(1e8)
vals.sort()
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)
results = [func(vals[np.searchsorted(ind,j,side='left'):
np.searchsorted(ind,j,side='right')])
for j in range(1,nbins)]
Using 1e8 as my test case, I go from 34 seconds of computation to about 17.
One efficient solution is using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
npi.group_by(ind).split(vals)
Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :
import pandas as pd
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values
or more generally :
pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values
Although the latter is slower for standard aggregation functions
(like sum, mean, etc)

NumPy or SciPy to calculate weighted median

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.
For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.
So far I've
imported the csv showing the weights as an array, masking values of 0, and
created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.
I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.
I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!
Update: here's some code for what I've done so far:
#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)
#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)
#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]
for i in range (rowLength):
createArr = np.arange(0, fieldLength*10, 10)
nCreateArr = np.array(createArr)
massArr.append(nCreateArr)
nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.
One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.
Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).
So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.
The code to do it like so:
import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])
c = np.cumsum(freq_count, axis=1)
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...
#This is just for explanation.
print "median masses is:", masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))
Output will be:
median masses is: [10 20 20 0 30]
[[ 30 191 9 0] <- The test data
[ 10 20 300 10]
[ 10 20 30 40]
[100 10 10 10]
[ 1 1 1 100]]
[[ 30. 221. 230. 230. 115. ] <- cumsum results with median added to the end.
[ 10. 30. 330. 340. 170. ] you can see from this where they fit in.
[ 10. 30. 60. 100. 50. ]
[ 100. 110. 120. 130. 65. ]
[ 1. 2. 3. 103. 51.5]]
wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.
Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).
def weighted_median(values, weights):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, 0.5 * c[-1])]]
Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend is to an arbitrary number of arbitrary quantiles.
Update
Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:
def weighted_quantiles(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]
This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])
Update 2
In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two number in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).
def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
i = np.argsort(values)
c = np.cumsum(weights[i])
q = np.searchsorted(c, quantiles * c[-1])
return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])
This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).
>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)
Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-values weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.
I ended up writing that function based on #muzzle and #maesers replies:
def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
The difference between interpolate True and False is as follows:
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5
(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])
Speed tests show it is just as performant as #maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.
Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.
import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain
book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'
masses = sh.col_values(0, start_rowx=1) # first column has mass
age = sh.row_values(0, start_colx=1) # first row has age ranges
count = 1
mass = []
for a in ages:
age.append(sh.col_values(count, start_rowx=1))
count += 1
stats = []
count = 0
for a in ages:
expanded = []
# create a tuple with the mass vector
age_mass = zip(masses, age[count])
count += 1
# replicate element[0] for element[1] times
expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)
# separate into one big list
medianlist = [x for t in expanded for x in t]
# convert to array and mask out zeroes
npa = np.array(medianlist)
npa = np.ma.masked_equal(npa,0)
median = np.median(npa)
meanMass = np.average(npa)
maxMass = np.max(npa)
minMass = np.min(npa)
stdev = np.std(npa)
stats1 = [median, meanMass, maxMass, minMass, stdev]
print stats1
stats.append(stats1)
np.savetxt(ofile, (stats), fmt="%d")

Plot using pandas

I have some event times in a list and I would like to plot an exponentially weighted moving average of them. I can do this using the following code.
import numpy as np
import matplotlib.pyplot as plt
print "Code runnning"
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = np.zeros(1000)
for item in l:
y[item]=1
s = np.zeros(1000)
x = np.linspace(0,1000,1000)
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
plt.plot(x, s)
plt.show()
This is clearly a horrible way to use python however. What's the right way to do this? Is it possible to do it without making all these extra sparse arrays?
The output should look like this.
Pandas comes to mind for this task:
import pandas as pd
l = [3.0,7.0,10.0,20.0,200.0]
s = pd.Series(np.ones_like(l), index=l)
y = s.reindex(range(1000), fill_value=0)
pd.ewma(y, 199).plot()
The period 199 is related to your parameter alpha 0.01 as n=2/(a+1). Result:
AFAIK there's not a very good way to do this with numpy or the scipy.sparse module -- the sparse matrices in scipy.sparse are designed to be 2D matrices, and to create one in the first place you'd basically need to use the code you've already written in your first loop (i.e., to set all of the nonzero locations in a sparse matrix), with the additional complexity of always having to specify two index values.
As if that's not bad enough, np.convolve doesn't work with sparse arrays, so you'd still need to write out the computation in your second loop to compute the moving average.
My recommendation, which probably isn't much help if you're looking for a fancy numpy version, is to fall back on Python's excellent support as a general-purpose language :
import matplotlib.pyplot as plt
a=0.01
l = set([3, 7, 10, 20, 200])
s = np.zeros(1000)
for i in xrange(len(s)):
s[i] = a * int(i-1 in l) + (1-a) * s[i-1]
plt.plot(s)
plt.show()
Here, I've stored the event index values in l, just as you did, but I used a set to make lookup times O(1) -- though if len(l) isn't very large, you might even be better off with a plain list or tuple, you'd need to measure it to be sure. Then you can avoid creating the y array and just rely on Iverson's convention to convert the Boolean value x in y into an int. You might not even need the explicit cast, but I find it helpful to be explicit.
I think you're looking for something like this:
import numpy as np
import matplotlib.pyplot as plt
from scikits.timeseries.lib.moving_funcs import mov_average_expw
l = [ 3.0, 7.0, 10.0, 20.0, 200.0 ]
y = np.zeros(1000)
y[[l]] = 1
emav = mov_average_expw(y, 199)
plt.plot(emav)
plt.show()
This makes use of mov_average_expw from scikits.timeseries. Check that method's documentation to see how I came up with the span parameter based on your code's a variable.

Moving average of an array in Python

I have an array where discreet sinewave values are recorded and stored. I want to find the max and min of the waveform. Since the sinewave data is recorded voltages using a DAQ, there will be some noise, so I want to do a weighted average. Assuming self.yArray contains my sinewave values, here is my code so far:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
for y in range (0,filtersize):
summation = sum(self.yArray[x+y])
ave = summation/filtersize
filterarray.append(ave)
My issue seems to be in the second for loop, where depending on my averaging window size (filtersize), I want to sum up the values in the window to take the average of them. I receive an error saying:
summation = sum(self.yArray[x+y])
TypeError: 'float' object is not iterable
I am an EE with very little experience in programming, so any help would be greatly appreciated!
The other answers correctly describe your error, but this type of problem really calls out for using numpy. Numpy will run faster, be more memory efficient, and is more expressive and convenient for this type of problem. Here's an example:
import numpy as np
import matplotlib.pyplot as plt
# make a sine wave with noise
times = np.arange(0, 10*np.pi, .01)
noise = .1*np.random.ranf(len(times))
wfm = np.sin(times) + noise
# smoothing it with a running average in one line using a convolution
# using a convolution, you could also easily smooth with other filters
# like a Gaussian, etc.
n_ave = 20
smoothed = np.convolve(wfm, np.ones(n_ave)/n_ave, mode='same')
plt.plot(times, wfm, times, -.5+smoothed)
plt.show()
If you don't want to use numpy, it should also be noted that there's a logical error in your program that results in the TypeError. The problem is that in the line
summation = sum(self.yArray[x+y])
you're using sum within the loop where your also calculating the sum. So either you need to use sum without the loop, or loop through the array and add up all the elements, but not both (and it's doing both, ie, applying sum to the indexed array element, that leads to the error in the first place). That is, here are two solutions:
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = sum(self.yArray[x:x+filtersize]) # sum over section of array
ave = summation/filtersize
filterarray.append(ave)
or
filterarray = []
filtersize = 2
length = len(self.yArray)
for x in range (0, length-(filtersize+1)):
summation = 0.
for y in range (0,filtersize):
summation = self.yArray[x+y]
ave = summation/filtersize
filterarray.append(ave)
self.yArray[x+y] is returning a single item out of the self.yArray list. If you are trying to get a subset of the yArray, you can use the slice operator instead:
summation = sum(self.yArray[x:y])
to return an iterable that the sum builtin can use.
A bit more information about python slices can be found here (scroll down to the "Sequences" section): http://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy
You could use numpy, like:
import numpy
filtersize = 2
ysums = numpy.cumsum(numpy.array(self.yArray, dtype=float))
ylags = numpy.roll(ysums, filtersize)
ylags[0:filtersize] = 0.0
moving_avg = (ysums - ylags) / filtersize
Your original code attempts to call sum on the float value stored at yArray[x+y], where x+y is evaluating to some integer representing the index of that float value.
Try:
summation = sum(self.yArray[x:y])
Indeed numpy is the way to go. One of the nice features of python is list comprehensions, allowing you to do away with the typical nested for loop constructs. Here goes an example, for your particular problem...
import numpy as np
step=2
res=[np.sum(myarr[i:i+step],dtype=np.float)/step for i in range(len(myarr)-step+1)]

Nx3 column data to 2d matrix for image processing

I am trying to find local maxima and countours in a Nx3 data in format ('x','y','value') i read from a text file; 'x' and 'y' form an evenly spaced grid and there is single value for every combination of 'x','y', it looks like this:
3.0, -0.4, 56.94369888305664
3.0, -0.3, 56.97200012207031
3.0, -0.2, 56.77149963378906
3.0, -0.1, 56.41230010986328
3.0, 0, 55.8302001953125
3.0, 0.1, 55.81560134887695
3.0, 0.2, 55.600399017333984
3.0, 0.3, 55.51969909667969
3.0, 0.4, 55.18550109863281
3.2, -0.4, 56.26380157470703
3.2, -0.3, 56.228599548339844
...
The problem is that the image code I am trying to use(link) requires the data to be in a different 2d matrix format for image processing. This is the relevant part of the code:
# Construct some test data
x, y = np.ogrid[-np.pi:np.pi:100j, -np.pi:np.pi:100j]
r = np.sin(np.exp((np.sin(x)**3 + np.cos(y)**2)))
# Find contours at a constant value of 0.8
contours = measure.find_contours(r, 0.8)
Can somebody help transform my data to the required 'grided' format?
EDIT: I finally went for pandas but I find the chosen answer better in the general case.This is what I did:
from pandas import read_csv
data=read_csv(filename, names=['x','y','values']).pivot(index='x', columns='y',
values='values')
After this data.values holds the table in 2d 'image form' the like I wanted.
y -0.4 -0.3 -0.2 -0.1
x
3.0 86.9423 87.6398 87.5256 89.5779
3.2 76.9414 77.7743 78.8633 76.8955
3.4 71.4146 72.8257 71.7210 71.5232
The best solution really depends on details your not giving. By the way, you should really give your code, or at least the np.loadtxt instruction.
In the following, "data" is the array loaded from the file using:
data = np.loadtxt('file.txt', [('x',float), ('y',float), ('value',float)])
1) Direct reshape:
Following on what #tom10 said
If you know that your (x,y,value) data is stored in the specific order:
[(x0,y0,v00), (x0,y1,v01), .... , (x1,y0,v10),(x1,y1,v11), ... ,(xN,yM,vNM)]
And that the values of all (x,y) pairs are given. Then the best is to make a 1D numpy array from your list of values and reshape it:
x = np.unique(data['x'])
y = np.unique(data['y'])
r = data['value'].reshape((x.size,y.size))
2) General cases:
see Populate arrays in python (numpy)? for a similar question and an other solution using dictionaries
If your cannot guaranty anything else than having (x,y,value) tuples:
# indexing: list of x and y coordinates, and functions that map them to index
x = np.unique(data['x']).tolist()
y = np.unique(data['y']).tolist()
ix = np.vectorize(lambda i: x.index(i), otypes='i')
iy = np.vectorize(lambda j: y.index(j), otypes='i')
# create output array
r = np.zeros((x.size,y.size), float) # default value is 0
r[ix(data['x']), iy(data['y'])] = data['value']
Note: In the reference given above, an other approach using dictionaries is given. I think this is more readable, but I did not test their relative speed.
3) Intermediate cases?
You might have an intermediate case, between a regular grid coordinates given in a specific order and no constraint at all. The general case being potentially very slow, you should design your algorithm to take advantage of any rule your data follow.
One example is if you know that the x-y indexing follow a specific rule, but are not necessarily given in order. For instance, if you know that the x and y are equally spaced "grid" coordinates, of the form:
coordinate = min_coordinate + i*step
Then find min_coordinate and step (for both x and y), and find i by solving this equation. This way, you avoid the costly index mapping np.vectorized(... list.index(...)):
x = np.unique(data['x'])
y = np.unique(data['y'])
ix = (data['x']-x.min())/(x[1]-x[0])
iy = (data['y']-y.min())/(y[1]-y[0])
# create output array
r = np.ones((x.size,y.size), float)*np.nan # default value is NaN
r[ix.astype(int), iy.astype(int)] = data['value']
For program you're using, you just need the data to be rectangular array of z values (in the example they give they just use x and y to construct z, but then never use them again). It looks like you have array that's 9 by N (where N is something you don't show). One easy way to get this is to just read the data in as a flat collection of z values, skipping the x,y values, reshape to set the shape you'd like. (I can't really write the code for this because you haven't given enough info, but it shouldn't be difficult.)

Categories