i'll explain for simple example then go into the deep
if i have a list of number consist of
t_original = [180,174,168,166,162,94,70,80,128,131,160,180]
if we graph this so it goes down from 180 to 70 then it ups to 180 again
but if we suddenly change the fourth value (166) by 450 then the list will be
t = [180,174,168,700,162,94,70,80,128,131,160,180]
which dose not make sense in the graph
i wanna treat the fourth value (700) as a wrong value
i want to replace it with a relative value even if not as the original value but relative to the previous two elements (168,174)
i wanna do the same for the whole list if another wrong value appeared again
we can call that [Filling gaps between list of numbers]
so i'm tryig to do the same idea but for bigger example
the method i have tried
and i'll share my code with output , filtered means applied filling gap function
my code
def preprocFN(*U):
prePlst=[] # after preprocessing list
#preprocessing Fc =| 2*LF1 prev by 1 - LF2 prev by 2 |
c0 = -2 #(previous) by 2
c1 =-1 #(previous)
c2 =0 #(current)
c3 = 1 #(next)
preP = U[0] # original list
if c2 == 0:
prePlst.append(preP[0])
prePlst.append(preP[1])
c1+=2
c2+=2
c0+=2
oldlen = len(preP)
while oldlen > c2:
Equ = abs(2*preP[c1] - preP[c0]) #fn of preprocessing #removed abs()
formatted_float = "{:.2f}".format(Equ) #with .2 number only
equu = float(formatted_float) #from string float to float
prePlst.insert(c2,equu) # insert the preprocessed value to the List
c1+=1
c2+=1
c0+=1
return prePlst
with my input : https://textuploader.com/t1py9
the output will be : https://textuploader.com/t1pyk
and when printing the values higher than 180 (wrong values)
result_list = [item for item in list if item > 180]
which dosen't make sense that any joint of human can pass the angle of 180
the output was [183.6, 213.85, 221.62, 192.05, 203.39, 197.22, 188.45, 182.48, 180.41, 200.09, 200.67, 198.14, 199.44, 198.45, 200.55, 193.25, 204.19, 204.35, 200.59, 211.4, 180.51, 183.4, 217.91, 218.94, 213.79, 205.62, 221.35, 182.39, 180.62, 183.06, 180.78, 231.09, 227.33, 224.49, 237.02, 212.53, 207.0, 212.92, 182.28, 254.02, 232.49, 224.78, 193.92, 216.0, 184.82, 214.68, 182.04, 181.07, 234.68, 233.63, 182.84, 193.94, 226.8, 223.69, 222.77, 180.67, 184.72, 180.39, 183.99, 186.44, 233.35, 228.02, 195.31, 183.97, 185.26, 182.13, 207.09, 213.21, 238.41, 229.38, 181.57, 211.19, 180.05, 181.47, 199.69, 213.59, 191.99, 194.65, 190.75, 199.93, 221.43, 181.51, 181.42, 180.22]
so the filling gaps fn from proposed method dosen't do it's job
any suggestion for applying the same concept with a different way ?
Extra Info may help
the filtered graph consists of filling gap function and then applying normalize function
i don't think the problem is from the normalizing function since the output from the filling gaps function isn't correct in my opinion maybe i'm wrong but anyway i provide the normalize steps so you get how the final filtered graph has been made
fn :
My Code :
def outLiersFN(*U):
outliers=[] # after preprocessing list
#preprocessing Fc =| 2*LF1 prev by 1 - LF2 prev by 2 |
c0 = -2 #(previous) by 2 #from original
c1 =-1 #(previous) #from original
c2 =0 #(current) #from original
c3 = 1 #(next) #from original
preP = U[0] # original list
if c2 == 0:
outliers.append(preP[0])
c1+=1
c2+=1
c0+=1
c3+=1
oldlen = len(preP)
M_RangeOfMotion = 90
while oldlen > c2 :
if c3 == oldlen:
outliers.insert(c2, preP[c2]) #preP[c2] >> last element in old list
break
if (preP[c2] > M_RangeOfMotion and preP[c2] < (preP[c1] + preP[c3])/2) or (preP[c2] < M_RangeOfMotion and preP[c2] > (preP[c1] + preP[c3])/2): #Check Paper 3.3.1
Equ = (preP[c1] + preP[c3])/2 #fn of preprocessing # From third index # ==== inserting current frame
formatted_float = "{:.2f}".format(Equ) #with .2 number only
equu = float(formatted_float) #from string float to float
outliers.insert(c2,equu) # insert the preprocessed value to the List
c1+=1
c2+=1
c0+=1
c3+=1
else :
Equ = preP[c2] # fn of preprocessing #put same element (do nothing)
formatted_float = "{:.2f}".format(Equ) # with .2 number only
equu = float(formatted_float) # from string float to float
outliers.insert(c2, equu) # insert the preprocessed value to the List
c1 += 1
c2 += 1
c0 += 1
c3 += 1
return outliers
I suggest the following algorithm:
data point t[i] is considered an outlier if it deviates from the average of t[i-2], t[i-1], t[i], t[i+1], t[i+2] by more than the standard deviation of these 5 elements.
outliers are replaced by the average of the two elements around them.
import matplotlib.pyplot as plt
from statistics import mean, stdev
t = [180,174,168,700,162,94,70,80,128,131,160,180]
def smooth(t):
new_t = []
for i, x in enumerate(t):
neighbourhood = t[max(i-2,0): i+3]
m = mean(neighbourhood)
s = stdev(neighbourhood, xbar=m)
if abs(x - m) > s:
x = ( t[i - 1 + (i==0)*2] + t[i + 1 - (i+1==len(t))*2] ) / 2
new_t.append(x)
return new_t
new_t = smooth(t)
plt.plot(t)
plt.plot(new_t)
plt.show()
Related
I have the following data:
0.8340502011561366 0.8423491600218922
0.8513456021654467
0.8458192388553084
0.8440111276014195
0.8489589671423143
0.8738088120491972
0.8845129900705279
0.8988298998926688
0.924633964692693
0.9544790734065157
0.9908034431246875
1.0236430466543138
1.061619773027915
1.1050038249835414
1.1371449802490126
1.1921182610371368
1.2752207659022576
1.344047620255176
1.4198117350668353
1.507943067143741
1.622137968203745
1.6814098429502085
1.7646810054280595
1.8485457435775694
1.919591124757554
1.9843144220593145
2.030158014640226
2.018184122476175
2.0323466012624207
2.0179200409023874
2.0316932950853723
2.013683870089898
2.03010703506514
2.0216151623726977
2.038855467786505
2.0453923522466093
2.03759031642753
2.019424996752278
2.0441806106428606
2.0607521369415136
2.059310067318373
2.0661157975162485
2.053216429539864
2.0715123971225564
2.0580473413362075
2.055814512721712
2.0808278560688964
2.0601637029377113
2.0539429365156003
2.0609648613513754
2.0585135712612646
2.087674625814453
2.062482961966647
2.066476100210777
2.0568444178944967
2.0587903943282266
2.0506399365756396
The data plotted looks like:
I want to find the point where the slope changes in sign (I circled it in black. Should be around index 26):
I need to find this point of change for several hundred files. So far I tried the recommendation from this post:
Finding the point of a slope change as a free parameter- Python
I think since my data is a bit noisey I am not getting a smooth transition in the change of the slope.
This is the code I have tried so far:
import numpy as np
#load 1-D data file
file = str(sys.argv[1])
y = np.loadtxt(file)
#create X based on file length
x = np.linspace(1,len(y), num=len(y))
Find first derivative:
m = np.diff(y)/np.diff(x)
print(m)
#Find second derivative
b = np.diff(m)
print(b)
#find Index
index = 0
for difference in b:
index += 1
if difference < 0:
print(index, difference)
Since my data is noisey I am getting some negative values before the index I want. The index I want it to retrieve in this case is around 26 (which is where my data becomes constant). Does anyone have any suggestions on what I can do to solve this issue? Thank you!
A gradient approach is useless in this case because you don't care about velocities or vector fields. The knowledge of the gradient don't add extra information to locate the maximum value since the run are always positive hence will not effect the sign of the gradient. A method based entirly on raise is suggested.
Detect the indices for which the data are decreasing, find the difference between them and the location of the max value. Then by index manipulation you can find the value for which data has a maximum.
data = '0.8340502011561366 0.8423491600218922 0.8513456021654467 0.8458192388553084 0.8440111276014195 0.8489589671423143 0.8738088120491972 0.8845129900705279 0.8988298998926688 0.924633964692693 0.9544790734065157 0.9908034431246875 1.0236430466543138 1.061619773027915 1.1050038249835414 1.1371449802490126 1.1921182610371368 1.2752207659022576 1.344047620255176 1.4198117350668353 1.507943067143741 1.622137968203745 1.6814098429502085 1.7646810054280595 1.8485457435775694 1.919591124757554 1.9843144220593145 2.030158014640226 2.018184122476175 2.0323466012624207 2.0179200409023874 2.0316932950853723 2.013683870089898 2.03010703506514 2.0216151623726977 2.038855467786505 2.0453923522466093 2.03759031642753 2.019424996752278 2.0441806106428606 2.0607521369415136 2.059310067318373 2.0661157975162485 2.053216429539864 2.0715123971225564 2.0580473413362075 2.055814512721712 2.0808278560688964 2.0601637029377113 2.0539429365156003 2.0609648613513754 2.0585135712612646 2.087674625814453 2.062482961966647 2.066476100210777 2.0568444178944967 2.0587903943282266 2.0506399365756396'
data = data.split()
import numpy as np
a = np.array(data, dtype=float)
diff = np.diff(a)
neg_indeces = np.where(diff<0)[0]
neg_diff = np.diff(neg_indeces)
i_max_dif = np.where(neg_diff == neg_diff.max())[0][0] + 1
i_max = neg_indeces[i_max_dif] - 1 # because aise as a difference of two consecutive values
print(i_max, a[i_max])
Output
26 1.9843144220593145
Some details
print(neg_indeces) # all indeces of the negative values in the data
# [ 2 3 27 29 31 33 36 37 40 42 44 45 47 48 50 52 54 56]
print(neg_diff) # difference between such indices
# [ 1 24 2 2 2 3 1 3 2 2 1 2 1 2 2 2 2]
print(neg_diff.max()) # value with highest difference
# 24
print(i_max_dif) # location of the max index of neg_indeces -> 27
# 2
print(i_max) # index of the max of the origonal data
# 26
When the first derivative changes sign, that's when the slope sign changes. I don't think you need the second derivative, unless you want to determine the rate of change of the slope. You also aren't getting the second derivative. You're just getting the difference of the first derivative.
Also, you seem to be assigning arbitrary x values. If you're y-values represent points that are equally spaced apart, than it's ok, otherwise the derivative will be wrong.
Here's an example of how to get first and second der...
import numpy as np
x = np.linspace(1, 100, 1000)
y = np.cos(x)
# Find first derivative:
m = np.diff(y)/np.diff(x)
#Find second derivative
m2 = np.diff(m)/np.diff(x[:-1])
print(m)
print(m2)
# Get x-values where slope sign changes
c = len(m)
changes_index = []
for i in range(1, c):
prev_val = m[i-1]
val = m[i]
if prev_val < 0 and val > 0:
changes_index.append(i)
elif prev_val > 0 and val < 0:
changes_index.append(i)
for i in changes_index:
print(x[i])
notice I had to curtail the x values for the second der. That's because np.diff() returns one less point than the original input.
I have the following program, it seems that the amp and period at the end print out a list of list(see below). And I am unable to plot them (I want to plot period against amp)
I have tried methods in How to make a flat list out of list of lists? to combine the output of amp and period so that they are plot-table, but nothing worked.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
def derivatives(t,y,q,F):
return [y[1], -np.sin(y[0])-q*y[1]+F*np.sin((2/3)*t)]
t = np.linspace(0.0, 100, 10000)
#initial conditions
theta0 = np.linspace(0.0,np.pi,100)
q = 0.0 #alpha / (mass*g), resistive term
F = 0.0 #G*np.sin(2*t/3)
for i in range (0,100):
sol = solve_ivp(derivatives, (0.0,100.0), (theta0[i], 0.0), method = 'RK45', t_eval = t,args = (q,F))
velocity = sol.y[1]
time = sol.t
zero_cross = 0
value = []
amp = []
period = []
for k in range (len(velocity)-1):
if (velocity[k+1]*velocity[k]) < 0:
zero_cross += 1
value.append(k)
else:
zero_cross += 0
zero_cross = zero_cross - zero_cross % 2 # makes the total number of zero-crossings even
if zero_cross != 0:
amp.append(theta0[i])
# period calculated using the time evolved between the first and last zero-crossing detected
period.append((2*(time[value[zero_cross - 1]] - time[value[0]])) / (zero_cross -1))
If I print out amp inside the loop, it displays as follows:
[0.03173325912716963]
[0.06346651825433926]
[0.0951997773815089]
[0.12693303650867852]
[0.15866629563584814]
[0.1903995547630178]
[0.2221328138901874]
[0.25386607301735703]
[0.28559933214452665]
[0.3173325912716963]
[0.3490658503988659]
[0.3807991095260356]
[0.4125323686532052]
[0.4442656277803748]
[0.47599888690754444]
[0.5077321460347141]
[0.5394654051618837]
[0.5711986642890533]
[0.6029319234162229]
[0.6346651825433925]
[0.6663984416705622]
[0.6981317007977318]
[0.7298649599249014]
[0.7615982190520711]
[0.7933314781792408]
[0.8250647373064104]
[0.85679799643358]
[0.8885312555607496]
[0.9202645146879193]
[0.9519977738150889]
[0.9837310329422585]
[1.0154642920694281]
[1.0471975511965979]
[1.0789308103237674]
[1.110664069450937]
[1.1423973285781066]
[1.1741305877052763]
[1.2058638468324459]
[1.2375971059596156]
[1.269330365086785]
[1.3010636242139548]
[1.3327968833411243]
[1.364530142468294]
[1.3962634015954636]
[1.4279966607226333]
[1.4597299198498028]
[1.4914631789769726]
[1.5231964381041423]
[1.5549296972313118]
[1.5866629563584815]
[1.618396215485651]
[1.6501294746128208]
[1.6818627337399903]
[1.71359599286716]
[1.7453292519943295]
[1.7770625111214993]
[1.8087957702486688]
[1.8405290293758385]
[1.872262288503008]
[1.9039955476301778]
[1.9357288067573473]
[1.967462065884517]
[1.9991953250116865]
[2.0309285841388562]
[2.0626618432660258]
[2.0943951023931957]
[2.126128361520365]
[2.1578616206475347]
[2.1895948797747042]
[2.221328138901874]
[2.2530613980290437]
[2.284794657156213]
[2.3165279162833827]
[2.3482611754105527]
[2.379994434537722]
[2.4117276936648917]
[2.443460952792061]
[2.475194211919231]
[2.5069274710464007]
[2.53866073017357]
[2.57039398930074]
[2.6021272484279097]
[2.633860507555079]
[2.6655937666822487]
[2.6973270258094186]
[2.729060284936588]
[2.7607935440637577]
[2.792526803190927]
[2.824260062318097]
[2.8559933214452666]
[2.887726580572436]
[2.9194598396996057]
[2.9511930988267756]
[2.982926357953945]
[3.0146596170811146]
[3.141592653589793]
[Finished in 3.822s]
I am not sure what type of output that is and how to handle, any help would be appreciated!
You are declaring the lists inside the loop, which means they will be reset to empty at every iteration. Consider declaring amp, period, and any array that should be set to empty only once (as initial state) before the loop, like so:
#initialize arrays, executes only once before the loop
amp = []
period = []
for i in range (0,100):
#your logic here, plus appending values to `amp` and `period`
#now `amp` and `period` should contain all desired values
I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul
Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span
The following is my script. Each equal part has self.number samples, in0 is input sample. There is an error as follows:
pn[i] = pn[i] + d
IndexError: list index out of range
Is this the problem about the size of pn? How can I define a list with a certain size but no exact number in it?
for i in range(0,len(in0)/self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
if pn[i] >= self.alpha:
out[i] = 1
elif pn[i] <= self.beta:
out[i] = 0
else:
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
if pn[i] >= self.noise:
out[i] = 1
else:
out[i] = 0
There are a number of problems in the code as posted, however, the gist seems to be something that you'd want to do with numpy arrays instead of iterating over lists.
For example, the set of if/else cases that check if pn[i] >= some_value and then sets a corresponding entry into another list with the result (true/false) could be done as a one-liner with an array operation much faster than iterating over lists.
import numpy as np
# for example, assuming you have 9 numbers in your list
# and you want them divided into 3 sublists of 3 values each
# in0 is your original list, which for example might be:
in0 = [1.05, -0.45, -0.63, 0.07, -0.71, 0.72, -0.12, -1.56, -1.92]
# convert into array
in2 = np.array(in0)
# reshape to 3 rows, the -1 means that numpy will figure out
# what the second dimension must be.
in2 = in2.reshape((3,-1))
print(in2)
output:
[[ 1.05 -0.45 -0.63]
[ 0.07 -0.71 0.72]
[-0.12 -1.56 -1.92]]
With this 2-d array structure, element-wise summing is super easy. So is element-wise threshold checking. Plus 'vectorizing' these operations has big speed advantages if you are working with large data.
# add corresponding entries, we want to add the columns together,
# as each row should correspond to your sub-lists.
pn = in2.sum(axis=0) # you can sum row-wise or column-wise, or all elements
print(pn)
output: [ 1. -2.72 -1.83]
# it is also trivial to check the threshold conditions
# here I check each entry in pn against a scalar
alpha = 0.0
out1 = ( pn >= alpha )
print(out1)
output: [ True False False]
# you can easily convert booleans to 1/0
x = out1.astype('int') # or simply out1 * 1
print(x)
output: [1 0 0]
# if you have a list of element-wise thresholds
beta = np.array([0.0, 0.5, -2.0])
out2 = (pn >= beta)
print(out2)
output: [True False True]
I hope this helps. Using the correct data structures for your task can make the analysis much easier and faster. There is a wealth of documentation on numpy, which is the standard numeric library for python.
You initialize pn to an empty list just inside the for loop, never assign anything into it, and then attempt to access an index i. There is nothing at index i because there is nothing at any index in pn yet.
for i in range(0, len(in0) / self.number):
pn = []
m = i*self.number
for d in in0[m: m + self.number]:
pn[i] += d
If you are trying to add the value d to the pn list, you should do this instead:
pn.append(d)
I have the following dataset in numpy
indices | real data (X) |targets (y)
| |
0 0 | 43.25 665.32 ... |2.4 } 1st block
0 0 | 11.234 |-4.5 }
0 1 ... ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
...
Theses are my variables
idx1 = data[:,0]
idx2 = data[:,1]
X = data[:,2:-1]
y = data[:,-1]
I also have a variable W which is a 3D array.
What I need to do in the code is loop through all the blocks in the dataset and return a scalar number for each block after some computation, then sum up all the scalars, and store it in a variable called cost. Problem is that the looping implementation is very slow, so I'm trying to do it vectorized if possible. This is my current code. Is it possible to do this without for loops in numpy?
IDX1 = 0
IDX2 = 1
# get unique indices
idx1s = np.arange(len(np.unique(data[:,IDX1])))
idx2s = np.arange(len(np.unique(data[:,IDX2])))
# initialize global sum variable to 0
cost = 0
for i1 in idx1s:
for i2 in idx2:
# for each block in the dataset
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
# get variables for that block
curr_X = X[mask,:]
curr_y = y[mask]
curr_W = W[:,i2,i1]
# calculate a scalar
pred = np.dot(curr_X,curr_W)
sigm = 1.0 / (1.0 + np.exp(-pred))
loss = np.sum((sigm- (0.5)) * curr_y)
# add result to global cost
cost += loss
Here is some sample data
data = np.array([[0,0,5,5,7],
[0,0,5,5,7],
[0,1,5,5,7],
[0,1,5,5,7],
[1,0,5,5,7],
[1,1,5,5,7]])
W = np.zeros((2,2,2))
idx1 = data[:,0]
idx2 = data[:,1]
X = data[:,2:-1]
y = data[:,-1]
That W was tricky... Actually, your blocks are pretty irrelevant, apart from getting the right slice of W to do the np.dot with the corresponding X, so I went the easy route of creating an aligned_W array as follows:
aligned_W = W[:, idx2, idx1]
This is an array of shape (2, rows) where rows is the number of rows of your data set. You can now proceed to do your whole calculation without any for loops as:
from numpy.core.umath_tests import inner1d
pred = inner1d(X, aligned_W.T)
sigm = 1.0 / (1.0 + np.exp(-pred))
loss = (sigm - 0.5) * curr_y
cost = np.sum(loss)
My guess is the major reason your code is slow is the following line:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
Because you repeatedly scan your input arrays for small number of rows of interest. So you need to do the following:
ni1 = len(np.unique(data[:,IDX1]))
ni2 = len(np.unique(data[:,IDX2]))
idx1s = np.arange(ni1)
idx2s = np.arange(ni2)
key = data[:,IDX1] * ni2 + data[:,IDX2] # 1D key to the rows
sortids = np.argsort(key) #indices to the sorted key
Then inside the loop instead of
mask=np.nonzero(...)
you need to do
curid = i1 * ni2 + i2
left = np.searchsorted(key, curid, 'left', sorter=sortids)
right=np.searchsorted(key, curid, 'right', sorter=sortids)
mask = sortids[left:right]
I don't think that there is a way to compare numpy array of different sizes without using for loops. Would be hard to decide what is the output meaning and shape of something like
[0,1,2,3,4] == [3,4,2]
The only suggestion that I can give you is to get rid of one of the for loop using itertools.product:
import itertools as it
[...]
idx1s = np.unique(data[:,IDX1])
idx2s = np.unique(data[:,IDX2])
# initialize global sum variable to 0
cost = 0
for i1, i2 in it.product(idx1s, idx2):
# for each block in the dataset
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
# get variables for that block
curr_X = X[mask,:]
curr_y = y[mask]
[...]
You can also keep mask as a bool array
mask = (data[:,IDX1] == i1) & (data[:,IDX2] == i2)
The output is the same and you have to use anyway the memory to create the bool array. Doing this way saves you some memory and a function evaluation
EDIT
If you know that the indices do not have holes or have few holes, might be worth to remove the part where you define idx1s and idxs2 and change the for loop to
max1, max2 = data[:,[IDX1, IDX2]].max(axis=0)
for i1, i2 in it.product(xrange(max1), xrange(max2)):
[...]
Both xrange and it.product are iterators, so they create only i1 and i2 when you need.
ps: if you are on python3.x use range instead of xrange