I have an array like this in a data file:
0 822.6 1391.3 1
0 822.6 1391.3 2
0 708.3 1501.2 3
0 708.3 1501.2 4
0 632.5 1585.8 5
0 632.5 1585.8 6
0 552.4 1652.6 7
0 552.4 1652.6 8
250 850.8 1358.6 1
250 803.3 1406.2 2
250 732.0 1481.9 3
250 694.9 1519 4
250 642.9 1566.5 5
250 613.2 1594.7 6
250 570.2 1637.8 7
250 537.5 1663 8
I want to create separate data sets depending on the last column.
In other words I want something like this:
while data[:,3] != 9:
if data[:,3] == 1:
x1 = data[:,0]
y1= (data[:,1]-data[:,2])**2
if data[:,3] == 2:
x2 = data[:,0]
y2= (data[:,1]-data[:,2])**2
And so on...
I only put does not equal 9 because I only have from 1-8 in the last column always.
I know this is completely wrong, but I need help.
Consider the following:
a = array([[ 0.64910219, 0.06868991, -0.34844128, 0. ],
[-1.34767042, -1.77338287, 0.693539 , 1. ],
[ 1.31245883, -2.08879047, -0.83514187, 3. ],
[ 0.43156959, 0.31388795, 0.2856625 , 1. ],
[-0.60531108, -0.63226693, 0.32063803, 2. ],
[-0.47538621, -0.64196643, -0.82296546, 3. ],
[ 0.3491207 , -1.25406403, 1.21754411, 0. ],
[-1.1573242 , 1.1636706 , 0.63733285, 2. ]])
d0 = a[a[:,3]==0,:]
d1 = a[a[:,3]==1,:]
d2 = a[a[:,3]==2,:]
d3 = a[a[:,3]==3,:]
The variables d0, d1, d2, d3 contain the rows with the appropriate values in the right-most column.
>>> d0
array([[ 0.64910219, 0.06868991, -0.34844128, 0. ],
[ 0.3491207 , -1.25406403, 1.21754411, 0. ]])
>>> d1
array([[-1.34767042, -1.77338287, 0.693539 , 1. ],
[ 0.43156959, 0.31388795, 0.2856625 , 1. ]])
>>> d2
array([[-0.60531108, -0.63226693, 0.32063803, 2. ],
[-1.1573242 , 1.1636706 , 0.63733285, 2. ]])
>>> d3
array([[ 1.31245883, -2.08879047, -0.83514187, 3. ],
[-0.47538621, -0.64196643, -0.82296546, 3. ]])
To create the x1, y1, etc.. that you mention in the post, you just need to manipulate those arrays.
x1 = d1[:,0]
y1 = (d1[:,1] - d1[:,2])**2
And so on for the other values of the fourth column. For a small number of possible values (1 - 8), hardcoding the different variables isn't too bad, but this method can easily generalize to an arbitrary list with a list of the computed outputs.
Related
For instance, I have the following numpy arrays:
a = numpy.array( [ [ 1 , 3 , 3 ] , [ 2 , 5 , 5 ] , [ 3 , 7 , 7 ] ] )
b = numpy.array( [ 1 , 2 , 3 ] )
I want to write a piece of code that will emulate:
a[ a == b ] = 0
which the output will be:
[ [ 0 , 3 , 3 ] , [ 0 , 5 , 5 ] , [ 0 , 7 , 7 ] ]
How to achieve this by not applying a for loop. Here it is just an example, in real reality the arrays are very large and for loop takes too much time to run.
you could do the following:
import numpy as np
a = np.array( [ [ 1 , 3 , 3 ] , [ 2 , 5 , 5 ] , [ 3 , 7 , 7 ] ] )
b = np.array( [ 1 , 2 , 3 ] )
def f(b, a):
return np.where(a == b, 0, a)
print(np.array([*map(f, b, a)]))
which gives:
[[0 3 3]
[0 5 5]
[0 7 7]]
I am trying to execute the following code:
def calculate_squared_dist_sliced_data(self, data, output, proc_numb):
for k in range(1, self.calc_border):
print("Calculating",k, "of", self.calc_border, "\n", (self.calc_border - k), "to go!")
kmeans = KMeansClusterer.KMeansClusterer(k, data)
print("inertia in round", k, ": ", kmeans.calc_custom_params(data, k).inertia_)
output.put( proc_numb, (kmeans.calc_custom_params(self.data, k).inertia_))
def calculate_squared_dist_mp(self):
length = np.shape(self.data)[0]
df_array = []
df_array[0] = self.data[int(length/4), :]
df_array[1] = self.data[int((length/4)+1):int(length/2), :]
df_array[2] = self.data[int((length/2)+1):int(3*length/4), :]
df_array[3] = self.data[int((3*length/4)+1):int(length/4), :]
output = mp.Queue()
processes = [mp.Process(target=self.calculate_squared_dist_sliced_data, args=(df_array[x], output, x)) for x in range(4)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
When executing df_array[0] = self.data[int(length/4), :], I get the following error:
IndexError: list assignment index out of range
The variable lentgh has the value 20195 (which is correct). I want to do the method calculate_squared_dist_sliced_data by multiprocessing, so I need to split the array data that is passed to this class.
Here is an example of how this numpy array looks:
[[ 0. 0. 0.02072968 ..., -0.07872599 -0.10147049 -0.44589 ]
[ 0. -0.11091352 0.11208243 ..., 0.08164318 -0.02754813
-0.44921876]
[ 0. -0.10642599 0.0028097 ..., 0.1185457 -0.22482443
-0.25121125]
...,
[ 0. 0. 0. ..., -0.03617197 0.00921685 0. ]
[ 0. 0. 0. ..., -0.08241634 -0.05494423
-0.10988845]
[ 0. 0. 0. ..., -0.03010139 -0.0925091
-0.02145017]]
Now I want to split this hole array into four equal pieces to give each one to a process. However, when selecting the rows I get the exception mentioned above. Can someone help me?
Maybe for a more theroretical approach of what I want to do:
A B C D
1 2 3 4
5 6 7 8
9 5 4 3
1 8 4 3
As a result I want to have for example two arrays, each containing two rows:
A B C D
1 2 3 4
5 6 7 8
and
A B C D
9 5 4 3
1 8 4 3
Can someone help me?
The left side of the assignment is not allowed as you list has length 0.
Either fix it to:
df_array = [None, None, None, None]
or use
df_array.append(self.data[int(length/4), :])
...
instead.
I just noticed that I tried to use a list like an array...
I've got a numpy array that looks like this:
1 0 0 0 200 0 0 0 1
6 0 0 0 2 0 0 0 4.3
5 0 0 0 1 0 0 0 7.1
expected out put would be
1 100 100 100 200 100 100 100 1
6 4 4 4 2 3.15 3.15 3.15 4.3
5 3 3 3 1 4.05 4.05 4.05 7.1
and I would like to replace all the 0 values with an average of their neighbours. Any hints welcome! Many thanks!
If the structure in the sample array is preserved throughout your array, then this code will work:
In [159]: def avg_func(r):
lavg = (r[0] + r[4])/2.0
ravg = (r[4] + r[-1])/2.0
r[1:4] = lavg
r[5:-1] = ravg
return r
In [160]: np.apply_along_axis(avg_func, 1, arr)
Out[160]:
array([[ 1. , 100.5 , 100.5 , 100.5 , 200. , 100.5 , 100.5 ,
100.5 , 1. ],
[ 6. , 4. , 4. , 4. , 2. , 3.15, 3.15,
3.15, 4.3 ],
[ 5. , 3. , 3. , 3. , 1. , 4.05, 4.05,
4.05, 7.1 ]])
But, as you can see this is kinda messy with hardcoding the indexes. You just have to get creative when you define avg_func here. Feel free to improve this solution and get creative. Also, note that this implementation does in-place modification on the input array.
UPDATED:
In my dataset I have 3 columns (x,y) and VALUE.
It's looking like this(sorted already):
df1:
x , y ,value
1 , 1 , 12
2 , 2 , 12
4 , 3 , 12
1 , 1 , 11
2 , 2 , 11
4 , 3 , 11
1 , 1 , 33
2 , 2 , 33
4 , 3 , 33
I need to get those rows where, distance bewteen them (in X and Y column) is <= 1 , lets say its my radius. But in same time i need to group and filter only those where Value is equal.
I had problems to compare it in one dataset because there was one header, so i have created second dataset with python commands:
df:
x , y ,value
1 , 1 , 12
2 , 2 , 12
4 , 3 , 12
x , y ,value
1 , 1 , 11
2 , 2 , 11
4 , 3 , 11
x , y ,value
1 , 1 , 33
2 , 2 , 33
4 , 3 , 33
I have tried to use this code:
def dist_value_comp(row):
x_dist = abs(df['y'] - row['y']) <= 1
y_dist = abs(df['x'] - row['x']) <= 1
xy_dist = x_dist & y_dist
max_value = df.loc[xy_dist, 'value'].max()
return row['value'] == max_value
df['keep_row'] = df.apply(dist_value_comp, axis=1)
df.loc[df['keep_row'], ['x', 'y', 'value']]
and
filtered_df = df[df.apply(lambda line: abs(line['x']- line['y']) <= 1, 1)]
for i in filtered_df.groupby('value'):
print(i)
Before I have received errors connected with bad data frame, I have repaired it but I have still no results on output.
That's how I am creating my new data frame df from df1, if you will have any better idea please put it here, is one have big minus because always prints me the table. And I test it again and this def gives me empty DataFrame.
VALUE1= df1.VALUE.unique()
def separator():
lst=[]
for VALUE in VALUE1:
abc= df1[df1.VALUE==VALUE]
print abc
return lst
ab=separator()
df=pd.DataFrame(ab)
When I am trying normal dataset df1, I have on output all data without taking into account radius =1
I need to get on my output table like this one:
x , y ,value
1 , 1 , 12
2 , 2 , 12
x , y ,value
1 , 1 , 11
2 , 2 , 11
x , y ,value
1 , 1 , 33
2 , 2 , 33
UPDATE 2:
I am working right now with this code:
filtered_df = df[df.apply(lambda line: abs(line['x']- line['y']) <= 1, 1)]
for i in filtered_df.groupby('value'):
print(i)
It seems to be ok(i am taking df1 as input), but when i am looking on the output,
its doing nothing because he dont know from what value it should use the radius +/-1, thats the reason i think.
In my dataset i have more columns, so lets take into account my 4th and 5th column 'D'&'E', so radius will be taken from this row where is minimum value in column D & E in same time.
df1:
x , y ,value ,D ,E
1 , 1 , 12 , 1 , 2
2 , 2 , 12 , 2 , 3
4 , 3 , 12 , 3 , 4
1 , 1 , 11 , 2 , 1
2 , 2 , 11 , 3 , 2
4 , 3 , 11 , 5 , 3
1 , 1 , 33 , 1 , 3
2 , 2 , 33 , 2 , 3
4 , 3 , 33 , 3 , 3
So output result should be same as i want to , but right now i know from what value radius +/-1 in this case should start.
Anyone can help me right now?
Sorry for misunderstanding !
From what I understand, the order in which you make your operations (filter those with distance <= 1 and grouping them) has no importance.
Here is my take:
#first selection of the lines with right distance
filtered_df = df[df.apply(lambda line: abs(line['x']- line['y']) <= 1, 1)]
# Then group
for i in filtered_df.groupby('value'):
print(i)
# Or do whatever you want
Let me know if you want some explanations on how some part of the code works.
I have 4 lists that I need to iterate over so that I get the following:
x y a b
Lists a and b are of equal length and I iterate over both using the zip function, the code:
for a,b in zip(aL,bL):
print(a,"\t",b)
list x contains 1000 items and list b contains 750 items, after the loop is finished I am supposed to have 750.000 lines.
What is want to achieve is the following:
1 1 a b
1 2 a b
1 3 a b
1 4 a b
.....
1000 745 a b
1000 746 a b
1000 747 a b
1000 748 a b
1000 749 a b
1000 750 a b
How can I achieve this? I have tried enumerate and izip but both results are not what I am seeking.
Thanks.
EDIT:
I have followed your code and used since it is way faster. My output now looks like this:
[[[ 0.00000000e+00 0.00000000e+00 4.00000000e+01 2.30000000e+01]
[ 1.00000000e+00 0.00000000e+00 8.50000000e+01 1.40000000e+01]
[ 2.00000000e+00 0.00000000e+00 7.20000000e+01 2.00000000e+00]
...,
[ 1.44600000e+03 0.00000000e+00 9.20000000e+01 4.60000000e+01]
[ 1.44700000e+03 0.00000000e+00 5.00000000e+01 6.10000000e+01]
[ 1.44800000e+03 0.00000000e+00 8.40000000e+01 9.40000000e+01]]]
I have now 750 lists and each of those have another 1000. I have tried to flatten those to get 4 values (x,y,a,b) per line. This just takes forever. Is there another way to flatten those?
EDIT2
I have tried
np.fromiter(itertools.chain.from_iterable(arr), dtype='int')
but it gave and error: setting an array element with a sequence, so I tried
np.fromiter(itertools.chain.from_iterable(arr[0]), dtype='int')
but this just gave one list back with what I suspect is the whole first list in the array.
EDIT v2
Now using np.stack instead of np.dstack, and handling file output.
This is considerably simpler than the solutions proposed below.
import numpy as np
import numpy.random as nprnd
aL = nprnd.randint(0,100,size=10) # 10 random ints
bL = nprnd.randint(0,100,size=10) # 10 random ints
xL = np.linspace(0,100,num=5) # 5 evenly spaced ints
yL = np.linspace(0,100,num=2) # 2 evenly spaced ints
xv,yv = np.meshgrid(xL,yL)
arr = np.stack((np.ravel(xv), np.ravel(yv), aL, bL), axis=-1)
np.savetxt('out.out', arr, delimiter=' ')
Using np.meshgrid gives us the following two arrays:
xv = [[ 0. 25. 50. 75. 100.]
[ 0. 25. 50. 75. 100.]]
yv = [[ 0. 0. 0. 0. 0.]
[ 100. 100. 100. 100. 100.]]
which, when we ravel, become:
np.ravel(xv) = [ 0. 25. 50. 75. 100. 0. 25. 50. 75. 100.]
np.ravel(yv) = [ 0. 0. 0. 0. 0. 100. 100. 100. 100. 100.]
These arrays have the same shape as aL and bL,
aL = [74 79 92 63 47 49 18 81 74 32]
bL = [15 9 81 44 90 93 24 90 51 68]
so all that's left is to stack all four arrays along axis=-1:
arr = np.stack((np.ravel(xv), np.ravel(yv), aL, bL), axis=-1)
arr = [[ 0. 0. 62. 41.]
[ 25. 0. 4. 42.]
[ 50. 0. 94. 71.]
[ 75. 0. 24. 91.]
[ 100. 0. 10. 55.]
[ 0. 100. 41. 81.]
[ 25. 100. 67. 11.]
[ 50. 100. 21. 80.]
[ 75. 100. 63. 37.]
[ 100. 100. 27. 2.]]
From here, saving is trivial:
np.savetxt('out.out', arr, delimiter=' ')
ORIGINAL ANSWER
idx = 0
out = []
for x in xL:
for y in yL:
v1 = aL[idx]
v2 = bL[idx]
out.append((x, y, v1, v2))
# print(x,y, v1, v2)
idx += 1
but, it's slow, and only gets slower with more coordinates. I'd consider using the numpy package instead. Here's an example with a 2 x 5 dataset.
aL = nprnd.randint(0,100,size=10) # 10 random ints
bL = nprnd.randint(0,100,size=10) # 10 random ints
xL = np.linspace(0,100,num=5) # 5 evenly spaced ints
yL = np.linspace(0,100,num=2) # 2 evenly spaced ints
lenx = len(xL) # 5
leny = len(yL) # 2
arr = np.ndarray(shape=(leny,lenx,4)) # create a 3-d array
this creates an 3-dimensional array having a shape of 2 rows x 5 columns. On the third axis (length 4) we populate the array with the data you want.
for x in range(leny):
arr[x,:,0] = xL
this syntax is a a little confusing at first. You can learn more about it here. In short, it iterates over the number of rows and sets a particular slice of the array to xL. In this case, the slice we have selected is the zeroth index in all columns of row x. (the : means, "select all indices on this axis"). For our small example, this would yield:
[[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]
[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]]
now we do the same for each column:
for y in range(lenx):
arr[:,y,1] = yL
-----
[[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]
[[ 0 100 0 0]
[ 25 100 0 0]
[ 50 100 0 0]
[ 75 100 0 0]
[100 100 0 0]]]
now we need to address arrays aL and bL. these arrays are flat, so we must first reshape them to conform to the shape of arr. In our simple example, this would take an array of length 10 and reshape it into a 2 x 5 2-dimensional array.
a_reshaped = aL.reshape(leny,lenx)
b_reshaped = bL.reshape(leny,lenx)
to insert the reshaped arrays into our arr, we select the 2nd and 3rd index for all rows and all columns (note the two :'s this time:
arr[:,:,2] = a_reshaped
arr[:,:,3] = b_reshaped
----
[[[ 0 0 3 38]
[ 25 0 63 89]
[ 50 0 4 25]
[ 75 0 72 1]
[100 0 24 83]]
[[ 0 100 55 85]
[ 25 100 39 9]
[ 50 100 43 85]
[ 75 100 63 57]
[100 100 6 63]]]
this runs considerably faster than the nested loop solution. hope it helps!
Sounds like you need a nested loop for x and y:
for x in yL:
for y in yL:
for a, b in zip(aL, bL):
print "%d\t%d\t%s\t%s" % (x, y, a, b)
Try this,
for i,j in zip(zip(a,b),zip(c,d)):
print "%d\t%d\t%s\t%s" % (i[0], i[1], j[0], j[1])