Create all x,y pairs from two coordinate arrays - python

I have 4 lists that I need to iterate over so that I get the following:
x y a b
Lists a and b are of equal length and I iterate over both using the zip function, the code:
for a,b in zip(aL,bL):
print(a,"\t",b)
list x contains 1000 items and list b contains 750 items, after the loop is finished I am supposed to have 750.000 lines.
What is want to achieve is the following:
1 1 a b
1 2 a b
1 3 a b
1 4 a b
.....
1000 745 a b
1000 746 a b
1000 747 a b
1000 748 a b
1000 749 a b
1000 750 a b
How can I achieve this? I have tried enumerate and izip but both results are not what I am seeking.
Thanks.
EDIT:
I have followed your code and used since it is way faster. My output now looks like this:
[[[ 0.00000000e+00 0.00000000e+00 4.00000000e+01 2.30000000e+01]
[ 1.00000000e+00 0.00000000e+00 8.50000000e+01 1.40000000e+01]
[ 2.00000000e+00 0.00000000e+00 7.20000000e+01 2.00000000e+00]
...,
[ 1.44600000e+03 0.00000000e+00 9.20000000e+01 4.60000000e+01]
[ 1.44700000e+03 0.00000000e+00 5.00000000e+01 6.10000000e+01]
[ 1.44800000e+03 0.00000000e+00 8.40000000e+01 9.40000000e+01]]]
I have now 750 lists and each of those have another 1000. I have tried to flatten those to get 4 values (x,y,a,b) per line. This just takes forever. Is there another way to flatten those?
EDIT2
I have tried
np.fromiter(itertools.chain.from_iterable(arr), dtype='int')
but it gave and error: setting an array element with a sequence, so I tried
np.fromiter(itertools.chain.from_iterable(arr[0]), dtype='int')
but this just gave one list back with what I suspect is the whole first list in the array.

EDIT v2
Now using np.stack instead of np.dstack, and handling file output.
This is considerably simpler than the solutions proposed below.
import numpy as np
import numpy.random as nprnd
aL = nprnd.randint(0,100,size=10) # 10 random ints
bL = nprnd.randint(0,100,size=10) # 10 random ints
xL = np.linspace(0,100,num=5) # 5 evenly spaced ints
yL = np.linspace(0,100,num=2) # 2 evenly spaced ints
xv,yv = np.meshgrid(xL,yL)
arr = np.stack((np.ravel(xv), np.ravel(yv), aL, bL), axis=-1)
np.savetxt('out.out', arr, delimiter=' ')
Using np.meshgrid gives us the following two arrays:
xv = [[ 0. 25. 50. 75. 100.]
[ 0. 25. 50. 75. 100.]]
yv = [[ 0. 0. 0. 0. 0.]
[ 100. 100. 100. 100. 100.]]
which, when we ravel, become:
np.ravel(xv) = [ 0. 25. 50. 75. 100. 0. 25. 50. 75. 100.]
np.ravel(yv) = [ 0. 0. 0. 0. 0. 100. 100. 100. 100. 100.]
These arrays have the same shape as aL and bL,
aL = [74 79 92 63 47 49 18 81 74 32]
bL = [15 9 81 44 90 93 24 90 51 68]
so all that's left is to stack all four arrays along axis=-1:
arr = np.stack((np.ravel(xv), np.ravel(yv), aL, bL), axis=-1)
arr = [[ 0. 0. 62. 41.]
[ 25. 0. 4. 42.]
[ 50. 0. 94. 71.]
[ 75. 0. 24. 91.]
[ 100. 0. 10. 55.]
[ 0. 100. 41. 81.]
[ 25. 100. 67. 11.]
[ 50. 100. 21. 80.]
[ 75. 100. 63. 37.]
[ 100. 100. 27. 2.]]
From here, saving is trivial:
np.savetxt('out.out', arr, delimiter=' ')
ORIGINAL ANSWER
idx = 0
out = []
for x in xL:
for y in yL:
v1 = aL[idx]
v2 = bL[idx]
out.append((x, y, v1, v2))
# print(x,y, v1, v2)
idx += 1
but, it's slow, and only gets slower with more coordinates. I'd consider using the numpy package instead. Here's an example with a 2 x 5 dataset.
aL = nprnd.randint(0,100,size=10) # 10 random ints
bL = nprnd.randint(0,100,size=10) # 10 random ints
xL = np.linspace(0,100,num=5) # 5 evenly spaced ints
yL = np.linspace(0,100,num=2) # 2 evenly spaced ints
lenx = len(xL) # 5
leny = len(yL) # 2
arr = np.ndarray(shape=(leny,lenx,4)) # create a 3-d array
this creates an 3-dimensional array having a shape of 2 rows x 5 columns. On the third axis (length 4) we populate the array with the data you want.
for x in range(leny):
arr[x,:,0] = xL
this syntax is a a little confusing at first. You can learn more about it here. In short, it iterates over the number of rows and sets a particular slice of the array to xL. In this case, the slice we have selected is the zeroth index in all columns of row x. (the : means, "select all indices on this axis"). For our small example, this would yield:
[[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]
[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]]
now we do the same for each column:
for y in range(lenx):
arr[:,y,1] = yL
-----
[[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]
[[ 0 100 0 0]
[ 25 100 0 0]
[ 50 100 0 0]
[ 75 100 0 0]
[100 100 0 0]]]
now we need to address arrays aL and bL. these arrays are flat, so we must first reshape them to conform to the shape of arr. In our simple example, this would take an array of length 10 and reshape it into a 2 x 5 2-dimensional array.
a_reshaped = aL.reshape(leny,lenx)
b_reshaped = bL.reshape(leny,lenx)
to insert the reshaped arrays into our arr, we select the 2nd and 3rd index for all rows and all columns (note the two :'s this time:
arr[:,:,2] = a_reshaped
arr[:,:,3] = b_reshaped
----
[[[ 0 0 3 38]
[ 25 0 63 89]
[ 50 0 4 25]
[ 75 0 72 1]
[100 0 24 83]]
[[ 0 100 55 85]
[ 25 100 39 9]
[ 50 100 43 85]
[ 75 100 63 57]
[100 100 6 63]]]
this runs considerably faster than the nested loop solution. hope it helps!

Sounds like you need a nested loop for x and y:
for x in yL:
for y in yL:
for a, b in zip(aL, bL):
print "%d\t%d\t%s\t%s" % (x, y, a, b)

Try this,
for i,j in zip(zip(a,b),zip(c,d)):
print "%d\t%d\t%s\t%s" % (i[0], i[1], j[0], j[1])

Related

Generating a numpy-ndarray from dataframe for keras data

This is a task I have been thinking of how to do it. I have a DataFrame containing motion characteristics of users (by user id) similar to the one below:
>>> df
id speed1 speed2 acc1 acc2 label
0 1 19 12 5 2 0
1 1 10 11 9 3 0
2 1 12 10 4 -1 0
3 1 29 13 8 4 0
4 1 30 23 9 10 0
5 1 18 11 2 -1 0
6 1 10 6 -3 -2 0
7 2 5 1 0 0 1
8 2 7 2 1 3 1
9 2 6 2 1 0 1
From this dataframe, I would like to generate a numpy ndarray (should I rather say list of arrays?) of fixed-length segments by splitting each user's (i.e. id) records, so that each segment is of the shape (1, 5, 4) that I can feed to neural network this way:
each segment (thus, the 1) consists of five arrays (thus, the 5) of the motion characteristics speed1 speed2 acc1 acc2 (thus the 4) in the above dataframe.
where the rows cannot make-up to five arrays, the remaining arrays are filled-up with zeros (i.e. zero-padded)
Then the label column should also be a separate array, matching the size of the new array, by duplicating the labelĀ“s value in the position of the zero-padded arrays for the padded segments.
In the given df example above, the expected output would be:
>>>input_array
[
[
[19 12 5 2]
[10 11 9 3]
[12 10 4 -1]
[29 13 8 4]
[30 23 9 10]
]
[
[18 11 2 -1]
[10 6 -3 -2]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
]
[
[5 6 -3 -2]
[7 2 1 3]
[6 2 1 0]
[0 0 0 0]
[0 0 0 0]
]
]
id=1 has 7 rows, so the last 3 rows are zero-padded. Similarly, id=2 has 3 rows, so the last 2 rows are zero-padded.
EDIT
I noticed 2 bugs with the function given in the answer.
The function introduces an all-zero array in some cases.
For example in this:
df2 = {
'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df2 = pd.DataFrame.from_dict(df2)
X , y = transform(df2[:10])
X
array([[[[ 1.763e+01, 0.000e+00, 0.000e+00, 2.903e+01],
[ 1.763e+01, -9.000e-02, 1.000e-02, 5.612e+01],
[ 1.700e-01, 1.240e+00, -2.040e+00, 1.849e+01],
[ 1.410e+00, -8.000e-01, 5.100e-01, 1.185e+01],
[ 6.100e-01, -2.900e-01, 1.500e-01, 3.675e+01]]],
[[[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]]],
[[[ 3.200e-01, -1.400e-01, 3.900e-01, 2.752e+01],
[ 1.800e-01, 2.500e-01, -3.800e-01, 8.108e+01],
[ 4.300e-01, -1.300e-01, 2.900e-01, 5.106e+01],
[ 3.000e-01, 1.600e-01, 1.300e-01, 1.985e+01],
[ 4.600e-01, 2.900e-01, -6.700e-01, 1.076e+01]]]])
Notice how the function introduced an all-zero array as the second element. Ideally the output should contain only the first and last arrays.
When passed a df with more than 10 rows, the function fails with an index can't contain negative values error.
So if you df2 you get this:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-743489875901> in <module>()
----> 1 X , y = transform(df2)
2 X
2 frames
<ipython-input-55-f6e028a2e8b8> in transform(dataframe, chunk_size)
24 inpt = np.pad(
25 inpt, [(0, chunk_size-len(inpt)),(0, 0)],
---> 26 mode='constant')
27 # add each inputs split to accumulators
28 X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
<__array_function__ internals> in pad(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array, pad_width, mode, **kwargs)
746
747 # Broadcast to shape (array.ndim, 2)
--> 748 pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
749
750 if callable(mode):
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x, ndim, as_index)
517
518 if as_index and x.min() < 0:
--> 519 raise ValueError("index can't contain negative values")
520
521 # Converting the array with `tolist` seems to improve performance
ValueError: index can't contain negative values
[EDITED] Bugs fixed. The implementation below should now give desired output:
import pandas as pd
import numpy as np
df = {
'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df = pd.DataFrame.from_dict(df)
def transform(dataframe, chunk_size=5):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'speed1':'acc2'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
X, y = transform(df)
print('X shape =', X.shape)
print('X =', X)
print('Y shape =', y.shape)
print('Y =', y)
# >> out:
# X shape = (3, 1, 5, 4)
# X = [[[[17.63 0. 0. 29.03]
# [17.63 -0.09 0.01 56.12]
# [ 0.17 1.24 -2.04 18.49]
# [ 1.41 -0.8 0.51 11.85]
# [ 0.61 -0.29 0.15 36.75]]]
#
#
# [[[ 0.32 -0.14 0.39 27.52]
# [ 0.18 0.25 -0.38 81.08]
# [ 0.43 -0.13 0.29 51.06]
# [ 0.3 0.16 0.13 19.85]
# [ 0.46 0.29 -0.67 10.76]]]
#
#
# [[[ 0.75 -0.38 0.65 14.51]
# [ 0.37 0.27 0.52 24.27]
# [ 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. ]]]]
# Y shape = (3,)
# Y = [3. 3. 3.]

Python numpy array split index out of range

I am trying to execute the following code:
def calculate_squared_dist_sliced_data(self, data, output, proc_numb):
for k in range(1, self.calc_border):
print("Calculating",k, "of", self.calc_border, "\n", (self.calc_border - k), "to go!")
kmeans = KMeansClusterer.KMeansClusterer(k, data)
print("inertia in round", k, ": ", kmeans.calc_custom_params(data, k).inertia_)
output.put( proc_numb, (kmeans.calc_custom_params(self.data, k).inertia_))
def calculate_squared_dist_mp(self):
length = np.shape(self.data)[0]
df_array = []
df_array[0] = self.data[int(length/4), :]
df_array[1] = self.data[int((length/4)+1):int(length/2), :]
df_array[2] = self.data[int((length/2)+1):int(3*length/4), :]
df_array[3] = self.data[int((3*length/4)+1):int(length/4), :]
output = mp.Queue()
processes = [mp.Process(target=self.calculate_squared_dist_sliced_data, args=(df_array[x], output, x)) for x in range(4)]
for p in processes:
p.start()
for p in processes:
p.join()
results = [output.get() for p in processes]
When executing df_array[0] = self.data[int(length/4), :], I get the following error:
IndexError: list assignment index out of range
The variable lentgh has the value 20195 (which is correct). I want to do the method calculate_squared_dist_sliced_data by multiprocessing, so I need to split the array data that is passed to this class.
Here is an example of how this numpy array looks:
[[ 0. 0. 0.02072968 ..., -0.07872599 -0.10147049 -0.44589 ]
[ 0. -0.11091352 0.11208243 ..., 0.08164318 -0.02754813
-0.44921876]
[ 0. -0.10642599 0.0028097 ..., 0.1185457 -0.22482443
-0.25121125]
...,
[ 0. 0. 0. ..., -0.03617197 0.00921685 0. ]
[ 0. 0. 0. ..., -0.08241634 -0.05494423
-0.10988845]
[ 0. 0. 0. ..., -0.03010139 -0.0925091
-0.02145017]]
Now I want to split this hole array into four equal pieces to give each one to a process. However, when selecting the rows I get the exception mentioned above. Can someone help me?
Maybe for a more theroretical approach of what I want to do:
A B C D
1 2 3 4
5 6 7 8
9 5 4 3
1 8 4 3
As a result I want to have for example two arrays, each containing two rows:
A B C D
1 2 3 4
5 6 7 8
and
A B C D
9 5 4 3
1 8 4 3
Can someone help me?
The left side of the assignment is not allowed as you list has length 0.
Either fix it to:
df_array = [None, None, None, None]
or use
df_array.append(self.data[int(length/4), :])
...
instead.
I just noticed that I tried to use a list like an array...

Creating an array based on a plot of custom function (Python)

I'm trying to use Numpy to create a y vector that will correspond to the following plot:
The x values will run from 0 to 24, the y values should be:
0 to 6 will be 0
6 to 18 will be sort of parabola
18 to 24 will be 0 again
What is a good way to do it? I don't have any practical ideas yet (I thought about some sort of interpolation).
Thank you!
I have done it assuming that you want a circle shape instead of a parabola (based on your scheme).
import numpy as np
length = 24
radius = 6
x = np.arange(length)
y = np.sqrt(radius**2-(x-(length/2))**2)
y = np.nan_to_num(y)
print(x)
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
print(y)
# [0. 0. 0. 0. 0. 0.
# 0. 3.31662479 4.47213595 5.19615242 5.65685425 5.91607978
# 6. 5.91607978 5.65685425 5.19615242 4.47213595 3.31662479
# 0. 0. 0. 0. 0. 0. ]

Array building in Python

I have an array like this in a data file:
0 822.6 1391.3 1
0 822.6 1391.3 2
0 708.3 1501.2 3
0 708.3 1501.2 4
0 632.5 1585.8 5
0 632.5 1585.8 6
0 552.4 1652.6 7
0 552.4 1652.6 8
250 850.8 1358.6 1
250 803.3 1406.2 2
250 732.0 1481.9 3
250 694.9 1519 4
250 642.9 1566.5 5
250 613.2 1594.7 6
250 570.2 1637.8 7
250 537.5 1663 8
I want to create separate data sets depending on the last column.
In other words I want something like this:
while data[:,3] != 9:
if data[:,3] == 1:
x1 = data[:,0]
y1= (data[:,1]-data[:,2])**2
if data[:,3] == 2:
x2 = data[:,0]
y2= (data[:,1]-data[:,2])**2
And so on...
I only put does not equal 9 because I only have from 1-8 in the last column always.
I know this is completely wrong, but I need help.
Consider the following:
a = array([[ 0.64910219, 0.06868991, -0.34844128, 0. ],
[-1.34767042, -1.77338287, 0.693539 , 1. ],
[ 1.31245883, -2.08879047, -0.83514187, 3. ],
[ 0.43156959, 0.31388795, 0.2856625 , 1. ],
[-0.60531108, -0.63226693, 0.32063803, 2. ],
[-0.47538621, -0.64196643, -0.82296546, 3. ],
[ 0.3491207 , -1.25406403, 1.21754411, 0. ],
[-1.1573242 , 1.1636706 , 0.63733285, 2. ]])
d0 = a[a[:,3]==0,:]
d1 = a[a[:,3]==1,:]
d2 = a[a[:,3]==2,:]
d3 = a[a[:,3]==3,:]
The variables d0, d1, d2, d3 contain the rows with the appropriate values in the right-most column.
>>> d0
array([[ 0.64910219, 0.06868991, -0.34844128, 0. ],
[ 0.3491207 , -1.25406403, 1.21754411, 0. ]])
>>> d1
array([[-1.34767042, -1.77338287, 0.693539 , 1. ],
[ 0.43156959, 0.31388795, 0.2856625 , 1. ]])
>>> d2
array([[-0.60531108, -0.63226693, 0.32063803, 2. ],
[-1.1573242 , 1.1636706 , 0.63733285, 2. ]])
>>> d3
array([[ 1.31245883, -2.08879047, -0.83514187, 3. ],
[-0.47538621, -0.64196643, -0.82296546, 3. ]])
To create the x1, y1, etc.. that you mention in the post, you just need to manipulate those arrays.
x1 = d1[:,0]
y1 = (d1[:,1] - d1[:,2])**2
And so on for the other values of the fourth column. For a small number of possible values (1 - 8), hardcoding the different variables isn't too bad, but this method can easily generalize to an arbitrary list with a list of the computed outputs.

Adjacent cells of multiple cell patches in a numpy array

this is a followup question arising from this solution.
The solution to count adjacent cells works pretty well unless you have multiple patches in the array.
So this time the array for instance looks like this.
import numpy
from scipy import ndimage
s = ndimage.generate_binary_structure(2,2)
a = numpy.zeros((6,6), dtype=numpy.int) # example array
a[1:3, 1:3] = 1;a[2:4,4:5] = 1
print a
[0 0 0 0 0 0]
[0 1 1 0 0 0]
[0 1 1 0 1 0]
[0 0 0 0 1 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
# Number of nonoverlapping cells
c = ndimage.binary_dilation(a,s).astype(a.dtype)
b = c - a
numpy.sum(b) # returns 19
# However the correct number of non overlapping cells should be 22 (12+10)
Is there any smart solution to solve this dilemma without using any loops or iterating through the array? The reason is that the array could be quite big.
idea 1:
Just thought over it and a way to do it might be to check for more than one patch in the iterating structure. For the total count number to be correct those cells below have to be equal 2 (or more) in the dilation. Anyone got any idea how to turn this thought into code?
[1 1 1 1 0 0]
[1 0 0 2 1 1]
[1 0 0 2 0 1]
[1 1 1 2 0 1]
[0 0 0 1 1 1]
[0 0 0 0 0 0]
You can use label from ndimage to segment each patch of ones.
Then you just ask where the returned array equals 1, 2, 3 etc and perform your algoritm on it (or you just use the ndimage.distance_transform_cdt but with inverting your forground/background for each labeled segment.
Edit 1:
This code will take your array a and do what you ask:
b, c = ndimage.label(a)
e = numpy.zeros(a.shape)
for i in xrange(c):
e += ndimage.distance_transform_cdt((b == i + 1) == 0) == 1
print e
I realize it is a bit ugly with all the equals there but it outputs:
In [41]: print e
[[ 1. 1. 1. 1. 0. 0.]
[ 1. 0. 0. 2. 1. 1.]
[ 1. 0. 0. 2. 0. 1.]
[ 1. 1. 1. 2. 0. 1.]
[ 0. 0. 0. 1. 1. 1.]
[ 0. 0. 0. 0. 0. 0.]]
Edit 2 (Alternative solution):
This code should do the same stuff and hopefully faster (however it will not find the where
two patches only touch corners).
b = ndimage.binary_closing(a) - a
b = ndimage.binary_dilation(b.astype(bool))
c = ndimage.distance_transform_cdt(a == 0) == 1
e = c.astype(numpy.int) * b + c
print e

Categories