Generating a numpy-ndarray from dataframe for keras data

Generating a numpy-ndarray from dataframe for keras data - python

This is a task I have been thinking of how to do it. I have a DataFrame containing motion characteristics of users (by user id) similar to the one below:
>>> df
id speed1 speed2 acc1 acc2 label
0 1 19 12 5 2 0
1 1 10 11 9 3 0
2 1 12 10 4 -1 0
3 1 29 13 8 4 0
4 1 30 23 9 10 0
5 1 18 11 2 -1 0
6 1 10 6 -3 -2 0
7 2 5 1 0 0 1
8 2 7 2 1 3 1
9 2 6 2 1 0 1
From this dataframe, I would like to generate a numpy ndarray (should I rather say list of arrays?) of fixed-length segments by splitting each user's (i.e. id) records, so that each segment is of the shape (1, 5, 4) that I can feed to neural network this way:
each segment (thus, the 1) consists of five arrays (thus, the 5) of the motion characteristics speed1 speed2 acc1 acc2 (thus the 4) in the above dataframe.
where the rows cannot make-up to five arrays, the remaining arrays are filled-up with zeros (i.e. zero-padded)
Then the label column should also be a separate array, matching the size of the new array, by duplicating the label´s value in the position of the zero-padded arrays for the padded segments.
In the given df example above, the expected output would be:
>>>input_array
[
[
[19 12 5 2]
[10 11 9 3]
[12 10 4 -1]
[29 13 8 4]
[30 23 9 10]
]
[
[18 11 2 -1]
[10 6 -3 -2]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
]
[
[5 6 -3 -2]
[7 2 1 3]
[6 2 1 0]
[0 0 0 0]
[0 0 0 0]
]
]
id=1 has 7 rows, so the last 3 rows are zero-padded. Similarly, id=2 has 3 rows, so the last 2 rows are zero-padded.
EDIT
I noticed 2 bugs with the function given in the answer.
The function introduces an all-zero array in some cases.
For example in this:
df2 = {
'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df2 = pd.DataFrame.from_dict(df2)
X , y = transform(df2[:10])
X
array([[[[ 1.763e+01, 0.000e+00, 0.000e+00, 2.903e+01],
[ 1.763e+01, -9.000e-02, 1.000e-02, 5.612e+01],
[ 1.700e-01, 1.240e+00, -2.040e+00, 1.849e+01],
[ 1.410e+00, -8.000e-01, 5.100e-01, 1.185e+01],
[ 6.100e-01, -2.900e-01, 1.500e-01, 3.675e+01]]],
[[[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00],
[ 0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00]]],
[[[ 3.200e-01, -1.400e-01, 3.900e-01, 2.752e+01],
[ 1.800e-01, 2.500e-01, -3.800e-01, 8.108e+01],
[ 4.300e-01, -1.300e-01, 2.900e-01, 5.106e+01],
[ 3.000e-01, 1.600e-01, 1.300e-01, 1.985e+01],
[ 4.600e-01, 2.900e-01, -6.700e-01, 1.076e+01]]]])
Notice how the function introduced an all-zero array as the second element. Ideally the output should contain only the first and last arrays.
When passed a df with more than 10 rows, the function fails with an index can't contain negative values error.
So if you df2 you get this:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-743489875901> in <module>()
----> 1 X , y = transform(df2)
2 X
2 frames
<ipython-input-55-f6e028a2e8b8> in transform(dataframe, chunk_size)
24 inpt = np.pad(
25 inpt, [(0, chunk_size-len(inpt)),(0, 0)],
---> 26 mode='constant')
27 # add each inputs split to accumulators
28 X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
<__array_function__ internals> in pad(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in pad(array, pad_width, mode, **kwargs)
746
747 # Broadcast to shape (array.ndim, 2)
--> 748 pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
749
750 if callable(mode):
/usr/local/lib/python3.6/dist-packages/numpy/lib/arraypad.py in _as_pairs(x, ndim, as_index)
517
518 if as_index and x.min() < 0:
--> 519 raise ValueError("index can't contain negative values")
520
521 # Converting the array with `tolist` seems to improve performance
ValueError: index can't contain negative values

[EDITED] Bugs fixed. The implementation below should now give desired output:
import pandas as pd
import numpy as np
df = {
'id': [1,1,1,1,1,1,1,1,1,1,1,1],
'speed1': [17.63,17.63,0.17,1.41,0.61,0.32,0.18,0.43,0.30,0.46,0.75,0.37],
'speed2': [0.00,-0.09,1.24,-0.80,-0.29,-0.14,0.25,-0.13,0.16,0.29,-0.38,0.27],
'acc1': [0.00,0.01,-2.04,0.51,0.15,0.39,-0.38,0.29,0.13,-0.67,0.65,0.52],
'acc2': [29.03,56.12,18.49,11.85,36.75,27.52,81.08,51.06,19.85,10.76,14.51,24.27],
'label' : [3,3,3,3,3,3,3,3,3,3,3,3] }
df = pd.DataFrame.from_dict(df)
def transform(dataframe, chunk_size=5):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'speed1':'acc2'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
X, y = transform(df)
print('X shape =', X.shape)
print('X =', X)
print('Y shape =', y.shape)
print('Y =', y)
# >> out:
# X shape = (3, 1, 5, 4)
# X = [[[[17.63 0. 0. 29.03]
# [17.63 -0.09 0.01 56.12]
# [ 0.17 1.24 -2.04 18.49]
# [ 1.41 -0.8 0.51 11.85]
# [ 0.61 -0.29 0.15 36.75]]]
#
#
# [[[ 0.32 -0.14 0.39 27.52]
# [ 0.18 0.25 -0.38 81.08]
# [ 0.43 -0.13 0.29 51.06]
# [ 0.3 0.16 0.13 19.85]
# [ 0.46 0.29 -0.67 10.76]]]
#
#
# [[[ 0.75 -0.38 0.65 14.51]
# [ 0.37 0.27 0.52 24.27]
# [ 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. ]
# [ 0. 0. 0. 0. ]]]]
# Y shape = (3,)
# Y = [3. 3. 3.]

Related

In python, how to quickly set the value of each row of a 2 dimensional array by condition on the corresponding element of a one dimensional array?

For instance, I have the following numpy arrays:
a = numpy.array( [ [ 1 , 3 , 3 ] , [ 2 , 5 , 5 ] , [ 3 , 7 , 7 ] ] )
b = numpy.array( [ 1 , 2 , 3 ] )
I want to write a piece of code that will emulate:
a[ a == b ] = 0
which the output will be:
[ [ 0 , 3 , 3 ] , [ 0 , 5 , 5 ] , [ 0 , 7 , 7 ] ]
How to achieve this by not applying a for loop. Here it is just an example, in real reality the arrays are very large and for loop takes too much time to run.

you could do the following:
import numpy as np
a = np.array( [ [ 1 , 3 , 3 ] , [ 2 , 5 , 5 ] , [ 3 , 7 , 7 ] ] )
b = np.array( [ 1 , 2 , 3 ] )
def f(b, a):
return np.where(a == b, 0, a)
print(np.array([*map(f, b, a)]))
which gives:
[[0 3 3]
[0 5 5]
[0 7 7]]

Map each tensor value to the closest value in a list

I have a tensor A with size [batchSize,2,2,2] where batchSize is a placeholder. In a custom layer, I would like to map each value of this tensor to the closest value in a list c with length n. The list is my codebook and I would like to quantize each value in the tensor based on this codebook; i.e. find the closest value to each tensor value in the list and replace the tensor value with that.
I could not figure out a 'clean' tensor operation that will quickly do that. I can not loop over the batchSize. Is there a method to do this in Tensorflow?

If I understand correctly, this is doable with tf.HashTable. As an illustration, I used a normal distribution with mean=0, stddev=4.
a = tf.random.normal(
shape = [batch, 2, 2, 2],
mean=0.0,
stddev=4
)
And I used a quantization with only 5 buckets (see the figure marked with number 0, 1, 2, 3, 4). This is extensible to any length n. Note I intentionally made the buckets have variable length.
My codebook is therefore:
a <= -2 -> bucket 4
-2 < a < -0.5 -> bucket 3
-0.5 <= a < 0.5 -> bucket 0
0.5 <= a < 2.5 -> bucket 1
a >= 2.5 -> bucket 2
The idea is to pre-create a key/value mapping from a scaled a to the bucket number. (the number of <key,value> pairs is dependent on the input granularity you need. Here I scaled by 10). Below is the code to initialize the mapping table and the produced mapping (input scaled by 10).
# The boundary is chosen based on that we clip by min=-4, max=4.
# after scaling, the boundary becomes -40 and 40.
keys = range(-40, 41)
values = []
for k in keys:
if k <= -20:
values.append(4)
elif k < -5:
values.append(3)
elif k < 5:
values.append(0)
elif k < 25:
values.append(1)
else:
values.append(2)
for (k, v) in zip(keys, values):
print ("%2d -> %2d" % (k, v))
-40 -> 4
-39 -> 4
...
-22 -> 4
-21 -> 4
-20 -> 4
-19 -> 3
-18 -> 3
...
-7 -> 3
-6 -> 3
-5 -> 0
-4 -> 0
...
3 -> 0
4 -> 0
5 -> 1
6 -> 1
...
23 -> 1
24 -> 1
25 -> 2
26 -> 2
...
40 -> 2
batch = 3
a = tf.random.normal(
shape = [batch, 2, 2, 2],
mean=0.0,
stddev=4,
dtype=tf.dtypes.float32
)
clip_a = tf.clip_by_value(a, clip_value_min=-4, clip_value_max=4)
SCALE = 10
scaled_clip_a = tf.cast(clip_a * SCALE, tf.int32)
table = tf.contrib.lookup.HashTable(
tf.contrib.lookup.KeyValueTensorInitializer(keys, values), -1)
quantized_a = tf.reshape(
table.lookup(tf.reshape(scaled_clip_a, [-1])),
[batch, 2, 2, 2])
with tf.Session() as sess:
table.init.run()
a, clip_a, scaled_clip_a, quantized_a = sess.run([a, clip_a, scaled_clip_a, quantized_a])
print ('a\n%s' % a)
print ('clip_a\n%s' % clip_a)
print ('scaled_clip_a\n%s' % scaled_clip_a)
print ('quantized_a\n%s' % quantized_a)
Result:
a
[[[[-0.26980758 -5.56331968]
[ 5.04240322 -7.18292665]]
[[-7.11545467 -3.24369478]
[ 1.01861215 -0.04510783]]]
[[[-0.28768024 0.2472897 ]
[ 2.17780781 -5.79106379]]
[[ 8.45582008 4.53902292]
[ 0.138162 -6.19155598]]]
[[[-7.5134449 4.56302166]
[-0.30592337 -0.60313278]]
[[-0.06204566 3.42917275]
[-1.14547718 3.31167102]]]]
clip_a
[[[[-0.26980758 -4. ]
[ 4. -4. ]]
[[-4. -3.24369478]
[ 1.01861215 -0.04510783]]]
[[[-0.28768024 0.2472897 ]
[ 2.17780781 -4. ]]
[[ 4. 4. ]
[ 0.138162 -4. ]]]
[[[-4. 4. ]
[-0.30592337 -0.60313278]]
[[-0.06204566 3.42917275]
[-1.14547718 3.31167102]]]]
scaled_clip_a
[[[[ -2 -40]
[ 40 -40]]
[[-40 -32]
[ 10 0]]]
[[[ -2 2]
[ 21 -40]]
[[ 40 40]
[ 1 -40]]]
[[[-40 40]
[ -3 -6]]
[[ 0 34]
[-11 33]]]]
quantized_a
[[[[0 4]
[2 4]]
[[4 4]
[1 0]]]
[[[0 0]
[1 4]]
[[2 2]
[0 4]]]
[[[4 2]
[0 3]]
[[0 2]
[3 2]]]]

Weighting Data Using Numpy

My data looks like:
list=[44359, 16610, 8364, ..., 1, 1, 1]
For each element in list I want to take i*([i+1]+[i-1])/2, where i is an element in the list, and i+1 and i-1 are the adjacent elements.
For some reason I cannot seem to do this cleanly in NumPy.
Here's what I've tried:
weights=[]
weights.append(1)
for i in range(len(hoff[3])-1):
weights.append((hoff[3][i-1]+hoff[3][i+1])/2)
Where I append 1 to the weights list so that lengths will match at the end. I arbitrarily picked 1, I'm not sure how to deal with the leftmost and rightmost points either.

You can use numpy's array operations to represent your "loop". If you think of data as bellow, where pL and pR are the values you choose to "pad" your data with on the left and right:
[pL, 0, 1, 2, ..., N-2, N-1, pR]
What you're trying to do is this:
[0, ..., N - 1] * ([pL, 0, ..., N-2] + [1, ..., N -1, pR]) / 2
Written in code it looks something like this:
import numpy as np
data = np.random.random(10)
padded = np.concatenate(([data[0]], data, [data[-1]]))
data * (padded[:-2] + padded[2:]) / 2.
Repeating the first and last value is known as "extending" in image processing, but there are other edge handling methods you could try.

I would use pandas for this, filling in the missing left- and right-most values with 1 (but you can use any value you want):
import numpy
import pandas
numpy.random.seed(0)
data = numpy.random.randint(0, 10, size=15)
df = (
pandas.DataFrame({'hoff': data})
.assign(before=lambda df: df['hoff'].shift(1).fillna(1).astype(int))
.assign(after=lambda df: df['hoff'].shift(-1).fillna(1).astype(int))
.assign(weight=lambda df: df['hoff'] * df[['before', 'after']].mean(axis=1))
)
print(df.to_string(index=False)
And that gives me:
hoff before after weight
5 1 0 2.5
0 5 3 0.0
3 0 3 4.5
3 3 7 15.0
7 3 9 42.0
9 7 3 45.0
3 9 5 21.0
5 3 2 12.5
2 5 4 9.0
4 2 7 18.0
7 4 6 35.0
6 7 8 45.0
8 6 8 56.0
8 8 1 36.0
1 8 1 4.5
A pure numpy-based solution would look like this (again, filling with 1):
before_after = numpy.ones((data.shape[0], 2))
before_after[1:, 0] = data[:-1]
before_after[:-1, 1] = data[1:]
weights = data * before_after.mean(axis=1)
print(weights)
array([ 2.5, 0. , 4.5, 15. , 42. , 45. , 21. , 12.5, 9. ,
18. , 35. , 45. , 56. , 36. , 4.5])

FutureWarning when applying a condition on a pandas dataframe to filter an array

I have applied PCA to an array of around 1000 observations but only want to keep the observation in the new array IF one of the features from the original array = something.
I have a numpy array df2 and a dataframe df. I want to find all rows in df2 where df.Position is CDM.
My actual data:
df2
[[ -6.00987823e+00 4.46585005e+00]
[ -7.09055159e+00 1.89437600e+00]
[ -5.91044431e+00 -1.97888707e+00]
[ -4.85698965e+00 -1.09936724e+00]
[ -4.01780368e-01 -2.57178392e+00]
[ -2.97351215e+00 -3.15940358e+00]
[ -4.27973589e+00 2.82707326e+00]
[ 3.95086576e+00 1.08281922e+00]
[ -2.94075361e+00 -1.95544661e+00]
[ -4.83788056e+00 2.32369496e+00]
[ -5.00473716e+00 -3.37680552e-01]
[ -4.88905829e+00 -1.55527476e+00]
[ -3.38202709e+00 -1.04402867e+00]
[ -2.14261510e+00 -5.30757477e-01]
[ 3.00813803e-01 -2.11010985e+00]
[ -2.67824986e+00 -1.83303905e+00]
[ -1.64547049e+00 -2.48056250e+00]
[ -2.92550543e+00 -3.02363170e+00]
[ -4.01116933e+00 2.90363840e+00]
[ -1.04571206e+00 7.58064433e-01]
[ 2.34068739e-01 -2.33981296e+00]
[ 3.15597517e+00 1.09429188e+00]
[ -3.83828970e+00 1.14195305e-01]
[ -7.33794066e-01 -3.70152816e+00]
[ 8.21789967e-01 -4.77818413e-01]
[ -3.29257688e+00 -1.61887349e+00]
[ -4.24297171e+00 2.27187714e+00]
[ 1.45714199e+00 -3.56024788e+00]
[ 1.79855738e+00 -3.71818328e-01]
[ 3.68171085e-01 -3.52961707e+00]
[ 3.77585412e+00 -3.01627595e-01]
[ -4.21740128e+00 -1.30913719e+00]
[ -3.85041585e+00 -1.05515969e+00]
[ -5.01752378e+00 4.67348167e-01]
[ 3.65943448e+00 9.21016483e-01]
[ 3.12159896e+00 -1.25707872e-01]
[ -4.50219722e+00 -4.06752784e+00]
[ -3.92172250e+00 -2.88567430e+00]
[ -2.68908475e-01 -2.17506629e+00]
[ -1.13728112e+00 -2.66843007e+00]
[ -8.73467957e-01 -1.24389494e+00]
[ 3.21966300e+00 -1.35271239e-01]
[ -4.31060796e+00 -1.90505910e+00]
[ 3.73904981e+00 7.70228802e-01]
[ 1.02646986e+00 -5.91828676e-01]
[ 8.43840480e-01 -1.49636218e+00]
[ 1.54065978e+00 -1.65086030e+00]
[ 2.96602068e+00 -7.41024474e-01]
[ 6.53636345e-01 3.04647288e-01]
[ 2.59236989e+00 -6.70435261e-02]
[ 2.00184665e-01 -1.55230314e+00]
[ -7.29533092e-01 -2.73390749e+00]
[ -2.93578745e+00 -2.18118257e+00]
[ -4.37481195e+00 1.02701222e+00]
[ 1.00713302e+00 -1.39943282e+00]
...]
df
(simply playing position in football/soccer - FB, CB, CDM, CM, AM, FW)
Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM
When filtering, I get this output (along with a FutureWarning):
Where am I going wrong and how can I filter the data appropriately?

The FutureWarning is probably a result of your numpy and pandas versions being out of date. You can upgrade them using:
pip install --upgrade numpy pandas
As for the filtering, there are quite a few options. Here I mention each one with some dummy data.
Setup
df
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
df2
0 1
0 0.122 0.222
1 0.343 0.345
2 0.345 0.563
Option 1
boolean indexing
df2[df.colour == 'red']
Out[726]:
0 1
0 0.122 0.222
1 0.343 0.345
Option 2
df.eval
df2[df.eval('colour == "red"')]
Out[732]:
0 1
0 0.122 0.222
1 0.343 0.345
Note that both these options work even if df2 is a numpy array of the form:
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
For your actual data, you'll need to do something along the same lines:
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])

I think you need boolean indexing:
from sklearn.decomposition import PCA
import pandas as pd
d = {'d': [4, 5, 5],
'a': [1, 2, 1],
'name': ['john', 'james', 'jane'],
'e': [5, 6, 7],
'f': [6, 7, 8], 'c': [3, 4, 3],
'b': [2, 3, 2],
'colour': ['red', 'red', 'blue']}
cols = ['name', 'colour', 'a', 'b', 'c', 'd', 'e', 'f']
df = pd.DataFrame(d, columns = cols)
print (df)
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
#create mask by condition
mask = df['colour'] == 'red'
#for multiple values
#mask = df['colour'].isin(['red', 'green', 'blue'])
print (mask)
0 True
1 True
2 False
Name: colour, dtype: bool
#filter only numeric values and convert to numpy array
arr = df.drop(['name','colour'], axis=1).values
print (arr)
[[1 2 3 4 5 6]
[2 3 4 5 6 7]
[1 2 3 5 7 8]]
pca = PCA(n_components=5)
pca.fit(arr)
print (pca.components_ )
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]
[ 0.62531284 -0.13184966 -0.136648 -0.71363037 0.17840759 0.17840759]]
#filter by condition
arr1 = pca.components_ [mask]
print (arr1)
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]]

Create all x,y pairs from two coordinate arrays

I have 4 lists that I need to iterate over so that I get the following:
x y a b
Lists a and b are of equal length and I iterate over both using the zip function, the code:
for a,b in zip(aL,bL):
print(a,"\t",b)
list x contains 1000 items and list b contains 750 items, after the loop is finished I am supposed to have 750.000 lines.
What is want to achieve is the following:
1 1 a b
1 2 a b
1 3 a b
1 4 a b
.....
1000 745 a b
1000 746 a b
1000 747 a b
1000 748 a b
1000 749 a b
1000 750 a b
How can I achieve this? I have tried enumerate and izip but both results are not what I am seeking.
Thanks.
EDIT:
I have followed your code and used since it is way faster. My output now looks like this:
[[[ 0.00000000e+00 0.00000000e+00 4.00000000e+01 2.30000000e+01]
[ 1.00000000e+00 0.00000000e+00 8.50000000e+01 1.40000000e+01]
[ 2.00000000e+00 0.00000000e+00 7.20000000e+01 2.00000000e+00]
...,
[ 1.44600000e+03 0.00000000e+00 9.20000000e+01 4.60000000e+01]
[ 1.44700000e+03 0.00000000e+00 5.00000000e+01 6.10000000e+01]
[ 1.44800000e+03 0.00000000e+00 8.40000000e+01 9.40000000e+01]]]
I have now 750 lists and each of those have another 1000. I have tried to flatten those to get 4 values (x,y,a,b) per line. This just takes forever. Is there another way to flatten those?
EDIT2
I have tried
np.fromiter(itertools.chain.from_iterable(arr), dtype='int')
but it gave and error: setting an array element with a sequence, so I tried
np.fromiter(itertools.chain.from_iterable(arr[0]), dtype='int')
but this just gave one list back with what I suspect is the whole first list in the array.

EDIT v2
Now using np.stack instead of np.dstack, and handling file output.
This is considerably simpler than the solutions proposed below.
import numpy as np
import numpy.random as nprnd
aL = nprnd.randint(0,100,size=10) # 10 random ints
bL = nprnd.randint(0,100,size=10) # 10 random ints
xL = np.linspace(0,100,num=5) # 5 evenly spaced ints
yL = np.linspace(0,100,num=2) # 2 evenly spaced ints
xv,yv = np.meshgrid(xL,yL)
arr = np.stack((np.ravel(xv), np.ravel(yv), aL, bL), axis=-1)
np.savetxt('out.out', arr, delimiter=' ')
Using np.meshgrid gives us the following two arrays:
xv = [[ 0. 25. 50. 75. 100.]
[ 0. 25. 50. 75. 100.]]
yv = [[ 0. 0. 0. 0. 0.]
[ 100. 100. 100. 100. 100.]]
which, when we ravel, become:
np.ravel(xv) = [ 0. 25. 50. 75. 100. 0. 25. 50. 75. 100.]
np.ravel(yv) = [ 0. 0. 0. 0. 0. 100. 100. 100. 100. 100.]
These arrays have the same shape as aL and bL,
aL = [74 79 92 63 47 49 18 81 74 32]
bL = [15 9 81 44 90 93 24 90 51 68]
so all that's left is to stack all four arrays along axis=-1:
arr = np.stack((np.ravel(xv), np.ravel(yv), aL, bL), axis=-1)
arr = [[ 0. 0. 62. 41.]
[ 25. 0. 4. 42.]
[ 50. 0. 94. 71.]
[ 75. 0. 24. 91.]
[ 100. 0. 10. 55.]
[ 0. 100. 41. 81.]
[ 25. 100. 67. 11.]
[ 50. 100. 21. 80.]
[ 75. 100. 63. 37.]
[ 100. 100. 27. 2.]]
From here, saving is trivial:
np.savetxt('out.out', arr, delimiter=' ')
ORIGINAL ANSWER
idx = 0
out = []
for x in xL:
for y in yL:
v1 = aL[idx]
v2 = bL[idx]
out.append((x, y, v1, v2))
# print(x,y, v1, v2)
idx += 1
but, it's slow, and only gets slower with more coordinates. I'd consider using the numpy package instead. Here's an example with a 2 x 5 dataset.
aL = nprnd.randint(0,100,size=10) # 10 random ints
bL = nprnd.randint(0,100,size=10) # 10 random ints
xL = np.linspace(0,100,num=5) # 5 evenly spaced ints
yL = np.linspace(0,100,num=2) # 2 evenly spaced ints
lenx = len(xL) # 5
leny = len(yL) # 2
arr = np.ndarray(shape=(leny,lenx,4)) # create a 3-d array
this creates an 3-dimensional array having a shape of 2 rows x 5 columns. On the third axis (length 4) we populate the array with the data you want.
for x in range(leny):
arr[x,:,0] = xL
this syntax is a a little confusing at first. You can learn more about it here. In short, it iterates over the number of rows and sets a particular slice of the array to xL. In this case, the slice we have selected is the zeroth index in all columns of row x. (the : means, "select all indices on this axis"). For our small example, this would yield:
[[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]
[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]]
now we do the same for each column:
for y in range(lenx):
arr[:,y,1] = yL
-----
[[[ 0 0 0 0]
[ 25 0 0 0]
[ 50 0 0 0]
[ 75 0 0 0]
[100 0 0 0]]
[[ 0 100 0 0]
[ 25 100 0 0]
[ 50 100 0 0]
[ 75 100 0 0]
[100 100 0 0]]]
now we need to address arrays aL and bL. these arrays are flat, so we must first reshape them to conform to the shape of arr. In our simple example, this would take an array of length 10 and reshape it into a 2 x 5 2-dimensional array.
a_reshaped = aL.reshape(leny,lenx)
b_reshaped = bL.reshape(leny,lenx)
to insert the reshaped arrays into our arr, we select the 2nd and 3rd index for all rows and all columns (note the two :'s this time:
arr[:,:,2] = a_reshaped
arr[:,:,3] = b_reshaped
----
[[[ 0 0 3 38]
[ 25 0 63 89]
[ 50 0 4 25]
[ 75 0 72 1]
[100 0 24 83]]
[[ 0 100 55 85]
[ 25 100 39 9]
[ 50 100 43 85]
[ 75 100 63 57]
[100 100 6 63]]]
this runs considerably faster than the nested loop solution. hope it helps!

Sounds like you need a nested loop for x and y:
for x in yL:
for y in yL:
for a, b in zip(aL, bL):
print "%d\t%d\t%s\t%s" % (x, y, a, b)

Try this,
for i,j in zip(zip(a,b),zip(c,d)):
print "%d\t%d\t%s\t%s" % (i[0], i[1], j[0], j[1])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generating a numpy-ndarray from dataframe for keras data - python

Related

In python, how to quickly set the value of each row of a 2 dimensional array by condition on the corresponding element of a one dimensional array?

Map each tensor value to the closest value in a list

Weighting Data Using Numpy

FutureWarning when applying a condition on a pandas dataframe to filter an array

Create all x,y pairs from two coordinate arrays

Categories

Resources