Merge 3D numpy array into pandas Dataframe + 1D vector - python

I have a dataset which is a numpy array with shape (1536 x 16 x 48). A quick explanation of these dimensions that might be helpful:
The dataset consists of data collected by EEG sensors at 256Hz rate (1 second = 256 measures/values);
1536 values represent 6 seconds of EEG data (256 * 6 = 1536);
16 is the number of electrodes used to collect data;
48 is the number of samples.
In summary: i have 48 samples of 6 seconds (1536 values) of EEG data, collected by 16 electrodes.
I need to create a pandas dataframe with all this data, and therefore turn this 3D array into 2D. The depth dimension (48) can be removed if i stack all samples one above another. So the new dataset will be shaped (1536 * 48) x 16.
In addition to that, since this is a classification problem, i have a vector with 48 values that represents the class of each EEG sample. The new dataset should also has this as a "class" column, and then the real shape would be: (1536 * 48) x 16 + 1 (class).
I could easily do that looping through the depth dimension of the 3D array and concatenate everything into a 2D new one. But this looks bad since i will be dealing with many datasets like this one. Performance is an issue. I would like to know if there's any more clever way of doing it.
I tried to provide the maximum of information i could for this question, but since it is not a trivial task feel free to ask further details if needed.
Thanks in advance.

Setup
>>> import numpy as np
>>> import pandas as pd
>>> a = np.zeros((4,3,3),dtype=int) + [0,1,2]
>>> a *= 10
>>> a += np.array([1,2,3,4])[:,None,None]
>>> a
array([[[ 1, 11, 21],
[ 1, 11, 21],
[ 1, 11, 21]],
[[ 2, 12, 22],
[ 2, 12, 22],
[ 2, 12, 22]],
[[ 3, 13, 23],
[ 3, 13, 23],
[ 3, 13, 23]],
[[ 4, 14, 24],
[ 4, 14, 24],
[ 4, 14, 24]]])
Split evenly along the last dimension; stack those elements, reshape, feed to DataFrame. Using the lengths of the array's dimensions simplifies the process.
>>> d0,d1,d2 = a.shape
>>> pd.DataFrame(np.stack(np.dsplit(a,d2)).reshape(d0*d2,d1))
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 11 11 11
5 12 12 12
6 13 13 13
7 14 14 14
8 21 21 21
9 22 22 22
10 23 23 23
11 24 24 24
>>>
Using your shape.
>>> b = np.random.random((1536, 16, 48))
>>> d0,d1,d2 = b.shape
>>> df = pd.DataFrame(np.stack(np.dsplit(b,d2)).reshape(d0*d2,d1))
>>> df.shape
(73728, 16)
>>>
After making the DataFrame from the 3d array, add the classification column to it, df['class'] = data. - Column selection, addition, deletion

For the numpy part
x = np.random.random((1536, 16, 48)) # ndarray with simillar shape
x = x.swapaxes(1,2) # swap axes 1 and 2 i.e 16 and 48
x = x.reshape((-1, 16), order='C') # order is important, you may want to check the docs
c = np.zeros((x.shape[0], 1)) # class column, shape=(73728, 1)
x = np.hstack((x, c)) # final dataset
x.shape
Output
(73728, 17)
or in one line
x = np.hstack((x.swapaxes(1,2).reshape((-1, 16), order='C'), c))
Finally,
x = pd.DataFrame(x)

Related

How to manually select values from an array

For example I have a matrix array
a=np.arrange(25).shape(5,5)
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
How do I make an 1D array of elements that I would like to choose manually? For example [2,3], [4,1], [1,0] and [2,2], so I get the following:
b=[13, 21, 5, 12]
The array should a reference rather than a copy.
You can make a function for this.
# defining the function
def get_value(matrix, row_list, col_list):
for i, j in zip(row_list, col_list):
return matrix[row_list, col_list]
# initializing the array
a = np.arange(0, 25, 1).reshape(5, 5)
# getting the required values and printing
b = get_value(a, [2,4,1,0], [3,1,0,2])
# output
print(b)
Edit
I'll let the previous answer be as is, just in case if anyone else stumbles upon that and needs it.
What the question wants is to give a value from b (i.e. b[0] which is 13) and change the value from the original matrix a based on the index of that passed value from b in a.
def change_the_value(old_mat, val_to_change, new_val):
mat_coor = np.array(np.matrix(np.where(old_mat == val_to_change)).T)[0]
old_mat[mat_coor[0], mat_coor[1]] = new_val
a = np.arange(0, 25, 1).reshape(5,5)
b = [13, 16, 5, 12]
change_the_value(a, b[0], 0)
a=np.arange(25).reshape(5,5)
search=[[2,3], [4,1], [1,0], [2,2]]
for row,col in search:
print(row,col, a[row][col])
output:
r c result
2 3 13
4 1 21
1 0 5
2 2 12
First of all, I've found that constructing a non-contiguous view to a Numpy array is not natively possible, because Numpy efficiently utilises contiguous memory layout of an array, which enables dramatic speed increase.
Here's a solution I found that works the best so far:
Instead of having a view to an array, I construct a collection indices, that I would like to process, [2,3], [4,1], [1,0], [2,2].
The collection type I have chosen are Sets, due to exclusion of duplicates and set().add and set().discard methods that do not require search. Keeping order was not necessary.
To use them for indexing an array they have to be casted from a set of tuples set{(2,3),(4,1),(1,0),(2,2)} to a tuple of arrays (ndarray([2,4,1,2], ndarray[3,1,0,2]).
Which can be achieved by unzipping a set and constructing a tuple of arrays:
import numpy as np
a=np.arrange(25).shape(5,5)
>>>[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
my_set = {(2,3),(4,1),(1,0),(2,2)}
uzip_set = list(zip(*my_set))
seq_from_set = (np.asarray(uzip_set[0]),np.asarray(uzip_set[1]))
print(seq_from_set)
>>>(array[2,4,1,2], array[3,1,0,2])
And array a can be manipulated by providing such a sequence of indices:
b = a[seq_from_set]
print(b)
>>>array[13,21,5,12]
a[seq_from_set] = 0
print(a)
>>>[[ 0 1 2 3 4]
[ 0 6 7 8 9]
[10 11 0 0 14]
[15 16 17 18 19]
[20 0 22 23 24]]
The solution is a bit sophisticated compared to something native, but works surprisingly fast. This allows an easy management of the collection of indices and supports quick conversion to a stream of indices on demand.

Why does indexing a numpy array using an array change the shape?

I'm trying to index a 2-dimensional array to certain values using numpy.where(), but unless I am indexing in the first index without a : slice it always increases the dimension. I can't seem to find an explanation for this in the documentation.
For example, say I have an array a:
a = np.arange(20)
a = np.reshape(a,(4,5))
print("a = ",a)
print("a shape = ", a.shape)
Output:
a = [[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
a shape = (4, 5)
If I have two indexing arrays, one in the 'x' direction and one in the 'y' direction:
x = np.arange(5)
y = np.arange(4)
xindx = np.where((x>=2)&(x<=4))
yindx = np.where((y>=1)&(y<=2))
and then index a using the 'y' index like so there's no problem:
print(a[yindx])
print(a[yindx].shape)
Output:
[[ 5 6 7 8 9]
[10 11 12 13 14]]
(2, 5)
But if I have : in one of the indices then I have an extra dimension of size 1:
print(a[yindx,:])
print(a[yindx,:].shape)
print(a[:,xindx])
print(a[:,xindx].shape)
Output:
[[[ 5 6 7 8 9]
[10 11 12 13 14]]]
(1, 2, 5)
[[[ 2 3 4]]
[[ 7 8 9]]
[[12 13 14]]
[[17 18 19]]]
(4, 1, 3)
I run into this issue with one-dimensional arrays, too. How do I fix this?
If xindx and yindx were numpy arrays, the result would be as expected. However, they are tuples with a single value.
Easiest (and pretty dumb) fix:
xindx = np.where((x>=2)&(x<=4))[0]
yindx = np.where((y>=1)&(y<=2))[0]
With only the condition given, np.where will return indices of matching elements in a tuple. This use is explicitly discouraged in the documentation.
More realistically, you probably need something like:
xindx = np.arange(2, 5)
yindx = np.arange(1, 3)
... but it really depends on the context we don't see

Summing numpy array blockwise to form a smaller array [duplicate]

This question already has answers here:
How to evaluate the sum of values within array blocks
(3 answers)
Closed 3 years ago.
We have a matrix N x N consisting of n x n blocks. So we have (N/n) x (N/n) blocks. We further divide it into large blocks so that each large block contains m x m number of smaller blocks. And then we need to sum (block-wise) smaller blocks inside each larger block. For example here each A is nxn and m = 2.
enter image description here
What is the simplest and possibly fast way of doing that with numpy array?
One fast way of doing this is to reshape your (N, N) array into (m, n, m, n) and then sum along the axes of size m:
import numpy as np
m = 3
n = 2
N = m * n
arr = np.arange((N * N)).reshape((N, N))
print(arr)
# [[ 0 1 2 3 4 5]
# [ 6 7 8 9 10 11]
# [12 13 14 15 16 17]
# [18 19 20 21 22 23]
# [24 25 26 27 28 29]
# [30 31 32 33 34 35]]
reshaped = arr.reshape((m, n, m, n))
summed = np.sum(reshaped, axis=(0, 2))
print(summed)
# [[126 135]
# [180 189]]
# ...checking a couple of blocks
# the "first m" (index 0) identifies blocks along rows
# the "second m" (index 2) identifies blocks along columns
print(reshaped[0, :, 0, :])
# [[0 1]
# [6 7]]
print(reshaped[1, :, 2, :])
# [[16 17]
# [22 23]]
# ...manually checking that the (0, 0) element of `summed` is correct
sum([0, 2, 4, 12, 14, 16, 24, 26, 28])
# 126

Multiplying arrays with alternating column selection

I have got a file testforce.dat that shows values divided in 9 columns and 3 rows. The first 3 column represents:
p1 p2 p3 f1 f2 f3 r1 r2 r3
18 5 27 20 21 8 14 12 25
9 26 23 1 4 10 7 16 24
19 22 15 13 17 6 11 2 3
I have got 100 files of this fashion.
I now want to calculate for the file force_00000.dat the vector g = [sum(p1*f1), sum(p2*f2), sum(p3*f3)] but for the next file force_00001.dat the vector should use other columns h = [sum(p1*r1), sum(p2*r2), sum(p3*r3)].
At the moment I am using the glob function to read my files into arrays. It puts every row into one array.
I am not sure how to get my alternating array multiplication done and would appreciate any suggestions :)
import numpy as np
import glob
i = 100
for x in range(0,int(i)):
## turns x into a string and adds if necessary "0" to achieve a fixed digit number;
y = str(x).zfill(5)
## the structure of the forcefile is "force_[00000-00099]";
files = sorted(glob.glob('.//results/force/force_%s.dat' % y))
column_names=('#position')
print files
## loads the file data into arrays
arrays=[np.loadtxt(filename) for filename in files]
print arrays
Edit: I tested the load of the first file with:
b=np.array(arrays)
print b.shape
And I get (1,3,9) for the shape of my generated array.
Edit2: I had the idea to use "usecols" and then multiply the desired values:
xposition=[np.loadtxt(filename,usecols= (0,1,2)) for filename in files]
xforce1=[np.loadtxt(filename,usecols= (3,4,5)) for filename in files]
print xposition
print xforce1
xp=np.asarray(xposition)
xf1=np.asarray(xforce1)
print xp
g=np.multiply(xp,xf1)
print g
this generated the following output:
[[[ 360. 105. 216.]
[ 9. 104. 230.]
[ 247. 374. 90.]]]
which means I have (p11 and f11 being the values of the first row, p21 from second row...)
[[[p11*f11 p12*f12 p13*f13]
[p21*f21 p22*f22 p23*f23]
[p31*f31 p32*f32 p33*f33]]]
which seems like I am slmost done for atleast one file. The desired g(g1,g2,g3) should look like:
p11*f11+p21*f21+p31*f31= g1
p12*f12+p22*f22+p32*f32= g2
p13*f13+p23*f23+p33*f33= g3
Sorry if that is a totally newbie question but I am not so familliar with python yet :)
For the issue with the alternating values I was thinking about using an if function that checks if "i" in the loop is an even number
loadtxt returns an array. [loadtxt(name) for name in filenames] produces a list of arrays, one array per name. np.array([...]) produces an array from that list. If the individual arrays are all the same size, the resulting array will be 3d.
If you need to treat every other file differently you could access them as a set with indexing
arr[::2,...]
arr[1;:2,...]
To multiply the 2 sets of columns from your example file:
In [558]: txt=b"""p1 p2 p3 f1 f2 f3 r1 r2 r3
...: 18 5 27 20 21 8 14 12 25
...: 9 26 23 1 4 10 7 16 24
...: 19 22 15 13 17 6 11 2 3"""
In [560]: arr = np.loadtxt(txt.splitlines(),skiprows=1,dtype=int)
In [561]: arr
Out[561]:
array([[18, 5, 27, 20, 21, 8, 14, 12, 25],
[ 9, 26, 23, 1, 4, 10, 7, 16, 24],
[19, 22, 15, 13, 17, 6, 11, 2, 3]])
In [562]: arr[:, 0:3]*arr[:, 3:6]
Out[562]:
array([[360, 105, 216],
[ 9, 104, 230],
[247, 374, 90]])
In [563]: arr[:, 0:3]*arr[:, 6:9]
Out[563]:
array([[252, 60, 675],
[ 63, 416, 552],
[209, 44, 45]])
If arr was a 3d array from load multiple files,
arr1 = arr[::2,...]
arr2 = arr[1::2,...]
arr1[:,:,0:3] * arr1[:,:,3:6]
etc

Shift several graphs under each other

I want to shift several graphs under each other. I read in the data as an array with 4 columns in the data:
# load data for variable intensities
data_with_minimum = []
for i in [6, 12, 25, 50, 100]:
data_with_minimum.append(np.loadtxt('data{0}.dat'.format(i)))
then I search for a characteristic point, in this case a minimum in the first 5000 rows(I know that there always is a minimum) and saving the indixes.
# open arrays for minimum value and index
m = []
mi = []
for k in range(5):
m.append(0)
mi.append(0)
# search minimum in first 5000 data points
for i in range(5000):
if m[k] > data_with_minimum[k][i,1]:
m[k] = data_with_minimum[k][i,1]
mi[k] = i
Lastly, I want to shift every minimum from the first column under each other:
# shift x-axis
for i in range(30000 - m_max):
for k in range(5):
data_with_minimum[k][i,1] = data_with_minimum[k][i+(mi[k]-min(mi)),1]
Unfortunately this is not working, because the values are redefining itself. Because I'm quite new to Python I got stuck. So any suggestions would be helpful. Or is there maybe in general an easier way to solve this problem? This seems to me inconvenient.. Thank you!
edit:
1) Unfortunately I can't post images because I've not enough reputation points. So I need to post this link shift graphs. Sorry for that. My goal is that the minima of all graphs are at the same point. This graph was plotted with the command:
plt.figure(0)
for i in range(5):
plt.plot(data_with_minimum[i][:,0], data_with_minimum[i][:,1])
Minimum data example:
x y(file1) y(file2) y(file3)
1 5 8 3
2 3 6 1
3 1 5 5
4 2 3 8
5 5 1 10
6 8 3 13
7 10 4 15
8 14 7 18
9 16 10 20
...
this should become
x y(file1) y(file2) y(file3)
1 3 3 3
2 1 1 1
3 2 3 5
4 5 4 8
5 8 7 10
6 10 10 13
7 14 - 15
8 16 - 18
9 - - 20
...
with 1 the minimum. But there is to mention that it could be possible that there's an additional minimum after the 5000 first data points.
And the Beginning of the real data of one file:
0.000000 -1.057758
0.000200 -1.051918
0.000400 -1.063922
0.000600 -1.065220
0.000800 -1.069438
0.001000 -1.065220
0.001400 -1.065545
0.001600 -1.077549
0.001800 -1.072682
0.002000 -1.082416
0.002200 -1.078847
0.002400 -1.090203
0.002600 -1.087283
0.002800 -1.095069
0.003000 -1.090527
0.003200 -1.098314
0.003400 -1.100261
0.003600 -1.108372
0.003800 -1.103505
0.004000 -1.111292
0.004200 -1.107074
0.004400 -1.113887
0.004600 -1.112590
0.004800 -1.127514
0.005000 -1.115510
0.005200 -1.127514
...
2) changed columns to rows in the passage "in this case a minimum in the first 5000 columns"
First of all, you can find the minima indices much easier and faster using numpy's argmin:
import numpy
# setup example data
x = numpy.arange(9)
data_with_minimum = numpy.array(
[[ 5, 3, 1, 2, 5, 8, 10, 14, 16],
[ 8, 6, 5, 3, 1, 3, 4, 7, 10],
[ 3, 1, 5, 8, 10, 13, 15, 18, 20]])
mi = numpy.argmin(data_with_minimum, axis = 1)
Then, I wanted to point you to numpy.roll, which could be used to shift/align the arrays, but if you are interested in plotting, it is much more elegant and logical not to modify the arrays at all (and to deal with boundary issues), but just to shift the line plots:
import matplotlib.pyplot as plt
plt.clf()
for i, row in enumerate(data_with_minimum):
plt.plot(x - mi[i], row)
plt.xlabel('offset from minimum')
plt.show()
This is hard to answer without a MWE. But here's how I would plot two lines so that their minimum values are aligned:
import numpy as np
np.random.seed(1)
a = np.random.random_sample(10)
b = np.random.random_sample(10)
# say we want to align "b" to "a" based on
# the minima as you describe
a_indices = np.arange(0, len(a))
b_indices = a_indices + (a.argmin() - b.argmin())
import matplotlib.pyplot as plt
plt.plot(a_indices, a)
plt.plot(b_indices, b)
plt.show()

Categories