Performing the non-paired t-test column-wise in my data - python

When we have two 1-D arrays:
import numpy as np
import scipy.stats as stats
a=np.array([0.36619718309859156,
0.32558139534883723,
0.3333333333333333,
0.3333333333333333,
0.2549019607843137,
0.3695652173913043,
0.3157894736842105,
0.3625])
and
b=np.array([0.938456,
0.3239485723,
0.300,
0.8658,
1.254901137,
2.3695,
0.75,
1.3625])
we can perform the t-test by the following:
stats.ttest_ind(a=a, b=b, equal_var=np.amax([np.var(a),np.var(b)])/np.amin([np.var(a),np.var(b)])<4)
However, I would like to compare the columns of A and B where A and B are 2-D arrays:
A=np.array([[0, 0.375, 0.5, 0.3917],
[0, 0.333, 0.4, 0.4285],
[0, 0.27272727, 0.0, 0.2],
[0.0, 0.25, 0.36365, 0.272],
[0, 0.285857, 0.4, 0.25],
[0, 0.416667, 0.33, 0.375],
[0, 0.28, 0.083, 0.41667],
[0, 0.2858, 0.25, 0.41666]])
B=np.array([[0, 0.4, 0.333, 0.142],
[0, 0.33333, 0.4, 0.1111111],
[0, 0.25, 0.285, 0.333333],
[0.0, 0.5, 0.380, 0.333],
[0.0, 0.5, 0.33, 0.375],
[0, 0.25, 0.294, 0.5],
[0.0, 0.5, 0.333, 0.2068965],
[0, 0.5, 0.3846, 0.2]])
ie. I would like to perform t-test on, and compare, the first column of A and the first column of B, then the second column of A and the second column of B, and so on. (I tried specifying the axes but I think I am not sure how to correctly incorporate the equal_var < 4 property in this case.)

You can transpose the data and then go through both at the same time:
def non_paired_t_test(a, b):
return stats.ttest_ind(a=a, b=b, equal_var=np.amax([np.var(a),np.var(b)])/np.amin([np.var(a),np.var(b)])<4)
for a, b in zip(A.transpose(), B.transpose()):
print(non_paired_t_test(a, b))

You can use A[:,i] which denotes all the rows in the i^th column:
for i in range(0,4):
print(stats.ttest_ind(a=A[:,i], b=B[:,i], equal_var=np.amax([np.var(A[:,i]),np.var(B[:,i])])/np.amin([np.var(A[:,i]),np.var(B[:,i])])<4))

Related

How to replace all the elements of a numpy array?

Given a numpy array with multiple arrays inside, how do I replace all the values of the array with values from another array?
For example:
import numpy
first_array = numpy.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
second_array = numpy.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6,
0.7, 0.8, 0.9, 1])
Given these arrays, How do I replace 1,2 with 0.1, 0.2 and etc?
Use np.reshape
# import numpy as np
>>> m
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]])
>>> n
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
>>> n.reshape(m.shape)
array([[0.1, 0.2],
[0.3, 0.4],
[0.5, 0.6],
[0.7, 0.8],
[0.9, 1. ]])
first_array = np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
second_array = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6,0.7, 0.8, 0.9, 1])
np.set_printoptions(formatter={'float':"{0:0.1f}".format})
first_array = first_array.astype(float)
for i in range(np.shape(first_array)[0]):
for j in range(np.shape(first_array)[1]):
first_array[i][j] = second_array[2*i+j]
print(first_array)
Output:
[[0.1 0.2]
[0.3 0.4]
[0.5 0.6]
[0.7 0.8]
[0.9 1.0]]

Get indices greater than value and keep value

I have a 2D array that looks like this:
[[0.1, 0.2, 0.4, 0.6, 0.9]
[0.3, 0.7, 0.8, 0.3, 0.9]
[0.7, 0.9, 0.4, 0.6, 0.9]
[0.1, 0.2, 0.6, 0.6, 0.9]]
And I want to save the indices where the array is higher than 0.6 but I also want to keep the value of that position, so the output would be:
[0, 3, 0.6]
[0, 4, 0.9]
[1, 2, 0.7]
and so on.
To get the indices I did this:
x = np.where(PPCF> 0.6)
high_pc = np.asarray(x).T.tolist()
but how do I keep the value in a third position?
Simple, no loops:
x = np.where(PPCF > 0.6) # condition to screen values
vals = PPCF[x] # find values by indices
np.concatenate((np.array(x).T, vals.reshape(vals.size, 1)), axis = 1) # resulting array
Feel free to convert it to a list.
This should work :
x = np.where(PPCF> 0.6)
high_pc = np.asarray(x).T.tolist()
for i in high_pc:
i.append(float(PPCF[i[0],i[1]]))
You could just run a loop along the columns and rows and check if each element is greater than the threshold and save them in a list.
a = [[0.1, 0.2, 0.4, 0.6, 0.9],
[0.3, 0.7, 0.8, 0.3, 0.9],
[0.7, 0.9, 0.4, 0.6, 0.9],
[0.1, 0.2, 0.6, 0.6, 0.9]]
def find_ix(a, threshold = 0.6):
res_list = []
for i in range(len(a)):
for j in range(len(a[i])):
if a[i][j] >= threshold:
res_list.append([i, j, a[i][j]])
return res_list
print("Resulting list = \n ", find_ix(a))
import numpy as np
arr = np.array([[0.1, 0.2, 0.4, 0.6, 0.9],
[0.3, 0.7, 0.8, 0.3, 0.9],
[0.7, 0.9, 0.4, 0.6, 0.9],
[0.1, 0.2, 0.6, 0.6, 0.9]])
rows, cols = np.where(arr > 0.6) # Get rows and columns where arr > 0.6
values = arr[rows, cols] # Get all values > 0.6 in arr
result = np.column_stack((rows, cols, values)) # Stack three columns to create final array
"""
Result -
[ 0. 4. 0.9]
[ 1. 1. 0.7]
[ 1. 2. 0.8]
[ 1. 4. 0.9]
[ 2. 0. 0.7]
[ 2. 1. 0.9]
[ 2. 4. 0.9]
[ 3. 4. 0.9]]
"""
You can convert result into a list.

Why does random.shuffle fail on numpy lists?

I have an array of row vectors, upon which I run random.shuffle:
#!/usr/bin/env python
import random
import numpy as np
zzz = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 1. ]])
iterations = 100000
f = 0
for _ in range(iterations):
random.shuffle(zzz)
if np.array_equal(zzz[0], zzz[1]):
print(zzz)
f += 1
print(float(f)/float(iterations))
Between 99.6 and 100% of the time, using random.shuffle on zzz returns a list with the same elements in it, e.g.:
$ ./test.py
...
[[ 0.1 0.2 0.3 0.4 0.5]
[ 0.1 0.2 0.3 0.4 0.5]]
0.996
Using numpy.random.shuffle appears to pass this test and shuffle row vectors correctly. I'm curious to know why random.shuffle fails.
If you look at the code of random.shuffle it performs swaps in the following way:
x[i], x[j] = x[j], x[i]
which for a numpy.array would fail, without raising any error. Example:
>>> zzz[1], zzz[0] = zzz[0], zzz[1]
>>> zzz
array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.1, 0.2, 0.3, 0.4, 0.5]])
The reason is that Python first evaluates the right hand side completely and then make the assignment (this is why with Python single line swap is possible) but for a numpy array this is not True.
numpy
>>> arr = np.array([[1],[1]])
>>> arr[0], arr[1] = arr[0]+1, arr[0]
>>> arr
array([[2],
[2]])
Python
>>> l = [1,1]
>>> l[0], l[1] = l[0]+1, l[0]
>>> l
[2, 1]
Try it like this :
#!/usr/bin/env python
import random
import numpy as np
zzz = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 1. ]])
iterations = 100000
f = 0
for _ in range(iterations):
random.shuffle(zzz[0])
random.shuffle(zzz[1])
if np.array_equal(zzz[0], zzz[1]):
print(zzz)
f += 1
print(float(f)/float(iterations))
In [200]: zzz = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
...: [0.6, 0.7, 0.8, 0.9, 1. ]])
...:
In [201]: zl = zzz.tolist()
In [202]: zl
Out[202]: [[0.1, 0.2, 0.3, 0.4, 0.5], [0.6, 0.7, 0.8, 0.9, 1.0]]
random.random is probably using an in-place assignment like:
In [203]: zzz[0],zzz[1]=zzz[1],zzz[0]
In [204]: zzz
Out[204]:
array([[0.6, 0.7, 0.8, 0.9, 1. ],
[0.6, 0.7, 0.8, 0.9, 1. ]])
Note the replication.
But applied to a list of lists:
In [205]: zl[0],zl[1]=zl[1],zl[0]
In [206]: zl
Out[206]: [[0.6, 0.7, 0.8, 0.9, 1.0], [0.1, 0.2, 0.3, 0.4, 0.5]]
In [207]: zl[0],zl[1]=zl[1],zl[0]
In [208]: zl
Out[208]: [[0.1, 0.2, 0.3, 0.4, 0.5], [0.6, 0.7, 0.8, 0.9, 1.0]]
I tested zl = list(zzz) and still got the array behavior. This zl is a list with views of zzz. tolist makes a list of lists thats totally independent ofzzz`.
In short random.random cannot handle inplace modifications of a ndarray correctly. np.random.shuffle is designed to work with the 1st dim of an array, so it gets it right.
correct assignment for ndarray is:
In [211]: zzz = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
...: [0.6, 0.7, 0.8, 0.9, 1. ]])
...:
In [212]: zzz[[0,1]] = zzz[[1,0]]
In [213]: zzz
Out[213]:
array([[0.6, 0.7, 0.8, 0.9, 1. ],
[0.1, 0.2, 0.3, 0.4, 0.5]])
In [214]: zzz[[0,1]] = zzz[[1,0]]
In [215]: zzz
Out[215]:
array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.6, 0.7, 0.8, 0.9, 1. ]])

How to sum/average a specific subset of columns or rows and return the new ndarray in numpy?

For the sake of illustration, imaging I have the following ndarray:
x = [[0.5, 0.3, 0.1, 0.1],
[0.4, 0.1, 0.3, 0.2],
[0.4, 0.3, 0.2, 0.1],
[0.6, 0.1, 0.1, 0.2]]
I want to sum the two vectors at columns 1 and 2 (starting the count from 0) so that the new ndarray would be:
y = [[0.5, 0.4, 0.1],
[0.4, 0.4, 0.2],
[0.4, 0.5, 0.1],
[0.6, 0.2, 0.2]]
And then, I want to average the vectors at rows 1 and 2 so that the final result would be:
z = [[0.5, 0.4, 0.1 ],
[0.4, 0.45, 0.15],
[0.6, 0.2, 0.2 ]]
Is there an efficient way to do that in numpy in one command? I really need efficiency as this operation is going to be applied in a nested loop.
Thanks in advance
#hpaulj s solution is very good, be sure to read it
You can sum columns quite easily:
a_summed = np.sum(a[:,1:3], axis=1)
You can also take the mean of multiple rows:
a_mean = np.mean(a[1:3], axis=0)
All you have to do is replace and delete the remaining columns, so it becomes:
import numpy as np
a_summed = np.sum(a[:,1:3], axis=1)
a[:, 1] = a_summed
a = np.delete(a, 2, 1)
a_mean = np.mean(a[1:3], axis=0)
a[1] = a_mean
a = np.delete(a, 2, 0)
print(a)
Since you are changing the original matrix size it would be better to do it in two steps as mentioned in the previous answers but, if you want to do it in one command, you could do it as follows and it makes for a nice generalized solution:
import numpy as np
x = np.array(([0.5, 0.3, 0.1, 0.1, 1],
[0.4, 0.1, 0.3, 0.2, 1],
[0.4, 0.3, 0.2, 0.1, 1],
[0.6, 0.1, 0.1, 0.2, 1]))
def sum_columns(matrix, col_start, col_end):
return np.column_stack((matrix[:, 0:col_start],
np.sum(matrix[:, col_start:col_end + 1], axis=1),
matrix[:, col_end + 1:]))
def avgRows_summedColumns(matrix, row_start, row_end):
return np.row_stack((matrix[0:row_start, :],
np.mean(matrix[row_start:row_end + 1, :], axis=0),
matrix[row_end:-1, :]))
# call the entire operation in one command
print(avgRows_summedColumns(sum_columns(x, 1, 2), 1, 2))
This way it doesn't matter how big your matrix is.
In [68]: x = [[0.5, 0.3, 0.1, 0.1],
...: [0.4, 0.1, 0.3, 0.2],
...: [0.4, 0.3, 0.2, 0.1],
...: [0.6, 0.1, 0.1, 0.2]]
In [69]: x=np.array(x)
ufunc like np.add have a reduceat method that lets us perform the action over groups of rows or columns. With that the first reduction is easy (but takes a little playing to understand the parameters):
In [70]: np.add.reduceat(x,[0,1,3], axis=1)
Out[70]:
array([[0.5, 0.4, 0.1],
[0.4, 0.4, 0.2],
[0.4, 0.5, 0.1],
[0.6, 0.2, 0.2]])
Apparently mean is not a ufunc, so I had to settle for add to reduce the rows:
In [71]: np.add.reduceat(Out[70],[0,1,3],axis=0)
Out[71]:
array([[0.5, 0.4, 0.1],
[0.8, 0.9, 0.3],
[0.6, 0.2, 0.2]])
and then divide by the row count to get the mean. I could generalize that to use the same [0,1,3] used in the reduceat, but for now just use a column array:
In [72]: np.add.reduceat(Out[70],[0,1,3],axis=0)/np.array([1,2,1])[:,None]
Out[72]:
array([[0.5 , 0.4 , 0.1 ],
[0.4 , 0.45, 0.15],
[0.6 , 0.2 , 0.2 ]])
and the whole thing in one expression:
In [73]: np.add.reduceat(np.add.reduceat(x,[0,1,3], axis=1),[0,1,3],axis=0)/ np.array([1,2,1])[:,None]
Out[73]:
array([[0.5 , 0.4 , 0.1 ],
[0.4 , 0.45, 0.15],
[0.6 , 0.2 , 0.2 ]])

Combining rows of the same key into single array

I have a pandas dataframe as follows:
error
0: [[0.1,0.4,-0.3]]
1: [[-0.6,-0.3,0.2]]
.
.
.
99: [[0.4,-0.7,0.1]]
I would like to combine all values into a single array like this:
[0.1,0.4,-0.3,-0.6,-0.3,0.2,...,0.4,-0.7,0.1]
Is there a fast way to do this using pandas or do I need to iterate over the data and build the array "manually" ?
The data order, in this case, is not important.
In a more general case, how to combine arrays that don't have the same size (e.g. row 0 contains an array of 3 elements, row 1 contains an array of 6 elements,etc...) ?
Use numpy.ravel:
L = np.array(df['error'].values.tolist()).ravel().tolist()
print (L)
[0.1, 0.4, -0.3, -0.6, -0.3, 0.2, 0.4, -0.7, 0.1]
More general solutions with str[0] for select nested lists:
print (df)
error
0 [[0.1,0.4,-0.3]]
1 [[-0.6,-0.3]]
99 [[0.4,-0.7,0.1]]
from itertools import chain
L = list(chain.from_iterable(df['error'].str[0]))
print (L)
[0.1, 0.4, -0.3, -0.6, -0.3, 0.4, -0.7, 0.1]
L = np.concatenate(df['error'].str[0].values).tolist()
print (L)
[0.1, 0.4, -0.3, -0.6, -0.3, 0.4, -0.7, 0.1]
df=pd.DataFrame([[0.1,0.4,-0.3],[-0.6,-0.3,0.2]])
df.values.flatten()
will return :
array([ 0.1, 0.4, -0.3, -0.6, -0.3, 0.2])
if you would like to append the element by column
df.values.flatten(order='F')
then it will return:
array([ 0.1, -0.6, 0.4, -0.3, -0.3, 0.2])

Categories