Remove specific values from dataframe - python

I have the following correlation matrix:
symbol abc xyz ghj
symbol
abc 1 0.1 -0.2
xyz 0.1 1 0.3
ghj -0.2 0.3 1
I need to be able to find the standard deviation for the whole dataframe but that has to exclude the perfect correlation values, ie: the standard deviation must not take into account abc:abc, xyz:xyz, ghj:ghj
I am able to get the standard deviation for the entire dataframe using:
df.stack().std()
But this takes into account every single value which is not correct. The standard deviation should not include row/column combinations where an item is being correlated to itself (ie: 1). Is there a way to remove abc:abc, xyz:xyz, ghj:ghj. Then calculate the standard deviation.
Perhaps converting it to a dict or something?

If you use numpy you can utilize np.extract and np.std:
In [61]: import numpy as np
In [62]: a = np.array([[ 1. , 0.1, -0.2],
[ 0.1, 1. , 0.3],
[-0.2, 0.3, 1. ]])
In [63]: a
Out[63]:
array([[ 1. , 0.1, -0.2],
[ 0.1, 1. , 0.3],
[-0.2, 0.3, 1. ]])
In [64]: calc_std = np.std(np.extract(a != 1, a))
In [65]: calc_std
Out[65]: 0.20548046676563256
np.extract(a != 1, a)) returns an array containing each element of a which is not equal to 1.
The returned array looks like this:
In [66]: np.extract(a != 1, a)
Out[66]: array([ 0.1, -0.2, 0.1, 0.3, -0.2, 0.3])
After this extraction you can easily calculate the standard deviation with np.std().

Related

Find column index of maximum element for each layer of 3d numpy array

I have a 3D NumPy array arr. Here is an example:
>>> arr
array([[[0.05, 0.05, 0.9 ],
[0.4 , 0.5 , 0.1 ],
[0.7 , 0.2 , 0.1 ],
[0.1 , 0.2 , 0.7 ]],
[[0.98, 0.01, 0.01],
[0.2 , 0.3 , 0.95],
[0.33, 0.33, 0.34],
[0.33, 0.33, 0.34]]])
For each layer of the cube (i.e., for each matrix), I want to find the index of the column containing the largest number in the matrix. For example, let's take the first layer:
>>> arr[0]
array([[0.05, 0.05, 0.9 ],
[0.4 , 0.5 , 0.1 ],
[0.7 , 0.2 , 0.1 ],
[0.1 , 0.2 , 0.7 ]])
Here, the largest element is 0.9, and it can be found on the third column (i.e. index 2). In the second layer, instead, the max can be found on the first column (the largest number is 0.98, the column index is 0).
The expected result from the previous example is:
array([2, 0])
Here's what I have done so far:
tmp = arr.max(axis=-1)
argtmp = arr.argmax(axis=-1)
indices = np.take_along_axis(
argtmp,
tmp.argmax(axis=-1).reshape((arr.shape[0], -1)),
1,
).reshape(-1)
The code above works, but I'm wondering if it can be further simplified as it seems too much complicated from my point of view.
Find the maximum in each column before applying argmax:
arr.max(-2).argmax(-1)
Reducing the column to a single maximum value will not change which column has the largest value. Since you don't care about the row index, this saves you a lot of trouble.

Matrix created from a function, and concatenated column vector of the matrix

We have a function f(x,y). We want to calculate the matrix Bij = f(xi,xj) = f(ih,jh) for 1 <= i,j <= n and h=1/(n+1), such as :
If f(x,y)=x+y, then Bij = ih+jh and the matrix becomes (here, n=3) :
I would like to program a function calculating the column vector b that concatenates all the columns of Bij. For example, with my previous example, we would have :
I done, we can change the function and n, here f(x,y)=x+y :
n=3
def f(i,j):
h=1.0/(n+1)
a=((i+1)*h)+((j+1)*h)
return a
B = np.fromfunction(f,(n,n))
print(B)
But I don't know how to do the vector b. And with
np.concatenate((B[:,0],B[:,1],B[:,2],B[:,3])
I get a line vector, and not a column vector. Could you help me ? Sorry for my bad english, and I'm beginner in Python.
The ravel function along with a new axis should do the trick:
import numpy as np
x = np.array([[0.5, 0.75, 1],
[0.75, 1, 1.25],
[1, 1.25, 1.5]])
x.T.ravel()[:, np.newaxis]
# array([[ 0.5 ],
# [ 0.75],
# [ 1. ],
# [ 0.75],
# [ 1. ],
# [ 1.25],
# [ 1. ],
# [ 1.25],
# [ 1.5 ]])
Ravel stitches together all the rows, so we first transpose the matrix (with .T). The result is a row-vector, and we change it to a column vector by adding a new axis.
import numpy as np
# create sample matrix `m`
m = np.matrix([[0.5, 0.75, 1], [0.75, 1, 1.25], [1, 1.25, 1.5]])
# convert matrix `m` to a 'flat' matrix
m_flat = m.flatten()
print(m_flat)
# `m_flat` is still a matrix, in case you need an array:
m_flat_arr = np.squeeze(np.asarray(m_flat))
print(m_flat_arr)
The snippet uses .flatten(), .asarray() and .squeeze() to convert the original matrix m being
matrix([[ 0.5 , 0.75, 1. ],
[ 0.75, 1. , 1.25],
[ 1. , 1.25, 1.5 ]])
into an array m_flat_arr of:
array([ 0.5 , 0.75, 1. , 0.75, 1. , 1.25, 1. , 1.25, 1.5 ])

Normalize values between -1 and 1 inclusive

I am trying to generate a .wav file in python using Numpy. I have voltages ranging between 0-5V and I need to normalize them between -1 and 1 to use them in a .wav file.
I have seen this website which uses numpy to generate a wav file but the algorithm used to normalize is no long available.
Can anyone explain how I would go about generating these values in Python on my Raspberry Pi.
isn't this just a simple calculation? Divide by half the maximum value and minus 1:
In [12]: data=np.linspace(0,5,21)
In [13]: data
Out[13]:
array([ 0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ,
2.25, 2.5 , 2.75, 3. , 3.25, 3.5 , 3.75, 4. , 4.25,
4.5 , 4.75, 5. ])
In [14]: data/2.5-1.
Out[14]:
array([-1. , -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0. ,
0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
The following function should do what you want, irrespective of the range of the input data, i.e., it works also if you have negative values.
import numpy as np
def my_norm(a):
ratio = 2/(np.max(a)-np.min(a))
#as you want your data to be between -1 and 1, everything should be scaled to 2,
#if your desired min and max are other values, replace 2 with your_max - your_min
shift = (np.max(a)+np.min(a))/2
#now you need to shift the center to the middle, this is not the average of the values.
return (a - shift)*ratio
my_norm(data)
You can use the fit_transform method in sklearn.preprocessing.StandardScaler. This method will remove the mean from your data and scale your array to unit variance (-1,1)
from sklearn.preprocessing import StandardScaler
data = np.asarray([[0, 0, 0],
[1, 1, 1],
[2,1, 3]])
data = StandardScaler().fit_transform(data)
And if you print out data, you will now have:
[[-1.22474487 -1.41421356 -1.06904497]
[ 0. 0.70710678 -0.26726124]
[ 1.22474487 0.70710678 1.33630621]]

In numpy, calculating a matrix where each cell contains the product of all the other entries in that row

I have a matrix
A = np.array([[0.2, 0.4, 0.6],
[0.5, 0.5, 0.5],
[0.6, 0.4, 0.2]])
I want a new matrix, where the value of the entry in row i and column j is the product of all the entries of the ith row of A, except for the cell of that row in the jth column.
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0.25, 0.25],
[ 0.08, 0.12, 0.24]])
The solution that first occurred to me was
np.repeat(np.prod(A, 1, keepdims = True), 3, axis = 1) / A
But this only works so long as no entries have values zero.
Any thoughts? Thank you!
Edit: I have developed
B = np.zeros((3, 3))
for i in range(3):
for j in range(3):
B[i, j] = np.prod(i, A[[x for x in range(3) if x != j]])
but surely there is a more elegant way to accomplish this, which makes use of numpy's efficient C backend instead of inefficient python loops?
If you're willing to tolerate a single loop:
B = np.empty_like(A)
for col in range(A.shape[1]):
B[:,col] = np.prod(np.delete(A, col, 1), 1)
That computes what you need, a single column at a time. It is not as efficient as theoretically possible because np.delete() creates a copy; if you care a lot about memory allocation, use a mask instead:
B = np.empty_like(A)
mask = np.ones(A.shape[1], dtype=bool)
for col in range(A.shape[1]):
mask[col] = False
B[:,col] = np.prod(A[:,mask], 1)
mask[col] = True
A variation on your solution using repeat, uses [:,None].
np.prod(A,axis=1)[:,None]/A
My 1st stab at handling 0s is:
In [21]: B
array([[ 0.2, 0.4, 0.6],
[ 0. , 0.5, 0.5],
[ 0.6, 0.4, 0.2]])
In [22]: np.prod(B,axis=1)[:,None]/(B+np.where(B==0,1,0))
array([[ 0.24, 0.12, 0.08],
[ 0. , 0. , 0. ],
[ 0.08, 0.12, 0.24]])
But as the comment pointed out; the [0,1] cell should be 0.25.
This corrects that problem, but now has problems when there are multiple 0s in a row.
In [30]: I=B==0
In [31]: B1=B+np.where(I,1,0)
In [32]: B2=np.prod(B1,axis=1)[:,None]/B1
In [33]: B3=np.prod(B,axis=1)[:,None]/B1
In [34]: np.where(I,B2,B3)
Out[34]:
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0. , 0. ],
[ 0.08, 0.12, 0.24]])
In [55]: C
array([[ 0.2, 0.4, 0.6],
[ 0. , 0.5, 0. ],
[ 0.6, 0.4, 0.2]])
In [64]: np.where(I,sum1[:,None],sum[:,None])/C1
array([[ 0.24, 0.12, 0.08],
[ 0.5 , 0. , 0.5 ],
[ 0.08, 0.12, 0.24]])
Blaz Bratanic's epsilon approach is the best non iterative solution (so far):
In [74]: np.prod(C+eps,axis=1)[:,None]/(C+eps)
A different solution iterating over the columns:
def paulj(A):
P = np.ones_like(A)
for i in range(1,A.shape[1]):
P *= np.roll(A, i, axis=1)
return P
In [130]: paulj(A)
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0.25, 0.25],
[ 0.08, 0.12, 0.24]])
In [131]: paulj(B)
array([[ 0.24, 0.12, 0.08],
[ 0.25, 0. , 0. ],
[ 0.08, 0.12, 0.24]])
In [132]: paulj(C)
array([[ 0.24, 0.12, 0.08],
[ 0. , 0. , 0. ],
[ 0.08, 0.12, 0.24]])
I tried some timings on a large matrix
In [13]: A=np.random.randint(0,100,(1000,1000))*0.01
In [14]: timeit paulj(A)
1 loops, best of 3: 23.2 s per loop
In [15]: timeit blaz(A)
10 loops, best of 3: 80.7 ms per loop
In [16]: timeit zwinck1(A)
1 loops, best of 3: 15.3 s per loop
In [17]: timeit zwinck2(A)
1 loops, best of 3: 65.3 s per loop
The epsilon approximation is probably the best speed we can expect, but has some rounding issues. Having to iterate over many columns hurts the speed. I'm not sure why the np.prod(A[:,mask], 1) approach is slowest.
eeclo https://stackoverflow.com/a/22441825/901925 suggested using as_strided. Here's what I think he has in mind (adapted from an overlapping block question, https://stackoverflow.com/a/8070716/901925)
def strided(A):
h,w = A.shape
A2 = np.hstack([A,A])
x,y = A2.strides
strides = (y,x,y)
shape = (w, h, w-1)
blocks = np.lib.stride_tricks.as_strided(A2[:,1:], shape=shape, strides=strides)
P = blocks.prod(2).T # faster to prod on last dim
# alt: shape = (w-1, h, w), and P=blocks.prod(0)
return P
Timing for the (1000,1000) array is quite an improvement over the column iterations, though still much slower than the epsilon approach.
In [153]: timeit strided(A)
1 loops, best of 3: 2.51 s per loop
Another indexing approach, while relatively straight forward, is slower, and produces memory errors sooner.
def foo(A):
h,w = A.shape
I = (np.arange(w)[:,None]+np.arange(1,w))
I1 = np.array(I)%w
P = A[:,I1].prod(2)
return P
Im on the run, so I do not have time to work out this solution; but what id do is create a contiguous circular view over the last axis, by means of concatenating the array to itself along the last axis, and then use np.lib.index_tricks.as_strided to select the appropriate elements to take an np.prod over. No python loops, no numerical approximation.
edit: here you go:
import numpy as np
A = np.array([[0.2, 0.4, 0.6],
[0.5, 0.5, 0.5],
[0.5, 0.0, 0.5],
[0.6, 0.4, 0.2]])
B = np.concatenate((A,A),axis=1)
C = np.lib.index_tricks.as_strided(
B,
A.shape +A.shape[1:],
B.strides+B.strides[1:])
D = np.prod(C[...,1:], axis=-1)
print D
Note: this method is not ideal, as it is O(n^3). See my other posted solution, which is O(n^2)
If you are willing to tolerate small error you could use the solution you first proposed.
A += 1e-10
np.around(np.repeat(np.prod(A, 1, keepdims = True), 3, axis = 1) / A, 9)
Here is an O(n^2) method without python loops or numerical approximation:
def double_cumprod(A):
B = np.empty((A.shape[0],A.shape[1]+1),A.dtype)
B[:,0] = 1
B[:,1:] = A
L = np.cumprod(B, axis=1)
B[:,1:] = A[:,::-1]
R = np.cumprod(B, axis=1)[:,::-1]
return L[:,:-1] * R[:,1:]
Note: it appears to be about twice as slow as the numerical approximation method, which is in line with expectation.

setting an array element with a sequence

So I'm not the best at python but I need to create this program for one of my courses and I keep getting this error.
Basically I have w_array = linspace(0.6, 1.1, 11), then I have zq = array([1, 1, w_array, 1])
and it comes up with the error message:
ValueError: setting an array element with a sequence.
the basic function of the code is to take a bezier spline aerofoil, with control points and weights, run the data in xfoil and print cd and cl values, but this addition is to show a graph of the range of cd for a certain control point.
hope it makes sense, any help would be greatly appreciated.
If you want zq be an array containing both ints and lists, use parameter dtype:
In [300]: zq = array([1, 1, w_array, 1], dtype=object)
In [301]: zq
Out[301]:
array([1, 1,
array([ 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ,
1.05, 1.1 ]),
1], dtype=object)
Is this your intended result?
In [2]:
numpy.hstack((1,1,numpy.linspace(0.6,1.1,11),1))
Out[2]:
array([ 1. , 1. , 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 ,
0.95, 1. , 1.05, 1.1, 1. ])
You probably want the resulting array to have float64 dtypes rather than object, a mixed bag of dtypes, as #DSM pointed out.

Categories