I am having a list pct_change. I need to calculate std deviation on the list ignoring the zeros. I tried below code, but it is not working as expected.
import numpy as np
m = np.ma.masked_equal(pct_change, 0)
value = m.mask.std()
Input value: pct_change
0 0.00
1 0.00
2 0.00
3 18523.94
4 15501.94
5 14437.03
6 13402.43
7 18986.14
Code has to ignore 3 zero values and then calculate standard deviation.
Filter for values unequal to zero first:
>>> a
array([ 0. , 0. , 0. , 18523.94, 15501.94, 14437.03,
13402.43, 18986.14])
>>> a[a!=0].std()
2217.2329816471693
One approach would be to convert the zeros to NaNs and then use np.nanstd that would ignore the NaNs for the standard deviation computation -
np.nanstd(np.where(np.isclose(a,0), np.nan, a))
Sample run -
In [296]: a
Out[296]: [0.0, 0.0, 0.0, 18523.94, 15501.94, 14437.03, 13402.43, 18986.14]
In [297]: np.nanstd(np.where(np.isclose(a,0), np.nan, a))
Out[297]: 2217.2329816471693
Note that we are using np.isclose(a,0) because we are dealing with floating-pt numbers here and it's not a good idea to simply compare against zeros to detect those in a float dtype array.
Related
I have a archive of data. (xx,yy,EXTRA) and I want to divide the data into grids of equal size. For example, lets suppose that the data is:
xx=np.array([0.1, 0.2, 3, 4.1, 3, 0.1])
yy=np.array([0.35, 0.15, 1.5, 4.5, 3.5, 3])
EXTRA=np.array([0.01,0.003,2.002,4.004,0.5,0.2])
I want to make square grids of size 1x1, and after obtain the sum of "EXTRA" for every point on the grid.
This is what I tried
import math
for i in range(0,5):
for j in range(0,5):
for x,y in zip(xx,yy):
k=math.floor(x)
kk=math.floor(y)
if i<=k<i+1.0 and j<=kk<j+1.0:
print("(x,y)=" ,x,",",y,",","(i,j)=",i,",",j ,"Unkow sum of EXTRA")
I obtain as output
(x,y)= 0.1 , 0.35 , (i,j)= 0 , 0 Unkow sum of extra
(x,y)= 0.2 , 0.15 , (i,j)= 0 , 0 Unkow sum of extra
(x,y)= 0.1 , 3.0 , (i,j)= 0 , 3 Unkow sum of extra
(x,y)= 3.0 , 1.5 , (i,j)= 3 , 1 Unkow sum of extra
(x,y)= 3.0 , 3.5 , (i,j)= 3 , 3 Unkow sum of extra
(x,y)= 4.1 , 4.5 , (i,j)= 4 , 4 Unkow sum of extra
So, the first two points have coordinates (0.1,0.35) and (0.2,0.15) and are inside the cuadrant (0,0). Looking in "EXTRA" I know that in the cuadrant (0,0) I should obtain that the sum of "EXTRA" should be Sum_extra= 0.01+0.003. However I can't figure out how to make that sum in terms of code.
More information
My real problem is that I have "particles" inside a big cubic box, and I want to subdivide the box in smaller boxes, and in each one of the smaller boxes I want to obtain the sum of their "mass", in my example "EXTRA=mass".
I suspect that the way I classify whether a particle belongs to a quadrant is slow, which would suppose a problem since I have a lot of data.Any suggestions will be appreciated.
Combine the three arrays with zip and sort the result on the xx and yy values. Then group that by the xx and yy values. Get the sum of the EXTRA values for each group.
import operator, itertools
important = operator.itemgetter(0,1)
xtra = operator.itemgetter(-1)
data = sorted(zip(xx.astype(int),yy.astype(int),EXTRA),key=important)
gb = itertools.groupby(data,important)
for key,group in gb:
values = list(map(xtra,group))
print(key,values,sum(values))
# or just
#print(key,sum(map(xtra,group)))
Same concept using a Pandas DataFrame.
import pandas as pd
xx, yy = xx.astype(int),yy.astype(int)
In [25]: df = pd.DataFrame({'xx':xx,'yy':yy,'EXTRA':EXTRA})
In [26]: df.groupby(['xx','yy'])['EXTRA'].sum()
Out[26]:
xx yy
0 0 0.013
3 0.200
3 1 2.002
3 0.500
4 4 4.004
Name: EXTRA, dtype: float64
This question already has answers here:
Sorting arrays in NumPy by column
(16 answers)
Closed 2 years ago.
I have a numpy array a like this
In [318]: a
Out[318]:
array([[0. , 1. , 2. , 3. ],
[0.5, 0.3, 0.2, 0.25]])
I need to sort along the second row (the one with [0.5,0.3,0.2,0.25]), while having the first row changed accordingly. In this case, the expected result is
2 3 1. 0
0.2, 0.25, 0.3, 0.5
How can I do this? Thank you. I tried np.sort with axis=-1 and 0; they are not what I need.
Important note: The performance is the key in my problem solving. My arrays, from an image processing application, are generally of N columns with an N close to 4 million.
Use np.argsort() to get the index of sorted row and then use it as a mask to sort the whole array (column based):
In [69]: mask = np.argsort(a[1])
In [70]: a[:, mask]
Out[70]:
array([[2. , 3. , 1. , 0. ],
[0.2 , 0.25, 0.3 , 0.5 ]])
I am trying to figure out a way in which I can calculate quantiles in pandas or python based on a column value? Also can I calculate multiple different quantiles in one output?
For example I want to calculate the 0.25, 0.50 and 0.9 quantiles for
Column Minutes in df where it is <= 5 and where it is > 5 and <=10
df[df['Minutes'] <=5]
df[(df['Minutes'] >5) & (df['Minutes']<=10)]
where column Minutes is just a column containing value of numerical minutes
Thanks!
DataFrame.quantile accepts values in array,
Try
df['minute'].quantile([0.25, 0.50 , 0.9])
Or filter the data first,
df.loc[df['minute'] <= 5, 'minute'].quantile([0.25, 0.50 , 0.9])
I am trying to create a table of pairwise correlation for a model that I am building, and I have some numpy.nan values (NAN) in my dataset. For some reason, when I perform the correlation using np.corrcoef() I have different results than using pd.df.corr():
for instance:
dataset = np.array([[1,np.nan,np.nan,1,1],[1,np.nan,np.nan,3000,1]])
pandas_data = pd.DataFrame(dataset.transpose())
print np.corrcoef(dataset)
to which I get:
[[ nan nan]
[ nan nan]]
but with the pandas dataframe I do have one result:
print pandas_data.corr()
0 1
0 NaN NaN
1 NaN 1
Is there a fundamental difference in the way they handle NaN, or I missed something? (Also, why is my correlation 1 if I do have different values?) Thanks
NumPy's default behavior is to propagate NaNs. That is, it performs the computations with the entire array, and every time something is added to NaN (or multiplied by, etc), the result is NaN. This is reasonable: if a = 5 and b = NaN, a + b should be NaN. Consequently, the variance of an array containing at least one NaN is NaN, and so is the correlation of that array with any other array.
The raw-data-oriented nature of pandas leads to different design decisions: it tries to extract as much information as possible from incomplete data. In particular, the corr method is designed (and documented) to exclude NaN.
To reproduce pandas behavior in NumPy, use a boolean mask valid as below: it requires that there are no NaN values in the column.
dataset = np.array([[1, 2, 3, 4, np.nan], [1, 0, np.nan, 8, 9]])
valid = ~np.isnan(dataset).any(axis=0)
numpy_corr = np.corrcoef(dataset[:, valid])
pandas_data = pd.DataFrame(dataset.transpose())
pandas_corr = pandas_data.corr()
Both correlation methods now return the same result:
[[ 1. , 0.90112711],
[ 0.90112711, 1. ]])
The diagonal entries represent the correlation of an array with itself, which is always 1 (theoretically; in practice it's 1 within machine precision).
I have the following dataframe which is the result of performing a standard pandas correlation:
df.corr()
abc xyz jkl
abc 1 0.2 -0.01
xyz -0.34 1 0.23
jkl 0.5 0.4 1
I have a few things that need to be done with these correlations, however these calculations need to exclude all the cells where the value is 1. The 1 values are the cells where the item has a perfect correlation with itself, therefore I am not interested in it.:
Determine the maximum correlation pair. The result is 'jkl' and 'abc' which has a correlation of 0.5
Determine the minimum correlation pair. The result is 'abc' and 'xyz' which has a correlation of -0.34
Determine the average/mean for the whole dataframe (again this needs to exclude all the values which are 1). The result would be (0.2 + -0.01 + -0.34 + 0.23 + 0.5 + 0.4) / 6 = 0,163333333
Check this:
from numpy import unravel_index,fill_diagonal,nanargmax,nanargmin
from bottleneck import nanmean
a = df(columns=['abc','xyz', 'jkl'])
a.loc['abc'] = [1, 0.2 , -0.01]
a.loc['xyz'] = [-0.34, 1, 0.23]
a.loc['jkl'] = [0.5, 0.4, 1]
b = a.values.copy()
fill_diagonal(b, None)
imax = unravel_index(nanargmax(b), b.shape)
imin = unravel_index(nanargmin(b), b.shape)
print(a.index[imax[0]],a.columns[imax[1]])
print(a.index[imin[0]],a.columns[imin[1]])
print(nanmean(b))
Please don't forget to copy your data, otherwise np.fill_diagonal will erase its diagonal values.