Getting mean, max, min from pandas dataframe - python

I have the following dataframe which is the result of performing a standard pandas correlation:
df.corr()
abc xyz jkl
abc 1 0.2 -0.01
xyz -0.34 1 0.23
jkl 0.5 0.4 1
I have a few things that need to be done with these correlations, however these calculations need to exclude all the cells where the value is 1. The 1 values are the cells where the item has a perfect correlation with itself, therefore I am not interested in it.:
Determine the maximum correlation pair. The result is 'jkl' and 'abc' which has a correlation of 0.5
Determine the minimum correlation pair. The result is 'abc' and 'xyz' which has a correlation of -0.34
Determine the average/mean for the whole dataframe (again this needs to exclude all the values which are 1). The result would be (0.2 + -0.01 + -0.34 + 0.23 + 0.5 + 0.4) / 6 = 0,163333333

Check this:
from numpy import unravel_index,fill_diagonal,nanargmax,nanargmin
from bottleneck import nanmean
a = df(columns=['abc','xyz', 'jkl'])
a.loc['abc'] = [1, 0.2 , -0.01]
a.loc['xyz'] = [-0.34, 1, 0.23]
a.loc['jkl'] = [0.5, 0.4, 1]
b = a.values.copy()
fill_diagonal(b, None)
imax = unravel_index(nanargmax(b), b.shape)
imin = unravel_index(nanargmin(b), b.shape)
print(a.index[imax[0]],a.columns[imax[1]])
print(a.index[imin[0]],a.columns[imin[1]])
print(nanmean(b))
Please don't forget to copy your data, otherwise np.fill_diagonal will erase its diagonal values.

Related

norm for all columns in a pandas datafrme

With a dataframe like this:
index col_1 col_2 ... col_n
0 0.2 0.1 0.3
1 0.2 0.1 0.3
2 0.2 0.1 0.3
...
n 0.4 0.7 0.1
How can one get the norm for each column ?
Where the norm is the sqrt of the sum of the squares.
I am able to do this for each column sequentially, but am unsure how to vectorize (avoiding a for loop) the same to an answer:
import pandas as pd
import numpy as np
norm_col_1 = np.linalg.norm(df[col_1])
norm_col_2 = np.linalg.norm(df[col_2])
norm_col_n = np.linalg.norm(df[col_n])
the answer would be a new dataframe series like this:
norms
col_1 0.111
col_2 0.202
col_3 0.55
...
con_n 0.100
You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise:
np.linalg.norm(df, axis=0)
To create a series with appropriate column names, try:
results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)

Dividing 3D data into cubic subsets and counting the points inside the cube

I have a archive of data. (xx,yy,EXTRA) and I want to divide the data into grids of equal size. For example, lets suppose that the data is:
xx=np.array([0.1, 0.2, 3, 4.1, 3, 0.1])
yy=np.array([0.35, 0.15, 1.5, 4.5, 3.5, 3])
EXTRA=np.array([0.01,0.003,2.002,4.004,0.5,0.2])
I want to make square grids of size 1x1, and after obtain the sum of "EXTRA" for every point on the grid.
This is what I tried
import math
for i in range(0,5):
for j in range(0,5):
for x,y in zip(xx,yy):
k=math.floor(x)
kk=math.floor(y)
if i<=k<i+1.0 and j<=kk<j+1.0:
print("(x,y)=" ,x,",",y,",","(i,j)=",i,",",j ,"Unkow sum of EXTRA")
I obtain as output
(x,y)= 0.1 , 0.35 , (i,j)= 0 , 0 Unkow sum of extra
(x,y)= 0.2 , 0.15 , (i,j)= 0 , 0 Unkow sum of extra
(x,y)= 0.1 , 3.0 , (i,j)= 0 , 3 Unkow sum of extra
(x,y)= 3.0 , 1.5 , (i,j)= 3 , 1 Unkow sum of extra
(x,y)= 3.0 , 3.5 , (i,j)= 3 , 3 Unkow sum of extra
(x,y)= 4.1 , 4.5 , (i,j)= 4 , 4 Unkow sum of extra
So, the first two points have coordinates (0.1,0.35) and (0.2,0.15) and are inside the cuadrant (0,0). Looking in "EXTRA" I know that in the cuadrant (0,0) I should obtain that the sum of "EXTRA" should be Sum_extra= 0.01+0.003. However I can't figure out how to make that sum in terms of code.
More information
My real problem is that I have "particles" inside a big cubic box, and I want to subdivide the box in smaller boxes, and in each one of the smaller boxes I want to obtain the sum of their "mass", in my example "EXTRA=mass".
I suspect that the way I classify whether a particle belongs to a quadrant is slow, which would suppose a problem since I have a lot of data.Any suggestions will be appreciated.
Combine the three arrays with zip and sort the result on the xx and yy values. Then group that by the xx and yy values. Get the sum of the EXTRA values for each group.
import operator, itertools
important = operator.itemgetter(0,1)
xtra = operator.itemgetter(-1)
data = sorted(zip(xx.astype(int),yy.astype(int),EXTRA),key=important)
gb = itertools.groupby(data,important)
for key,group in gb:
values = list(map(xtra,group))
print(key,values,sum(values))
# or just
#print(key,sum(map(xtra,group)))
Same concept using a Pandas DataFrame.
import pandas as pd
xx, yy = xx.astype(int),yy.astype(int)
In [25]: df = pd.DataFrame({'xx':xx,'yy':yy,'EXTRA':EXTRA})
In [26]: df.groupby(['xx','yy'])['EXTRA'].sum()
Out[26]:
xx yy
0 0 0.013
3 0.200
3 1 2.002
3 0.500
4 4 4.004
Name: EXTRA, dtype: float64

Get the percentile of a column ordered by another column

I have a dataframe with two columns, score and order_amount. I want to find the score Y that represents the Xth percentile of order_amount. I.e. if I sum up all of the values of order_amount where score <= Y I will get X% of the total order_amount.
I have a solution below that works, but it seems like there should be a more elegant way with pandas.
import pandas as pd
test_data = {'score': [0.3,0.1,0.2,0.4,0.8],
'value': [10,100,15,200,150]
}
df = pd.DataFrame(test_data)
df
score value
0 0.3 10
1 0.1 100
2 0.2 15
3 0.4 200
4 0.8 150
# Now we can order by `score` and use `cumsum` to calculate what we want
df_order = df.sort_values('score')
df_order['percentile_value'] = 100*df_order['value'].cumsum()/df_order['value'].sum()
df_order
score value percentile_value
1 0.1 100 21.052632
2 0.2 15 24.210526
0 0.3 10 26.315789
3 0.4 200 68.421053
4 0.8 150 100.000000
# Now can find the first value of score with percentile bigger than 50% (for example)
df_order[df_order['percentile_value']>50]['score'].iloc[0]
Use Series.searchsorted:
idx = df_order['percentile_value'].searchsorted(50)
print (df_order.iloc[idx, df.columns.get_loc('score')])
0.4
Or get first value of filtered Series with next and iter, if no match returned some default value:
s = df_order.loc[df_order['percentile_value'] > 50, 'score']
print (next(iter(s), 'no match'))
0.4
One line solution:
out = next(iter((df.sort_values('score')
.assign(percentile_value = lambda x: 100*x['value'].cumsum()/x['value'].sum())
.query('percentile_value > 50')['score'])),'no matc')
print (out)
0.4
here is another way starting from the oriinal dataframe using np.percentile:
df = df.sort_values('score')
df.loc[np.searchsorted(df['value'],np.percentile(df['value'].cumsum(),50)),'score']
Or series.quantile
df.loc[np.searchsorted(df['value'],df['value'].cumsum().quantile(0.5)),'score']
Or similarly with iloc, if index is not default:
df.iloc[np.searchsorted(df['value']
,np.percentile(df['value'].cumsum(),50)),df.columns.get_loc('score')]
0.4

Floating point comparison not working on pandas groupby output

I am facing issues with pandas filtering of rows. I am trying to filter out team whose sum of weight is not equal to one.
dfteam
Team Weight
A 0.2
A 0.5
A 0.2
A 0.1
B 0.5
B 0.25
B 0.25
dfteamtemp = dfteam.groupby(['Team'], as_index=False)['Weight'].sum()
dfweight = dfteamtemp[(dfteamtemp['Weight'].astype(float)!=1.0)]
dfweight
Team Weight
0 A 1.0
I am not sure about the reason for this output. I should get an empty dataframe but it is giving me Team A even thought the sum is 1.
You are a victim of floating point inaccuracies. The first value does not exactly add up to 1.0 -
df.groupby('Team').Weight.sum().iat[0]
0.99999999999999989
You can resolve this by using np.isclose instead -
np.isclose(df.groupby('Team').Weight.sum(), 1.0)
array([ True, True], dtype=bool)
And filter on this array. Or, as #ayhan suggested, use groupby + filter -
df.groupby('Team').filter(lambda x: not np.isclose(x['Weight'].sum(), 1))
Empty DataFrame
Columns: [Team, Weight]
Index: []

Subtracting multiple columns and appending results in pandas DataFrame

I have a table of sensor data, for which some columns are measurements and some columns are sensor bias. For example, something like this:
df=pd.DataFrame({'x':[1.0,2.0,3.0],'y':[4.0,5.0,6.0],
'dx':[0.25,0.25,0.25],'dy':[0.5,0.5,0.5]})
dx dy x y
0 0.25 0.5 1.0 4.0
1 0.25 0.5 2.0 5.0
2 0.25 0.5 3.0 6.0
I can add a column to the table by subtracting the bias from the measurement like this:
df['newX'] = df['x'] - df['dx']
dx dy x y newX
0 0.25 0.5 1.0 4.0 0.75
1 0.25 0.5 2.0 5.0 1.75
2 0.25 0.5 3.0 6.0 2.75
But I'd like to do that for many columns at once. This doesn't work:
df[['newX','newY']] = df[['x','y']] - df[['dx','dy']]
for two reasons, it seems.
When subtracting DataFrames the column labels are used to align the subtraction, so I wind up with a 4 column result ['x', 'y', 'dx', 'dy'].
It seems I can insert a single column into the DataFrame using indexing, but not more than one.
Obviously I can iterate over the columns and do each one individually, but is there a more compact way to accomplish what I'm trying to do that is more analogous to the one column solution?
DataFrames generally align operations such as arithmetic on column and row indices. Since df[['x','y']] and df[['dx','dy']] have different column names, the dx column is not subtracted from the x column, and similiarly for the y columns.
In contrast, if you subtract a NumPy array from a DataFrame, the operation is done elementwise since the NumPy array has no Panda-style indices to align upon.
Hence, if you use df[['dx','dy']].values to extract a NumPy array consisting of the values in df[['dx','dy']], then your assignment can be done as desired:
import pandas as pd
df = pd.DataFrame({'x':[1.0,2.0,3.0],'y':[4.0,5.0,6.0],
'dx':[0.25,0.25,0.25],'dy':[0.5,0.5,0.5]})
df[['newx','newy']] = df[['x','y']] - df[['dx','dy']].values
print(df)
yields
dx dy x y newx newy
0 0.25 0.5 1.0 4.0 0.75 3.5
1 0.25 0.5 2.0 5.0 1.75 4.5
2 0.25 0.5 3.0 6.0 2.75 5.5
Be ware that if you were to try assigning a NumPy array (on the right-hand side)
to a DataFrame (on the left-hand side), the column names specified on the left must already exist.
In contrast, when assigning a DataFrame on the right-hand side to a DataFrame on the left, new columns can be used since in this case Pandas zips the keys (new column names) on the left with the columns on the right and assigns values in column-order instead of by aligning columns:
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
Thus, using a DataFrame on the right
df[['newx','newy']] = df[['x','y']] - df[['dx','dy']].values
works, but using a NumPy array on the right
df[['newx','newy']] = df[['x','y']].values - df[['dx','dy']].values
does not.

Categories