standardize pandas groupby results - python

I am using pandas to get subgroup averages, and the basics work fine. For instance,
d = np.array([[1,4],[1,1],[0,1],[1,1]])
m = d.mean(axis=1)
p = pd.DataFrame(m,index='A1,A2,B1,B2'.split(','),columns=['Obs'])
pprint(p)
x = p.groupby([v[0] for v in p.index])
pprint(x.mean('Obs'))
x = p.groupby([v[1] for v in p.index])
pprint(x.mean('Obs'))
YIELDS:
Obs
A1 2.5
A2 1.0
B1 0.5
B2 1.0
Obs
A 1.75. <<<< 1.75 is (2.5 + 1.0) / 2
B 0.75
Obs
1 1.5
2 1.0
But, I also need to know how much A and B (1 and 2) deviate from their common mean. That is, I'd like to have tables like:
Obs Dev
A 1.75 0.50 <<< deviation of the Obs average, i.e., 1.75 - 1.25
B 0.75 -0.50 <<< 0.75 - 1.25 = -0.50
Obs Dev
1 1.5 0.25
2 1.0 -0.25
I can do this using loc, apply etc - but this seems silly. Can anyone think of an elegant way to do this using groupby or something similar?

Aggregate the means, then compute the difference to the mean of means:
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-d['Obs'].mean())
)
Or, in case of a variable number of items if you want the difference to the overall mean (not the mean of means!):
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-p['Obs'].mean()) # notice the p (not d)
)
output:
Obs Dev
A 1.75 0.5
B 0.75 -0.5

Related

Distances between subsets of rows in pandas

I have a matrix of the format
S1 S2 id var
0 1.2 3.2 A1 A
1 3.4 0.4 A2 A
2 -2.3 1.2 A3 A
3 0.1 -1.3 B1 B
4 4.5 1.3 B2 B
5 -2.3 -1.2 C1 C
And I want to compare the pairwise distances between all sets of A vs all B, then A vs C, and B vs C such that I get an average for dist_AB, dist_AC, and dist_BC. In other words:
dist_AB = ((A1 - B1) + (A1 - B2) + (A2 - B1) + (A2 - B2))/4
dist_AC = ((A1 - C1) + (A2 - C1))/2
dist_BC = ((B1 - C1) + (B2 - C2))/2
The challenge here is to do it on subsets. To implement this I can use loops:
import io
import numpy as np
import pandas as pd
TESTDATA="""
S1 S2 id var
1.2 3.2 A1 A
3.4 0.4 A2 A
-2.3 1.2 A3 A
0.1 -1.3 B1 B
4.5 1.3 B2 B
-2.3 -1.2 C1 C
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep="\s+")
vars_set=df[['id','var']].groupby('var')['id'].agg(list)
distances=pd.DataFrame()
for v1,v2 in itertools.combinations(vars_set.keys(),2):
print(v1+v2)
data1=df.loc[df['var']==v1]
data2=df.loc[df['var']==v2]
for row1 in data1.index:
for row2 in data2.index:
data1_row=data1.loc[row1,]
data2_row=data2.loc[row2,]
dist=np.linalg.norm(
data1_row[['S1','S2']]-data2_row[['S1','S2']]
)
out=pd.Series([v1+v2, data1_row['id'], data2_row['id'], dist], index=['var','id1','id2','dist'])
distances=pd.concat([distances, out], axis=1)
distances=distances.T
distances=distances.groupby('var')['dist'].agg('mean').reset_index()
distances
### returns the mean distances
var dist
0 AB 3.973345
1 AC 4.647527
2 BC 4.823540
My question is regarding the implementation. As I will be doing this calculation over many thousands of rows, this is very inefficient. Is there any more elegant and efficient way of doing it?
I have a solution without using itertools, but it involves a few steps. Let me know if it works with your larger dataset.
First we create a dataframe containing every combination using df.merge():
df2 = df.merge(df, 'cross')
Then we need to remove some combinations (e.g. A-A and e.g. A1-B1 is the same as B1-A1).
df2 = df2[df2.var_x != df2.var_y].reset_index(drop=True)
df2 = df2[pd.DataFrame(np.sort(df2[['id_x','id_y']].values, 1)).duplicated()]
Now we compute the distance:
df2['distance'] = np.linalg.norm(df2[['S1_x', 'S2_x']] - df2[['S1_y', 'S2_y']].values, axis = 1)
And finally using groupby we can compute the mean distance between the variables:
df2.groupby(['var_x', 'var_y']).distance.mean()
I hope it speeds up your computations!

Converting a column into floats except first few rows i.e selective conversion in a column

I m trying to do some data processing but I m getting the same error everytime.
My dataframe (con_tc) looks like this as follows:-
Index u_p0 u_p1 u_p2.........u_p100
x 0 0 0 0
y 0 0 0 0
z 30 50 75 1000
0.01 0.5 0.6 0.43 0.83
0.02 0.56 0.94 0.94 0.7
....
1000 0.4 0.5 0.45 0.56
When I run this line of code
con_tc.index = con_tc.index.map(lambda w: float(w) if (w not in 'xyz') else w)
which is trying to clean the index into float, I am getting the error as
TypeError: 'in <string>' requires string as left operand, not float
The aim behind this is to convert all the numeric values into floats except x,y and z.
In basic term
Index
x
y
z
0.01
0.02
....
1000
If anyone can help me out it will be really helpful.
May be you can use this way
con_tc[ 'float_u_p_0' ] = con_tc.apply( lambda x: float(x.u_p0) if x.Index not in ("xyz") else x.u_p0

Choosing a random value from a discrete distribution

I had come across the following code while reading up about RL. The probs vector contains the probabilities of each action to be taken. And I believe the given loop tries to choose an action randomly from the given distribution. Why/How does this work?
a = 0
rand_select = np.random.rand()
while True:
rand_select -= probs[a]
if rand_select < 0 or a + 1 == n_actions:
break
a += 1
actions = a
After going through similar code, I realised that "actions" contains the final action to be taken.
You can view the probabilities as a distribution of contiguous parts on the line from 0.0 to 1.0.
if we have A: 0.2, B: 0.3, C: 0.5 to the line could be
0.0 --A--> 0.2
0.2 --B--> 0.5
0.5 --C--> 1.0
And in total 1.0.
The algorithm is choosing a random location between 0.0->1.0 and finds out where it "landed" (A, B or C) by sequentially ruling out parts.
Suppose we draw 0.73, We can "visualize" it like this (selection marked with *):
0.0 ---------------------------> 1.0
*
0.0 --A--> 0.2 --B--> 0.5 --C--> 1.0
0.73 - 0.2 > 0 so we reduce 0.2 (=0.53) and are left with:
0.2 --B--> 0.5
0.5 --C--> 1.0
0.53 - 0.3 > 0 so we reduce 0.5 (=0.23) and are left with:
0.5 --C--> 1.0
0.23 - 0.5 < 0 so we know the part we drew was C.
The selection distributes the same as the probabilities and the algorithm is O(n) where n is the number of probabilities.

Issue replacing values in numpy array

I am trying to copy an array, replace all values in the copy below a threshold but keep the original array in tact.
Here is a simplified example of what I need to do.
import numpy as np
A = np.arange(0,1,.1)
B = A
B[B<.3] = np.nan
print ('A =', A)
print ('B =', B)
Which yields
A = [ nan nan nan 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
B = [ nan nan nan 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
I can't understand why the values in A <= .3 are also overwritten?
Can someone explain this to me and suggest a work around?
Change B = A to B = A.copy() and this should work as expected. As written, B and A refer to the same object in memory.

Pandas new column with constant increments

I need a new column that adds in increments, in this case .02.
DF before:
x y x2 y2
0 1.022467 1.817298 1.045440 3.302572
1 1.026426 1.821669 1.053549 3.318476
2 1.018198 1.818419 1.036728 3.306648
3 1.013077 1.813290 1.026325 3.288020
4 1.017878 1.811058 1.036076 3.279930
DF after:
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.000000
1 1.026426 1.821669 1.053549 3.318476 0.020000
2 1.018198 1.818419 1.036728 3.306648 0.040000
3 1.013077 1.813290 1.026325 3.288020 0.060000
4 1.017878 1.811058 1.036076 3.279930 0.080000
5 1.016983 1.814031 1.034254 3.290708 0.100000
I have looked around for a while, and cannot find a good solution. The only way on my mind is to make a standard python list and bring it in. There has to be a better way. Thanks
Because your index is the perfect range for this (i.e. 0...n), just multiply it by your constant:
df['t'] = .02 * df.index.values
>>> df
x y x2 y2 t
0 1.022467 1.817298 1.045440 3.302572 0.00
1 1.026426 1.821669 1.053549 3.318476 0.02
2 1.018198 1.818419 1.036728 3.306648 0.04
3 1.013077 1.813290 1.026325 3.288020 0.06
4 1.017878 1.811058 1.036076 3.279930 0.08
You could also use a list comprehension:
df['t'] = [0.02 * i for i in range(len(df))]

Categories