I have a matrix of the format
S1 S2 id var
0 1.2 3.2 A1 A
1 3.4 0.4 A2 A
2 -2.3 1.2 A3 A
3 0.1 -1.3 B1 B
4 4.5 1.3 B2 B
5 -2.3 -1.2 C1 C
And I want to compare the pairwise distances between all sets of A vs all B, then A vs C, and B vs C such that I get an average for dist_AB, dist_AC, and dist_BC. In other words:
dist_AB = ((A1 - B1) + (A1 - B2) + (A2 - B1) + (A2 - B2))/4
dist_AC = ((A1 - C1) + (A2 - C1))/2
dist_BC = ((B1 - C1) + (B2 - C2))/2
The challenge here is to do it on subsets. To implement this I can use loops:
import io
import numpy as np
import pandas as pd
TESTDATA="""
S1 S2 id var
1.2 3.2 A1 A
3.4 0.4 A2 A
-2.3 1.2 A3 A
0.1 -1.3 B1 B
4.5 1.3 B2 B
-2.3 -1.2 C1 C
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep="\s+")
vars_set=df[['id','var']].groupby('var')['id'].agg(list)
distances=pd.DataFrame()
for v1,v2 in itertools.combinations(vars_set.keys(),2):
print(v1+v2)
data1=df.loc[df['var']==v1]
data2=df.loc[df['var']==v2]
for row1 in data1.index:
for row2 in data2.index:
data1_row=data1.loc[row1,]
data2_row=data2.loc[row2,]
dist=np.linalg.norm(
data1_row[['S1','S2']]-data2_row[['S1','S2']]
)
out=pd.Series([v1+v2, data1_row['id'], data2_row['id'], dist], index=['var','id1','id2','dist'])
distances=pd.concat([distances, out], axis=1)
distances=distances.T
distances=distances.groupby('var')['dist'].agg('mean').reset_index()
distances
### returns the mean distances
var dist
0 AB 3.973345
1 AC 4.647527
2 BC 4.823540
My question is regarding the implementation. As I will be doing this calculation over many thousands of rows, this is very inefficient. Is there any more elegant and efficient way of doing it?
I have a solution without using itertools, but it involves a few steps. Let me know if it works with your larger dataset.
First we create a dataframe containing every combination using df.merge():
df2 = df.merge(df, 'cross')
Then we need to remove some combinations (e.g. A-A and e.g. A1-B1 is the same as B1-A1).
df2 = df2[df2.var_x != df2.var_y].reset_index(drop=True)
df2 = df2[pd.DataFrame(np.sort(df2[['id_x','id_y']].values, 1)).duplicated()]
Now we compute the distance:
df2['distance'] = np.linalg.norm(df2[['S1_x', 'S2_x']] - df2[['S1_y', 'S2_y']].values, axis = 1)
And finally using groupby we can compute the mean distance between the variables:
df2.groupby(['var_x', 'var_y']).distance.mean()
I hope it speeds up your computations!
Related
Supposing I have a data frame that looks like:
col1 col2
0 10
1 23
2 21
3 15
I want to subtract each value in col2 with the previous row sequentially, so that we are subtracting the previously subtracted value to get:
col1 col2
0 10 # left unchanged as index == 0
1 13 # 23 - 10
2 8 # 21 - 13
3 7 # 15 - 8
Other solutions that I have found all subtract the previous values as is, and not the new subtracted value. I would like to avoid using for loops as I have a very large dataset.
Try below to understand the 'previously subtracted'
b2 = a2 - a1
b3 = a3 - b2 = a3 - a2 + a1
b4 = a4 - b3 = a4 - a3 + a2 - a1
b5 = a5 - b4 = a5 - a4 + a3 - a2 + a1
So we just do
s = np.arange(len(df))%2
s = s + s - 1
df['new'] = np.tril(np.multiply.outer(s,s)).dot(df.col2)
Out[47]: array([10, 13, 8, 7])
Below a simple pure Pandas (doesn't need to import numpy) approach which is a more straightforward concept and easy to understand from code without additional explanations:
Let's first define a function which will do the required work:
def ssf(val):
global last_val
last_val = val - last_val
return last_val
Using the function above the code for creating the new column will be:
last_val = 0
df['new'] = df.col2.apply(ssf)
Let's compare number of functions/methods used by the pure Pandas approach compared to the numpy one in the other answer.
The Pandas approach uses 2 functions/methods: ssf() and .apply() and 1 operation: simple subtraction.
The numpy approach uses 5 functions/methods: .arange(), len(), .tril(), .multiply.outer() and .dot() and 3 operations: array addition, array subtraction and modulo division.
I am using pandas to get subgroup averages, and the basics work fine. For instance,
d = np.array([[1,4],[1,1],[0,1],[1,1]])
m = d.mean(axis=1)
p = pd.DataFrame(m,index='A1,A2,B1,B2'.split(','),columns=['Obs'])
pprint(p)
x = p.groupby([v[0] for v in p.index])
pprint(x.mean('Obs'))
x = p.groupby([v[1] for v in p.index])
pprint(x.mean('Obs'))
YIELDS:
Obs
A1 2.5
A2 1.0
B1 0.5
B2 1.0
Obs
A 1.75. <<<< 1.75 is (2.5 + 1.0) / 2
B 0.75
Obs
1 1.5
2 1.0
But, I also need to know how much A and B (1 and 2) deviate from their common mean. That is, I'd like to have tables like:
Obs Dev
A 1.75 0.50 <<< deviation of the Obs average, i.e., 1.75 - 1.25
B 0.75 -0.50 <<< 0.75 - 1.25 = -0.50
Obs Dev
1 1.5 0.25
2 1.0 -0.25
I can do this using loc, apply etc - but this seems silly. Can anyone think of an elegant way to do this using groupby or something similar?
Aggregate the means, then compute the difference to the mean of means:
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-d['Obs'].mean())
)
Or, in case of a variable number of items if you want the difference to the overall mean (not the mean of means!):
(p.groupby(p.index.str[0])
.agg(Obs=('Obs', 'mean'))
.assign(Dev=lambda d: d['Obs']-p['Obs'].mean()) # notice the p (not d)
)
output:
Obs Dev
A 1.75 0.5
B 0.75 -0.5
I have a mixed data-typed feature matrix. And corresponding output labels on each row. I am interested to find the hierarchy among the output labels (i.e, classes or intents). Following is a sample:
token probability intent
--------------------------------------------------------
t1 0.2 a
t2 0.7 a
t3 0.1 a
t1 0.3 b
t4 0.6 b
t3 0.1 b
t5 0.3 c
t6 0.3 c
t7 0.25 c
t8 0.15 c
t1 0.5 d
t2 0.5 d
Based on this data I want to generate a tree to represent a relationship among the output labels:
()
/ \
() \
/ |\ \
/ | \ \
a b d c
I have gone through dendrogram and for mixed data-type, the distance matrix which can be used is Gower distance. I guess probably they are helpful, although I was not able to find a way to put them together. Also note that there can be more than two crunchings (a, b, d). Is there any way to do this in this way or otherwise?
I have a dataframe with 2 identifiers (ID1, ID2) and 3 numeric columns (X1,X2,X3) and a column titled 'input' (total 6 columns) and n rows. For each row, I want to get the index of the nth column such that n is the last time that (x1+x2+xn... >=0) is still true.
How can I do this in Python?
In R I did this by using:
tmp = data
for (i in 4:5)
{
data[,i]<- tmp$input - rowSums(tmp[,3:i])
}
output<- apply((data[,3:5]), 1, function(x) max(which(x>0)))
data$output <- output
I am trying to translate this into Python. What might be the best way to do this? There can be N such rows, and M such columns.
Sample Data:
ID1 ID2 X1 X2 X3 INPUT OUTPUT (explanation)
a b 1 2 3 3 2 (X1 = 1, x1+x2 = 3, x1+x3+x3 = 6 ... and after 2 sums, input< sums)
a1 a2 5 2 1 4 0 (X1 = 5, x1+x2 = 7, x1+x3+x3 = 8 ... and even for 1 sum, input< sums)
a2 b2 0 4 5 100 3 (X1=0, X1+X2=4, X1+X2+X3=9, ... even after 3 sums, input>sums)
You can use Pandas module which handles this very effectively in Python.
import pandas as pd
#Taking a sample data here
df = pd.DataFrame([
['A','B',1,3,4,0.1],
['K','L',10,3,14,0.5],
['P','H',1,73,40,0.6]],columns = ['ID1','ID2','X2','X3','X4','INPUT'])
#Below code does the functionality you would want.
df['new_column']=df[['X2','X3','X4']].max(axis=1)
I would like to have a function defined for percentage diff calculation between any two pandas columns.
Lets say that my dataframe is defined by:
R1 R2 R3 R4 R5 R6
A B 1 2 3 4
I would like my calculation defined as
df['R7'] = df[['R3','R4']].apply( method call to calculate perc diff)
and
df['R8'] = df[['R5','R6']].apply(same method call to calculate perc diff)
How can i do that?
I have tried below
df['perc_cnco_error'] = df[['CumNetChargeOffs_x','CumNetChargeOffs_y']].apply(lambda x,y: percCalc(x,y))
def percCalc(x,y):
if x<1e-9:
return 0
else:
return (y - x)*100/x
and it gives me the error message
TypeError: ('() takes exactly 2 arguments (1 given)',
u'occurred at index CumNetChargeOffs_x')
At it's simplest terms:
def percentage_change(col1,col2):
return ((col2 - col1) / col1) * 100
You can apply it to any 2 columns of your dataframe:
df['a'] = percentage_change(df['R3'],df['R4'])
df['b'] = percentage_change(df['R6'],df['R5'])
>>> print(df)
R1 R2 R3 R4 R5 R6 a b
0 A B 1 2 3 4 100.0 -25.0
Equivalently using pandas arithmetic operation functions
def percentage_change(col1,col2):
return ((col2.sub(col1)).div(col1)).mul(100)
pandas.sub
pandas.div
pandas.mul
You can also utilise pandas built-in pct_change which computes the percentage change across all the columns passed, and select the column you want to return:
df['R7'] = df[['R3','R4']].pct_change(axis=1)['R4']
df['R8'] = df[['R6','R5']].pct_change(axis=1)['R5']
>>> print(df)
R1 R2 R3 R4 R5 R6 a b R7 R8
0 A B 1 2 3 4 100.0 -25.0 1.0 -0.25
Setup:
df = pd.DataFrame({'R1':'A','R2':'B',
'R3':1,'R4':2,'R5':3,'R6':4},
index=[0])
To calculate percent diff between R3 and R4 you can use:
df['R7'] = (df.R3 - df.R4) / df.R3 * 100
This would give you the deviation in percentage:
df.apply(lambda row: (row.iloc[0]-row.iloc[1])/row.iloc[0]*100, axis=1)
If you have more than two columns try,
df[['R3', 'R5']].apply(lambda row: (row.iloc[0]-row.iloc[1])/row.iloc[0]*100, axis=1)