Tensor indexing with matrix - python

I have matrix (3 x 15) dummies with sequences of tokens as rows:
[[ 1 66 67 68 0 0 0 0 0 0 0 0 0 0 0]
[ 1 66 67 66 68 66 67 66 0 0 0 0 0 0 0]
[ 1 66 67 68 18 19 20 21 22 23 24 25 26 17 0]]
Also, there's a tensor probs of shape (3 x 15 x n_tokens) with token probabilities.
From probs I need to select only probabilities of tokens in dummies.
I think, it may be possible to use the matrix as indices for the tensor, but I haven't found how to do that.

You can do that like this:
import tensorflow as tf
dummies = ...
probs = ...
s = tf.shape(dummies)
i = tf.range(s[0])
j = tf.range(s[1])
ii, jj = tf.meshgrid(i, j, indexing='ij')
idx = tf.stack([ii, jj, dummies], axis=-1)
result = tf.gather_nd(probs, idx)

Related

argsort() only positive and negative values separately and add a new pandas column

I have a dataframe that has column , 'col', with both positive and negative numbers. I would like run a ranking separately on both the positive and negative numbers only with 0 excluded not to mess up the ranking. My issue is that my code below is updating the 'col' column. I must be keeping a reference it but not sure where?
data = {'col':[random.randint(-1000, 1000) for _ in range(100)]}
df = pd.DataFrame(data)
pos_idx = np.where(df.col > 0)[0]
neg_idx = np.where(df.col < 0)[0]
p = df[df.col > 0].col.values
n = df[df.col < 0].col.values
p_rank = np.round(p.argsort().argsort()/(len(p)-1)*100,1)
n_rank = np.round((n*-1).argsort().argsort()/(len(n)-1)*100,1)
pc = df.col.values
pc[pc > 0] = p_rank
pc[pc < 0] = n_rank
df['ranking'] = pc
One way to do it is to avoid mutating the original dataframe by replacing this line in your code:
pc = df.col.values
with:
pc = df.copy().col.values
So that:
print(df)
# Output
col ranking
0 -492 49
1 884 93
2 -355 36
3 741 77
4 -210 24
.. ... ...
95 564 57
96 683 63
97 -129 18
98 -413 44
99 810 81
[100 rows x 2 columns]
was able to figure it out on my own.
created a new column of zeros then used .loc to update te value at their respective index locations.
df['ranking'] = 0
df[df.col > 0, 'ranking'] = pos_rank
df[df.col < 0, 'ranking'] = neg_rank

selecting indexes with multiple years of observations

I wish to select only the rows that have observations across multiple years. For example, suppose
mlIndx = pd.MultiIndex.from_tuples([('x', 0,),('x',1),('z', 0), ('y', 1),('t', 0),('t', 1)])
df = pd.DataFrame(np.random.randint(0,100,(6,2)), columns = ['a','b'], index=mlIndx)
In [18]: df
Out[18]:
a b
x 0 6 1
1 63 88
z 0 69 54
y 1 27 27
t 0 98 12
1 69 31
My desired output is
Out[19]:
a b
x 0 6 1
1 63 88
t 0 98 12
1 69 31
My current solution is blunt so something that can scale up more easily would be great. You can assumed a sorted index.
df.reset_index(level=0, inplace=True)
df[df.level_0.duplicated() | df.level_0.duplicated(keep='last')]
Out[30]:
level_0 a b
0 x 6 1
1 x 63 88
0 t 98 12
1 t 69 31
You can figure this out with groupby (on the first level of the index) + transform, and then use boolean indexing to filter out those rows:
df[df.groupby(level=0).a.transform('size').gt(1)]
a b
x 0 67 83
1 2 34
t 0 18 87
1 63 20
Details
Output of the groupby -
df.groupby(level=0).a.transform('size')
x 0 2
1 2
z 0 1
y 1 1
t 0 2
1 2
Name: a, dtype: int64
Filtering from here is straightforward, just find those rows with size > 1.
Use the group by filter
You can pass a function that returns a boolean to
df.groupby(level=0).filter(lambda x: len(x) > 1)
a b
x 0 7 33
1 31 43
t 0 71 18
1 68 72
I've spent my fare share of time focused on speed. Not all solutions need to be the fastest solutions. However, since the subject has come up. I'll offer what I think should be a fast solution. It is my intent to keep future readers informed.
Results of Time Test
res.plot(loglog=True)
res.div(res.min(1), 0).T
10 30 100 300 1000 3000
cs 4.425970 4.643234 5.422120 3.768960 3.912819 3.937120
wen 2.617455 4.288538 6.694974 18.489803 57.416648 148.860403
jp 6.644870 21.444406 67.315362 208.024627 569.421257 1525.943062
pir 6.043569 10.358355 26.099766 63.531397 165.032540 404.254033
pir_pd_factorize 1.153351 1.132094 1.141539 1.191434 1.000000 1.000000
pir_np_unique 1.058743 1.000000 1.000000 1.000000 1.021489 1.188738
pir_best_of 1.000000 1.006871 1.030610 1.086425 1.068483 1.025837
Simulation Details
def pir_pd_factorize(df):
f, u = pd.factorize(df.index.get_level_values(0))
m = np.bincount(f)[f] > 1
return df[m]
def pir_np_unique(df):
u, f = np.unique(df.index.get_level_values(0), return_inverse=True)
m = np.bincount(f)[f] > 1
return df[m]
def pir_best_of(df):
if len(df) > 1000:
return pir_pd_factorize(df)
else:
return pir_np_unique(df)
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000],
columns='cs wen jp pir pir_pd_factorize pir_np_unique pir_best_of'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(
dict(a=range(i)),
pd.MultiIndex.from_arrays([
np.random.randint(i // 4 * 3, size=i),
range(i)
])
)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)
Just a new way
s=df.a.count(level=0)
df.loc[s[s>1].index.tolist()]
Out[12]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
And if you want to keep using duplicate
s=df.index.get_level_values(level=0)
df.loc[s[s.duplicated()].tolist()]
Out[18]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
I'm not convinced groupby is necessary:
df = df.sort_index()
df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
# a b
# x 0 16 3
# 1 97 36
# t 0 9 18
# 1 37 30
Some benchmarking:
df = pd.concat([df]*10000).sort_index()
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
%timeit cs(df) # 19.5ms
%timeit pir(df) # 33.8ms
%timeit wen(df) # 17.0ms
%timeit jp(df) # 22.3ms

outputting python/numpy arrays as columns

I'm very new to python, but have been using it to calculate and filter through data. I'm trying to output my array so I can pass it to other programs, but the output is one solid piece of text, with brackets and commas separating it.
I understand there are ways of manipulating this, but I want to understand why my code has output it in this format, and how to make it output it in nice columns instead.
The array was generated with:
! /usr/bin/env python
import numpy as np
import networkx
import gridData
from scipy.spatial.distance import euclidean
INPUT1=open("test_area.xvg",'r')
INPUT2=open("test_atom.xvg",'r')
OUTPUT1= open("negdist.txt",'w')
area = []
pointneg = []
posneg = []
negdistance =[ ]
negresarea = []
while True:
line = INPUT1.readline()
if not line:
break
col = line.split()
if col:
area.append(((col[0]),float(col[1])))
pointneg.append((-65.097000,5.079000,-9.843000))
while True:
line = INPUT2.readline()
if not line:
break
col = line.split()
if col:
pointneg.append((float(col[5]),float(col[6]),float(col[7])))
posneg.append((col[4]))
for col in posneg:
negresarea.append(area[int(col)-1][1])
a=len(pointneg)
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print >> OUTPUT1, negdistance
example output:
[(-1, 1.22333, 0.0), (-1, 1.24223, 153.4651968428021), (-1, 1.48462, 148.59335545709976), (-1, 1.39778, 86.143305392816202), (-1, 0.932278, 47.914688322058403), (-1, 1.04997, 28.622555546282022),
desired output:
[-1, 1.22333, 0.0
-1, 1.24223, 153.4651968428021
-1, 1.48462, 148.59335545709976
-1, 1.39778, 86.143305392816202
-1, 0.932278, 47.914688322058403
-1, 1.04997, 28.622555546282022...
Example inputs:
example input1
1 2.12371 0
2 1.05275 0
3 0.865794 0
4 0.933986 0
5 1.09092 0
6 1.22333 0
7 1.54639 0
8 1.24223 0
9 1.10928 0
10 1.16232 0
11 0.60942 0
12 1.40117 0
13 1.58521 0
14 1.00011 0
15 1.18881 0
16 1.68442 0
17 0.866275 0
18 1.79196 0
19 1.4375 0
20 1.198 0
21 1.01645 0
22 1.82221 0
23 1.99409 0
24 1.0728 0
25 0.679654 0
26 1.15578 0
27 1.28326 0
28 1.00451 0
29 1.48462 0
30 1.33399 0
31 1.13697 0
32 1.27483 0
33 1.18738 0
34 1.08141 0
35 1.15163 0
36 0.93699 0
37 0.940171 0
38 1.92887 0
39 1.35721 0
40 1.85447 0
41 1.39778 0
42 1.97309 0
Example Input2
ATOM 35 CA GLU 6 56.838 -5.202 -102.459 1.00273.53 C
ATOM 55 CA GLU 8 54.729 -6.650 -96.930 1.00262.73 C
ATOM 225 CA GLU 29 5.407 -2.199 -58.801 1.00238.62 C
ATOM 321 CA GLU 41 -24.633 -0.327 -34.928 1.00321.69 C
The problem is the multiple parenthesis when you append. You are appending tuples.
what you want is to be adding lists - i.e. the ones with square brackets.
import numpy as np
area = []
with open('example2.txt') as filehandle:
for line in filehandle:
if line.strip() == '':continue
line = line.strip().split(',')
area.append([int(line[0]),float(line[1]),float(line[2])])
area = np.array(area)
print(area)
'example2.txt' is the data you provided made into a csv
I didn't really get an answer that enabled me to understand the problem, the one suggested above just prevented to whole code working properly. I did find a work around by including the print command in the loop defining my final output.
for x in xrange(a-1):
negdistance.append((-1,(negresarea[x]),euclidean((pointneg[0]),(pointneg[x]))))
print negdistance
negdistance =[]

2D circular convolution Vs convolution FFT [Matlab/Octave/Python]

I am trying to understand the FTT and convolution (cross-correlation) theory and for that reason I have created the following code to understand it. The code is Matlab/Octave, however I could also do it in Python.
In 1D:
x = [5 6 8 2 5];
y = [6 -1 3 5 1];
x1 = [x zeros(1,4)];
y1 = [y zeros(1,4)];
c1 = ifft(fft(x1).*fft(y1));
c2 = conv(x,y);
c1 = 30 31 57 47 87 47 33 27 5
c2 = 30 31 57 47 87 47 33 27 5
In 2D:
X=[1 2 3;4 5 6; 7 8 9]
y=[-1 1];
conv1 = conv2(x,y)
conv1 =
24 53 89 29 21
96 140 197 65 42
168 227 305 101 63
Here is where I find the problem, padding a matrix and a vector? How should I do it? I could pad x with zeros around? or just on one side? and what about y? I know that the length of the convolution should be M+L-1 when x and y are vectors, but what about when they are matrices?
How could I continue my example here?
You need to zero-pad one variable with:
As many zero-columns as the number of columns of other variable minus
one.
As many zero-rows as the number of rows of the other variable minus one.
In Matlab, it would look in the following way:
% 1D
x = [5 6 8 2 5];
y = [6 -1 3 5 1];
x1 = [x zeros(1,size(x,2))];
y1 = [y zeros(1,size(y,2))];
c1 = ifft(fft(x1).*fft(y1));
c2 = conv(x,y,'full');
% 2D
X = [1 2 3;4 5 6; 7 8 9];
Y = [-1 1];
X1 = [X zeros(size(X,1),size(Y,2)-1);zeros(size(Y,1)-1,size(X,2)+size(Y,2)-1)];
Y1 = zeros(size(X1)); Y1(1:size(Y,1),1:size(Y,2)) = Y;
c1 = ifft2(fft2(X1).*fft2(Y1));
c2 = conv2(X,Y,'full');
In order to clarify the convolution, look also at this picture:

replace zeroes in numpy array with the median value

I have a numpy array like this:
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
I want to replace all the zeros with the median value of the whole array (where the zero values are not to be included in the calculation of the median)
So far I have this going on:
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
foo = np.array(foo_array)
foo = np.sort(foo)
print "foo sorted:",foo
#foo sorted: [ 0 0 0 0 0 3 5 8 14 15 16 17 18 19 21 26 27 29 29 31 38 38 40 49 55]
nonzero_values = foo[0::] > 0
nz_values = foo[nonzero_values]
print "nonzero_values?:",nz_values
#nonzero_values?: [ 3 5 8 14 15 16 17 18 19 21 26 27 29 29 31 38 38 40 49 55]
size = np.size(nz_values)
middle = size / 2
print "median is:",nz_values[middle]
#median is: 26
Is there a clever way to achieve this with numpy syntax?
Thank you
This solution takes advantage of numpy.median:
import numpy as np
foo_array = [38,26,14,55,31,0,15,8,0,0,0,18,40,27,3,19,0,49,29,21,5,38,29,17,16]
foo = np.array(foo_array)
# Compute the median of the non-zero elements
m = np.median(foo[foo > 0])
# Assign the median to the zero elements
foo[foo == 0] = m
Just a note of caution, the median for your array (with no zeroes) is 23.5 but as written this sticks in 23.
foo2 = foo[:]
foo2[foo2 == 0] = nz_values[middle]
Instead of foo2, you could just update foo if you want. Numpy's smart array syntax can combine a few lines of the code you made. For example, instead of,
nonzero_values = foo[0::] > 0
nz_values = foo[nonzero_values]
You can just do
nz_values = foo[foo > 0]
You can find out more about "fancy indexing" in the documentation.

Categories