Creating a value matrix in python - python

I have a dataset as follow
d = {'dist': [100, 200, 200, 400],'id': [1, 2, 3, 4]}
df = pd.DataFrame(data= d)
I would like to create a value matrix around the id
with the calcul : dist(id1) - dist(id2)
null | 1 | 2 | 3 | 4
1 | 0 | 100 | 100 | 300
2 |-100 | 0 | 0 | 200
3 |-100 | 0 | 0 | 200
4 |-300 |-200 |-200 | 0
Any advices will be appreciated

(Edit) Here's the simplified version via the beauty of numpy:
import numpy as np
d = {'dist': [100, 200, 200, 400],'id': [1, 2, 3, 4]}
a = np.array(d['dist']).reshape(1,-1)
b = np.array(a).reshape(-1,1)
# the solution
print a-b
# [[ 0 100 100 300]
# [-100 0 0 200]
# [-100 0 0 200]
# [-300 -200 -200 0]]
(Old Answer) You can do it with a little matrix algebra:
import numpy as np
d = {'dist': [100, 200, 200, 400],'id': [1, 2, 3, 4]}
a = np.array(d['dist']).reshape(1,-1)
b = np.array(a).reshape(-1,1)
# some matrix algebra
c = b.dot(a)
e = c/a
f = c/b
# the solution
print f-e
# [[ 0 100 100 300]
# [-100 0 0 200]
# [-100 0 0 200]
# [-300 -200 -200 0]]

I'm not familiar with numpy, but you could create the matrix, given the existing data structure, using this mildly complicated dictionary comprehension:
matrix = {id: {v: d.get("dist")[i] - d.get("dist")[j] for j, v in enumerate(d.get("id"))} for i, id in enumerate(d.get("id"))}
Keys of the matrix are the columns, and keys of each column are the rows. You could probably write this in a much neater fashion, but this a built-ins only answer that conforms to your request.

Related

Aggregate list dataframe in pandas using dedicated function

I have the following dataframe in pandas:
data = {'ID_1': {0: '10A00', 1: '10B00', 2: '20001', 3: '20001'},
'ID_2_LIST': {0: [20009, 30006], 1: [20001, 30006],
2: [30009, 30006], 3: [20001, 30003]},
'ID_OCCURRENCY_LIST': {0: [1, 2], 1: [5, 6], 2: [2, 4], 3: [1, 3]}}
# create df
df = pd.DataFrame(data)
| | ID_1 | ID_2_LIST | ID_OCCURRENCY_LIST |
|---:|:-------|:---------------|:---------------------|
| 0 | 10A00 | [20009, 30006] | [1, 2] |
| 1 | 10B00 | [20001, 30006] | [5, 6] |
| 2 | 20001 | [30009, 30006] | [2, 4] |
| 3 | 20001 | [20001, 30003] | [1, 3] |
I would aggregate by ID_1 field applying an external function (in order to identify similar ID_1, let's say "similarID(ID1,ID2)", which returns ID1 or ID2 according to some internal rules), re-generate the list of ID2 and sum the occurrencies for all the equal ID2.
The outcome should be:
**INDEX ID_1 ID_2_LIST ID_OCCURRENCY_LIST**
0 10A00 [20009,30006,20001] [1, 8, 5]
1 10B00 [20001,30006, 30003,20001] [5, 6, 4, 2]
1 20001 [30009,30006, 20001,30003] [2, 4, 1, 3]
EDIT
The code for the function is the following(s1=first string,c1=second string, p1=similarity percentage l1=confidence level, demeraulevenshtein is a literature function):
def pySimilar(s1,c1,p1,l1):
if s1 is None or c1 is None:
return 0
if len(s1)<=5 or len(c1)<=5:
return 0
s1=s1.strip()
c1=c1.strip()
s=s1
c=c1
if s1[3:len(s1)]==c1[3:len(c1)]:
return 1
if len(s1)>=len(c1):
ITERATIONLENGTH=len(c1)/2
else:
ITERATIONLENGTH=len(s1)/2
if len(s1)>=len(c1):
a=int(len(c1)/2)+1
if s1.find(c1[3:a])<0:
return 0
else:
b=int(len(s1)/2)+1
if c1.find(s1[3:b])<0:
return 0
v=[]
CNT=0
TMP=0
max_res=0
search=s1
while CNT < ITERATIONLENGTH:
TMP=(100-((pyDamerauLevenschtein(s[3:len(s)],c[3:len(c)]))*100)/(len(c)-3)) * ((len(search)-3)/(len(s1)-3))
v.append(TMP)
CNT=CNT+1
if TMP>max_res:
max_res = TMP
#s=s[0:len(s)-CNT]
search=s1[0:len(s1)-CNT]
s=s1[0:len(s1)-CNT]
c=c1[0:len(c1)-CNT]
if ((p1-(l1*p1/100)<=sum(v)/len(v) and sum(v)/len(v)<=p1+(l1*p1/100)) or sum(v)/len(v)>=p1+(l1*p1/100)) :
return 1
else:
return 0
I have implemented a function to be applied in the dataframe but it is very slow:
def aggregateListAndOccurrencies(list1,list2):
final = []
final_cnt = []
output = []
cnt_temp = 0
while list1:
elem = list1.pop(0)
cnt = list2.pop(0)
i=0
cnt_temp = cnt
for item in list1:
if pyMATCHSIMILARPN(elem,item,65,20)==1:
cnt_temp = list2[i]+cnt_temp
list1.pop(i)
list2.pop(i)
i+=1
final.append(elem)
final_cnt.append(cnt_temp)
output.append(final)
output.append(final_cnt)
return output
How could apply this in pandas? Any suggestions?
You can simply do a groupby over your ID_1 and just sum the ID_2_List and ID_OCCURRENCY_LIST columns:
df.groupby('ID_1').agg({'ID_2_LIST': 'sum', 'ID_OCCURRENCY_LIST': 'sum'})
if there's a spicific function you'd like the groupby to work with you can then you can use lambda to add it in the .agg:
df.groupby('ID_1').agg({'ID_2_LIST': 'sum', 'ID_OCCURRENCY_LIST': lambda x: ' '.join(x)})

Python: multiplying fixed array by element wise second array

Example code:
import numpy as np
a = np.arange(1,11)
b = np.arange(1,11)
b[:] = 0
b[3] = 10
b[4] = 10
print(a, b)
[ 1 2 3 4 5 6 7 8 9 10] [ 0 0 0 10 10 0 0 0 0 0]
I am attempting to multiply b by element-wise a-array such that my resulting array is the following:
[0 0 10 30 50 70 90 110 130 150]
Any help would be greatly appreciated.
It looks like you want the convolution of both arrays:
np.convolve(a,b)[:len(a)+1]
# array([ 0, 0, 0, 10, 30, 50, 70, 90, 110, 130, 150])
Elementwise multiplication of b with a shall give you [0,0,40,50,0,0,0,0,0,0] and not what you have stated.

Index of identical rows in a NumPy array

I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?
This is pretty simple with pandas groupby:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.
You can use collections.defaultdict, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.

TensorFlow - numpy-like tensor indexing

In numpy, we can do this:
x = np.random.random((10,10))
a = np.random.randint(0,10,5)
b = np.random.randint(0,10,5)
x[a,b] # gives 5 entries from x, indexed according to the corresponding entries in a and b
When I try something equivalent in TensorFlow:
xt = tf.constant(x)
at = tf.constant(a)
bt = tf.constant(b)
xt[at,bt]
The last line gives a "Bad slice index tensor" exception. It seems TensorFlow doesn't support indexing like numpy or Theano.
Does anybody know if there is a TensorFlow way of doing this (indexing a tensor by arbitrary values). I've seen the tf.nn.embedding part, but I'm not sure they can be used for this and even if they can, it's a huge workaround for something this straightforward.
(Right now, I'm feeding the data from x as an input and doing the indexing in numpy but I hoped to put x inside TensorFlow to get higher efficiency)
You can actually do that now with tf.gather_nd. Let's say you have a matrix m like the following:
| 1 2 3 4 |
| 5 6 7 8 |
And you want to build a matrix r of size, let's say, 3x2, built from elements of m, like this:
| 3 6 |
| 2 7 |
| 5 3 |
| 1 1 |
Each element of r corresponds to a row and column of m, and you can have matrices rows and cols with these indices (zero-based, since we are programming, not doing math!):
| 0 1 | | 2 1 |
rows = | 0 1 | cols = | 1 2 |
| 1 0 | | 0 2 |
| 0 0 | | 0 0 |
Which you can stack into a 3-dimensional tensor like this:
| | 0 2 | | 1 1 | |
| | 0 1 | | 1 2 | |
| | 1 0 | | 2 0 | |
| | 0 0 | | 0 0 | |
This way, you can get from m to r through rows and cols as follows:
import numpy as np
import tensorflow as tf
m = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
rows = np.array([[0, 1], [0, 1], [1, 0], [0, 0]])
cols = np.array([[2, 1], [1, 2], [0, 2], [0, 0]])
x = tf.placeholder('float32', (None, None))
idx1 = tf.placeholder('int32', (None, None))
idx2 = tf.placeholder('int32', (None, None))
result = tf.gather_nd(x, tf.stack((idx1, idx2), -1))
with tf.Session() as sess:
r = sess.run(result, feed_dict={
x: m,
idx1: rows,
idx2: cols,
})
print(r)
Output:
[[ 3. 6.]
[ 2. 7.]
[ 5. 3.]
[ 1. 1.]]
LDGN's comment is correct. This is not possible at the moment, and is a requested feature. If you follow issue#206 on github you'll get updated if/when this is available. Many people would like this feature.
For Tensorflow 0.11, basic indexing has been implemented. More advanced indexing (like boolean indexing) is still missing but apparently is planned for future versions.
Advanced indexing can be tracked with https://github.com/tensorflow/tensorflow/issues/4638

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

Categories