Matlab to python conversion matrix operations - python

Hi I am trying to covert this distance formula for rectilinear distance from matlab to python. X1 and X2 are two matrices of two dimensional points and could be differing lengths.
nd = size(X1); n = nd(1);
d = nd(2);
m = size(X2,1);
D = abs(X1(:,ones(1,m)) - X2(:,ones(1,n))') + ...
abs(X1(:,2*ones(1,m)) - X2(:,2*ones(1,n))');
I think the problem I am having most in python is appending the ones matrices with X1 and X2 since they are np.arrays.

First your code:
octave:1> X1=[0,1,2,3;2,3,1,1]'
octave:2> X2=[2,3,2;4,2,4]'
<your code>
octave:21> D
D =
4 3 4
2 3 2
3 2 3
4 1 4
Matching numpy code:
X1=np.array([[0,1,2,3],[2,3,1,1]]).T
X2=np.array([[2,3,2],[4,2,4]]).T
D=np.abs(X1[:,None,:]-X2[None,:,:]).sum(axis=-1)
produces, D:
array([[4, 3, 4],
[2, 3, 2],
[3, 2, 3],
[4, 1, 4]])
numpy broadcasts automatically, so it doesn't need the ones() to expand the dimensions. Instead I use None (same as np.newaxis) to create new dimensions. The difference is then 3D, which is then summed on the last axis.
I forgot how spoilt we are with the numpy broadcasting. Though newer Octave has something similar:
D = sum(abs(reshape(X1,[],1,2)-reshape(X2,1,[],2)),3)

Related

Speed up algorithm pandas

Before turning to the question I have o explain the algorithm that I am using.
For this purpose, say I have a dataframe as follows:
# initialize list of lists
data = [[2, [4], None], [4, [9,18,6], None], [6, [], 9],[7, [2], None],[9, [4], 7],[14, [18,6], 3],[18, [7], 1]]
# Create a mock pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_docdb','frontier'])
Now I will define a distance measure which is 0 whereby the frontier variable is different from NaN.
The algorithm basically updates the distance variable as follows:
Look for all docdb having a distance=0 within the variable cited_docdb (which is a list for each observation);
Assign a value of 0 to them within cited_docdb;
Assign a distance of 1 to all docdb having at least a 0 within their cited_docdb
Repeat the process with distance=1,2,3,..., max_cited_docdb (the maximum number of docdb cited)
The algorithm woks as follows:
df.replace(' NaN', np.NaN)
df['distance'] = np.where((df['fronteer'] >0), 0, np.nan)
for k in range(max(max_cited_docdb)):
s=df.set_index('docdb')['distance'].dropna()[df.set_index('docdb')['distance'].dropna()>=k]
df['cited_docdb'] = [[s.get(i, i) for i in x] for x in df['cited_docdb']]
m=[k in x for x in df['cited_docdb']]
df.loc[m&df['distance'].isna(), 'distance'] = k+1
Now, my problem is that my original database has 3 Million of observations and the docdb that cited most other docdb has 9500 values (i.e. the longest cited_docdb list has 9500 values). Hence, the algorithm above is extremely slow. Is there a way to speed it up (e.g. modifying the algorithm somehow with dask??) or not?
Thanks a lot
It looks like a graph problem where you want to get the shortest distance between the nodes in docdb and a fixed terminal node (here NaN).
You can approach this with networkx.
Here is your graph:
import networkx as nx
G = nx.from_pandas_edgelist(df.explode('cited_docdb'),
source='docdb', target='cited_docdb',
create_using=nx.DiGraph)
# get shortest path length (minus bounds = 2)
d = {n: len(nx.shortest_path(G, n, np.nan))-2
for n in df.loc[df['frontier'].isna(), 'docdb']}
# {2: 2, 4: 1, 7: 3}
# map values
df['distance'] = df['docdb'].map(d).fillna(0, downcast='infer')
output:
docdb cited_docdb frontier distance
0 2 [4] NaN 2
1 4 [9, 18, 6] NaN 1
2 6 [] 9.0 0
3 7 [2] NaN 3
4 9 [4] 7.0 0
5 14 [18, 6] 3.0 0
6 18 [7] 1.0 0

numpy vectorized operation for a large array

I am trying to do some computations for a numpy array by python3.
the array:
c0 c1 c2 c3
r0 1 5 2 7
r1 3 9 4 6
r2 8 2 1 3
Here the "cx" and "rx" are column and row names.
I need to compute the difference of each element by row if the elements are not in a given column list.
e.g.
given a column list [0, 2, 1] # they are column indices
which means that
for r0, we need to calculate the difference between the c0 and all other columns, so we have
[1, 5-1, 2-1, 7-1]
for r1, we need to calculate the difference between the c2 and all other columns, so we have
[3-4, 9-4, 4, 6-4]
for r2, we need to calculate the difference between the c1 and all other columns, so we have
[8-2, 2, 1-2, 3-2]
so, the result should be
1 4 1 6
-1 5 4 2
6 2 -1 1
Because the array could be very large, I would like to do the calculation by numpy vectorized operation, e.g. broadcasting.
BuT, I am not sure how to do it efficiently.
I have checked Vectorizing operation on numpy array, Vectorizing a Numpy slice operation, Vectorize large NumPy multiplication, Replace For Loop with Numpy Vectorized Operation, Vectorize numpy array for loop.
But, none of them work for me.
thanks for any help !
Extract the values from the array first and then do subtraction:
import numpy as np
a = np.array([[1, 5, 2, 7],
[3, 9, 4, 6],
[8, 2, 1, 3]])
cols = [0,2,1]
# create the index for advanced indexing
idx = np.arange(len(a)), cols
# extract values
vals = a[idx]
# subtract array by the values
a -= vals[:, None]
# add original values back to corresponding position
a[idx] += vals
print(a)
#[[ 1 4 1 6]
# [-1 5 4 2]
# [ 6 2 -1 1]]
Playground

Pandas transform inconsistent behavior for list

I have sample snippet that works as expected:
import pandas as pd
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)
The result is:
label wave y new
0 a 1 0 (1,)
1 b 2 0 (2, 3)
2 b 3 0 (2, 3)
3 c 4 0 (4,)
It works analagously, if instead of tuple in transform I give set, frozenset, dict, but if I give list I got completly unexpected result:
df['new'] = df.groupby(['label'])[['wave']].transform(list)
label wave y new
0 a 1 0 1
1 b 2 0 2
2 b 3 0 3
3 c 4 0 4
There is a workaround to get expected result:
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.
The question is why it works in this way?
I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.
For example, this will cause the list to unpack, as the len of the list matches the length of each group:
df.groupby(['label'])[['wave']].transform(lambda x: list(x))
wave
0 1
1 2
2 3
3 4
However, if the length of the list is not the same as each group, you will get desired behaviour:
df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])
wave
0 [1, 0]
1 [2, 3, 0]
2 [2, 3, 0]
3 [4, 0]
I think this is a side effect of the list unpacking functionality.
I think that is a bug in pandas. Can you open a ticket on their github page please?
At first I thought, it might be, because list is just not handeled correctly as argument to .transform, but if I do:
def create_list(obj):
print(type(obj))
return obj.to_list()
df.groupby(['label'])[['wave']].transform(create_list)
I get the same unexpected result. If however the agg method is used, it works directly:
df.groupby(['label'])['wave'].agg(list)
Out[179]:
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
I can't imagine that this is intended behavior.
Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints (remember for [['wave']] which creates a one-columed dataframe transform(tuple) indeed returned tuples):
df.groupby(['label'])['wave'].transform(tuple)
Out[177]:
0 1
1 2
2 3
3 4
Name: wave, dtype: int64
If I do that again with agg instead of transform it works for both ['wave'] and [['wave']]
I was using version 0.25.0 on an ubuntu X86_64 system for my tests.
Since DataFrames are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.
pd.DataFrame.trasnform is originally implemented on top of .agg:
# pandas/core/generic.py
#Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce " "aggregated results")
return result
However, transform always return a DataFrame that must have the same length as self, which is essentially the input.
When you do an .agg function on the DataFrame, it works fine:
df.groupby('label')['wave'].agg(list)
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
The problem gets introduced when transform tries to return a Series with the same length.
In the process to transforming a groupby element which is a slice from self and then concatenating this again, lists gets unpacked to the same length of index as #Allen mentioned.
However, when they don't align, then don't get unpacked:
df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
A workaround this problem might be avoiding transform:
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
Another interesting work around, that works for strings, is:
df = df.applymap(str) # Make them all strings... would be best to use on non-numeric data.
df.groupby(['label'])['wave'].transform(' '.join).str.split()
Output:
0 [1]
1 [2, 3]
2 [2, 3]
3 [4]
Name: wave, dtype: object
The suggested answers does not work on Pandas 1.2.4 anymore. Here is a workaround for it:
df.groupby(['label'])[['wave']].transform(lambda x: [list(x) + [1]]*len(x))
The idea behind it is the same as explained in other answers (e.g. #Allen's answer). Therefore, the solution here is wrap the function into another list and repeat it same number as the group length, so that when pandas transform unwraps it, each row gets the inside list.
output:
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]

Checking min distance between multiple slices of dataframe of co-ordinates

Having two sets of lists with cells ID
A = [4, 6, 2493, 2494, 2495]
B = [3, 7, 4983, 4982, 4984, 4981, 4985, 2492, 2496]
And each cell from the upper lists has X, Y coordinates in a seperate columns in a df, each.
df
cell_ID; X ;Y
1; 5; 6
2; 10; 6
...
Where values in the A, B list are the ones in cell_id column. How can I find the sum of distances between cells in A and B, but primarly looking at cells in A in relationship to B? So I have to calculate 5 (A lenght) distances for each cell in A, take min() of those 5 and sum() all those nine min values. I hope that makes sense
I was thinking the following:
Take first value from list A (this is cell with id = 4) and calculate distance between all cells in B and take further only the min value
Repeat step 1 with all other values in A
Make a sum() of all distances
I tried with the code below... but failed
def sum_distances(df, i, col_X='X', col_Y='Y'):
for i in range(A)
return (((df.iloc[B][col_X] - df.iloc[i,2])**2 + (df.iloc[B][col_Y] - df.iloc[i,3])**2)**0.5).min
I don't know how to integrate min() and sum() at the same time.
If I'm not mistaken, you're looking for the Euclidean distance between (x,y) co-ordinates. Here is one possible approach (based on this SO post)
Generate some dummy data in the same format of the OP
import pandas as pd
import numpy as np
A = [0, 1, 2, 3, 4]
B = [10, 11, 12, 13, 14, 9, 8, 7, 6]
df = pd.DataFrame(np.random.rand(15,2), columns=['X','Y'], index=range(15))
df.index.name = 'CellID'
print('Raw data\n{}'.format(df))
Raw data
X Y
CellID
0 0.125591 0.890772
1 0.754238 0.644081
2 0.952322 0.099627
3 0.090804 0.809511
4 0.514346 0.041740
5 0.678598 0.230225
6 0.594182 0.432634
7 0.005777 0.891440
8 0.925187 0.045035
9 0.903591 0.238609
10 0.187591 0.255377
11 0.252635 0.149840
12 0.513432 0.972749
13 0.433606 0.550940
14 0.104991 0.440052
To get the minimum distance between each index of B and A
# Get df at indexes from list A: df_A
df_A = df.iloc[A,]
# For df at each index from list B (df.iloc[b,]), get distance to df_A: d
dist = []
for b in B:
d = (pd.DataFrame(df_A.values - df.iloc[b,].values)**2).sum(1)**0.5
dist.append(d.min())
print('Sum of minimum distances is {}'.format(sum(dist)))
Output (for sum of minimum distances between each index of B and A)
Sum of minimum distances is 2.36509386378

How to sum NaN in numpy?

I have to sum two values obtained by np.average as
for i in x :
a1 = np.average(function1(i))
a2 = np.average(function2(i))
plt.plot(i, a1+a2, 'o')
But the np.average may return NaN. Then, only points for which both a1 and a2 are available will be calculated.
How can I use zero instead of NaN to make the sum for all points?
I tried to find a function in numpy to do so, but numpy.nan_to_num is for arrays.
You can use numpy like this:
import numpy as np
a = [1, 2, np.nan]
a_sum = np.nansum(a)
a_mean = np.nanmean(a)
print('a = ', a) # [1, 2, nan]
print("a_sum = {}".format(a_sum)) # 3.0
print("a_mean = {}".format(a_mean)) # 1.5
You can also use :
clean_x = x[~np.isnan(x)]

Categories