Checking min distance between multiple slices of dataframe of co-ordinates - python

Having two sets of lists with cells ID
A = [4, 6, 2493, 2494, 2495]
B = [3, 7, 4983, 4982, 4984, 4981, 4985, 2492, 2496]
And each cell from the upper lists has X, Y coordinates in a seperate columns in a df, each.
df
cell_ID; X ;Y
1; 5; 6
2; 10; 6
...
Where values in the A, B list are the ones in cell_id column. How can I find the sum of distances between cells in A and B, but primarly looking at cells in A in relationship to B? So I have to calculate 5 (A lenght) distances for each cell in A, take min() of those 5 and sum() all those nine min values. I hope that makes sense
I was thinking the following:
Take first value from list A (this is cell with id = 4) and calculate distance between all cells in B and take further only the min value
Repeat step 1 with all other values in A
Make a sum() of all distances
I tried with the code below... but failed
def sum_distances(df, i, col_X='X', col_Y='Y'):
for i in range(A)
return (((df.iloc[B][col_X] - df.iloc[i,2])**2 + (df.iloc[B][col_Y] - df.iloc[i,3])**2)**0.5).min
I don't know how to integrate min() and sum() at the same time.

If I'm not mistaken, you're looking for the Euclidean distance between (x,y) co-ordinates. Here is one possible approach (based on this SO post)
Generate some dummy data in the same format of the OP
import pandas as pd
import numpy as np
A = [0, 1, 2, 3, 4]
B = [10, 11, 12, 13, 14, 9, 8, 7, 6]
df = pd.DataFrame(np.random.rand(15,2), columns=['X','Y'], index=range(15))
df.index.name = 'CellID'
print('Raw data\n{}'.format(df))
Raw data
X Y
CellID
0 0.125591 0.890772
1 0.754238 0.644081
2 0.952322 0.099627
3 0.090804 0.809511
4 0.514346 0.041740
5 0.678598 0.230225
6 0.594182 0.432634
7 0.005777 0.891440
8 0.925187 0.045035
9 0.903591 0.238609
10 0.187591 0.255377
11 0.252635 0.149840
12 0.513432 0.972749
13 0.433606 0.550940
14 0.104991 0.440052
To get the minimum distance between each index of B and A
# Get df at indexes from list A: df_A
df_A = df.iloc[A,]
# For df at each index from list B (df.iloc[b,]), get distance to df_A: d
dist = []
for b in B:
d = (pd.DataFrame(df_A.values - df.iloc[b,].values)**2).sum(1)**0.5
dist.append(d.min())
print('Sum of minimum distances is {}'.format(sum(dist)))
Output (for sum of minimum distances between each index of B and A)
Sum of minimum distances is 2.36509386378

Related

Speed up algorithm pandas

Before turning to the question I have o explain the algorithm that I am using.
For this purpose, say I have a dataframe as follows:
# initialize list of lists
data = [[2, [4], None], [4, [9,18,6], None], [6, [], 9],[7, [2], None],[9, [4], 7],[14, [18,6], 3],[18, [7], 1]]
# Create a mock pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_docdb','frontier'])
Now I will define a distance measure which is 0 whereby the frontier variable is different from NaN.
The algorithm basically updates the distance variable as follows:
Look for all docdb having a distance=0 within the variable cited_docdb (which is a list for each observation);
Assign a value of 0 to them within cited_docdb;
Assign a distance of 1 to all docdb having at least a 0 within their cited_docdb
Repeat the process with distance=1,2,3,..., max_cited_docdb (the maximum number of docdb cited)
The algorithm woks as follows:
df.replace(' NaN', np.NaN)
df['distance'] = np.where((df['fronteer'] >0), 0, np.nan)
for k in range(max(max_cited_docdb)):
s=df.set_index('docdb')['distance'].dropna()[df.set_index('docdb')['distance'].dropna()>=k]
df['cited_docdb'] = [[s.get(i, i) for i in x] for x in df['cited_docdb']]
m=[k in x for x in df['cited_docdb']]
df.loc[m&df['distance'].isna(), 'distance'] = k+1
Now, my problem is that my original database has 3 Million of observations and the docdb that cited most other docdb has 9500 values (i.e. the longest cited_docdb list has 9500 values). Hence, the algorithm above is extremely slow. Is there a way to speed it up (e.g. modifying the algorithm somehow with dask??) or not?
Thanks a lot
It looks like a graph problem where you want to get the shortest distance between the nodes in docdb and a fixed terminal node (here NaN).
You can approach this with networkx.
Here is your graph:
import networkx as nx
G = nx.from_pandas_edgelist(df.explode('cited_docdb'),
source='docdb', target='cited_docdb',
create_using=nx.DiGraph)
# get shortest path length (minus bounds = 2)
d = {n: len(nx.shortest_path(G, n, np.nan))-2
for n in df.loc[df['frontier'].isna(), 'docdb']}
# {2: 2, 4: 1, 7: 3}
# map values
df['distance'] = df['docdb'].map(d).fillna(0, downcast='infer')
output:
docdb cited_docdb frontier distance
0 2 [4] NaN 2
1 4 [9, 18, 6] NaN 1
2 6 [] 9.0 0
3 7 [2] NaN 3
4 9 [4] 7.0 0
5 14 [18, 6] 3.0 0
6 18 [7] 1.0 0

How to union some records in a pandas data frame that has intersection

I have a data frame with two columns. Each row contain start and end of a ranges and data frame is sorted.
I want to union every ranges that have intersection till there are not any pair of ranges that have intersection.
my solution is using for loop and iterate over all rows and union them but it is very slow. Can anyone present a faster way for this?
Example
Input:
A
B
1
5
2
4
7
9
11
20
12
21
Output:
A
B
1
5
7
9
11
21
for creating data frame, use below code:
import pandas as pd
a = [1, 2, 7, 11, 12]
b = [5, 4, 9, 20, 21]
df = pd.DataFrame({"A": a, "B": b})
I have a new solution that is faster than for loop in large data frames. suppose name of data frame is df.
df["merger"] = df.A.gt(df.B.shift(1)).astype("Int8").cumsum()
df = df.groupby(by=["merger"]).agg({"A": "min", "B": "max"}).reset_index(drop=True)

GroupBy aggregate function that computes two values at once

I have a datafame like the following:
import pandas as pd
df = pd.DataFrame({
'A': [1, 1, 1, 2, 2, 2],
'B': [1, 2, 3, 4, 5, 6],
'C': [4, 5, 6, 7, 8, 9],
})
Now I want to group and aggregate with two values being produced per group. The result should be similar to the following:
expected = df.groupby('A').agg([min, max])
# B C
# min max min max
# A
# 1 1 3 4 6
# 2 4 6 7 9
However, in my case, instead of two distinct functions min and max, I have one function that computes these two values at once:
def minmax(x):
"""This function promises to compute the min and max in one go."""
return min(x), max(x)
Now my question is, how can I use this one function to produce two aggregation values per group?
It's kind of related to this answer but I couldn't figure out how to do it. The best I could come up with is using a doubly-nested apply however this is not very elegant and also it produces the multi-index on the rows rather than on the columns:
result = df.groupby('A').apply(
lambda g: g.drop(columns='A').apply(
lambda h: pd.Series(dict(zip(['min', 'max'], minmax(h))))
)
)
# B C
# A
# 1 min 1 4
# max 3 6
# 2 min 4 7
# max 6 9
If you are stuck with a function that returns a tuple of values. I'd:
Define a new function that wraps the tuple values into a dict such that you predefine the dict.keys() to align with what you want the column names to be.
Use a careful for loop that doesn't waste time and space.
Wrap Function
# Given Function
def minmax(x):
"""This function promises to compute the min and max in one go."""
return min(x), max(x)
# wrapped function
def minmax_dict(x):
return dict(zip(['min', 'max'], minmax(x)))
Careful for loop
I'm aiming to pass this dictionary into the pd.DataFrame constructor. That means, I want tuples of the MultiIndex column elements in the keys. I want the values to be dictionaries with keys being the index elements.
dat = {}
for a, d in df.set_index('A').groupby('A'):
for cn, c in d.iteritems():
for k, v in minmax_dict(c).items():
dat.setdefault((cn, k), {})[a] = v
pd.DataFrame(dat).rename_axis('A')
B C
min max min max
A
1 1 3 4 6
2 4 6 7 9
Added Detail
Take a look at the crafted dictionary
data
{('B', 'min'): {1: 1, 2: 4},
('B', 'max'): {1: 3, 2: 6},
('C', 'min'): {1: 4, 2: 7},
('C', 'max'): {1: 6, 2: 9}}
One other solution:
pd.concat({k:d.agg(minmax).set_axis(['min','max'])
for k,d in df.drop('A',axis=1).groupby(df['A'])
})
Output:
B C
1 min 1 4
max 3 6
2 min 4 7
max 6 9

numpy vectorized operation for a large array

I am trying to do some computations for a numpy array by python3.
the array:
c0 c1 c2 c3
r0 1 5 2 7
r1 3 9 4 6
r2 8 2 1 3
Here the "cx" and "rx" are column and row names.
I need to compute the difference of each element by row if the elements are not in a given column list.
e.g.
given a column list [0, 2, 1] # they are column indices
which means that
for r0, we need to calculate the difference between the c0 and all other columns, so we have
[1, 5-1, 2-1, 7-1]
for r1, we need to calculate the difference between the c2 and all other columns, so we have
[3-4, 9-4, 4, 6-4]
for r2, we need to calculate the difference between the c1 and all other columns, so we have
[8-2, 2, 1-2, 3-2]
so, the result should be
1 4 1 6
-1 5 4 2
6 2 -1 1
Because the array could be very large, I would like to do the calculation by numpy vectorized operation, e.g. broadcasting.
BuT, I am not sure how to do it efficiently.
I have checked Vectorizing operation on numpy array, Vectorizing a Numpy slice operation, Vectorize large NumPy multiplication, Replace For Loop with Numpy Vectorized Operation, Vectorize numpy array for loop.
But, none of them work for me.
thanks for any help !
Extract the values from the array first and then do subtraction:
import numpy as np
a = np.array([[1, 5, 2, 7],
[3, 9, 4, 6],
[8, 2, 1, 3]])
cols = [0,2,1]
# create the index for advanced indexing
idx = np.arange(len(a)), cols
# extract values
vals = a[idx]
# subtract array by the values
a -= vals[:, None]
# add original values back to corresponding position
a[idx] += vals
print(a)
#[[ 1 4 1 6]
# [-1 5 4 2]
# [ 6 2 -1 1]]
Playground

How to apply a Pandas filter on a data frame based on entries in a column from a different data frame (no join)

As an example, I have one data frame (df_1) with one column which contains some text data. The second data frame (df_2) contains some numbers. How do I check if the text contains the numbers from the second data frame?
df_1
Note
0 The code to this is 1003
1 The code to this is 1004
df_2
Code_Number
0 1006
1 1003
So I want to check if the entries in [Note] from df_1 contains the entries from [Code_Number] from df_2
I have tried using the following code: df_1[df_1['Note'].str.contains(df_2['Code_Number'])] and I know I cannot use a join as I do not have a key to join on.
The final result which I am looking for after the filtering has been applied is:
Note
0 The code to this is 1003
Try this, and see if it covers your use case: get a cross cartesian of both columns, using itertools' product and filter based on the condition:
from itertools import product
m = [ left for left, right
in product(df.Note,df1.Code_Number)
if str(right) in left]
pd.DataFrame(m,columns=['Note'])
Note
0 The code to this is 1003
Do this:
df_1.loc[df_1['Note'].apply(lambda x: any(str(number) in x for number in df_2['Code_Number']))]
Firstly, you have to create 1 column in your df1 where the notes are with a list of numbers that are present in the Notes and then Compare the List column of numbers with the List column of the df2 where the numbers are present(both should be in list format)
#Extract Numbers from Notes
a_string = "0abcadda1 11 def 23 10007"
numbers = [int(word) for word in a_string.split() if word.isdigit()]
print(numbers)
list_test = "103,23"
#Finding common element from both lists the list
L1 = [2,3,4]
L2 = [1,2]
[i for i in L1 if i in L2]
S1 = set(L1)
S2 = set(L2)
print(S1.intersection(S2))
#If you want to find out the common element
def common_data(list1, list2):
result = False
# traverse in the 1st list
for x in list1:
# traverse in the 2nd list
for y in list2:
# if one common
if x == y:
result = True
return result
return result
# driver code
a = [1, 2, 3, 4, 5]
b = [5, 6, 7, 8, 9]
print(common_data(a, b))
a = [1, 2, 3, 4, 5]
b = [6, 7, 8, 9]
print(common_data(a, b))

Categories