Is there a way to write fast nested loops in python?

Is there a way to write fast nested loops in python? - python

I wrote a function that takes a set of multiple variables and matches it to a list of dataframes (which contains 5572 rows with presence and absence of those variables), later calculating a distance matrix.
In other words, I have a list with 25 presence/absence vectors (filled with 0s and ONE 1) and a list of 25 dataframes (also filled with 0s and 1s). Vectors and dataframes are in iterable order, meaning veclist[0] matches column number in PAM[0] and so on. The function has a nested loop, with the outer loop running the dataframes in order and the inner loop running the rows (finds in which row the variables occur in the dataframe for all 25 variables and gives it a value)
To illustrate it better I recreated a short version of my data below.
The list of vectors looks like this:
import numpy as np
import pandas as pd
import math
A = np.random.randint(1, size=6).reshape(1, 6)
B = np.random.randint(1, size=11).reshape(1, 11)
A[:,2]=1
B[:,7]=1
vec1 = pd.DataFrame(A, columns=["a","b","c","d","e","f"])
vec2 = pd.DataFrame(B, columns=["a","b","c","d","e","f","g","h","i","j","k"])
veclist=[vec1,vec2]
print(veclist[0])
and the list of dataframes looks like this:
A2 = np.random.randint(2, size=600).reshape(100,6)
B2 = np.random.randint(2, size=1100).reshape(100,11)
df1 = pd.DataFrame(A2, columns=["a","b","c","d","e","f"])
df2 = pd.DataFrame(B2, columns=["a","b","c","d","e","f","g","h","i","j","k"])
dflist=[df1,df2]
print(dflist[0])
This is the code for the function I wrote.
def find_distance(veclist,dflist):
ncol=len(dflist)
nrow=dflist[0].shape[0]
distance=np.zeros((nrow,ncol))
for k in range(ncol):
pres=np.where(veclist[k]==1) #getting matching column (where the vector matches the header)
PAM=dflist[k]
for m in range(nrow):
if (veclist[k].iloc[pres]==1).bool() & (PAM.iloc[m,pres[1]]==1).bool():
a = 2
else:
a = 0
if (veclist[k].iloc[pres]==1).bool() & (PAM.iloc[m,pres[1]]==0).bool():
b = 1
else:
b = 0
if (veclist[k].iloc[pres]==0).bool() & (PAM.iloc[m,pres[1]]==1).bool():
c = 1
else:
c = 0
d = (2 * a)/(2 * a + b + c)
d = math.sqrt(1-d)
distance[m,k]=d
return distance
This function works: it does everything I need it to do. However, it is very slow. With my real data, it is taking up to a minute to run the inner loop. I am more familiar with R, where this same function runs in seconds. So why is this function taking up to 25 minutes to run in python?
What did I do wrong?
I am guessing that the problem is the code needs to be more pythonic. I am slowly migrating from R to python, and still have some difficulties. For example, I don't know how to do away with the nested loop and use Numpy, since everything needs to be perfectly matched and saved. Any help with this problem would be much appreciated.
I am using python 3.7 and the spyder IDE.

Related

faster way to run a for loop for a very large dataframe list

I am using two for loops inside each other to calculate a value using combinations of elements in a dataframe list. the list consists of large number of dataframes and using two for loops takes considerable amount of time.
Is there a way i can do the operation faster?
the functions I refer with dummy names are the ones where I calculate the results.
My code looks like this:
conf_list = []
for tr in range(len(trajectories)):
df_1 = trajectories[tr]
if len(df_1) == 0:
continue
for tt in range(len(trajectories)):
df_2 = trajectories[tt]
if len(df_2) == 0:
continue
if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
continue
df_temp = cartesian_product_basic(df_1,df_2)
flg, df_temp = another_function(df_temp)
if flg == 0:
continue
flg_h = some_other_function(df_temp)
if flg_h == 1:
conf_list.append(1)
My input list consist of around 5000 dataframes looking like (having several hundreds of rows)
id
x
y
z
time
1
5
7
2
5
and what i do is I get the cartesian product with combinations of two dataframes and for each couple I calculate another value 'c'. If this value c meets a condition then I add an element to my c_list so that I can get the final number of couples meeting the requirement.
For further info;
a_function(df_1, df_2) is a function getting the cartesian product of two dataframes.
another_function looks like this:
def another_function(df_temp):
df_temp['z_dif'] = nwh((df_temp['time_x'] == df_temp['time_y'])
, abs(df_temp['z_x']- df_temp['z_y']) , np.nan)
df_temp = df_temp.dropna()
df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
, np.nan , 1)
df_temp = df_temp.dropna()
if len(df_temp) == 0:
flg = 0
else:
flg = 1
return flg, df_temp
and some_other_function looks like this:
def some_other_function(df_temp):
df_temp['x_dif'] = df_temp['x_x']*df_temp['x_y']
df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])
df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
, 1 , np.nan)
if df_temp['conf'].sum()>0:
flg_h = 1
return flg_h

The following are the way to make your code run faster:
Instead of for-loop use list comprehension.
use built-in functions like map, filter, sum ect, this would make your code faster.
Do not use '.' or dot operants, for example
Import datetime
A=datetime.datetime.now() #dont use this
From datetime.datetime import now as timenow
A=timenow()# use this
Use c/c++ based operation libraries like numpy.
Don't convert datatypes unnecessarily.
in infinite loops, use 1 instead of "True"
Use built-in Libraries.
if the data would not change, convert it to a tuple
Use String Concatenation
Use Multiple Assignments
Use Generators
When using if-else to check a Boolean value, avoid using assignment operator.
# Instead of Below approach
if a==1:
print('a is 1')
else:
print('a is 0')
# Try this approach
if a:
print('a is 1')
else:
print('a is 0')
# This would help as a portion of time is reduce which was used in check the 2 values.
Usefull references:
Speeding up Python Code: Fast Filtering and Slow Loops
Speed Up Python Code

Python list comparison numpy optimization

I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?

By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))

You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

Appending the each dataframe from a list of dataframe with another list of dataframes

I have 2 sets of split data frames from a big data frame. Say for example,
import pandas as pd, numpy as np
np.random.seed([3,1415])
ind1 = ['A_p','B_p','C_p','D_p','E_p','F_p','N_p','M_p','O_p','Q_p']
col1 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df1 = pd.DataFrame(np.random.randint(10, size=(10, 7)), columns=col1,index=ind1)
ind2 = ['G_l','I_l','J_l','K_l','L_l','M_l','R_l','N_l']
col2 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df2 = pd.DataFrame(np.random.randint(20, size=(8, 7)), columns=col2,index=ind2)
# Split the dataframes into two parts
pc_1,pc_2 = np.array_split(df1, 2)
lnc_1,lnc_2 = np.array_split(df2, 2)
And now, I need to concatenate each split data frames from df1 (pc1, pc2) with each data frames from df2 (ln_1,lnc_2). Currently, I am doing it following,
# concatenate each split data frame pc1 with lnc1
pc1_lnc_1 =pd.concat([pc_1,lnc_1])
pc1_lnc_2 =pd.concat([pc_1,lnc_2])
pc2_lnc1 =pd.concat([pc_2,lnc_1])
pc2_lnc2 =pd.concat([pc_2,lnc_2])
On every concatenated data frame I need to run a correlation analysis function, for example,
correlation(pc1_lnc_1)
And I wanted to save the results separately, for example,
pc1_lnc1= correlation(pc1_lnc_1)
pc1_lnc2= correlation(pc1_lnc_2)
......
pc1_lnc1.to_csv(output,sep='\t')
The question is if there is a way I can automate the above concatenation part, rather than coding it in every line using some sort of loop, currently for every concatenated data frame. I am separately running the function correlation. And I have a pretty long list of the split data frame.

You can loop over the split dataframes:
for pc in np.array_split(df1, 2):
for lnc in np.array_split(df2, 2):
print(correlation(pd.concat([pc,lnc])))

Here is another thought,
def correlation(data):
# do some complex operation..
return data
# {"pc_1" : split_1, "pc_2" : split_2}
pc = {f"pc_{i + 1}": v for i, v in enumerate(np.array_split(df1, 2))}
lc = {f"lc_{i + 1}": v for i, v in enumerate(np.array_split(df2, 2))}
for pc_k, pc_v in pc.items():
for lc_k, lc_v in lc.items():
# (pc_1, lc_1), (pc_1, lc_2) ..
correlation(pd.concat([pc_v, lc_v])). \
to_csv(f"{pc_k}_{lc_k}.csv", sep="\t", index=False)
# will create csv like pc_1_lc_1.csv, pc_1_lc_2.csv.. in the current working dir

If you don't have your individual dataframes in an array (and assuming you have a nontrivial number of dataframes), the easiest way (with minimal code modification) would be to throw an eval in with a loop.
Something like
for counter in range(0,n):
for counter2 in range(0:n);
exec("pc{}_lnc{}=correlation(pd.concat([pc_{},lnc_{}]))".format(counter,counter2,counter,counter2))
eval("pc{}_lnc{}.to_csv(filename,sep='\t')".format(counter,counter2)
The standard disclaimer around eval does still apply (don't do it because it's lazy programming practice and unsafe inputs could cause all kinds of problems in your code).
See here for more details about why eval is bad
edit Updating answer for updated question.

New Dataframe column as a generic function of other rows (pandas)

What is the fastest (and most efficient) way to create a new column in a DataFrame that is a function of other rows in pandas ?
Consider the following example:
import pandas as pd
d = {
'id': [1, 2, 3, 4, 5, 6],
'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)
Which yields:
id word
0 1 cat
1 2 hat
2 3 hag
3 4 hog
4 5 dog
5 6 elephant
Suppose I want to create a new column bar containing a value that is based on the output of using a function foo to compare the word in the current row to the other rows in the dataframe.
def foo(word1, word2):
# do some calculation
return foobar # in this example, the return type is numeric
threshold = some_threshold
for index, _id, word in pandas_df.itertuples():
value = sum(
pandas_df[pandas_df['word'] != word].apply(
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
pandas_df.loc[index, 'bar'] = value
This does produce the correct output, but it uses itertuples() and apply(), which is not performant for large DataFrames.
Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?
Notes / Updates:
In the original post, I used edit distance/levenshtein distance as the foo function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.
If foo was nltk.metrics.distance.edit_distance and the threshold was set to 2 (as in the original post), this produces the output below:
id word bar
0 1 cat 1.0
1 2 hat 2.0
2 3 hag 2.0
3 4 hog 2.0
4 5 dog 1.0
5 6 elephant 0.0
I have the same question for spark dataframes as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similar pandas problems can sometimes be modified to work for spark.
Inspired by this answer to my spark version of this question, I tried to use a cartesian product in pandas. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around calling apply().
Example code:
from nltk.metrics.distance import edit_distance as edit_dist
pandas_df2 = pd.DataFrame(d)
i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)
cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()

Let's try to analyze the problem for a second:
If you have N rows, then you have N*N "pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.
What you can try, however, is to play with the constant factors of that time complexity.
The possible options I found are:
1. Parallelization:
Since you have some large DataFrame, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.
For example, we can partition the rows of the df into disjoint parts, and process each part individually, then combine the results.
A very basic parallel code might look like this:
from multiprocessing import cpu_count,Pool
def work(part):
"""
Args:
part (DataFrame) : a part (collection of rows) of the whole DataFrame.
Returns:
DataFrame: the same part, with the desired property calculated and added as a new column
"""
# Note that we are using the original df (pandas_df) as a global variable
# But changes made in this function will not be global (a side effect of using multiprocessing).
for index, _id, word in part.itertuples(): # iterate over the "part" tuples
value = sum(
pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
part.loc[index, 'bar'] = value
return part
# New code starts here ...
cores = cpu_count() #Number of CPU cores on your system
data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts
Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.
2. "Low level" optimizations:
Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.
3. Function-dependent pruning:
The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:
Assuming you are using Levenshtein distance (LD), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2)) .
You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1, compare it only with strings having length l2 having abs(l1-l2) <= limit. (limit is the maximum accepted dis-similarity, 2 in your provided example).
Another observation is that LD(s1,s2) = LD(s2,s1). That cuts the number of pairs by a factor of 2.
This solution may actually get you down to O(n) time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9 rows, but on average we have only 10^3 rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2 pairs, instead of 10^9 * 10^9 pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.

Thoughts about preprocessing (groupby)
Because you are looking for edit distance less than 2, you can first group by the length of strings. If the difference of length between groups is greater or equal to 2, you do not need to compare them. (This part is quite similar to Qusai Alothman's answer in section 3. H)
Thus, first thing is to group by the length of the string.
df["length"] = df.word.str.len()
df.groupby("length")["id", "word"]
Then, you compute the edit distance between every two consecutive group if the difference in length is less than or equal to 2. This does not directly relate to your question but I hope it would be helpful.
Potential vectorization (after groupby)
After that, you may also try to vectorize the computation by splitting each string into characters. Note that if the cost of splitting is greater than the vectorized benefits it carries, you should not do this. Or when you are creating the data frame, just create one that with characters rather than words.
We will use the answer in Pandas split dataframe column for every character to split a string into a list of characters.
# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))
0 1 2
0 c a t
1 h a t
2 h a g
3 h o g
4 d o g
splitted.loc[0] == splitted # compare one word to all words
0 1 2
0 True True True -> comparing to itself is always all true.
1 False True True
2 False True False
3 False False False
4 False False False
splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
0 1
1 2
2 2
3 2
4 1
dtype: int64
Explanation of splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
For each row, lambda x: (x == splitted) compares each row to the whole df just like splitted.loc[0] == splitted above. It will generate a true/false table.
Then, we sum up the table horizontally with a .sum(axis=1) following (x == splitted).
Then, we want to find out which words are similar. Thus, we apply a ge function that checks the number of true is over a threshold. Here, we only allow difference to be 1, so it is set to be len(x)-1.
Finally, we will have to subtract the whole array by 1 because we compare each word with itself in operation. We will want to exclude self-comparison.
Note, this vectorization part only works for within-group similarity checking. You still need to check groups with different length with the edit distance approach, I suppose.

Using scipy pdist on pandas data frames (and) lists

I've run into an odd problem yet again.
Suppose I have the following dummy data frame (by way of demonstrating my problem):
import numpy as np
import pandas as pd
import string
# Test data frame
N = 3
col_ids = string.letters[:N]
df = pd.DataFrame(
np.random.randn(5, 3*N),
columns=['{}_{}'.format(letter, coord) for letter in col_ids for coord in list('xyz')])
df
This produces:
A_x A_y A_z B_x B_y B_z C_x C_y C_z
0 -1.339040 0.185817 0.083120 0.498545 -0.569518 0.580264 0.453234 1.336992 -0.346724
1 -0.938575 0.367866 1.084475 1.497117 0.349927 -0.726140 -0.870142 -0.371153 -0.881763
2 -0.346819 -1.689058 -0.475032 -0.625383 -0.890025 0.929955 0.683413 0.819212 0.102625
3 0.359540 -0.125700 -0.900680 -0.403000 2.655242 -0.607996 1.117012 -0.905600 0.671239
4 1.624630 -1.036742 0.538341 -0.682000 0.542178 -0.001380 -1.126426 0.756532 -0.701805
Now I would like to use scipy.spatial.distance.pdist on this pandas data frame. This turns out to be a rather non-trivial process. What pdist does is to compute the distance between m points using Euclidean distance (2-norm) as the distance metric between the points. The points are arranged as m n-dimensional row vectors in the matrix X (source).
So, there are a couple of things that one has to do to create a function that operates on a pandas data frame, such that the pdist function can be used. You will note that pdist is convenient when the number of points gets very large. I've tried making my own, which works for a one-row data-frame, but I cannot get it to work, ideally, on the whole data frame at once.
Here's my attempt:
from scipy.spatial.distance import pdist, squareform
import numpy as np
import pandas as pd
import string
def Euclidean_distance(df):
EcDist = pd.DataFrame(index=df.index) # results container
arr = df.values # Store data frame values into a numpy array
tag_list = [num for elem in arr for num in elem] # flatten numpy array into single list
tag_list_3D = zip(*[iter(tag_list)]*3) # separate list into length = 3 sub-lists, that pdist() can work with
EcDist = pdist(tag_list_3D) # the distance between m points using Euclidean distance (2-norm)
return EcDist
First I begin my creating a results container in pandas form, to store the result in. Secondly I save the pandas data frame as a numpy array, in order to get it into list form in the next step. It has to be list form because the pdist function does only operate on lists. When saving the data frame into an array, it stores it as a list within a list. This has to be flattened which is saved in the 'tag_list' variable. Thirdly, the tag_list is furthered reduced into sub-lists of length three, such that the x, y and z coordinates can be obtained for each point, which can the be used to find the Euclidean distance between all of these points (in this example there are three points: A,B and C each being three dimensional).
As said, the function works if the data frame is a single row, but when using the function in the given example it calculates the Euclidean distance for 5x3 points, which yields a total of 105 distances. What I want it to do is to calculate the distances per row (so pdist should only work on a 1x3 vector at a time). Such that my final results, for this example, would look something like this:
dist_1 dist_2 dist_3
0 0.807271 0.142495 1.759969
1 0.180112 0.641855 0.257957
2 0.196950 1.334812 0.638719
3 0.145780 0.384268 0.577387
4 0.044030 0.735428 0.549897
(these are just dummy numbers to show the desired shape)
Hence how do I get my function to apply to the data frame in a row-wise fashion?
Or better yet, how can I get it to perform the function on the entire data frame at once, and then store the result in a new data frame?
Any help would be very appreciated. Thanks.

If I understand correctly, you have "groups" of points. In your example each group has three points, which you call A, B and C. A is represented by three columns A_x, A_y, A_z, and likewise for B and C.
What I suggest is that you restructure your "wide-form" data into a "long" form in which each row contains only one point. Each row then will have only three columns for the coordinates, and then you will add an additional column to represent which group a point is in. Here's an example:
>>> d = pandas.DataFrame(np.random.randn(12, 3), columns=["X", "Y", "Z"])
>>> d["Group"] = np.repeat([1, 2, 3, 4], 3)
>>> d
X Y Z Group
0 -0.280505 0.888417 -0.936790 1
1 0.823741 -0.428267 1.483763 1
2 -0.465326 0.005103 -1.107431 1
3 -1.009077 -1.618600 -0.443975 2
4 0.535634 0.562617 1.165269 2
5 1.544621 -0.858873 -0.349492 2
6 0.839795 0.720828 -0.973234 3
7 -2.273654 0.125304 0.469443 3
8 -0.179703 0.962098 -0.179542 3
9 -0.390777 -0.715896 -0.897837 4
10 -0.030338 0.746647 0.250173 4
11 -1.886581 0.643817 -2.658379 4
The three points with Group==1 correspond to A, B and C in your first row; the three points with Group==2 correspond to A, B, and C in your second row; etc.
With this structure, computing the pairwise distances by group using pdist becomes straightforward:
>>> d.groupby('Group')[["X", "Y", "Z"]].apply(lambda g: pandas.Series(distance.pdist(g), index=["D1", "D2", "D3"]))
D1 D2 D3
Group
1 2.968517 0.918435 2.926395
2 3.119856 2.665986 2.309370
3 3.482747 1.314357 2.346495
4 1.893904 2.680627 3.451939
It is possible to do a similar thing with your existing setup, but it will be more awkward. The problem with the way you set it up is that you have encoded critical information in a difficult-to-extract way. The information about which columns are X coordinates and which are Y or Z coordinates, as well as the information about which columns refer to point A versus B or C, in your setup, is encoded in the textual names of the columns. You as a human can see which columns are X values just by looking at them, but specifying that programmatically requires parsing the string names of the columns.
You can see this in how you made the column names with your '{}_{}'.format(letter, coord) business. This means that in order to get to use pdist on your data, you will have to do the reverse operation of parsing the column names as strings in order to decide which columns to compare. Needless to say, this will be awkward. On the other hand, if you put the data into "long" form, there is no such difficulty: the X coordinates of all points line up in one column, and likewise for Y and Z, and the information about which points are to be compared is also contained in one column (the "Group" column).
When you want to do large-scale operations on subsets of data, it's usually better to split out things into separate rows. This allows you to leverage the power of groupby, and is also usually what is expected by scipy tools.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a way to write fast nested loops in python? - python

Related

faster way to run a for loop for a very large dataframe list

Python list comparison numpy optimization

Appending the each dataframe from a list of dataframe with another list of dataframes

New Dataframe column as a generic function of other rows (pandas)

Using scipy pdist on pandas data frames (and) lists

Categories

Resources