Speed up algorithm pandas - python

Before turning to the question I have o explain the algorithm that I am using.
For this purpose, say I have a dataframe as follows:
# initialize list of lists
data = [[2, [4], None], [4, [9,18,6], None], [6, [], 9],[7, [2], None],[9, [4], 7],[14, [18,6], 3],[18, [7], 1]]
# Create a mock pandas DataFrame
df = pd.DataFrame(data, columns=['docdb', 'cited_docdb','frontier'])
Now I will define a distance measure which is 0 whereby the frontier variable is different from NaN.
The algorithm basically updates the distance variable as follows:
Look for all docdb having a distance=0 within the variable cited_docdb (which is a list for each observation);
Assign a value of 0 to them within cited_docdb;
Assign a distance of 1 to all docdb having at least a 0 within their cited_docdb
Repeat the process with distance=1,2,3,..., max_cited_docdb (the maximum number of docdb cited)
The algorithm woks as follows:
df.replace(' NaN', np.NaN)
df['distance'] = np.where((df['fronteer'] >0), 0, np.nan)
for k in range(max(max_cited_docdb)):
s=df.set_index('docdb')['distance'].dropna()[df.set_index('docdb')['distance'].dropna()>=k]
df['cited_docdb'] = [[s.get(i, i) for i in x] for x in df['cited_docdb']]
m=[k in x for x in df['cited_docdb']]
df.loc[m&df['distance'].isna(), 'distance'] = k+1
Now, my problem is that my original database has 3 Million of observations and the docdb that cited most other docdb has 9500 values (i.e. the longest cited_docdb list has 9500 values). Hence, the algorithm above is extremely slow. Is there a way to speed it up (e.g. modifying the algorithm somehow with dask??) or not?
Thanks a lot

It looks like a graph problem where you want to get the shortest distance between the nodes in docdb and a fixed terminal node (here NaN).
You can approach this with networkx.
Here is your graph:
import networkx as nx
G = nx.from_pandas_edgelist(df.explode('cited_docdb'),
source='docdb', target='cited_docdb',
create_using=nx.DiGraph)
# get shortest path length (minus bounds = 2)
d = {n: len(nx.shortest_path(G, n, np.nan))-2
for n in df.loc[df['frontier'].isna(), 'docdb']}
# {2: 2, 4: 1, 7: 3}
# map values
df['distance'] = df['docdb'].map(d).fillna(0, downcast='infer')
output:
docdb cited_docdb frontier distance
0 2 [4] NaN 2
1 4 [9, 18, 6] NaN 1
2 6 [] 9.0 0
3 7 [2] NaN 3
4 9 [4] 7.0 0
5 14 [18, 6] 3.0 0
6 18 [7] 1.0 0

Related

Pandas transform inconsistent behavior for list

I have sample snippet that works as expected:
import pandas as pd
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)
The result is:
label wave y new
0 a 1 0 (1,)
1 b 2 0 (2, 3)
2 b 3 0 (2, 3)
3 c 4 0 (4,)
It works analagously, if instead of tuple in transform I give set, frozenset, dict, but if I give list I got completly unexpected result:
df['new'] = df.groupby(['label'])[['wave']].transform(list)
label wave y new
0 a 1 0 1
1 b 2 0 2
2 b 3 0 3
3 c 4 0 4
There is a workaround to get expected result:
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.
The question is why it works in this way?
I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.
For example, this will cause the list to unpack, as the len of the list matches the length of each group:
df.groupby(['label'])[['wave']].transform(lambda x: list(x))
wave
0 1
1 2
2 3
3 4
However, if the length of the list is not the same as each group, you will get desired behaviour:
df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])
wave
0 [1, 0]
1 [2, 3, 0]
2 [2, 3, 0]
3 [4, 0]
I think this is a side effect of the list unpacking functionality.
I think that is a bug in pandas. Can you open a ticket on their github page please?
At first I thought, it might be, because list is just not handeled correctly as argument to .transform, but if I do:
def create_list(obj):
print(type(obj))
return obj.to_list()
df.groupby(['label'])[['wave']].transform(create_list)
I get the same unexpected result. If however the agg method is used, it works directly:
df.groupby(['label'])['wave'].agg(list)
Out[179]:
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
I can't imagine that this is intended behavior.
Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints (remember for [['wave']] which creates a one-columed dataframe transform(tuple) indeed returned tuples):
df.groupby(['label'])['wave'].transform(tuple)
Out[177]:
0 1
1 2
2 3
3 4
Name: wave, dtype: int64
If I do that again with agg instead of transform it works for both ['wave'] and [['wave']]
I was using version 0.25.0 on an ubuntu X86_64 system for my tests.
Since DataFrames are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.
pd.DataFrame.trasnform is originally implemented on top of .agg:
# pandas/core/generic.py
#Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce " "aggregated results")
return result
However, transform always return a DataFrame that must have the same length as self, which is essentially the input.
When you do an .agg function on the DataFrame, it works fine:
df.groupby('label')['wave'].agg(list)
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
The problem gets introduced when transform tries to return a Series with the same length.
In the process to transforming a groupby element which is a slice from self and then concatenating this again, lists gets unpacked to the same length of index as #Allen mentioned.
However, when they don't align, then don't get unpacked:
df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
A workaround this problem might be avoiding transform:
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
Another interesting work around, that works for strings, is:
df = df.applymap(str) # Make them all strings... would be best to use on non-numeric data.
df.groupby(['label'])['wave'].transform(' '.join).str.split()
Output:
0 [1]
1 [2, 3]
2 [2, 3]
3 [4]
Name: wave, dtype: object
The suggested answers does not work on Pandas 1.2.4 anymore. Here is a workaround for it:
df.groupby(['label'])[['wave']].transform(lambda x: [list(x) + [1]]*len(x))
The idea behind it is the same as explained in other answers (e.g. #Allen's answer). Therefore, the solution here is wrap the function into another list and repeat it same number as the group length, so that when pandas transform unwraps it, each row gets the inside list.
output:
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]

Nearest neighbor matching in Pandas

Given two DataFrames (t1, t2), both with a column 'x', how would I append a column to t1 with the ID of t2 whose 'x' value is the nearest to the 'x' value in t1?
t1:
id x
1 1.49
2 2.35
t2:
id x
3 2.36
4 1.5
output:
id id2
1 4
2 3
I can do this by creating a new DataFrame and iterating on t1.groupby() and doing look ups on t2 then merging, but this take incredibly long given a 17 million row t1 DataFrame.
Is there a better way to accomplish? I've scoured the pandas docs regarding groupby, apply, transform, agg, etc. But an elegant solution has yet to present itself despite my thought that this would be a common problem.
Using merge_asof
df = pd.merge_asof(df1.sort_values('x'),
df2.sort_values('x'),
on='x',
direction='nearest',
suffixes=['', '_2'])
print(df)
Out[975]:
id x id_2
0 3 0.87 6
1 1 1.49 5
2 2 2.35 4
Method 2 reindex
df1['id2']=df2.set_index('x').reindex(df1.x,method='nearest').values
df1
id x id2
0 1 1.49 4
1 2 2.35 3
convert to list t1 and t2 and sort them after this
and with the zip() function match the id
list1 = t1.values.tolist()
list2 = t2.values.tolist()
list1.sort() // ASC ORD DESC YOU DECIDE
list2.sort()
list3 = zip(list1,list2)
print(list3)
//after that you must see the output like (1,4),(2,3)
You can calculate a new array with the distance from each element in t1 to each element in t2, and then take the argmin along the rows to get the right index. This has the advantage that you can choose whatever distance function you like, and it does not require the dataframes to be of equal length.
It creates one intermediate array of size len(t1) * len(t2). Using a pandas builtin might be more memory-efficient, but this should be as fast as you can get as everything is done on the C side of numpy. You could always do this method in batches if memory is a problem.
import numpy as np
import pandas as pd
t1 = pd.DataFrame({"id": [1, 2], "x": np.array([1.49, 2.35])})
t2 = pd.DataFrame({"id": [3, 4], "x": np.array([2.36, 1.5])})
Now comes the part doing the actual work. The .to_numpy() bit is important since otherwise Pandas tries to merge on the indices. The first line uses broadcasting to create horizontal and vertical "repetitions" in a memory-efficient way.
dist = np.abs(t1["x"][np.newaxis, :] - t2["x"][:, np.newaxis])
closest_idx = np.argmin(dist, axis=1)
closest_id = t2["id"][closest_idx].to_numpy()
output = pd.DataFrame({"id1": t1["id"], "id2": closest_id})
print(output)
Alternatively, you can use round to 1 precision
t1 = {'id': [1, 2], 'x': [1.49,2.35]}
t2 = {'id': [3, 4], 'x': [2.36,1.5]}
df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)
df = df1.round(1).merge(df2.round(1), on='x', suffixes=('','2')).drop('x',1)
print(df)
id id2
0 1 4
1 2 3
add .drop('x',1) to remove the output for the binding column 'x'.
add suffixes=('','2') to rename the column titles.

Checking min distance between multiple slices of dataframe of co-ordinates

Having two sets of lists with cells ID
A = [4, 6, 2493, 2494, 2495]
B = [3, 7, 4983, 4982, 4984, 4981, 4985, 2492, 2496]
And each cell from the upper lists has X, Y coordinates in a seperate columns in a df, each.
df
cell_ID; X ;Y
1; 5; 6
2; 10; 6
...
Where values in the A, B list are the ones in cell_id column. How can I find the sum of distances between cells in A and B, but primarly looking at cells in A in relationship to B? So I have to calculate 5 (A lenght) distances for each cell in A, take min() of those 5 and sum() all those nine min values. I hope that makes sense
I was thinking the following:
Take first value from list A (this is cell with id = 4) and calculate distance between all cells in B and take further only the min value
Repeat step 1 with all other values in A
Make a sum() of all distances
I tried with the code below... but failed
def sum_distances(df, i, col_X='X', col_Y='Y'):
for i in range(A)
return (((df.iloc[B][col_X] - df.iloc[i,2])**2 + (df.iloc[B][col_Y] - df.iloc[i,3])**2)**0.5).min
I don't know how to integrate min() and sum() at the same time.
If I'm not mistaken, you're looking for the Euclidean distance between (x,y) co-ordinates. Here is one possible approach (based on this SO post)
Generate some dummy data in the same format of the OP
import pandas as pd
import numpy as np
A = [0, 1, 2, 3, 4]
B = [10, 11, 12, 13, 14, 9, 8, 7, 6]
df = pd.DataFrame(np.random.rand(15,2), columns=['X','Y'], index=range(15))
df.index.name = 'CellID'
print('Raw data\n{}'.format(df))
Raw data
X Y
CellID
0 0.125591 0.890772
1 0.754238 0.644081
2 0.952322 0.099627
3 0.090804 0.809511
4 0.514346 0.041740
5 0.678598 0.230225
6 0.594182 0.432634
7 0.005777 0.891440
8 0.925187 0.045035
9 0.903591 0.238609
10 0.187591 0.255377
11 0.252635 0.149840
12 0.513432 0.972749
13 0.433606 0.550940
14 0.104991 0.440052
To get the minimum distance between each index of B and A
# Get df at indexes from list A: df_A
df_A = df.iloc[A,]
# For df at each index from list B (df.iloc[b,]), get distance to df_A: d
dist = []
for b in B:
d = (pd.DataFrame(df_A.values - df.iloc[b,].values)**2).sum(1)**0.5
dist.append(d.min())
print('Sum of minimum distances is {}'.format(sum(dist)))
Output (for sum of minimum distances between each index of B and A)
Sum of minimum distances is 2.36509386378

Is there a better/more efficient way to do this (vectorised)? Very slow performance with Pandas apply

So in R, I'd use an optimized apply function for this, but I've read now that the Panda's apply function is an abstracted loop and is perhaps even slower than one, and it shows in the performance. On my machine, it took 30 mins to process 60k rows.
So essentially I'm looking to calculate a moving average based on a dataset with different groups on which I need to calculate a moving average. There are a lot of these groups. So I essentially first have to subset the dataset on a row/cell-wise basis, and only then calculate the moving average.
So I'm trying to come up with a vectorised solution to this but can't seem to figure how you'd go about subsetting the dataframe within a vectorised approach.
My current solution makes use of an apply function which is really easy to understand and maintain:
df['SMA'] = df.apply(SMA, axis=1)
def SMA(row):
Subset = df[(df['group']==row['group'])&(df['t']<=row['t'])].reset_index()
Subset2 = Subset[len(Subset.index)-(2):len(Subset.index)]
return df['val'].mean()
This is my expected output (which I am currently getting but just very very slowly):
This is the dataframe and this example I want the moving average over two timepoints, "t" in this example:
t group val moving average
1 A 1 NA
2 A 2 1.5
3 A 3 2.5
1 B 4 NA
2 B 5 4.5
3 B 6 5.5
1 C 7 NA
2 C 8 7.5
3 C 9 8.5
This kind of operation (splitting into groups) is handled by the .groupby method in pandas. If we take care to set the index to time, it also handles giving us the right output with a time index back.
Here's an example which does basically the same as your code:
df = pandas.DataFrame(
[[1, 'A', 1],
[2, 'A', 2],
[3, 'A', 3],
[1, 'B', 4],
[2, 'B', 5],
[3, 'B', 6],
[1, 'C', 7],
[2, 'C', 8],
[3, 'C', 9]],
columns=['t', 'group', 'val'])
df = df.set_index('t')
moving_avg = df.groupby('group').rolling(2).mean()
moving_avg is now a new dataframe. Note that because I set the index to be t in the first part, it's handled correctly in the grouping and rolling averages:
val
group t
A 1 NaN
2 1.5
3 2.5
B 1 NaN
2 4.5
3 5.5
C 1 NaN
2 7.5
3 8.5

pandas dataframe to features and labels

Here is a dataframe which I want to convert to features and label list/arrays.
The dataframe represents Fedex Ground Shipping rates for weight and zone Ids (columns of the dataframe).
The features need to be like below
[weight,zone]
e.g. [[1,2],[1,3] ...[1,25],[2,2],[2,3] ...[2,25]....[8,25]]
And the labels corresponding to them are basically the shipping charges so,
[[shipping charge]]
e.g. [[8.95],[9.44] .....[35.18]]
While I am using following code, but I am sure there has to be a faster, more optimized and perhaps more direct way to achieve this, either using dataframe or numpy
i=0
j=0
for weight in df_ground.Weight:
for column in column_list[1:]: # skipping the weight column !
features[j] = [df_ground.Weight[i],column]
labels[j] = df_ground[column][df_ground['Weight'] == df_ground.Weight[i]]
j +=1
i +=1
For a dataframe of size 2700 this code takes between 1 and 2 seconds. I am asking for suggestions on a more optimized way.
First, make 'Weight' index and mix the index and the columns:
mixed = df_ground.set_index('Weight').stack()
#Weight
#1 2 8.95
# 3 9.44
# 4 9.89
#....
#2 2 9.24
# 3 9.92
# 4 10.41
Now, your new index is your features and the data column is your labels:
features = [list(x) for x in mixed.index]
#[[1, 2], [1, 3], [1, 4], ..., [2, 2], [2, 3], [2, 4], ...]
labels = [[x] for x in mixed.values]
#[[8.95],[9.44],[9.89],[9.24],[9.92],[10.41]])

Categories