Memory efficient looping in dataframe

Memory efficient looping in dataframe - python

I have the following dataframe
product_id
weight_in_g
1
50
2
120
3
130
4
200
5
42
6
90
I am trying to match products based on weights within a deviation of 50 using this loop
list1=[]
for row in df[['product_id', 'weight_in_g']].itertuples():
high = row[1] + 50
low = row[1] - 50
id = df['product_id'].loc[((df['weight_in_g'] >= low) & (df['weight_in_g'] <= high)) | (df['weight_in_g'] == 0)]
list1.append(id)
df['weight_matches'] = list1
del list1
Which gives me the output:
product_id
weight_in_g
weight_matches
1
50
1, 5, 6
2
120
2, 3, 6
3
130
3, 2, 6
4
200
4, 6
5
42
5, 6, 1
6
90
6, 1
I'm using this as a filter together with text embedding. So for that reason i'm including all values with "0" which is about 35% of the dataset (I'd rather keep the values instead of not matching 35% of my dataset)
This works with 10.000 and 20.000 rows, but if i'm going higher then that my notebook runs out of memory (13gb ram)
Is there any way i can make this more memory efficient?

You are using memory very inefficiently. The id variable is of type pd.Series, which stores 2 things: the indexes and the values, plus some small overhead. pandas defaults to int64 for integers so it takes 128 bits / 16 bytes to store a single product ID.
Unrelated to memory, you are also using loops, which are slow. You can have it run faster by using numpy broadcasting to take advantage of vectorization.
# If your product ID can fit into int32, convert it to conserve memory
product_ids = df["product_id"].to_numpy(dtype="int32")
# Convert columns to numpy arrays
weight = df["weight_in_g"].to_numpy()
weight_high = weight + 50
weight_low = weight - 50
# Raise weight by one dimension to prepare for broadcasting
weight = weight[:, None]
# A naive implementation of numpy broadcasting will require n*n bytes for the
# `mask` array. When n = 50_000, the memory demand is *at least* 2.5GB, which my
# weak PC can't afford. Chunking in a trade-off: bigger chunk size results in
# faster performance but demand more memory. Pick a size that best suits you.
chunk_size = 5000
weight_matches = []
for i in np.arange(0, len(weight), chunk_size):
chunk = weight[i : i + chunk_size]
mask = (chunk == 0) | ((weight_low <= chunk) & (chunk <= weight_high))
weight_matches += [product_ids[np.nonzero(m)] for m in mask]
df["weight_matches"] = weight_matches
The above code took 8.7s to run 50k rows. I was able to tweak it to run on 100k rows. However, keep in mind that every solution has a limit. If your requirement exceeds that limit, you need to find a new solution.

Here is one way to do it
# do a self join
# query where difference is +/- 50 in weights
# groupby and accumulate the matching products as list
# finally reset and rename the column
(df.merge(df , how='cross', suffixes=(None, '_y'))
.query('abs(weight_in_g - weight_in_g_y) <= 50')
.groupby(['product_id','weight_in_g'])['product_id_y'].agg(list)
).reset_index()
.rename(columns={'product_id_y':'weight_matches'})
product_id weight_in_g weight_matches
0 1 50 [1, 5, 6]
1 2 120 [2, 3, 6]
2 3 130 [2, 3, 6]
3 4 200 [4]
4 5 42 [1, 5, 6]
5 6 90 [1, 2, 3, 5, 6]

Related

making a lowest value list in python from three other list

I have a large data set with 400 items_1 that are being compared with 400 other items_2, which is giving a score based on how each item compares to the other giving a weighted value_1. so I have three columns of about 160,000
input example
input list
Item1
342
228
310
170
374
25
184
120
348
253
317
120
43
241
310
203
Item2
352
352
109
339
224
109
224
361
109
11
224
109
171
224
361
224
weight
1.815
2.024
2.045
2.062
2.087
2.104
2.127
2.128
2.138
2.146
2.148
2.15
2.177
2.18
2.181
2.183
each position on the list is corrilated to each other. so for the first set of values for item 1 and item 2 and wieght are all corrlilated values.
I am trying to build a list with this data set that fill filter all this data that will give me a range of item 1 with item 2 and with its weight the lowest values.
With item 1 ranging from 1 to 400 and item 2 ranging from 1 to 400 with out repeating values. but the weighted values need to be minimized.
output example
output list
Item 1
1
2
3
4
5
6
7
8
9
10
Item 2
32
2
10
5
1
8
3
15
45
18
weight
1.2
1.3
1.1
1.2
1.4
1.6
2.1
1.8
1.7
1.6
so I would be using the input list as a reference to look up values when I match item 1 with item 2 and find the lowest weighted values for each 400 comparision.
Any ideas how I can get started or were to begin. I would appreciate any help on this.
I tried using excel =INDEX(B2:B401,MATCH(MIN(C2:C401),C2:C401,0)) but this wont update based on my first column and sometimes gives me duplicate values. I was also think of using merge sort so I found some code that will sort one array, but I dont think thats what I want.
I am familiar with python, and bash either of those laungages I can use. I think I just need a little help in the right direction.

If i understood correctly what your goal is, you can forge the 3 dependent lists together and sort them with an anonymous lambda function as key. For testing, i created 3 lists according to your example. For better understanding, the values of the same indices of item_1 and items_2 are calculated as sums and are written in weights_1, also at the same index, meaning weights_1[x] = items_1[x] + items_2[x]:
items_1 = [1, 4, 8, 3, 2]
items_2 = [4, 2, 1, 10, 11]
weights_1 = [5, 6, 9, 13, 13]
the following function takes these 3 lists, forges them together as a list of lists via a simple list comprehension, using each iteration-value and the current number of the iteration as indices for the other 2 lists and sorts them by the length of the "items_1" value at index 0 key=lambda x: x[0], you can change that, for whatever value you are trying to sort the lists.
def sort_lists(items1, items2, weights):
combined_list = sorted([[items1[i], items2[i], weight] for i, weight in enumerate(weights)], key=lambda x: x[0])
list_1 = [value[0] for value in combined_list]
list_2 = [value[1] for value in combined_list]
list_3 = [value[2] for value in combined_list]
return list_1, list_2, list_3
to assign the results, you can just call the function somewhat like this:
items_1, items_2, weights_1 = sort_lists(items_1, items_2, weights_1)
OUTPUT:
[1, 2, 3, 4, 8] <- sorted values while still being coherent
[4, 11, 10, 2, 1]
[5, 13, 13, 6, 9]
you can do the same with the values of items_2, by just changing the lambda x: x[0] to lambda x: x[1], i think you get the idea.
Note that this definitely is not the best way to write this function, especially in terms of reausability. Its just an example to show you how it CAN be done!

Faster way to sum all combinations of rows in dataframe

I have a dataframe of 10,000 rows that I am trying to sum all possible combinations of those rows. According to my math, that's about 50 million combinations. I'll give a small example to simplify what my data looks like:
df = Ratio Count Score
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
And here's the desired result:
results = Min Ratio Max Ratio Total Count Total Score
1 2 13 23
1 3 21 36
1 4 30 50
1 5 40 65
2 3 15 25
2 4 24 39
2 5 34 54
3 4 17 27
3 5 27 42
4 5 19 29
This is the code that I came up with to complete the calculation:
for i in range(len(df)):
j = i + 1
while j <= len(df):
range_to_calc = df.iloc[i:j]
total_count = range_to_calc['Count'].sum()
total_score = range_to_calc['Score'].sum()
new_row = {'Min Ratio': range_to_calc.at[range_to_calc.first_valid_index(),'Ratio'],
'Max Ratio': range_to_calc.at[range_to_calc.last_valid_index(),'Ratio'],
'Total Count': total_count,
'Total Score': total_score}
results = results.append(new_row, ignore_index=True)
j = j + 1
This code works, but according to my estimates after running it for a few minutes, it would take 200 hours to complete. I understand that using numpy would be a lot faster, but I can't wrap my head around how to build multiple arrays to add together. (I think it would be easy if I was doing just 1+2, 2+3, 3+4, etc., but it's a lot harder because I need 1+2, 1+2+3, 1+2+3+4, etc.) Is there a more efficient way to complete this calculation so it can run in a reasonable amount of time? Thank you!
P.S.: If you're wondering what I want to do with a 50 million-row dataframe, I don't actually need that in my final results. I'm ultimately looking to divide the Total Score of each row in the results by its Total Count to get a Total Score Per Total Count value, and then display the 1,000 highest Total Scores Per Total Count, along with each associated Min Ratio, Max Ratio, Total Count, and Total Score.

After these improvements it takes ~2 minutes to run for 10k rows.
For the sum computation, you can pre-compute cumulative sum(cumsum) and save it. sum(i to j) is equal to sum(0 to j) - sum(0 to i-1).
Now sum(0 to j) is cumsum[j] and sum(0 to i - 1) is cumsum[i-1].
So sum(i to j) = cumsum[j] - cumsum[i - 1].
This gives significant improvement over computing sum each time for different combination.
Operation over numpy array is faster than the operation on pandas series, hence convert every colum to numpy array and then do the computation over it.
(From other answers): Instead of appending in list, initialise an empty numpy array of size ((n*(n+1)//2) -n , 4) and use it to save the results.
Use:
count_cumsum = np.cumsum(df.Count.values)
score_cumsum = np.cumsum(df.Score.values)
ratios = df.Ratio.values
n = len(df)
rowInCombination = (n * (n + 1) // 2) - n
arr = np.empty(shape = (rowInCombination, 4), dtype = int)
k = 0
for i in range(len(df)):
for j in range(i + 1, len(df)):
arr[k, :] = ([
count_cumsum[j] - count_cumsum[i-1] if i > 0 else count_cumsum[j],
score_cumsum[j] - score_cumsum[i-1] if i > 0 else score_cumsum[j],
ratios[i],
ratios[j]])
k = k + 1
out = pd.DataFrame(arr, columns = ['Total_Count', 'Total_Score',
'Min_Ratio', 'Max_Ratio'])
Input:
df = pd.DataFrame({'Ratio': [1, 2, 3, 4, 5],
'Count': [6, 7, 8, 9, 10],
'Score': [11, 12, 13, 14, 15]})
Output:
>>>out
Min_Ratio Max_Ratio Total_Count Total_Score
0 1 2 13 23
1 1 3 21 36
2 1 4 30 50
3 1 5 40 65
4 2 3 15 25
5 2 4 24 39
6 2 5 34 54
7 3 4 17 27
8 3 5 27 42
9 4 5 19 29

First of all, you can improve the algorithm. Then, you can speed up the computation using Numpy vectorization/broadcasting.
Here are the interesting point to improve the performance of the algorithm:
append of Pandas is slow because it recreate a new dataframe. You should never use it in a costly loop. Instead, you can append the lines to a Python list or even directly write the items in a pre-allocated Numpy vector.
computing partial sums take an O(n) time while you can pre-compute the cumulative sums and then just find the partial sum in constant time.
CPython loops are very slow, but the inner loop can be vectorized using Numpy thanks to broadcasting.
Here is the resulting code:
import numpy as np
import pandas as pd
def fastImpl(df):
n = len(df)
resRowCount = (n * (n+1)) // 2
k = 0
cumCounts = np.concatenate(([0], df['Count'].astype(int).cumsum()))
cumScores = np.concatenate(([0], df['Score'].astype(int).cumsum()))
ratios = df['Ratio'].astype(int)
minRatio = np.empty(resRowCount, dtype=int)
maxRatio = np.empty(resRowCount, dtype=int)
count = np.empty(resRowCount, dtype=int)
score = np.empty(resRowCount, dtype=int)
for i in range(n):
kStart, kEnd = k, k+(n-i)
jStart, jEnd = i+1, n+1
minRatio[kStart:kEnd] = ratios[i]
maxRatio[kStart:kEnd] = ratios[i:n]
count[kStart:kEnd] = cumCounts[jStart:jEnd] - cumCounts[i]
score[kStart:kEnd] = cumScores[jStart:jEnd] - cumScores[i]
k = kEnd
assert k == resRowCount
return pd.DataFrame({
'Min Ratio': minRatio,
'Max Ratio': maxRatio,
'Total Count': count,
'Total Score': score
})
Note that this code give the same results than the code in your question, but the original code does not give the expected results stated in the question. Note also that since inputs are integers, I forced Numpy to use integers for sake of performance (despite the algorithm should work with floats too).
This code is hundreds of thousand times faster than the original code on big dataframes and it succeeds to compute a dataframe of 10,000 rows in 0.7 second.

Others have explained why your algorithm was so slow so I will dive into that.
Let's take a different approach to your problem. In particular, look at how the Total Count and Total Score columns are calculated:
Calculate the cumulative sum for every row from 1 to n
Calculate the cumulative sum for every row from 2 to n
...
Calculate the cumulative sum for every row from n to n
Since cumulative sums are accumulative, we only need to calculate it once for row 1 to row n:
The cumsum of (2 to n) is the cumsum of (1 to n) - (row 1)
The cumsum of (3 to n) is the cumsum of (2 to n) - (row 2)
And so on...
In other words, the current cumsum is the previous cumsum minus its first row, then dropping the first row.
As you have theorized, pandas is a lot slower than numpy so we will convert everthing into numpy for speed:
arr = df[['Ratio', 'Count', 'Score']].to_numpy() # Convert to numpy array
tmp = np.cumsum(arr[:, 1:3], axis=0) # calculate cumsum for row 1 to n
tmp = np.insert(tmp, 0, arr[0, 0], axis=1) # create the Min Ratio column
tmp = np.insert(tmp, 1, arr[:, 0], axis=1) # create the Max Ratio column
results2 = [tmp]
for i in range(1, len(arr)):
tmp = results2[-1][1:] # current cumsum is the previous cumsum without the first row
diff = results2[-1][0] # the previous cumsum's first row
tmp -= diff # adjust the current cumsum
tmp[:, 0] = arr[i, 0] # new Min Ratio
tmp[:, 1] = arr[i:, 0] # new Max Ratio
results2.append(tmp)
# Assemble the result
results2 = np.concatenate(results2).reshape(-1,4)
results2 = pd.DataFrame(results2, columns=['Min Ratio', 'Max Ratio', 'Total Count', 'Total Score'])
During my test, this produces the results for a 10k row data frame in about 2 seconds.

Sorry to write late for this topic, but I'm just looking for a solution for a similar topic. The solution for this issue is simple because the combination is only in pairs. This is solved by uploading the dataframe to any DB and executing the following query whose duration is less than 10 seconds:
SEL f1.*,f2.*,f1.score+f2.score
FROM table_with_data_source f1, table_with_data_source f2
where f1.ratio<>f2.ratio;
The database will do it very fast even if there are 100,000 records or more.
However, none of the algorithms that I saw in the answers, actually perform a combinatorial of values. He only does it in pairs. The problem really gets complicated when it's a true combinatorial, for example:
Given: a, b, c, d and e as records:
a
b
c
d
e
The real combination would be:
a+b
a+c
a+d
a+e
a+b+c
a+b+d
a+b+e
a+c+d
a+c+e
a+d+e
a+b+c+d
a+b+c+e
a+c+d+e
a+b+c+d+e
b+c
b+d
b+e
b+c+d
b+c+e
b+d+e
c+d
c+e
c+d+e
d+e
This is a real combinatorial, which covers all possible combinations. For this case I have not been able to find a suitable solution since it really affects the performance of any HW. Anyone have any idea how to perform a real combinatorial using python? At the database level it affects the general performance of the database.

Is there a way to vectorize code that currently iterates over rows in a Pandas dataframe?

I have some code right now that works fine, but it entirely too slow. I'm trying to add up the weighted sum of squares for every row in a Pandas dataframe. I'd like to vectorize the operations--that seems to run much, much faster--but there's a wrinkle in the code that has defeated my attempts to vectorize.
totalDist = 0.0
for index, row in pU.iterrows():
totalDist += (row['distance'][row['schoolChoice']]**2.0*float(row['students']))
The row has 'students' (an integer), distance (a numpy array of length n), and schoolChoice (an integer less than or equal to n-1 which designates which element of the distance array I'm using for the calcuation). Basically, I'm pulling a row-specific value from the numpy array. I've used df.lookup, but that actually seems to be slower and is being deprecated. Any suggestions on how to make this run faster? Thanks in advance!

If all else fails you can use .apply() on each row
totalSum = df.apply(lambda row: row.distance[row.schoolChoice] ** 2 * row.students, axis=1).sum()
To go faster you can import numpy
totalSum = (numpy.stack(df.distance)[range(len(df.schoolChoice)), df.schoolChoice] ** 2 * df.students).sum()
The numpy method requires distance be the same length for each row - however it is possible to pad them to the same length if needed. (Though this may affect any gains made.)
Tested on a df of 150,000 rows like:
distance schoolChoice students
0 [1, 2, 3] 0 4
1 [4, 5, 6] 2 5
2 [7, 8, 9] 2 6
3 [1, 2, 3] 0 4
4 [4, 5, 6] 2 5
Timings:
method time
0 for loop 15.9s
1 df.apply 4.1s
2 numpy 0.7s

memory efficient way to create a column that indicates a unique combination of values from a set of columns

I want to find a more efficient way (in terms of peak memory usage and possibly time) to do the work of panda's groupby.ngroup so that I don't run into memory issues when working with large datasets (I provide reasons for why this column is useful to me below). Take this example with a small dataset. I can accomplish this task easily using groupby.ngroup.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array(
[[0, 1, 92],
[0, 0, 39],
[0, 0, 32],
[1, 0, 44],
[1, 1, 50],
[0, 1, 11],
[0, 0, 14]]), columns=['male', 'edu', 'wage'])
df['group_id'] = df.groupby(['male', 'edu']).ngroup()
df
male edu wage group_id
0 0 1 92 1
1 0 0 39 0
2 0 0 32 0
3 1 0 44 2
4 1 1 50 3
5 0 1 11 1
6 0 0 14 0
But when I start using larger datasets, the memory usage and computation time explodes and the memory usage in the groupby as a ratio over the memory usage of the dataframe increases almost three-fold for N=100,000,000 as compared to N=100,000. See below.
from memory_profiler import memory_usage
import time
N_values = [10**k for k in range(4, 9)]
stats = pd.DataFrame(index=N_values, dtype=float, columns=['time', 'basemem', 'groupby_mem'])
for N in N_values:
df = pd.DataFrame(
np.hstack([np.random.randint(0, 2, (N, 2)), np.random.normal(5, 1, (N, 1))]),
columns=['male', 'edu', 'wage']
)
def groupby_ngroup():
df.groupby(['male', 'edu']).ngroup()
def foo():
pass
basemem = max(memory_usage(proc=foo))
tic = time.time()
mem = max(memory_usage(proc=groupby_ngroup))
toc = time.time() - tic
stats.loc[N, 'basemem'] = basemem
stats.loc[N, 'groupby_mem'] = mem
stats.loc[N, 'time'] = toc
stats['mem_ratio'] = stats.eval('groupby_mem/basemem')
stats
time basemem groupby_mem mem_ratio
10000 0.037834 104.781250 105.359375 1.005517
100000 0.051785 108.187500 113.125000 1.045638
1000000 0.143642 128.156250 182.437500 1.423555
10000000 0.644650 334.148438 820.183594 2.454549
100000000 6.074531 2422.585938 7095.437500 2.928869
Why am I interested in this group identifier? Because I want to create columns that utilize pandas' groupby functions such as groupby.mean using the .map method as opposed to groupby.transform which takes a lot of memory and time. Furthermore, the .map approach can be used with dask dataframes as dask currently doesn't support .transform. With a column for "group_id" I can simply do means = df.groupby(['group_id'])['wage'].mean() and df['mean_wage'] = df['group_id'].map(means) to do the work of transform.

How about not using ngroup, and instead writing our own function to create group_id column?
Here is a code snippet that seems to give a slightly better performance:
from memory_profiler import memory_usage
import time
import pandas as pd
import numpy as np
N_values = [10**k for k in range(4, 9)]
stats = pd.DataFrame(index=N_values, dtype=float, columns=['time', 'basemem', 'groupby_mem'])
for N in N_values:
df = pd.DataFrame(
np.hstack([np.random.randint(0, 2, (N, 2)), np.random.normal(5, 1, (N, 1))]),
columns=['male', 'edu', 'wage']
)
def groupby_ngroup():
#df.groupby(['male', 'edu']).ngroup()
df['group_id'] = 2*df.male + df.edu
def foo():
pass
basemem = max(memory_usage(proc=foo))
tic = time.time()
mem = max(memory_usage(proc=groupby_ngroup))
toc = time.time() - tic
stats.loc[N, 'basemem'] = basemem
stats.loc[N, 'groupby_mem'] = mem
stats.loc[N, 'time'] = toc
stats['mem_ratio'] = stats.eval('groupby_mem/basemem')
stats
time basemem groupby_mem mem_ratio
10000 0.117921 2370.792969 79.761719 0.033643
100000 0.026921 84.265625 84.324219 1.000695
1000000 0.067960 130.101562 130.101562 1.000000
10000000 0.220024 308.378906 536.140625 1.738577
100000000 0.751135 2367.187500 3651.171875 1.542409
Essentially, we use the fact that the columns are numerical and treat them as binary numbers. The group_ids shall be the decimal equivalents.
Scaling it for three columns gives a similar result. For that, replace the dataframe initialization to the following:
df = pd.DataFrame(
np.hstack([np.random.randint(0, 2, (N, 3)), np.random.normal(5, 1, (N, 1))]),
columns=['male', 'edu','random1', 'wage']
)
and group_id function to:
def groupby_ngroup():
df['group_id'] = 4*df.male + 2*df.edu + df.random1
Following are the results of that test:
time basemem groupby_mem mem_ratio
10000 0.050006 78.906250 78.980469 1.000941
100000 0.033699 85.007812 86.339844 1.015670
1000000 0.066184 147.378906 147.378906 1.000000
10000000 0.322198 422.039062 691.179688 1.637715
100000000 1.233054 3167.921875 5183.183594 1.636146

Let us try using hash
list(map(hash,df.to_records().tolist()))
[4686582722376372986, 3632587615391525059, 2578593961740479157, -48845846747569345, 2044051356115000853, -583388452461625474, -1637380652526859201]

For a groupby where groupby variables are of unknown pattern, it seems that groupby.ngroup may be as good as it gets. But if your groupby variables are all categorical, e.g., take values 0,1,2,3...., then we can take inspiration from the solution given by #saurjog.
To generate the group ID, we can build a numerical expression that evaluates a special sum of the groupby variables. Consider the following functions
def gen_groupby_numexpr(cols, numcats):
txt = [cols[0]]
k = numcats[0]
for c,k_ in zip(cols[1:], numcats[1:]):
txt.append('{}*{}'.format(k, c))
k = k*k_
return ' + '.join(txt)
def ngroup_cat(df, by, numcats):
'''
by : list
the categorical (0,1,2,3...) groupby column names
numcats : list
the number of unique values for each column in "by"
'''
expr = gen_groupby_numexpr(by, numcats)
return df.eval(expr)
The function gen_groupby_numexpr generates the numerical expression and ngroup_cat generates the group id for the groupby variables in by with unique value counts numcats. Thus, consider the following dataset that matches our use case. It contains 3 categorical variables we will use to form the groupby, two of which take values in {0,1} and one takes values in {0,1,2}.
df2 = pd.DataFrame(np.hstack([np.random.randint(0, 2, (100, 2)),
np.random.randint(0, 3, (100, 1)),
np.random.randint(0, 20, (100, 1))]),
columns=['male', 'mar', 'edu', 'wage'])
If we generate the numerical expression we get:
'male + 2*mar + 4*edu'
Putting this altogether, we can generate the group id with
df2['group_id'] = ngroup_cat(df2, ['male', 'mar', 'edu'], [2, 2, 3])
from which we get 2*2*3=12 unique group IDs:
df2[['male', 'mar', 'edu', 'group_id']].drop_duplicates().sort_values(['group_id'])
male mar edu group_id
1 0 0 0 0
13 1 0 0 1
8 0 1 0 2
10 1 1 0 3
4 0 0 1 4
12 1 0 1 5
2 0 1 1 6
6 1 1 1 7
7 0 0 2 8
5 1 0 2 9
44 0 1 2 10
0 1 1 2 11
When I bench mark the solution above against groupby.ngroup it runs nearly 3 times faster on a dataset of N=10,000,000 and uses significantly less additional memory.
Now we can estimate these groupby means and then map them back to the whole dataframe to do the work of transform. I compute some benchmarks with mixed results on whether using transform or groupby then map is faster and less memory intensive. If you are computing means for groups of many variables then I think the latter is more efficient. Further, the latter can also be done in dask where transform is not yet supported.

Fast python algorithm (in numpy or pandas?) to find indices of array elements that match elements in another array

I am looking for a fast method to determine the cross-matching indices of two arrays, defined as follows.
I have two very large (>1e7 elements) structured arrays, one called members, and another called groups. Both arrays have a groupID column. The groupID entries of the groups array are unique, the groupID entries of the members array are not.
The groups array has a column called mass. The members array has a (currently empty) column called groupmass. I want to assign the correct groupmass to those elements of members with a groupID that matches one of the groups. This would be accomplished via:
members['groupmass'][idx_matched_members] = groups['mass'][idx_matched_groups]
So what I need is a fast routine to compute the two index arrays idx_matched_members and idx_matched_groups. This sort of task seems so common that it seems very likely that a package like numpy or pandas would have an optimized solution. Does anyone know of a solution, professionally developed, homebrewed, or otherwise?

This can be done with pandas using map to map the data from one column using the data of another. Here's an example with sample data:
members = pandas.DataFrame({
'id': np.arange(10),
'groupID': np.arange(10) % 3,
'groupmass': np.zeros(10)
})
groups = pandas.DataFrame({
'groupID': np.arange(3),
'mass': np.random.randint(1, 10, 3)
})
This gives you this data:
>>> members
groupID groupmass id
0 0 0 0
1 1 0 1
2 2 0 2
3 0 0 3
4 1 0 4
5 2 0 5
6 0 0 6
7 1 0 7
8 2 0 8
9 0 0 9
>>> groups
groupID mass
0 0 3
1 1 7
2 2 4
Then:
>>> members['groupmass'] = members.groupID.map(groups.set_index('groupID').mass)
>>> members
groupID groupmass id
0 0 3 0
1 1 7 1
2 2 4 2
3 0 3 3
4 1 7 4
5 2 4 5
6 0 3 6
7 1 7 7
8 2 4 8
9 0 3 9
If you will often want to use the groupID as the index into groups, you can set it that way permanently so you won't have to use set_index every time you do this.

Here's an example of setting the mass with just numpy. It does use iteration, so for large arrays it won't be fast.
For just 10 rows, this is much faster than the pandas equivalent. But as the data set becomes larger (eg. M=10000), pandas is much better. The setup time for pandas is larger, but the per row iteration time much lower.
Generate test arrays:
dt_members = np.dtype({'names':['groupID','groupmass'], 'formats': [int, float]})
dt_groups = np.dtype({'names':['groupID', 'mass'], 'formats': [int, float]})
N, M = 5, 10
members = np.zeros((M,), dtype=dt_members)
groups = np.zeros((N,), dtype=dt_groups)
members['groupID'] = np.random.randint(101, 101+N, M)
groups['groupID'] = np.arange(101, 101+N)
groups['mass'] = np.arange(1,N+1)
def getgroup(id):
idx = id==groups['groupID']
return groups[idx]
members['groupmass'][:] = [getgroup(id)['mass'] for id in members['groupID']]
In python2 the iteration could use map:
members['groupmass'] = map(lambda x: getgroup(x)['mass'], members['groupID'])
I can improve the speed by about 2x by minimizing the repeated subscripting, eg.
def setmass(members, groups):
gmass = groups['mass']
gid = groups['groupID']
mass = [gmass[id==gid] for id in members['groupID']]
members['groupmass'][:] = mass
But if groups['groupID'] can be mapped onto arange(N), then we can get a big jump in speed. By applying the same mapping to members['groupID'], it becomes a simple array indexing problem.
In my sample arrays, groups['groupID'] is just arange(N)+101. So the mapping just subtracts that minimum.
def setmass1(members, groups):
members['groupmass'][:] = groups['mass'][members['groupID']-groups['groupID'].min()]
This is 300x faster than my earlier code, and 8x better than the pandas solution (for 10000,500 arrays).
I suspect pandas does something like this. pgroups.set_index('groupID').mass is the mass Series, with an added .index attribute. (I could test this with a more general array)
In a more general case, it might help to sort groups, and if necessary, fill in some indexing gaps.
Here's a 'vectorized' solution - no iteration. But it has to calculate a very large matrix (length of groups by length of members), so does not gain much speed (np.where is the slowest step).
def setmass2(members, groups):
idx = np.where(members['groupID'] == groups['groupID'][:,None])
members['groupmass'][idx[1]] = groups['mass'][idx[0]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.