Smoothing / interpolating categorical data (fast) - python

I am currently working with an array, containing categorical data.
Categories are organised like this: None,zoneA, zoneB
My array is a measure of sensors, it tells me if, at any time, the sensor is in zoneA, zoneB or not in a zone.
My goal here is to smooth those values.
For example, the sensor could be out of zoneA or b for a period of 30 measures, but if it happened I want those measures to be "smoothed".
Ex :
array[zoneA, zoneA, zoneA, None, None, zoneA, zoneA, None, None, None, zoneA]
should give
array[zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, None, None, None, zoneA]
with a threshold of 2.
Currently, I am using an iteration over arrays, but its computation is too expensive and can lead to 1 or 2 min of computation. Is there an existing algorithm to answer that problem?
My current code :
def smooth(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= self.delay)
and (
df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")]
== "None"
)
):
df_iter.iloc[
last_index: (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter

Python implementation
I like to start these kind of things clean and simple. Therefore I just wrote a simple class that does exactly what is needed, without thinking too much about optimization. I call it Interpolator as this looks like categorical interpolation to me.
class Interpolator:
def __init__(self, data):
self.data = data
self.current_idx = 0
self.current_nan_region_start = None
self.result = None
self.maxgap = 1
def run(self, maxgap=2):
# Initialization
self.result = [None] * len(self.data)
self.maxgap = maxgap
self.current_nan_region_start = None
prev_isnan = 0
for idx, item in enumerate(self.data):
isnan = item is None
self.current_idx = idx
if isnan:
if prev_isnan:
# Result is already filled with empty data.
# Do nothing.
continue
else:
self.entered_nan_region()
prev_isnan = 1
else: # not nan
if prev_isnan:
self.exited_nan_region()
prev_isnan = 0
else:
self.continuing_in_categorical_region()
def entered_nan_region(self):
self.current_nan_region_start = self.current_idx
def continuing_in_categorical_region(self):
self.result[self.current_idx] = self.data[self.current_idx]
def exited_nan_region(self):
nan_region_end = self.current_idx - 1
nan_region_length = nan_region_end - self.current_nan_region_start + 1
# Always copy the empty region endpoint even if gap is not filled
self.result[self.current_idx] = self.data[self.current_idx]
if nan_region_length > self.maxgap:
# Do not interpolate as exceeding maxgap
return
if self.current_nan_region_start == 0:
# Special case. data starts with "None"
# -> Cannot interpolate
return
if self.data[self.current_nan_region_start - 1] != self.data[self.current_idx]:
# Do not fill as both ends of missing data
# region do not have same value
return
# Fill the gap
for idx in range(self.current_nan_region_start, self.current_idx):
self.result[idx] = self.data[self.current_idx]
def interpolate(data, maxgap=2):
"""
Interpolate categorical variables over missing
values (None's).
Parameters
----------
data: list of objects
The data to interpolate. Holds
categorical data, such as 'cat', 'dog'
or 108. None is handled as missing data.
maxgap: int
The maximum gap to interpolate over.
For example, with maxgap=2, ['car', None,
None, 'car', None, None, None, 'car']
would become ['car', 'car', 'car' 'car',
None, None None, 'car'].
Note: Interpolation will only occur on missing
data regions where both ends contain the same value.
For example, [1, None, 2, None, 2] will become
[1, None, 2, 2, 2].
"""
interpolator = Interpolator(data)
interpolator.run(maxgap=maxgap)
return interpolator.result
This is how one would use it (code for get_data() below):
data = get_data(k=100)
interpolated_data = interpolate(data)
Copy-paste Cython implementation
Most probably the python implementation is fast enough, as with array size of 1000.000, the amount of time needed to process the data is 0.504 seconds on my laptop. Anyway, creating Cython versions is fun and might give small additional timing bonus.
Needed steps:
Copy-paste the python implementation into new file, called fast_categorical_interpolate.pyx
Create setup.py to the same folder, with following contents:
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize(
"fast_categorical_interpolate.pyx",
language_level="3",
),
)
Run python setup.py build_ext --inplace to build the Cython extension. You'll see something like fast_categorical_interpolate.cp38-win_amd64.pyd in the same folder.
Now, you may use the interpolator like this:
import fast_categorical_interpolate as fpi
data = get_data(k=100)
interpolated_data = fpi.interpolate(data)
Of course, there might be some optimizations that you could do in the Cython code to make this even faster, but on my machine the speed improvement was 38% out of the box with N=1000.000 and 126% when N=10.000.
Timings on my machine
When N=100 (number of items in the list), python implementation is about 160x , and Cython implementation about 250x faster than smooth
In [8]: timeit smooth(test_df, delay=2)
10.2 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: timeit interpolate(data)
64.8 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: timeit fpi.interpolate(data)
41.3 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When N=10.000, the timing difference is about 190x (Python) to 302x (Cython).
In [5]: timeit smooth(test_df, delay=2)
1.08 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit interpolate(data)
5.69 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: timeit fpi.interpolate(data)
3.57 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
When N=1000.000, the python implementation is about 210x faster and Cython implementation is about 287x faster.
In [9]: timeit smooth(test_df, delay=2)
1min 45s ± 24.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: timeit interpolate(data)
504 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: timeit fpi.interpolate(data)
365 ms ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appendix
Test data creator get_data()
import random
random.seed(0)
def get_data(k=100):
return random.choices(population=[None, "ZoneA", "ZoneB"], weights=[4, 3, 2], k=k)
Function and test data for testing smooth()
import pandas as pd
data = get_data(k=1000)
test_df = pd.DataFrame(dict(landlot=data)).fillna("None")
def smooth(df: pd.DataFrame, delay=2) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= delay)
and (df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")] == "None")
):
df_iter.iloc[
last_index : (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Note on the "current code"
I think there must be some copy-paste error somewhere, as the "current code" does not work as all. I replaced the self.delay with a delay=2 keyword argument to indicate the max gap. I assume that is was it was supposed to be. Even with that the logic did not work correcly with the simple example data you provided.

Related

How to determine indices in array between a set of values (in batch)

I was wondering if anyone has an idea on how to speed up the identification of which indices are between a set of values.
Let's say I have a 1d array of sorted values (~50k) and a large list (>100k) of a pair of min/max values and I want to determine which (if any) indices in the 1d array are present. I must also be able to do this many times where the 1d array changes in size/shape.
My current approach is to use numpy and numba and list comprehension but unfortunately it doesn't really scale. It's okay if I try to look for ~1k values but when the number is much larger, it's too slow to be able to repeat it 1000s of times.
Current code:
import numpy as np
import numba
#numba.njit()
def find_between_batch(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
res = []
for i in range(len(min_value)):
res.append(np.where(np.logical_and(array >= min_value[i], array <= max_value[i]))[0])
return res
Here is an example of the input:
x = np.linspace(0, 2000, 50000) # input 1d array
# these are the boundaries for which we should find the indices
mins = np.sort(np.random.choice(x, 10000)) - 0.01 # lower values to search for
maxs = mins + 0.02 # upper values to search for
And the current performance
# pre-compile
result = find_between_batch(x, mins, maxs)
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
And example output
result
[array([11]),
array([14]),
array([19]),
array([23]),
...
]
Does anyone have a suggestion on how to speed this up or if there is another approach that could give me the same results?
Thanks for the suggestion to use np.searchsorted - I've come up with a solution that is approx. 10-100x faster than my initial attempt.
#numba.njit()
def find_between_batch2(array: np.ndarray, min_value: np.ndarray, max_value: np.ndarray):
"""Find indices between specified boundaries for many items."""
min_indices = np.searchsorted(array, min_value, side="left")
max_indices = np.searchsorted(array, max_value, side="right")
res = []
for i in range(len(min_value)):
_array = array[min_indices[i]:max_indices[i]]
res.append(min_indices[i] + find_between(_array, min_value[i], max_value[i]))
return res
Original code:
%timeit -r 3 -n 10 find_between_batch(x, mins, maxs)
616 ms ± 4.11 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
Updated code:
%timeit -r 3 -n 10 find_between_batch2(x, mins, maxs)
6.36 ms ± 73.6 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)

Faster way to find same integers in sequence in numpy array

Right now I am just looping through using np.nditer() and comparing to the previous element. Is there a (vectorised) approach which is faster?
Added bonus is the fact that I don't always have to go to the end of the array; as soon as a sequence of max_len has been found I am done searching.
import numpy as np
max_len = 3
streak = 0
prev = np.nan
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for c in np.nditer(a):
if c == prev:
streak += 1
if streak == max_len:
print(c)
break
else:
prev = c
streak = 1
Alternative I thought about is using np.diff() but this just shifts the problem; we are now looking for a sequence of zeroes in its result. Also I doubt it will be faster since it will have to calculate the difference for every integer whereas in practice the sequence will occur before reaching the end of the list more often than not.
I developed a numpy-only version that works, but after testing, I found that it performs quite poorly because it can't take advantage of short-circuiting. Since that's what you asked for, I describe it below. However, there is a much better approach using numba with a lightly modified version of your code. (Note that all of these return the index of the first match in a, rather than the value itself. I find that approach more flexible.)
#numba.jit(nopython=True)
def find_reps_numba(a, max_len):
streak = 1
val = a[0]
for i in range(1, len(a)):
if a[i] == val:
streak += 1
if streak >= max_len:
return i - max_len + 1
else:
streak = 1
val = a[i]
return -1
This turns out to be ~100x faster than the pure Python version.
The numpy version uses the rolling window trick and the argmax trick. But again, this turns out to be far slower than even the pure Python version, by a substantial ~30x.
def rolling_window(a, window):
a = numpy.ascontiguousarray(a) # This approach requires a C-ordered array
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def find_reps_numpy(a, max_len):
windows = rolling_window(a, max_len)
return (windows == windows[:, 0:1]).sum(axis=1).argmax()
I tested both of these against a non-jitted version of the first function. (I used Jupyter's %%timeit feature for testing.)
a = numpy.random.randint(0, 100, 1000000)
%%timeit
find_reps_numpy(a, 3)
28.6 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
find_reps_orig(a, 3)
4.04 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
find_reps_numba(a, 3)
8.29 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that these numbers can vary dramatically depending on how deep into a the functions have to search. For a better estimate of expected performance, we can regenerate a new set of random numbers each time, but it's difficult to do so without including that step in the timing. So for comparison here, I include the time required to generate the random array without running anything else:
a = numpy.random.randint(0, 100, 1000000)
9.91 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numpy(a, 3)
38.2 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_orig(a, 3)
13.7 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numba(a, 3)
9.87 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, find_reps_numba is so fast that the variance in the time it takes to run numpy.random.randint(0, 100, 1000000) is much larger — hence the illusory speedup between the first and last tests.
So the big moral of the story is that numpy solutions aren't always best. Sometimes even pure Python is faster. In those cases, numba in nopython mode may be the best option by far.
You can use groupby from the itertools package.
import numpy as np
from itertools import groupby
max_len = 3
best = ()
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for k, g in groupby(a):
tup_g = tuple(g)
if tup_g==max_len:
best = tup_g
break
if len(tup_g) > len(best):
best = tup_g
best
# returns:
(2, 2, 2)
You could create sub-arrays of length max_length, moving one position to the right each time (like ngrams), and check if the sum of one sub_array divided by max_length is equal to the first element of that sub-array.
If that's True, then you have found your consecutive sequence of integers of length max_length.
def get_conseq(array, max_length):
sub_arrays = zip(*[array[i:] for i in range(max_length)])
for e in sub_arrays:
if sum(e) / len(e) == e[0]:
print("Found : {}".format(e))
return e
print("Nothing found")
return []
For example, this array [1,2,2,3,4,5], with max_length = 2, will be 'split' like this:
[1,2]
[2,2]
[2,3]
[3,4]
[4,5]
On the second element, [2,2], the sum is 4, divided by max_length gives 2, and that matches the first element of that subgroup, and the function returns.
You can break if that's what you prefer to do, instead of returning like I do.
You could also add a few rules to capture edge cases, to make things clean (empty array, max_length superior to the length of the array, etc).
Here are a few example calls:
>>> splits([1,2,3,4,5,6], 2)
Nothing found
>>> splits([1,2,2,3,4,5,6], 3)
Nothing found
>>> splits([1,2,3,3,3], 3)
Found : [3, 3, 3]
>>> splits([1,2,2,3,3], 2)
Found : [2, 2]
Hope this helps !
Assuming you are looking for the element that appears for at least max_len times consecutively, here's one NumPy based way -
m = np.r_[True,a[:-1]!=a[1:],True]
idx0 = np.flatnonzero(m)
m2 = np.diff(idx0)>=max_len
out = None # None for no such streak found case
if m2.any():
out = a[idx0[m2.argmax()]]
Another with binary-dilation -
from scipy.ndimage.morphology import binary_erosion
m = np.r_[False,a[:-1]==a[1:]]
m2 = binary_erosion(m, np.ones(max_len-1, dtype=bool))
out = None
if m2.any():
out = a[m2.argmax()]
Finally, for completeness, you can also look into numba. Your existing code would work as it is, with a direct-looping over a, i.e. for c in a:.

Python pandas return value from other column

I have a file "specieslist.txt" which contain the following information:
Bacillus,genus
Borrelia,genus
Burkholderia,genus
Campylobacter,genus
Now, I want python to look for a variable in the first column (in this example "Campylobacter") and return the value of the second ("genus"). I wrote the following code
import csv
import pandas as pd
species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df.loc[df['species'] == species_import]
print (input['level'])
However, my code return too much, while I am only want "genus"
3 genus
Name: level, dtype: object
You can select first value of Series by iat:
species_import = 'Campylobacter'
out = df.loc[df['species'] == species_import, 'level'].iat[0]
#alternative
#out = df.loc[df['species'] == species_import, 'level'].values[0]
print (out)
genus
Better solution working if no value matched and empty Series is returned - it return no match:
#jpp comment
This solution is better only when you have a large series and the matched value is expected to be near the top
species_import = 'Campylobacter'
out = next(iter(df.loc[df['species'] == species_import, 'level']), 'no match')
print (out)
genus
EDIT:
Idea from comments, thanks #jpp:
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
print (get_first_val(species_import))
genus
print (get_first_val('aaa'))
no match
EDIT:
df = pd.DataFrame({'species':['a'] * 10000 + ['b'], 'level':np.arange(10001)})
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
In [232]: %timeit next(iter(df.loc[df['species'] == 'a', 'level']), 'no match')
1.3 ms ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [233]: %timeit (get_first_val('a'))
1.1 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [235]: %timeit (get_first_val('b'))
1.48 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [236]: %timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match')
1.24 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Performance of various methods, to demonstrate when it is useful to use next(...).
n = 10**6
df = pd.DataFrame({'species': ['b']+['a']*n, 'level': np.arange(n+1)})
def get_first_val(val):
try:
return df.loc[df['species'] == val, 'level'].iat[0]
except IndexError:
return 'no match'
%timeit next(iter(df.loc[df['species'] == 'b', 'level']), 'no match') # 123 ms per loop
%timeit get_first_val('b') # 125 ms per loop
%timeit next(idx for idx, val in enumerate(df['species']) if val == 'b') # 20.3 µs per loop
get
With pandas.Series.get, you can return either a scalar value if the 'species' is unique or a pandas.Series if not unique.
f = df.set_index('species').level.get
f('Campylobacter')
'genus'
If not in the data, you can provide a default
f('X', 'Not In Data')
'Not In Data'
We could also use dict.get and only return scalars. If not unique, this will return the last one.
f = dict(zip(df.species, df.level)).get
If you want to return the first one, you can do that a few ways
f = dict(zip(df.species[::-1], df.level[::-1])).get
Or
f = df.drop_duplicates('species').pipe(
lambda d: dict(zip(d.species, d.level)).get
)
# Change the last line of your code to
print(input['level'].values)
# For Explanation refer below code
import csv
import pandas as pd
species_import = 'Campylobacter'
df = pd.read_csv('specieslist.txt', header=None, names = ['species', 'level'] )
input = df['species'] == species_import # return a pandas dataFrame
print(type(df[input])) # return a Pandas DataFrame
print(type(df[input]['level'])) # return a Pandas Series
# To obtain the value from this Series.
print(df[input]['level'].values) # return 'genus'

Do both lines return a string?

I am currently following through the beginner Codebat track. Both pieces of code work however is there anything fundamentally wrong/different between the two ways of writing the below code?
thanks,
def mine(myStr, x):
myResult = myStr * x
return myResult
def codebat(thierStr, i):
codeResult = ''
for i in range(i):
codeResult += thierStr
return codeResult
import string # string.ascii_letters = 'abcde...ABCDE...'
def mine(s, x):
return s * x # fixed your code so it multiplies by x, not 4
def theirs(s, x): # renamed but the same as codebat
res = ''
for _ in range(x):
res += s
return res
We can see they give the same results
mine(string.ascii_letters, 10) == theirs(string.ascii_letters, 10) # --> True
We can test the time efficiency of these functions however
%timeit mine(string.ascii_letters, 1000)
2.27 µs ± 9.69 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit theirs(string.ascii_letters, 1000)
202 µs ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see mine is almost 100 times more efficient because under the hood python pre-allocates the memory needed for the new string. In theirs it has to keep reallocating memory each time the string length is increased.

Pandas: improve running time looping over string contains substring

I got a Pandas dataframe which contains a column with pretty long strings (let's say URL_paths) and a list of unique substrings (reference list). For every row in my dataframe, I want to determine the corresponding reference element in my list. Hence, if the URL in a given row is for example abcd1234, and one of the reference values is cd123, then I want to add cd123 as reference to my dataframe, to categorize this row/URL.
I got my code working (see example below), but it's pretty slow due to a for loop (I guess) which I can't get rid off. I got the feeling that my code can be much faster, but can't think of a way to improve it.
How can I improve running time?
See working example below:
import string
import secrets
import pandas as pd
import time
from random import randint
n_ref = 100
n_target = 1000000
## Build reference Series, and target dataframe
reference = pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(randint(10, 19)))
for _ in range(n_ref))
target = pd.Series(reference.sample(n = n_target, replace = True)).reset_index().iloc[:,1]
dfTarget = pd.DataFrame({
'target' : target,
'pre-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'post-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'reference' : pd.Series()})
dfTarget['target_combined'] = dfTarget[['pre-string', 'target', 'post-string']].apply(lambda x: ''.join(x), axis=1)
## Fill in reference column
## Loop over references and return reference in reference column
start_time = time.time()
for x in reference:
dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
print("--- %s seconds ---" % (time.time() - start_time))
Out: 42.60... seconds
On my machine, I see a 17x improvement using pd.Series.apply:
reference_set = set(reference)
def calculator(x):
return next((i for i in reference_set if i in x), None)
dfTarget['reference'] = dfTarget['target_combined'].apply(calculator)
But for optimal performance, see #unutbu's solution.
Here is a slightly (4.3 times) faster approach:
RegEx pattern:
In [23]: pat = '.*({}).*'.format(reference.str.cat(sep='|'))
In [24]: pat
Out[24]: '.*(J6BUVB2BRDLL3IR9S1J|ZOXS91UK513RR18YREI|92KWUFKOK4G9XJAHIBJ|PMEH6N96091AK9XCA5J|3CICA38SDIXLFVED74I|V48OJCY2DS|LX8KGGBORWP6A|7H
V3NN71MU|JMA2K7QSHK72X|CNAOYI3C8T|NZE9SFKPYX|EU9K88XA29YATWR|SB871PEZ7TOPCG8|ZPP76BSDULM8|3QHLISVYEBWH|ST8VOI959D8YPCZ0|02BW83KYG3TEPWMOP|TG
I3P5QZC988GNM8FI0|GJG9MC18G5TU1TIDQB6|V7V5ZZJ5W7O|51KMJ07HEBIX|27GPT3B9DLY|O8KSR85BUB6WBKRC|ZKUEEFX5JFRE0IFRN0|FH8CUWHDETQ5TXWHSS1|N77FTB9VG
LK|JS4RUUQLD7IFP|3R45N7LOY1BZ8RR6O|JY3RXZ0OTC|YJQYOO03G0N7H7E56D|RVJ2VFNK6T7P30|GKPGAK6WAQ2QCAU6H3|7XNJ7A24CHWO1PK|1DVD5G1AE3I40|9F7CCWKHMMF
MBYD18|FWPEUWOWNK2SXR36SG|VTE64VCRY5|YGM8TT19EZTX|GKJYM3QS9ONTERQY1O0|KWMB1TMQTWMC6QCY|JS9SY7W5HI0KK|WNSHPK9KNEP77B|7EIS883NUXSO5Q6|K3HL2UYW
458LCBOSL|XI1FRVGHN0IL0F53CK4|F4HL7GKMOL2Q4Y13|IAXPAA4OX2J1X1|SXPLPYVB6EFSN4U5ZW|5L947F08PX8UW|IONNAOC26A|VQVHXHGYP8634|509ALPOKABO|SUJA66H2
DS7UOXFV|3GYIZATSZAXF8283SZO|A5612XI7X3N4|IH3RB3640D23Q28O|MH0YD83OELSI|RIFFPNRIV0XCY|Y0CXWE6GZPQ3FKH|WSCWR598Z8GBW9G|7C9O59EIA23POSI|UG4D5H
AAOYU5E|F249VSIILZ6KXDQSX|06XZSJHWSM|X01Y9AZ2W5V8HZ|1JLPWMPRGRFWIK|3ZVBSLEQ8DO|WMLKKETELHC|WDPHDS7A7XN7|6X4O4AE2IB3OS|V5J5HWO9RO19ZW2LGT|MK9
P8D9N8V4AJZB|0VT48C38I4T1V6S|R987QUQBTPRHCT7QWA4|D4XXBMCYWQ1172OY|ZUY1O565D2W5GSAL8|V8AR792X1K5UL9DLCKV|CXYK6IQWK3MUC3CO|6X7B6240VC9YL|4QV2D
13ZY15A9D5M1H|WJ7HOMK2FNBZZ6N2Z|QCOWSA3RLR|81I6Z0I5GM|KRD9Y1H3E2WEY9710Q|0161MNQHKEC30E8UI|HGB4XB0QDVHM4H92|RWD6L6EZJUSRK|6U9WOE3YVYKY31K8Q0
K|KCXWHL43B16MRQ1|EO330WAPN7XMX4|VYUX5W2NN277W09NMDB|J8EXE4YIMN0FB|SHE8D14C5A3X|PMPYKSY2FVXFR4Y8X3W|G3YU894U5QGOOM3Z|58J37WJPJBOC7QNKV|NE9WE
JSRXTYFXYZ0TBI|7UPR5XSVOJ244HHZ|N0QZCN6NADW|W2CTEUISOHUY).*'
Replacement:
dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
Timing against 10.000 rows DF:
In [25]: %%timeit
...: dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
...:
617 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [26]: %%timeit
...: [dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] for x in reference]
...:
1.96 s ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %%timeit
...: for x in reference:
...: dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
...:
2.64 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [28]: 2.64/0.617
Out[28]: 4.278768233387359
In [29]: 2.64/1.96
Out[29]: 1.3469387755102042

Categories