Related
I am currently working with an array, containing categorical data.
Categories are organised like this: None,zoneA, zoneB
My array is a measure of sensors, it tells me if, at any time, the sensor is in zoneA, zoneB or not in a zone.
My goal here is to smooth those values.
For example, the sensor could be out of zoneA or b for a period of 30 measures, but if it happened I want those measures to be "smoothed".
Ex :
array[zoneA, zoneA, zoneA, None, None, zoneA, zoneA, None, None, None, zoneA]
should give
array[zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, None, None, None, zoneA]
with a threshold of 2.
Currently, I am using an iteration over arrays, but its computation is too expensive and can lead to 1 or 2 min of computation. Is there an existing algorithm to answer that problem?
My current code :
def smooth(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= self.delay)
and (
df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")]
== "None"
)
):
df_iter.iloc[
last_index: (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Python implementation
I like to start these kind of things clean and simple. Therefore I just wrote a simple class that does exactly what is needed, without thinking too much about optimization. I call it Interpolator as this looks like categorical interpolation to me.
class Interpolator:
def __init__(self, data):
self.data = data
self.current_idx = 0
self.current_nan_region_start = None
self.result = None
self.maxgap = 1
def run(self, maxgap=2):
# Initialization
self.result = [None] * len(self.data)
self.maxgap = maxgap
self.current_nan_region_start = None
prev_isnan = 0
for idx, item in enumerate(self.data):
isnan = item is None
self.current_idx = idx
if isnan:
if prev_isnan:
# Result is already filled with empty data.
# Do nothing.
continue
else:
self.entered_nan_region()
prev_isnan = 1
else: # not nan
if prev_isnan:
self.exited_nan_region()
prev_isnan = 0
else:
self.continuing_in_categorical_region()
def entered_nan_region(self):
self.current_nan_region_start = self.current_idx
def continuing_in_categorical_region(self):
self.result[self.current_idx] = self.data[self.current_idx]
def exited_nan_region(self):
nan_region_end = self.current_idx - 1
nan_region_length = nan_region_end - self.current_nan_region_start + 1
# Always copy the empty region endpoint even if gap is not filled
self.result[self.current_idx] = self.data[self.current_idx]
if nan_region_length > self.maxgap:
# Do not interpolate as exceeding maxgap
return
if self.current_nan_region_start == 0:
# Special case. data starts with "None"
# -> Cannot interpolate
return
if self.data[self.current_nan_region_start - 1] != self.data[self.current_idx]:
# Do not fill as both ends of missing data
# region do not have same value
return
# Fill the gap
for idx in range(self.current_nan_region_start, self.current_idx):
self.result[idx] = self.data[self.current_idx]
def interpolate(data, maxgap=2):
"""
Interpolate categorical variables over missing
values (None's).
Parameters
----------
data: list of objects
The data to interpolate. Holds
categorical data, such as 'cat', 'dog'
or 108. None is handled as missing data.
maxgap: int
The maximum gap to interpolate over.
For example, with maxgap=2, ['car', None,
None, 'car', None, None, None, 'car']
would become ['car', 'car', 'car' 'car',
None, None None, 'car'].
Note: Interpolation will only occur on missing
data regions where both ends contain the same value.
For example, [1, None, 2, None, 2] will become
[1, None, 2, 2, 2].
"""
interpolator = Interpolator(data)
interpolator.run(maxgap=maxgap)
return interpolator.result
This is how one would use it (code for get_data() below):
data = get_data(k=100)
interpolated_data = interpolate(data)
Copy-paste Cython implementation
Most probably the python implementation is fast enough, as with array size of 1000.000, the amount of time needed to process the data is 0.504 seconds on my laptop. Anyway, creating Cython versions is fun and might give small additional timing bonus.
Needed steps:
Copy-paste the python implementation into new file, called fast_categorical_interpolate.pyx
Create setup.py to the same folder, with following contents:
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize(
"fast_categorical_interpolate.pyx",
language_level="3",
),
)
Run python setup.py build_ext --inplace to build the Cython extension. You'll see something like fast_categorical_interpolate.cp38-win_amd64.pyd in the same folder.
Now, you may use the interpolator like this:
import fast_categorical_interpolate as fpi
data = get_data(k=100)
interpolated_data = fpi.interpolate(data)
Of course, there might be some optimizations that you could do in the Cython code to make this even faster, but on my machine the speed improvement was 38% out of the box with N=1000.000 and 126% when N=10.000.
Timings on my machine
When N=100 (number of items in the list), python implementation is about 160x , and Cython implementation about 250x faster than smooth
In [8]: timeit smooth(test_df, delay=2)
10.2 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: timeit interpolate(data)
64.8 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: timeit fpi.interpolate(data)
41.3 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When N=10.000, the timing difference is about 190x (Python) to 302x (Cython).
In [5]: timeit smooth(test_df, delay=2)
1.08 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit interpolate(data)
5.69 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: timeit fpi.interpolate(data)
3.57 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
When N=1000.000, the python implementation is about 210x faster and Cython implementation is about 287x faster.
In [9]: timeit smooth(test_df, delay=2)
1min 45s ± 24.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: timeit interpolate(data)
504 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: timeit fpi.interpolate(data)
365 ms ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appendix
Test data creator get_data()
import random
random.seed(0)
def get_data(k=100):
return random.choices(population=[None, "ZoneA", "ZoneB"], weights=[4, 3, 2], k=k)
Function and test data for testing smooth()
import pandas as pd
data = get_data(k=1000)
test_df = pd.DataFrame(dict(landlot=data)).fillna("None")
def smooth(df: pd.DataFrame, delay=2) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= delay)
and (df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")] == "None")
):
df_iter.iloc[
last_index : (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Note on the "current code"
I think there must be some copy-paste error somewhere, as the "current code" does not work as all. I replaced the self.delay with a delay=2 keyword argument to indicate the max gap. I assume that is was it was supposed to be. Even with that the logic did not work correcly with the simple example data you provided.
I have to compute a large number of 3x3 linear transformations (eg. rotations). This is what I have so far:
import numpy as np
from scipy import sparse
from numba import jit
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
def dot1():
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2():
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3():
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4():
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
On a macbook pro 2012, this gives me:
In [62]: %timeit dot1()
783 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [63]: %timeit dot2()
261 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit dot3()
293 ms ± 2.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [65]: %timeit dot4()
281 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appart from the naive approach, all approaches are similar. Is there a way to accelerate this significantly?
Edit
(The cuda approach is the best when available. The following is comparing the non-cuda versions)
Following the various suggestions, I modified dot2, added the Op#A method, and a version based on #59356461.
#njit(fastmath=True, parallel=True)
def dot2(Op, A):
""" same as above, but jitted """
new = np.empty_like(A)
for i in prange(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot5(Op, A):
""" using matmul """
return Op#A
#njit(fastmath=True, parallel=True)
def dot6(Op, A):
""" another numba.jit with parallel (based on #59356461) """
new = np.empty_like(A)
for i_n in prange(A.shape[0]):
for i_k in range(A.shape[2]):
for i_x in range(3):
acc = 0.0j
for i_y in range(3):
acc += Op[i_n, i_x, i_y] * A[i_n, i_y, i_k]
new[i_n, i_x, i_k] = acc
return new
This is what I get (on a different machine) with benchit:
def gen(n, k):
Op = np.random.rand(n, 3, 3) + 1j * np.random.rand(n, 3, 3)
A = np.random.rand(n, 3, k) + 1j * np.random.rand(n, 3, k)
return Op, A
# benchit
import benchit
funcs = [dot1, dot2, dot3, dot4, dot5, dot6]
inputs = {n: gen(n, 100) for n in [100,1000,10000,100000,1000000]}
t = benchit.timings(funcs, inputs, multivar=True, input_name='Number of operators')
t.plot(logy=True, logx=True)
You've gotten some great suggestions, but I wanted to add one more due to this specific goal:
Is there a way to accelerate this significantly?
Realistically, if you need these operations to be significantly faster (which often means > 10x) you probably would want to use a GPU for the matrix multiplication. As a quick example:
import numpy as np
import cupy as cp
n = 100000 # number of transformations
k = 100 # number of vectors for each transformation
# CPU version
A = np.random.rand(n, 3, k) # vectors
Op = np.random.rand(n, 3, 3) # operators
def dot5(): # the suggested, best CPU approach
return Op#A
# GPU version using a V100
gA = cp.asarray(A)
gOp = cp.asarray(Op)
# run once to ignore JIT overhead before benchmarking
gOp#gA;
%timeit dot5()
%timeit gOp#gA; cp.cuda.Device().synchronize() # need to sync for a fair benchmark
112 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.19 ms ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use Op#A like suggested by #hpaulj in comments.
Here is a comparison using benchit:
def dot1(A,Op):
""" naive approach: many times np.dot """
return np.stack([np.dot(o, a) for o, a in zip(Op, A)])
#jit(nopython=True)
def dot2(A,Op):
""" same as above, but jitted """
new = np.empty_like(A)
for i in range(Op.shape[0]):
new[i] = np.dot(Op[i], A[i])
return new
def dot3(A,Op):
""" using einsum """
return np.einsum("ijk,ikl->ijl", Op, A)
def dot4(A,Op):
n = A.shape[0]
sOp = sparse.bsr_matrix((Op, np.arange(n), np.arange(n+1))) # same as Op but as block-diag
""" using sparse block diag matrix """
return sOp.dot(A.reshape(3 * n, -1)).reshape(n, 3, -1)
def dot5(A,Op):
return Op#A
in_ = {n:[np.random.rand(n, 3, k), np.random.rand(n, 3, 3)] for n in [100,1000,10000,100000,1000000]}
They seem to be close in performance for larger scale with dot5 being slightly faster.
In one answer Nick mentioned using the GPU - which is the best solution of course.
But - as a general rule - what you're doing is likely CPU limited. Therefore (with the exception to the GPU approach), the best bang you can get is if you make use of all the cores on your machine to work in parallel.
So for that you would want to use multiprocessing (not python's multithreading!), to split the job up into pieces running on each core in parallel.
This is not trivial, but also not too hard, and there are many good examples/guides online.
But if you had an 8-core machine, it would likely give you an almost 8x speed increase as long as you're careful to avoid memory bottlenecks by trying to pass many small objects between processes, but pass them all in a group at the start
I recently played with Cython and Numba to accelerate small pieces of a python that does numerical simulation. At first, developing with numba seems easier. Yet, I found difficult to understand when numba will provide a better performance and when it will not.
One example of unexpected performance drop is when I use the function np.zeros() to allocate a big array in a compiled function. For example, consider the three function definitions:
import numpy as np
from numba import jit
def pure_python(n):
mat = np.zeros((n,n), dtype=np.double)
# do something
return mat.reshape((n**2))
#jit(nopython=True)
def pure_numba(n):
mat = np.zeros((n,n), dtype=np.double)
# do something
return mat.reshape((n**2))
def mixed_numba1(n):
return mixed_numba2(np.zeros((n,n)))
#jit(nopython=True)
def mixed_numba2(array):
n = len(array)
# do something
return array.reshape((n,n))
# To compile
pure_numba(10)
mixed_numba1(10)
Since the #do something is empty, I do not expect the pure_numba function to be faster. Yet, I was not expecting such a performance drop:
n=10000
%timeit x = pure_python(n)
%timeit x = pure_numba(n)
%timeit x = mixed_numba1(n)
I obtain (python 3.7.7, numba 0.48.0 on a mac)
4.96 µs ± 65.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
344 ms ± 7.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.8 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Here, the numba code is much slower when I use the function np.zeros() inside the compiled function. It works normally when the np.zeros() is outside the function.
Am I doing something wrong here or should I always allocate big arrays like these outside functions that are compiled by numba?
Update
This seems related to a lazy initialization of the matrices by np.zeros((n,n)) when n is large enough (see Performance of zeros function in Numpy ).
for n in [1000, 2000, 5000]:
print('n=',n)
%timeit x = pure_python(n)
%timeit x = pure_numba(n)
%timeit x = mixed_numba1(n)
gives me:
n = 1000
468 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
296 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
300 µs ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 2000
4.79 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.45 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.54 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n = 5000
270 µs ± 4.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
104 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
119 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
tl;dr Numpy uses C memory functions whereas Numba must assign zeros
I wrote a script to plot the time it takes for several options to complete and it appears that Numba has a severe drop in performance when the size of the np.zeros array reaches 2048*2048*8 = 32 MB on my machine as shown in the diagram below.
Numba's implementation of np.zeros is just as fast as creating an empty array and filling it with zeros by iterating over the dimensions of the array (this is the Numba nested loop green curve of the diagram). This can actually be double-checked by setting the NUMBA_DUMP_IR environment variable before running the script (see below). When comparing to the dump for numba_loop there is not much difference.
Interstingly, np.zeros gets a little boost passed the 32 MB threshold.
My best guess, although I am far from an expert, is that the 32 MB limit is an OS or hardware bottleneck coming from the amount of data that can fit in a cache for the same process. If this is exceeded, the operation of moving data in and out of the cache to operate on it is very time consuming.
By contrast, Numpy uses calloc to get some memory segment with a promise to fill the data with zeros when it will be accessed.
This is how far I got and I realise it's only half an answer but maybe someone more knowledgeable can shed some light on what is actually going on.
Numba IR dump:
---------------------------IR DUMP: pure_numba_zeros----------------------------
label 0:
n = arg(0, name=n) ['n']
$2load_global.0 = global(np: <module 'numpy' from '/lib/python3.8/site-packages/numpy/__init__.py'>) ['$2load_global.0']
$4load_attr.1 = getattr(value=$2load_global.0, attr=zeros) ['$2load_global.0', '$4load_attr.1']
del $2load_global.0 []
$10build_tuple.4 = build_tuple(items=[Var(n, script.py:15), Var(n, script.py:15)]) ['$10build_tuple.4', 'n', 'n']
$12load_global.5 = global(np: <module 'numpy' from '/lib/python3.8/site-packages/numpy/__init__.py'>) ['$12load_global.5']
$14load_attr.6 = getattr(value=$12load_global.5, attr=double) ['$12load_global.5', '$14load_attr.6']
del $12load_global.5 []
$18call_function_kw.8 = call $4load_attr.1($10build_tuple.4, func=$4load_attr.1, args=[Var($10build_tuple.4, script.py:15)], kws=[('dtype', Var($14load_attr.6, script.py:15))], vararg=None) ['$10build_tuple.4', '$14load_attr.6', '$18call_function_kw.8', '$4load_attr.1']
del $4load_attr.1 []
del $14load_attr.6 []
del $10build_tuple.4 []
mat = $18call_function_kw.8 ['$18call_function_kw.8', 'mat']
del $18call_function_kw.8 []
$24load_method.10 = getattr(value=mat, attr=reshape) ['$24load_method.10', 'mat']
del mat []
$const28.12 = const(int, 2) ['$const28.12']
$30binary_power.13 = n ** $const28.12 ['$30binary_power.13', '$const28.12', 'n']
del n []
del $const28.12 []
$32call_method.14 = call $24load_method.10($30binary_power.13, func=$24load_method.10, args=[Var($30binary_power.13, script.py:16)], kws=(), vararg=None) ['$24load_method.10', '$30binary_power.13', '$32call_method.14']
del $30binary_power.13 []
del $24load_method.10 []
$34return_value.15 = cast(value=$32call_method.14) ['$32call_method.14', '$34return_value.15']
del $32call_method.14 []
return $34return_value.15 ['$34return_value.15']
The script to produce the diagram:
import numpy as np
from numba import jit
from time import time
import os
import matplotlib.pyplot as plt
os.environ['NUMBA_DUMP_IR'] = '1'
def numpy_zeros(n):
mat = np.zeros((n,n), dtype=np.double)
return mat.reshape((n**2))
#jit(nopython=True)
def numba_zeros(n):
mat = np.zeros((n,n), dtype=np.double)
return mat.reshape((n**2))
#jit(nopython=True)
def numba_loop(n):
mat = np.empty((n * 2,n), dtype=np.float32)
for i in range(mat.shape[0]):
for j in range(mat.shape[1]):
mat[i, j] = 0.
return mat.reshape((2 * n**2))
# To compile
numba_zeros(10)
numba_loop(10)
os.environ['NUMBA_DUMP_IR'] = '0'
max_n = 4100
time_deltas = {
'numpy_zeros': [],
'numba_zeros': [],
'numba_loop': [],
}
call_count = 10
for n in range(0, max_n, 10):
for f in (numpy_zeros, numba_zeros, numba_loop):
start = time()
for i in range(call_count):
x = f(n)
delta = time() - start
time_deltas[f.__name__].append(delta / call_count)
print(f'{f.__name__:25} n = {n}: {delta}')
print()
size = np.arange(0, max_n, 10) ** 2 * 8 / 1024 ** 2
fig, ax = plt.subplots()
plt.xticks(np.arange(0, size[-1], 16))
plt.axvline(x=32, color='gray', lw=0.5)
ax.plot(size, time_deltas['numpy_zeros'], label='Numpy zeros (calloc)')
ax.plot(size, time_deltas['numba_zeros'], label='Numba zeros')
ax.plot(size, time_deltas['numba_loop'], label='Numba nested loop')
ax.set_xlabel('Size of array in MB')
ax.set_ylabel(r'Mean $\Delta$t in s')
plt.legend(loc='upper left')
plt.show()
I am trying to accelerate the code below that produces a list of lists with different types for each column. I originally created pandas dataframe and then converted it to list, but this seems to be fairly slow. How can I create this list faster, by say an order of magnitude? All columns are constant except one.
import pandas as pd
import numpy as np
import time
import datetime
def overflow_check(x):
# in SQL code the column is decimal(13, 2)
p=13
s=3
max_limit = float("9"*(p-s) + "." + "9"*s)
#min_limit = 0.01 #float("0" + "." + "0"*(s-2) + '1')
#min_limit = 0.1
if np.logical_not(isinstance(x, np.ndarray)) or len(x) < 1:
raise Exception("Non-numeric or empty array.")
else:
#print(x)
return x * (np.abs(x) < max_limit) + np.sign(x)* max_limit * (np.abs(x) >= max_limit)
def list_creation(y_forc):
backcast_length = len(y_forc)
backcast = pd.DataFrame(data=np.full(backcast_length, 2),
columns=['TypeId'])
backcast['id2'] = None
backcast['Daily'] = 1
backcast['ForecastDate'] = y_forc.index.strftime('%Y-%m-%d')
backcast['ReportDate'] = pd.to_datetime('today').strftime('%Y-%m-%d')
backcast['ForecastMethodId'] = 1
backcast['ForecastVolume'] = overflow_check(y_forc.values)
backcast['CreatedBy'] = 'test'
backcast['CreatedDt'] = pd.to_datetime('today')
return backcast.values.tolist()
i=pd.date_range('05-01-2010', '21-05-2018', freq='D')
x=pd.DataFrame(index=i, data = np.random.randint(0, 100, len(i)))
t=time.perf_counter()
y =list_creation(x)
print(time.perf_counter()-t)
This should be a bit faster, it just directly creates the list:
def list_creation1(y_forc):
zipped = zip(y_forc.index.strftime('%Y-%m-%d'), overflow_check(y_forc.values)[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 =pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
%%timeit
list_creation(x)
> 29.3 ms ± 468 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
list_creation1(x)
> 17.1 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Edit: one of the large issues with the slowness is the time it takes to go from datetime to specified format. if we can get rid of that by phrasing it as the following:
def list_creation1(i, v):
zipped = zip(i, overflow_check(np.array([[_x] for _x in v]))[:,0])
t = pd.to_datetime('today').strftime('%Y-%m-%d')
t1 =pd.to_datetime('today')
return [
[2, None, 1, i, t,
1, v, 'test', t1]
for i,v in zipped
]
start = datetime.datetime.strptime("05-01-2010", "%d-%m-%Y")
end = datetime.datetime.strptime("21-05-2018", "%d-%m-%Y")
i = [(start + datetime.timedelta(days=x)).strftime("%d-%m-%Y") for x in range(0, (end-start).days)]
x=np.random.randint(0, 100, len(i))
Then this is now a lot faster:
%%timeit
list_creation1(i, x)
> 1.87 ms ± 24.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I got a Pandas dataframe which contains a column with pretty long strings (let's say URL_paths) and a list of unique substrings (reference list). For every row in my dataframe, I want to determine the corresponding reference element in my list. Hence, if the URL in a given row is for example abcd1234, and one of the reference values is cd123, then I want to add cd123 as reference to my dataframe, to categorize this row/URL.
I got my code working (see example below), but it's pretty slow due to a for loop (I guess) which I can't get rid off. I got the feeling that my code can be much faster, but can't think of a way to improve it.
How can I improve running time?
See working example below:
import string
import secrets
import pandas as pd
import time
from random import randint
n_ref = 100
n_target = 1000000
## Build reference Series, and target dataframe
reference = pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(randint(10, 19)))
for _ in range(n_ref))
target = pd.Series(reference.sample(n = n_target, replace = True)).reset_index().iloc[:,1]
dfTarget = pd.DataFrame({
'target' : target,
'pre-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'post-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'reference' : pd.Series()})
dfTarget['target_combined'] = dfTarget[['pre-string', 'target', 'post-string']].apply(lambda x: ''.join(x), axis=1)
## Fill in reference column
## Loop over references and return reference in reference column
start_time = time.time()
for x in reference:
dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
print("--- %s seconds ---" % (time.time() - start_time))
Out: 42.60... seconds
On my machine, I see a 17x improvement using pd.Series.apply:
reference_set = set(reference)
def calculator(x):
return next((i for i in reference_set if i in x), None)
dfTarget['reference'] = dfTarget['target_combined'].apply(calculator)
But for optimal performance, see #unutbu's solution.
Here is a slightly (4.3 times) faster approach:
RegEx pattern:
In [23]: pat = '.*({}).*'.format(reference.str.cat(sep='|'))
In [24]: pat
Out[24]: '.*(J6BUVB2BRDLL3IR9S1J|ZOXS91UK513RR18YREI|92KWUFKOK4G9XJAHIBJ|PMEH6N96091AK9XCA5J|3CICA38SDIXLFVED74I|V48OJCY2DS|LX8KGGBORWP6A|7H
V3NN71MU|JMA2K7QSHK72X|CNAOYI3C8T|NZE9SFKPYX|EU9K88XA29YATWR|SB871PEZ7TOPCG8|ZPP76BSDULM8|3QHLISVYEBWH|ST8VOI959D8YPCZ0|02BW83KYG3TEPWMOP|TG
I3P5QZC988GNM8FI0|GJG9MC18G5TU1TIDQB6|V7V5ZZJ5W7O|51KMJ07HEBIX|27GPT3B9DLY|O8KSR85BUB6WBKRC|ZKUEEFX5JFRE0IFRN0|FH8CUWHDETQ5TXWHSS1|N77FTB9VG
LK|JS4RUUQLD7IFP|3R45N7LOY1BZ8RR6O|JY3RXZ0OTC|YJQYOO03G0N7H7E56D|RVJ2VFNK6T7P30|GKPGAK6WAQ2QCAU6H3|7XNJ7A24CHWO1PK|1DVD5G1AE3I40|9F7CCWKHMMF
MBYD18|FWPEUWOWNK2SXR36SG|VTE64VCRY5|YGM8TT19EZTX|GKJYM3QS9ONTERQY1O0|KWMB1TMQTWMC6QCY|JS9SY7W5HI0KK|WNSHPK9KNEP77B|7EIS883NUXSO5Q6|K3HL2UYW
458LCBOSL|XI1FRVGHN0IL0F53CK4|F4HL7GKMOL2Q4Y13|IAXPAA4OX2J1X1|SXPLPYVB6EFSN4U5ZW|5L947F08PX8UW|IONNAOC26A|VQVHXHGYP8634|509ALPOKABO|SUJA66H2
DS7UOXFV|3GYIZATSZAXF8283SZO|A5612XI7X3N4|IH3RB3640D23Q28O|MH0YD83OELSI|RIFFPNRIV0XCY|Y0CXWE6GZPQ3FKH|WSCWR598Z8GBW9G|7C9O59EIA23POSI|UG4D5H
AAOYU5E|F249VSIILZ6KXDQSX|06XZSJHWSM|X01Y9AZ2W5V8HZ|1JLPWMPRGRFWIK|3ZVBSLEQ8DO|WMLKKETELHC|WDPHDS7A7XN7|6X4O4AE2IB3OS|V5J5HWO9RO19ZW2LGT|MK9
P8D9N8V4AJZB|0VT48C38I4T1V6S|R987QUQBTPRHCT7QWA4|D4XXBMCYWQ1172OY|ZUY1O565D2W5GSAL8|V8AR792X1K5UL9DLCKV|CXYK6IQWK3MUC3CO|6X7B6240VC9YL|4QV2D
13ZY15A9D5M1H|WJ7HOMK2FNBZZ6N2Z|QCOWSA3RLR|81I6Z0I5GM|KRD9Y1H3E2WEY9710Q|0161MNQHKEC30E8UI|HGB4XB0QDVHM4H92|RWD6L6EZJUSRK|6U9WOE3YVYKY31K8Q0
K|KCXWHL43B16MRQ1|EO330WAPN7XMX4|VYUX5W2NN277W09NMDB|J8EXE4YIMN0FB|SHE8D14C5A3X|PMPYKSY2FVXFR4Y8X3W|G3YU894U5QGOOM3Z|58J37WJPJBOC7QNKV|NE9WE
JSRXTYFXYZ0TBI|7UPR5XSVOJ244HHZ|N0QZCN6NADW|W2CTEUISOHUY).*'
Replacement:
dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
Timing against 10.000 rows DF:
In [25]: %%timeit
...: dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
...:
617 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [26]: %%timeit
...: [dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] for x in reference]
...:
1.96 s ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %%timeit
...: for x in reference:
...: dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
...:
2.64 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [28]: 2.64/0.617
Out[28]: 4.278768233387359
In [29]: 2.64/1.96
Out[29]: 1.3469387755102042