Add leading zeros based on condition in python - python

I have a dataframe with 5 million rows. Let's say the dataframe looked like below:
>>> df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})
>>> df
Random
0 86
1 7639103627
2 96
3 32
4 1469476501
Note that the Random column is stored as a string.
If the number in column Random has fewer than 9 digits, I want to add leading zeros to make it 9 digits. If the number has 9 or more digits, I want to add leading zeros to make it 20 digits.
what I have done is this:
for i in range(0,len(df['Random'])):
if len(df['Random'][i]) < 9:
df['Random'][i]=df['Random'][i].zfill(9)
else:
df['Random'][i]=df['Random'][i].zfill(20)
Since the number of rows is over 5 million, this process takes a lot of time! (performance was 5it/sec. Tested using tqdm, estimated time of completion was in days!).
Is there an easier and faster way of performing this task?

Let us do np.where combine with zfill, alternative you can check with str.pad
df.Random=np.where(df.Random.str.len()<9,df.Random.str.zfill(9),df.Random.str.zfill(20))
df
Out[9]:
Random
0 000000086
1 00000000007639103627
2 000000096
3 000000032
4 00000000001469476501

I used 'apply' combined with the fill_zeros function written below to get a run time of 603ms over a dataframe of 1,000,000 rows.
data = {
'Random': [str(randint(0, 100_000_000)) for i in range(0, 1_000_000)]
}
df = pd.DataFrame(data)
def fill_zeros(x):
if len(x) < 9:
return x.zfill(9)
else:
return x.zfill(20)
%timeit df['Random'].apply(fill_zeros)
603 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Compared to:
%timeit np.where(df.Random.str.len()<9,df.Random.str.zfill(9),df.Random.str.zfill(20))
1.57 s ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Since you are asking about efficiency, string operations are one of the common "gotchas" with Pandas, since while they are vectorized (in that you can apply them to an entire Series in one go), that does not mean that they are more efficient than looping, and this is one example where looping is actually going to be faster than using the string accessor, which tends to be more for convenience than speed.
When in doubt, make sure you time functions on your actual data, since something you think may be clunky and slow may be faster than something that looks clean!
I'm going to propose a very basic looping function that I think will beat any approach using the string accessor.
def loopy(series):
return pd.Series(
(
el.zfill(9) if len(el) < 9 else el.zfill(20)
for el in series
),
name=series.name,
)
# to compare more fairly with the apply version
def cache_loopy(series, _len=len, _zfill=str.zfill):
return pd.Series(
(_zfill(el, 9 if _len(el) < 9 else 20) for el in series), name=series.name)
Now let's check the timings, using the code provided by Martijn above and simple_benchmark.
Functions
def loopy(series):
series.copy() # not necessary but just to make timings fair
return pd.Series(
(
el.zfill(9) if len(el) < 9 else el.zfill(20)
for el in series
),
name=series.name,
)
def str_accessor(series):
target = series.copy()
mask = series.str.len() < 9
unmask = ~mask
target[mask] = target[mask].str.zfill(9)
target[unmask] = target[unmask].str.zfill(20)
return target
def np_where_str_accessor(series):
target = series.copy()
return np.where(target.str.len()<9,target.str.zfill(9),target.str.zfill(20))
def fill_zeros(x, _len=len, _zfill=str.zfill):
# len() and str.zfill() are cached as parameters for performance
return _zfill(x, 9 if _len(x) < 9 else 20)
def apply_fill(series):
series = series.copy()
return series.apply(fill_zeros)
def cache_loopy(series, _len=len, _zfill=str.zfill):
series.copy()
return pd.Series(
(_zfill(el, 9 if _len(el) < 9 else 20) for el in series), name=series.name)
Setup
import pandas as pd
import numpy as np
from random import choices, randrange
from simple_benchmark import benchmark
def randvalue(chars="0123456789", _c=choices, _r=randrange):
return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
fns = [loopy, str_accessor, np_where_str_accessor, apply_fill, cache_loopy]
args = { 2**i: pd.Series([randvalue() for _ in range(2**i)]) for i in range(14, 21)}
b = benchmark(fns, args, 'Series Length')
b.plot()

You need vectorize this; select the columns using a boolean index and use .str.zfill() on the resulting subsets:
# select the right rows to avoid wasting time operating on longer strings
shorter = df.Random.str.len() < 9
longer = ~shorter
df.Random[shorter] = df.Random[shorter].str.zfill(9)
df.Random[longer] = df.Random[longer].str.zfill(20)
Note: I did not use np.where() because we wouldn't want to double the work. A vectorized df.Random.str.zfill() is faster than looping over the rows, but doing it twice still takes more time than doing it just once for each set of rows.
Speed comparison on 1 million rows of strings with values of random lengths (from 5 characters all the way up to 30):
In [1]: import numpy as np, pandas as pd
In [2]: import platform; print(platform.python_version_tuple(), platform.platform(), pd.__version__, np.__version__, sep="\n")
('3', '7', '3')
Darwin-17.7.0-x86_64-i386-64bit
0.24.2
1.16.4
In [3]: !sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7820HQ CPU # 2.90GHz
In [4]: from random import choices, randrange
In [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):
...: return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
...:
In [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})
In [7]: %%timeit
...: target = df.copy()
...: shorter = target.Random.str.len() < 9
...: longer = ~shorter
...: target.Random[shorter] = target.Random[shorter].str.zfill(9)
...: target.Random[longer] = target.Random[longer].str.zfill(20)
...:
...:
825 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: target = df.copy()
...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))
...:
...:
929 ms ± 69.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(The target = df.copy() line is needed to make sure that each repeated test run is isolated from the one before.)
Conclusion: on 1 million rows, using np.where() is about 10% slower.
However, using df.Row.apply(), as proposed by jackbicknell14, beats either method by a huge margin:
In [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):
...: # len() and str.zfill() are cached as parameters for performance
...: return _zfill(x, 9 if _len(x) < 9 else 20)
In [10]: %%timeit
...: target = df.copy()
...: target.Random = target.Random.apply(fill_zeros)
...:
...:
299 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's about 3 times faster!

df.Random.str.zfill(9).where(df.Random.str.len() < 9, df.Random.str.zfill(20))

Related

Is there a faster method than np.isin and np.where for large arrays?

I have a 1xN array A and a 2xM array B. I want to make two new 1xN arrays
a boolean one that checks whether the first column of B is in A
another one with entries i that are B[1,i] if B[0,i] is in A, and np.nan otherwise
Whatever method I use needs to be super fast as it’ll be called a lot. I can do the first part using this: Is there method faster than np.isin for large array?
But I’m stumped on a good way to do the second part. Here’s what I’ve got so far (adapting the code in the post above):
import numpy as np
import numba as nb
#nb.jit(parallel=True)
def isinvals(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = list_vals.index(arr[i]) ## THIS LINE IS WAY TOO SLOW
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
N = int(1e5)
M = int(20e3)
num_arr = 100e3
num_vals = 20e3
num_types = 6
arr = np.random.randint(0, num_arr, N)
vals_col1 = np.random.randint(0, num_vals, M)
vals_col2 = np.random.randint(0, num_types, M)
vals = np.array([vals_col1, vals_col2])
%timeit result, result_vals = isinvals(arr,vals)
46.4 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The line I've marked above (list_vals.index(arr[i])) is the slow part. If I don't use that I can make a super fast version:
#nb.jit(parallel=True)
def isinvals_cheating(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = 0 ## TEMPORARILY SETTING TO 0 TO INDICATE SPEED DIFFERENCE
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
%timeit result, result_vals = isinvals_cheating(arr,vals)
1.13 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
i.e. that single line is making it 40 times slower.
Any ideas? I've also tried using np.where() but it's even slower.
Assuming OP's solution gives the desired result since the question seems ambiguous for non-unique values ​​in vals[0, idx] with different corresponding values vals[1, idx]. A lookup table is faster, but needs len(arr) additional space.
#nb.njit # tested with numba 0.55.1
def isin_nb(arr, vals):
lookup = np.empty(len(arr), np.float32)
lookup.fill(np.nan)
lookup[vals[0, ::-1]] = vals[1, ::-1]
res_val = lookup[arr]
return ~np.isnan(res_val), res_val
With the example data used in the question
res, res_val = isin_nb(arr, vals)
# %timeit 1000 loops, best of 5: 294 µs per loop
Asserting equal results
np.testing.assert_equal(res, result)
np.testing.assert_equal(res_val, result_vals)

Smoothing / interpolating categorical data (fast)

I am currently working with an array, containing categorical data.
Categories are organised like this: None,zoneA, zoneB
My array is a measure of sensors, it tells me if, at any time, the sensor is in zoneA, zoneB or not in a zone.
My goal here is to smooth those values.
For example, the sensor could be out of zoneA or b for a period of 30 measures, but if it happened I want those measures to be "smoothed".
Ex :
array[zoneA, zoneA, zoneA, None, None, zoneA, zoneA, None, None, None, zoneA]
should give
array[zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, zoneA, None, None, None, zoneA]
with a threshold of 2.
Currently, I am using an iteration over arrays, but its computation is too expensive and can lead to 1 or 2 min of computation. Is there an existing algorithm to answer that problem?
My current code :
def smooth(self, df: pd.DataFrame) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= self.delay)
and (
df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")]
== "None"
)
):
df_iter.iloc[
last_index: (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Python implementation
I like to start these kind of things clean and simple. Therefore I just wrote a simple class that does exactly what is needed, without thinking too much about optimization. I call it Interpolator as this looks like categorical interpolation to me.
class Interpolator:
def __init__(self, data):
self.data = data
self.current_idx = 0
self.current_nan_region_start = None
self.result = None
self.maxgap = 1
def run(self, maxgap=2):
# Initialization
self.result = [None] * len(self.data)
self.maxgap = maxgap
self.current_nan_region_start = None
prev_isnan = 0
for idx, item in enumerate(self.data):
isnan = item is None
self.current_idx = idx
if isnan:
if prev_isnan:
# Result is already filled with empty data.
# Do nothing.
continue
else:
self.entered_nan_region()
prev_isnan = 1
else: # not nan
if prev_isnan:
self.exited_nan_region()
prev_isnan = 0
else:
self.continuing_in_categorical_region()
def entered_nan_region(self):
self.current_nan_region_start = self.current_idx
def continuing_in_categorical_region(self):
self.result[self.current_idx] = self.data[self.current_idx]
def exited_nan_region(self):
nan_region_end = self.current_idx - 1
nan_region_length = nan_region_end - self.current_nan_region_start + 1
# Always copy the empty region endpoint even if gap is not filled
self.result[self.current_idx] = self.data[self.current_idx]
if nan_region_length > self.maxgap:
# Do not interpolate as exceeding maxgap
return
if self.current_nan_region_start == 0:
# Special case. data starts with "None"
# -> Cannot interpolate
return
if self.data[self.current_nan_region_start - 1] != self.data[self.current_idx]:
# Do not fill as both ends of missing data
# region do not have same value
return
# Fill the gap
for idx in range(self.current_nan_region_start, self.current_idx):
self.result[idx] = self.data[self.current_idx]
def interpolate(data, maxgap=2):
"""
Interpolate categorical variables over missing
values (None's).
Parameters
----------
data: list of objects
The data to interpolate. Holds
categorical data, such as 'cat', 'dog'
or 108. None is handled as missing data.
maxgap: int
The maximum gap to interpolate over.
For example, with maxgap=2, ['car', None,
None, 'car', None, None, None, 'car']
would become ['car', 'car', 'car' 'car',
None, None None, 'car'].
Note: Interpolation will only occur on missing
data regions where both ends contain the same value.
For example, [1, None, 2, None, 2] will become
[1, None, 2, 2, 2].
"""
interpolator = Interpolator(data)
interpolator.run(maxgap=maxgap)
return interpolator.result
This is how one would use it (code for get_data() below):
data = get_data(k=100)
interpolated_data = interpolate(data)
Copy-paste Cython implementation
Most probably the python implementation is fast enough, as with array size of 1000.000, the amount of time needed to process the data is 0.504 seconds on my laptop. Anyway, creating Cython versions is fun and might give small additional timing bonus.
Needed steps:
Copy-paste the python implementation into new file, called fast_categorical_interpolate.pyx
Create setup.py to the same folder, with following contents:
from setuptools import setup
from Cython.Build import cythonize
setup(
ext_modules=cythonize(
"fast_categorical_interpolate.pyx",
language_level="3",
),
)
Run python setup.py build_ext --inplace to build the Cython extension. You'll see something like fast_categorical_interpolate.cp38-win_amd64.pyd in the same folder.
Now, you may use the interpolator like this:
import fast_categorical_interpolate as fpi
data = get_data(k=100)
interpolated_data = fpi.interpolate(data)
Of course, there might be some optimizations that you could do in the Cython code to make this even faster, but on my machine the speed improvement was 38% out of the box with N=1000.000 and 126% when N=10.000.
Timings on my machine
When N=100 (number of items in the list), python implementation is about 160x , and Cython implementation about 250x faster than smooth
In [8]: timeit smooth(test_df, delay=2)
10.2 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: timeit interpolate(data)
64.8 µs ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: timeit fpi.interpolate(data)
41.3 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
When N=10.000, the timing difference is about 190x (Python) to 302x (Cython).
In [5]: timeit smooth(test_df, delay=2)
1.08 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: timeit interpolate(data)
5.69 ms ± 852 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: timeit fpi.interpolate(data)
3.57 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
When N=1000.000, the python implementation is about 210x faster and Cython implementation is about 287x faster.
In [9]: timeit smooth(test_df, delay=2)
1min 45s ± 24.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: timeit interpolate(data)
504 ms ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: timeit fpi.interpolate(data)
365 ms ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Appendix
Test data creator get_data()
import random
random.seed(0)
def get_data(k=100):
return random.choices(population=[None, "ZoneA", "ZoneB"], weights=[4, 3, 2], k=k)
Function and test data for testing smooth()
import pandas as pd
data = get_data(k=1000)
test_df = pd.DataFrame(dict(landlot=data)).fillna("None")
def smooth(df: pd.DataFrame, delay=2) -> pd.DataFrame:
"""
Args:
df (pd.DataFrame): dataframe with landlot column to smooth.
Returns:dataframe smoothed
"""
df_iter = df
last = "None"
last_index = 0
for num, line in df_iter.iterrows():
if (
(line.landlot != "None")
and (line.landlot == last)
and (num - last_index <= delay)
and (df_iter.iloc[(num - 1), df_iter.columns.get_loc("landlot")] == "None")
):
df_iter.iloc[
last_index : (num + 1), # noqa: E203
df_iter.columns.get_loc("landlot"),
] = last
if line.landlot != "None":
last = line.landlot
last_index = num
return df_iter
Note on the "current code"
I think there must be some copy-paste error somewhere, as the "current code" does not work as all. I replaced the self.delay with a delay=2 keyword argument to indicate the max gap. I assume that is was it was supposed to be. Even with that the logic did not work correcly with the simple example data you provided.

Faster way to find same integers in sequence in numpy array

Right now I am just looping through using np.nditer() and comparing to the previous element. Is there a (vectorised) approach which is faster?
Added bonus is the fact that I don't always have to go to the end of the array; as soon as a sequence of max_len has been found I am done searching.
import numpy as np
max_len = 3
streak = 0
prev = np.nan
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for c in np.nditer(a):
if c == prev:
streak += 1
if streak == max_len:
print(c)
break
else:
prev = c
streak = 1
Alternative I thought about is using np.diff() but this just shifts the problem; we are now looking for a sequence of zeroes in its result. Also I doubt it will be faster since it will have to calculate the difference for every integer whereas in practice the sequence will occur before reaching the end of the list more often than not.
I developed a numpy-only version that works, but after testing, I found that it performs quite poorly because it can't take advantage of short-circuiting. Since that's what you asked for, I describe it below. However, there is a much better approach using numba with a lightly modified version of your code. (Note that all of these return the index of the first match in a, rather than the value itself. I find that approach more flexible.)
#numba.jit(nopython=True)
def find_reps_numba(a, max_len):
streak = 1
val = a[0]
for i in range(1, len(a)):
if a[i] == val:
streak += 1
if streak >= max_len:
return i - max_len + 1
else:
streak = 1
val = a[i]
return -1
This turns out to be ~100x faster than the pure Python version.
The numpy version uses the rolling window trick and the argmax trick. But again, this turns out to be far slower than even the pure Python version, by a substantial ~30x.
def rolling_window(a, window):
a = numpy.ascontiguousarray(a) # This approach requires a C-ordered array
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def find_reps_numpy(a, max_len):
windows = rolling_window(a, max_len)
return (windows == windows[:, 0:1]).sum(axis=1).argmax()
I tested both of these against a non-jitted version of the first function. (I used Jupyter's %%timeit feature for testing.)
a = numpy.random.randint(0, 100, 1000000)
%%timeit
find_reps_numpy(a, 3)
28.6 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
find_reps_orig(a, 3)
4.04 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
find_reps_numba(a, 3)
8.29 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Note that these numbers can vary dramatically depending on how deep into a the functions have to search. For a better estimate of expected performance, we can regenerate a new set of random numbers each time, but it's difficult to do so without including that step in the timing. So for comparison here, I include the time required to generate the random array without running anything else:
a = numpy.random.randint(0, 100, 1000000)
9.91 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numpy(a, 3)
38.2 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_orig(a, 3)
13.7 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
a = numpy.random.randint(0, 100, 1000000)
find_reps_numba(a, 3)
9.87 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, find_reps_numba is so fast that the variance in the time it takes to run numpy.random.randint(0, 100, 1000000) is much larger — hence the illusory speedup between the first and last tests.
So the big moral of the story is that numpy solutions aren't always best. Sometimes even pure Python is faster. In those cases, numba in nopython mode may be the best option by far.
You can use groupby from the itertools package.
import numpy as np
from itertools import groupby
max_len = 3
best = ()
a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])
for k, g in groupby(a):
tup_g = tuple(g)
if tup_g==max_len:
best = tup_g
break
if len(tup_g) > len(best):
best = tup_g
best
# returns:
(2, 2, 2)
You could create sub-arrays of length max_length, moving one position to the right each time (like ngrams), and check if the sum of one sub_array divided by max_length is equal to the first element of that sub-array.
If that's True, then you have found your consecutive sequence of integers of length max_length.
def get_conseq(array, max_length):
sub_arrays = zip(*[array[i:] for i in range(max_length)])
for e in sub_arrays:
if sum(e) / len(e) == e[0]:
print("Found : {}".format(e))
return e
print("Nothing found")
return []
For example, this array [1,2,2,3,4,5], with max_length = 2, will be 'split' like this:
[1,2]
[2,2]
[2,3]
[3,4]
[4,5]
On the second element, [2,2], the sum is 4, divided by max_length gives 2, and that matches the first element of that subgroup, and the function returns.
You can break if that's what you prefer to do, instead of returning like I do.
You could also add a few rules to capture edge cases, to make things clean (empty array, max_length superior to the length of the array, etc).
Here are a few example calls:
>>> splits([1,2,3,4,5,6], 2)
Nothing found
>>> splits([1,2,2,3,4,5,6], 3)
Nothing found
>>> splits([1,2,3,3,3], 3)
Found : [3, 3, 3]
>>> splits([1,2,2,3,3], 2)
Found : [2, 2]
Hope this helps !
Assuming you are looking for the element that appears for at least max_len times consecutively, here's one NumPy based way -
m = np.r_[True,a[:-1]!=a[1:],True]
idx0 = np.flatnonzero(m)
m2 = np.diff(idx0)>=max_len
out = None # None for no such streak found case
if m2.any():
out = a[idx0[m2.argmax()]]
Another with binary-dilation -
from scipy.ndimage.morphology import binary_erosion
m = np.r_[False,a[:-1]==a[1:]]
m2 = binary_erosion(m, np.ones(max_len-1, dtype=bool))
out = None
if m2.any():
out = a[m2.argmax()]
Finally, for completeness, you can also look into numba. Your existing code would work as it is, with a direct-looping over a, i.e. for c in a:.

Why does a numpy array not appear to be much faster than a standard python list?

From what I understand, numpy arrays can handle operations more quickly than python lists because they're handled in a parallel rather than iterative fashion. I tried to test that out for fun, but I didn't see much of a difference.
Was there something wrong with my test? Does the difference only matter with arrays much bigger than the ones I used? I made sure to create a python list and numpy array in each function to cancel out differences creating one vs. the other might make, but the time delta really seems negligible. Here's my code:
My final outputs were numpy function: 6.534756324786595s, list function: 6.559365831783256s
import timeit
import numpy as np
a_setup = 'import timeit; import numpy as np'
std_fx = '''
def operate_on_std_array():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
for index,elem in enumerate(std_arr):
std_arr[index] = (elem**20)*63134
return std_arr
'''
parallel_fx = '''
def operate_on_np_arr():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
np_arr = (np_arr**20)*63134
return np_arr
'''
def operate_on_std_array():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
for index,elem in enumerate(std_arr):
std_arr[index] = (elem**20)*63134
return std_arr
def operate_on_np_arr():
std_arr = list(range(0,1000000))
np_arr = np.asarray(std_arr)
np_arr = (np_arr**20)*63134
return np_arr
print('std',timeit.timeit(setup = a_setup, stmt = std_fx, number = 80000000))
print('par',timeit.timeit(setup = a_setup, stmt = parallel_fx, number = 80000000))
#operate_on_np_arr()
#operate_on_std_array()
The timeit docs here show that the statement you pass in is supposed to execute something, but the statements you pass in just define functions. I was thinking 80000000 trials on a 1-million-length array should take much longer.
Other issues you have in your test:
np_arr = (np_arr**20)*63134 may create a copy of np_arr, but your Python list equivalent only mutates an existing array.
Numpy math is different than Python math. 100**20 in Python returns a huge number because Python has unbounded-length integers, but Numpy uses C-style fixed-length integers that overflow. (In general, you have to imagine doing the operation in C when you use Numpy because other unintuitive things may apply, like garbage in uninitialized arrays.)
Here's a test where I modify both in place, multiplying then dividing by 31 each time so the values don't change over time or overflow:
import numpy as np
import timeit
std_arr = list(range(0,100000))
np_arr = np.array(std_arr)
np_arr_vec = np.vectorize(lambda n: (n * 31) / 31)
def operate_on_std_array():
for index,elem in enumerate(std_arr):
std_arr[index] = elem * 31
std_arr[index] = elem / 31
return std_arr
def operate_on_np_arr():
np_arr_vec(np_arr)
return np_arr
import time
def test_time(f):
count = 100
start = time.time()
for i in range(count):
f()
dur = time.time() - start
return dur
print(test_time(operate_on_std_array))
print(test_time(operate_on_np_arr))
Results:
3.0798873901367188 # standard array time
2.221336841583252 # np array time
Edit: As #user2357112 pointed out, the proper Numpy way to do it is this:
def operate_on_np_arr():
global np_arr
np_arr *= 31
np_arr //= 31 # integer division, not double
return np_arr
Makes it much faster. I see 0.1248 seconds.
Here are some timings using the ipython magic to initialize lists and or arrays. The results should focus on the calculations:
In [103]: %%timeit alist = list(range(10000))
...: for i,e in enumerate(alist):
...: alist[i] = (e*3)*20
...:
4.13 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [104]: %%timeit arr = np.arange(10000)
...: z = (arr*3)*20
...:
20.6 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [105]: %%timeit alist = list(range(10000))
...: z = [(e*3)*20 for e in alist]
...:
...:
1.71 ms ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Looking at the effect of array creation times:
In [106]: %%timeit alist = list(range(10000))
...: arr = np.array(alist)
...: z = (arr*3)*20
...:
...:
1.01 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ok, the calculation isn't the same. If I use **3 instead, all times are about 2x larger. Same relative relations.

Pandas: improve running time looping over string contains substring

I got a Pandas dataframe which contains a column with pretty long strings (let's say URL_paths) and a list of unique substrings (reference list). For every row in my dataframe, I want to determine the corresponding reference element in my list. Hence, if the URL in a given row is for example abcd1234, and one of the reference values is cd123, then I want to add cd123 as reference to my dataframe, to categorize this row/URL.
I got my code working (see example below), but it's pretty slow due to a for loop (I guess) which I can't get rid off. I got the feeling that my code can be much faster, but can't think of a way to improve it.
How can I improve running time?
See working example below:
import string
import secrets
import pandas as pd
import time
from random import randint
n_ref = 100
n_target = 1000000
## Build reference Series, and target dataframe
reference = pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(randint(10, 19)))
for _ in range(n_ref))
target = pd.Series(reference.sample(n = n_target, replace = True)).reset_index().iloc[:,1]
dfTarget = pd.DataFrame({
'target' : target,
'pre-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'post-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'reference' : pd.Series()})
dfTarget['target_combined'] = dfTarget[['pre-string', 'target', 'post-string']].apply(lambda x: ''.join(x), axis=1)
## Fill in reference column
## Loop over references and return reference in reference column
start_time = time.time()
for x in reference:
dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
print("--- %s seconds ---" % (time.time() - start_time))
Out: 42.60... seconds
On my machine, I see a 17x improvement using pd.Series.apply:
reference_set = set(reference)
def calculator(x):
return next((i for i in reference_set if i in x), None)
dfTarget['reference'] = dfTarget['target_combined'].apply(calculator)
But for optimal performance, see #unutbu's solution.
Here is a slightly (4.3 times) faster approach:
RegEx pattern:
In [23]: pat = '.*({}).*'.format(reference.str.cat(sep='|'))
In [24]: pat
Out[24]: '.*(J6BUVB2BRDLL3IR9S1J|ZOXS91UK513RR18YREI|92KWUFKOK4G9XJAHIBJ|PMEH6N96091AK9XCA5J|3CICA38SDIXLFVED74I|V48OJCY2DS|LX8KGGBORWP6A|7H
V3NN71MU|JMA2K7QSHK72X|CNAOYI3C8T|NZE9SFKPYX|EU9K88XA29YATWR|SB871PEZ7TOPCG8|ZPP76BSDULM8|3QHLISVYEBWH|ST8VOI959D8YPCZ0|02BW83KYG3TEPWMOP|TG
I3P5QZC988GNM8FI0|GJG9MC18G5TU1TIDQB6|V7V5ZZJ5W7O|51KMJ07HEBIX|27GPT3B9DLY|O8KSR85BUB6WBKRC|ZKUEEFX5JFRE0IFRN0|FH8CUWHDETQ5TXWHSS1|N77FTB9VG
LK|JS4RUUQLD7IFP|3R45N7LOY1BZ8RR6O|JY3RXZ0OTC|YJQYOO03G0N7H7E56D|RVJ2VFNK6T7P30|GKPGAK6WAQ2QCAU6H3|7XNJ7A24CHWO1PK|1DVD5G1AE3I40|9F7CCWKHMMF
MBYD18|FWPEUWOWNK2SXR36SG|VTE64VCRY5|YGM8TT19EZTX|GKJYM3QS9ONTERQY1O0|KWMB1TMQTWMC6QCY|JS9SY7W5HI0KK|WNSHPK9KNEP77B|7EIS883NUXSO5Q6|K3HL2UYW
458LCBOSL|XI1FRVGHN0IL0F53CK4|F4HL7GKMOL2Q4Y13|IAXPAA4OX2J1X1|SXPLPYVB6EFSN4U5ZW|5L947F08PX8UW|IONNAOC26A|VQVHXHGYP8634|509ALPOKABO|SUJA66H2
DS7UOXFV|3GYIZATSZAXF8283SZO|A5612XI7X3N4|IH3RB3640D23Q28O|MH0YD83OELSI|RIFFPNRIV0XCY|Y0CXWE6GZPQ3FKH|WSCWR598Z8GBW9G|7C9O59EIA23POSI|UG4D5H
AAOYU5E|F249VSIILZ6KXDQSX|06XZSJHWSM|X01Y9AZ2W5V8HZ|1JLPWMPRGRFWIK|3ZVBSLEQ8DO|WMLKKETELHC|WDPHDS7A7XN7|6X4O4AE2IB3OS|V5J5HWO9RO19ZW2LGT|MK9
P8D9N8V4AJZB|0VT48C38I4T1V6S|R987QUQBTPRHCT7QWA4|D4XXBMCYWQ1172OY|ZUY1O565D2W5GSAL8|V8AR792X1K5UL9DLCKV|CXYK6IQWK3MUC3CO|6X7B6240VC9YL|4QV2D
13ZY15A9D5M1H|WJ7HOMK2FNBZZ6N2Z|QCOWSA3RLR|81I6Z0I5GM|KRD9Y1H3E2WEY9710Q|0161MNQHKEC30E8UI|HGB4XB0QDVHM4H92|RWD6L6EZJUSRK|6U9WOE3YVYKY31K8Q0
K|KCXWHL43B16MRQ1|EO330WAPN7XMX4|VYUX5W2NN277W09NMDB|J8EXE4YIMN0FB|SHE8D14C5A3X|PMPYKSY2FVXFR4Y8X3W|G3YU894U5QGOOM3Z|58J37WJPJBOC7QNKV|NE9WE
JSRXTYFXYZ0TBI|7UPR5XSVOJ244HHZ|N0QZCN6NADW|W2CTEUISOHUY).*'
Replacement:
dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
Timing against 10.000 rows DF:
In [25]: %%timeit
...: dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
...:
617 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [26]: %%timeit
...: [dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] for x in reference]
...:
1.96 s ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [27]: %%timeit
...: for x in reference:
...: dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
...:
2.64 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [28]: 2.64/0.617
Out[28]: 4.278768233387359
In [29]: 2.64/1.96
Out[29]: 1.3469387755102042

Categories