go through every rows of a dataframe without iteration - python

this is my sample data :
Inventory is based on a Product
Customer Product Quantity Inventory
1 A 100 800
2 A 1000 800
3 A 700 800
4 A 50 800
5 B 20 100
6 B 50 100
7 B 40 100
8 B 30 100
Code require to create this data :
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
I need to get a new column which is known Available to promise which is calculated by subtracting the quantity from previously available to promise and calculation only happens if the previously available inventory is greater than the order quantity .
here is my expected output:
Customer Product Quantity Inventory Available to Promise
1 A 100 800 700 (800-100 = 700)
2 A 1000 800 700 (1000 greater than 700 so same value)
3 A 700 800 0 (700-700 = 0)
4 A 50 800 0 (50 greater than 0)
5 B 20 100 80 (100-20 = 80)
6 B 50 100 30 (80-50 = 30)
7 B 40 100 30 (40 greater than 30)
8 B 30 100 0 (30 - 30 = 0)
i have achieved this using for loop and itterows in python pandas
this is my code:
master_df = df[['Product','Inventory']].drop_duplicates()
master_df['free'] = df['Inventory']
df['available_to_promise']=np.NaN
for i,row in df.iterrows():
if i%1000==0:
print(i)
try:
available = master_df[row['Product']==master_df['Product']]['free'].reset_index(drop=True).iloc[0]
if available-row['Quantity']>=0:
df.at[i,'available_to_promise']=available-row['Quantity']
a = master_df.loc[row['Product']==master_df['Product']].reset_index()['index'].iloc[0]
master_df.at[a,'free'] = available-row['Quantity']
else:
df.at[i,'available_to_promise']=available
except Exception as e:
print(i)
print(e)
print((df.columns))
df = df.fillna(0)
Due to for loop is so slow in python, when there is a huge data input this loop take so much time to execute thus my aws lambda function is failing
Can you guys help me to optimize this code by introducing a better alternative to this loop which can execute in a few seconds ?

I am not sure it is simple to write a vectorized and performant code that replicates the desired logic.
However, it is relatively simple to write it in a way that it is easy to accelerate with Numba.
Firstly, let us write your code as a (pure) function of the dataframe, returning the values to eventually put in df["Available to Promise"].
Eventually, it is easy to inglobate its result into the original dataframe with:
df["Available to Promise"] = calc_avail_OP(df)
The OP's code, save for exception handling and printing (and incorporation into the original dataframe as just discussed) is equivalent to the following:
import numpy as np
import pandas as pd
def calc_avail_OP(df):
temp_df = df[["Product", "Inventory"]].drop_duplicates()
temp_df["free"] = df["Inventory"]
result = np.zeros(len(df), dtype=df["Inventory"].dtype)
for i, row in df.iterrows():
available = (
temp_df[row["Product"] == temp_df["Product"]]["free"]
.reset_index(drop=True)
.iloc[0]
)
if available - row["Quantity"] >= 0:
result[i] = available - row["Quantity"]
a = (
temp_df.loc[row["Product"] == temp_df["Product"]]
.reset_index()["index"]
.iloc[0]
)
temp_df.at[a, "free"] = available - row["Quantity"]
else:
result[i] = available
return result
Now, if the input is sorted so that the unique products appear consecutively, the same can be achieved with a few scalar temporary variables on native NumPy objects, and this can be effectively accelerated with Numba:
import numba as nb
#nb.njit
def _calc_avail_nb(products, quantities, stocks):
n = len(products)
avails = np.empty(n, dtype=stocks.dtype)
last_product = products[0]
avail = stocks[0]
for i in range(n):
if products[i] != last_product:
last_product = products[i]
avail = stocks[i]
qty = quantities[i]
if avail >= qty:
avail -= qty
avails[i] = avail
return avails
def calc_avail_nb(df):
return _calc_avail_nb(
df["Product"].to_numpy(dtype="U"),
df["Quantity"].to_numpy(),
df["Inventory"].to_numpy()
)
If the input is not guaranteed to be sorted, one could keep track of inventory information with a dict():
import numba as nb
#nb.njit
def _calc_avail_dict_nb(products, quantities, stocks):
inventory = {products[0]: stocks[0]}
n = len(products)
avails = np.empty(n, dtype=stocks.dtype)
for i in range(n):
product = products[i]
avail = inventory.setdefault(products[i], stocks[i])
qty = quantities[i]
if avail >= qty:
avail -= qty
inventory[products[i]] = avail
avails[i] = avail
return avails
def calc_avail_dict_nb(df):
return _calc_avail_dict_nb(
df["Product"].to_numpy(dtype="U"),
df["Quantity"].to_numpy(),
df["Inventory"].to_numpy()
)
The following text include a comparison with some approaches from other answers:
a generator-based approach (based on #Vitalizzare's answer):
def stock(val):
s = val
q = yield
while True:
s = s - q if s >= q else s
q = yield s
def exaust_stock(df):
st = stock(df.iloc[0]['Inventory']).send
st(None)
return df['Quantity'].apply(st)
def calc_avail_gen(df):
return (
df
.groupby('Product')
.apply(exaust_stock)
.reset_index(level=0, drop=True)
.to_numpy()
)
another Numba-accelerated approach (based on #NathanFurnal's answer):
#nb.njit
def _calc_avail_grouped_nb(quant, inv):
stock = inv[0]
n = len(quant)
out = np.zeros((n,), dtype=np.int_)
for i in range(n):
if stock > 0 and quant[i] <= stock:
stock -= quant[i]
out[i] = stock
else:
out[i] = stock
return out
def calc_avail_grouped_nb(df):
return (
df
.groupby('Product')
.apply(lambda x: _calc_avail_grouped_nb(x['Quantity'].to_numpy(), x['Inventory'].to_numpy()))
.explode()
.to_numpy(dtype=np.int_)
)
The test indicate that while they do provide the same results, calc_avail_nb() and calc_avail_dict_nb() provide a speed increase of ~200x on the test input.
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
funcs = calc_avail_OP, calc_avail_nb, calc_avail_dict_nb, calc_avail_gen, calc_avail_grouped_nb
base = funcs[0](df)
timings = {}
n = len(df)
timings[n] = []
for func in funcs:
res = func(df)
is_good = np.allclose(base, res)
timed = %timeit -n 8 -r 8 -q -o func(df)
is_good = True
timing = timed.best * 1e6
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good!s:5} {timing:10.3f} µs {timings[n][0] / timing:5.1f}x")
# calc_avail_OP True 11699.373 µs 1.0x
# calc_avail_nb True 52.821 µs 221.5x
# calc_avail_dict_nb True 57.198 µs 204.5x
# calc_avail_gen True 3360.806 µs 3.5x
# calc_avail_grouped_nb True 1099.665 µs 10.6x
Similar tests on larger inputs seem to point to an even larger speed gain.
The timings are computed with the following:
import string
import random
def gen_df(n, m=None, max_stock=None):
if not m:
m = 2 + n // 16
if not max_stock:
max_stock = n
k = n.bit_length()
inventory = {
"".join(
random.choices(string.ascii_letters, k=random.randint(1, 2 + k))
): random.randint(max_stock // 2, max_stock)
for _ in range(m)
}
products = random.choices(list(inventory.keys()), k=n)
return pd.DataFrame(
{
"Customer": np.random.randint(1, int(1.1 * max_stock), n),
"Product": products,
"Quantity": np.random.randint(1, int(1.1 * max_stock), n),
"Inventory": [inventory[product] for product in products],
}
)
np.random.seed(0)
random.seed(0)
timings = {}
for i in range(3, 18, 3):
n = 2 ** i
print(f"i={i}, n={n}")
df = gen_df(n)
base = funcs[0](df)
timings[n] = []
for func in funcs:
res = func(df)
is_good = np.allclose(base, res)
timed = %timeit -n 1 -r 1 -q -o func(df)
is_good = True
timing = timed.best * 1e3
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good!s:5} {timing:10.3f} ms {timings[n][0] / timing:5.1f}x")
and plotted with:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input Size / #', ylabel='Best timing / µs', figsize=(6, 4))
fig = plt.gcf()
fig.patch.set_facecolor('white')
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df = df[[funcs[0].__name__]].to_numpy() / df
df.plot(marker='o', xlabel='Input Size / #', ylabel='Speed increase / %x', figsize=(6, 4))
fig = plt.gcf()
fig.patch.set_facecolor('white')
to obtain, respectively:
and

How to use generators to apply functions with intermediate states to pandas data frames
def stock(val):
s = val
q = yield
while True:
q = yield (s:=s-q) if s >= q else s
def exaust_stock(df):
st = stock(df.iloc[0]['Inventory']).send
st(None)
return df['Quantity'].apply(st)
df['Stock'] = (
df
.groupby('Product')
.apply(exaust_stock)
.reset_index(level=0, drop=True)
)

You're doing a lot of manipulating of the two dataframes you have, and I think that might be the cause of the speed issue.
I would use a dict to keep track of the available inventory.
I'm actually curious what the speed comparison if you apply this on a large dataframe... (see my edit below for that)
import pandas as pd
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
df["Available to Promise"] = 0
# create availability tracking
available = {k: None for k in set(df.Product)}
for idx, row in df.iterrows():
if available[row.Product] == None:
if row.Quantity <= row.Inventory:
available[row.Product] = row.Inventory - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = row.Inventory
available[row.Product] = 0
elif available[row.Product] > 0:
if row.Quantity <= available[row.Product]:
available[row.Product] = available[row.Product] - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = available[row.Product]
available[row.Product] = 0
print(df)
output
Customer Product Quantity Inventory Available to Promise
0 1 A 100 800 700
1 2 A 1000 800 700
2 3 A 700 800 0
3 4 A 50 800 0
4 5 B 20 100 80
5 6 B 50 100 30
6 7 B 40 100 30
7 8 B 30 100 0
EDIT:
After norok2's comment below I did a speed comparison.
adjusted code with timeit included
import pandas as pd
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
df["Available to Promise"] = 0
def do_stuff(df):
available = {k: None for k in set(df.Product)}
for idx, row in df.iterrows():
if available[row.Product] == None:
if row.Quantity <= row.Inventory:
available[row.Product] = row.Inventory - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = row.Inventory
available[row.Product] = 0
elif available[row.Product] > 0:
if row.Quantity <= available[row.Product]:
available[row.Product] = available[row.Product] - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = available[row.Product]
available[row.Product] = 0
import timeit
import statistics
timings=[]
for _ in range(1000):
timings.append(timeit.timeit("do_stuff(df)", setup="from __main__ import do_stuff, df", number=1))
print(f"Mine:\n Mean: {statistics.mean(timings)}\n Min: {min(timings)}\n Max: {max(timings)}")
I then used the function calc_avail_OP(df, label="Avail") that norok2 created, and timed it in the same way as I did mine, with this piece of code:
import timeit
import statistics
timings=[]
for _ in range(1000):
timings.append(timeit.timeit("calc_avail_OP(df)", setup="from __main__ import calc_avail_OP, df", number=1))
print(f"OP's:\n Mean: {statistics.mean(timings)}\n Min: {min(timings)}\n Max: {max(timings)}")
output for both
Mine:
Mean: 0.0003488006000061432
Min: 0.0003338999995321501
Max: 0.001021500000206288
OP's:
Mean: 0.0037762733999825286
Min: 0.003618599999754224
Max: 0.005391000000599888
so, with %timeit I get this result:
%timeit -n 16 -r 16 do_stuff(df)
365 µs ± 19.5 µs per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_nb(df)
30 µs ± 13.2 µs per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_OP(df)
3.95 ms ± 258 µs per loop (mean ± std. dev. of 16 runs, 16 loops each)
norok2's is still the fastest, on a larger df the difference becomes very obvious
with a 100k row dataframe:
%timeit -n 16 -r 16 do_stuff(df)
3.26 s ± 153 ms per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_nb(df)
82.3 ms ± 15.9 ms per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_OP(df)
39.3 s ± 3.01 s per loop (mean ± std. dev. of 16 runs, 16 loops each)

I have a bit of a solution, it's not incredibly powerful because it still uses loops but it has the advantage of being simpler and easy to optimize.
import pandas as pd
import numpy as np
def func_no_jit(quant, inv):
stock = inv[0]
n = len(quant)
out = np.zeros((n,), dtype=np.int64)
for i in range(n):
if stock > 0 and quant[i] <= stock:
stock -= quant[i]
out[i] = stock
else:
out[i] = stock
return out
res = (
df.groupby('Product')
.apply(lambda x: func(x['Quantity'].values, x['Inventory'].values))
.explode()
)
df["Promise"] = res
A possible solution is to use numba, When I used it, I could cut the time the process took in half, for a dataframe of 100_000 elements, it has no real effect on small dataframes though.
from numba import njit
#njit
def func(quant, inv):
stock = inv[0]
n = len(quant)
out = np.zeros((n,), dtype=np.int64)
for i in range(n):
if stock > 0 and quant[i] <= stock:
stock -= quant[i]
out[i] = stock
else:
out[i] = stock
return out
See the results here:
In [11]: big_df
Out[11]:
Customer Product Quantity Inventory
0 0 I 328 282
1 1 A 668 874
2 2 H 51 496
3 3 A 561 526
4 4 H 143 421
... ... ... ... ...
99995 99995 D 43 392
99996 99996 F 162 540
99997 99997 C 565 902
99998 99998 H 633 936
99999 99999 A 731 810
[100000 rows x 4 columns]
big_df.sort_values('Product', inplace=True) # Sort to keep track of indices
In [12]: %timeit big_df.groupby('Product').apply(lambda x : func_no_jit(x["Quantity"].values
...: ,x["Inventory"].values)).explode()
33.3 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit big_df.groupby('Product').apply(lambda x : func(x["Quantity"].values,x["Inv
...: entory"].values)).explode()
12.5 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
OP's solution on the 100_000 elements dataframe:
product_set = set(big_df.Product)
available = dict(zip(list(product_set), [None for _ in range(len(product_set))]))
def op_func():
big_df['Available to Promise'] = 0
for idx, row in big_df.iterrows():
if available[row.Product] == None:
if row.Quantity <= row.Inventory:
available[row.Product] = row.Inventory - row.Quantity
big_df.at[idx, "Available to Promise"] = available[row.Product]
else:
big_df.at[idx, "Available to Promise"] = row.Inventory
available[row.Product] = 0
elif available[row.Product] > 0:
if row.Quantity <= available[row.Product]:
available[row.Product] = available[row.Product] - row.Quantity
big_df.at[idx, "Available to Promise"] = available[row.Product]
else:
big_df.at[idx, "Available to Promise"] = available[row.Product]
available[row.Product] = 0
In [11]: %timeit op_func()
3.53 s ± 433 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

What is the fastest way to calculate / create powers of ten?

If as the input you provide the (integer) power, what is the fastest way to create the corresponding power of ten? Here are four alternatives I could come up with, and the fastest way seems to be using an f-string:
from functools import partial
from time import time
import numpy as np
def fstring(power):
return float(f'1e{power}')
def asterisk(power):
return 10**power
methods = {
'fstring': fstring,
'asterisk': asterisk,
'pow': partial(pow, 10),
'np.pow': partial(np.power, 10, dtype=float)
}
# "dtype=float" is necessary because otherwise it will raise:
# ValueError: Integers to negative integer powers are not allowed.
# see https://stackoverflow.com/a/43287598/5472354
powers = [int(i) for i in np.arange(-10000, 10000)]
for name, method in methods.items():
start = time()
for i in powers:
method(i)
print(f'{name}: {time() - start}')
Results:
fstring: 0.008975982666015625
asterisk: 0.5190775394439697
pow: 0.4863283634185791
np.pow: 0.046906232833862305
I guess the f-string approach is the fastest because nothing is actually calculated, though it only works for integer powers of ten, whereas the other methods are more complicated operations that also work with any real number as the base and power. So is the f-string actually the best way to go about it?
You're comparing apples to oranges here. 10 ** n computes an integer (when n is non-negative), whereas float(f'1e{n}') computes a floating-point number. Those won't take the same amount of time, but they solve different problems so it doesn't matter which one is faster.
But it's worse than that, because there is the overhead of calling a function, which is included in your timing for all of your alternatives, but only some of them actually involve calling a function. If you write 10 ** n then you aren't calling a function, but if you use partial(pow, 10) then you have to call it as a function to get a result. So you're not actually comparing the speed of 10 ** n fairly.
Instead of rolling your own timing code, use the timeit library, which is designed for doing this properly. The results are in seconds for 1,000,000 repetitions (by default), or equivalently they are the average time in microseconds for one repetiton.
Here's a comparison for computing integer powers of 10:
>>> from timeit import timeit
>>> timeit('10 ** n', setup='n = 500')
1.09881673199925
>>> timeit('pow(10, n)', setup='n = 500')
1.1821871869997267
>>> timeit('f(n)', setup='n = 500; from functools import partial; f = partial(pow, 10)')
1.1401332350014854
And here's a comparison for computing floating-point powers of 10: note that computing 10.0 ** 500 or 1e500 is pointless because the result is simply an OverflowError or inf.
>>> timeit('10.0 ** n', setup='n = 200')
0.12391662099980749
>>> timeit('pow(10.0, n)', setup='n = 200')
0.17336435099969094
>>> timeit('f(n)', setup='n = 200; from functools import partial; f = partial(pow, 10.0)')
0.18887039500077663
>>> timeit('float(f"1e{n}")', setup='n = 200')
0.44305286100097874
>>> timeit('np.power(10.0, n, dtype=float)', setup='n = 200; import numpy as np')
1.491982370000187
>>> timeit('f(n)', setup='n = 200; from functools import partial; import numpy as np; f = partial(np.power, 10.0, dtype=float)')
1.6273324920002779
So the fastest of these options in both cases is the obvious one: 10 ** n for integers and 10.0 ** n for floats.
Another contender for the floats case, precompute all possible nonzero finite results and look them up:
0.0 if n < -323 else f[n] if n < 309 else inf
The preparation:
f = [10.0 ** i for i in [*range(309), *range(-323, 0)]]
inf = float('inf')
Benchmark with kaya3's exponent n = 200 as well as n = -200 as negative exponent with nonzero result and n = -5000 / n = 5000 as medium-size negative/positive exponents from your original range:
n = 200
487 ns 487 ns 488 ns float(f'1e{n}')
108 ns 108 ns 108 ns 10.0 ** n
128 ns 129 ns 130 ns 10.0 ** n if n < 309 else inf
72 ns 73 ns 73 ns 0.0 if n < -323 else f[n] if n < 309 else inf
n = -200
542 ns 544 ns 545 ns float(f'1e{n}')
109 ns 109 ns 110 ns 10.0 ** n
130 ns 130 ns 131 ns 10.0 ** n if n < 309 else inf
76 ns 76 ns 76 ns 0.0 if n < -323 else f[n] if n < 309 else inf
n = -5000
291 ns 291 ns 291 ns float(f'1e{n}')
99 ns 99 ns 100 ns 10.0 ** n
119 ns 120 ns 120 ns 10.0 ** n if n < 309 else inf
34 ns 34 ns 34 ns 0.0 if n < -323 else f[n] if n < 309 else inf
n = 5000
292 ns 293 ns 293 ns float(f'1e{n}')
error error error 10.0 ** n
33 ns 33 ns 33 ns 10.0 ** n if n < 309 else inf
53 ns 53 ns 53 ns 0.0 if n < -323 else f[n] if n < 309 else inf
Benchmark code (Try it online!):
from timeit import repeat
solutions = [
"float(f'1e{n}')",
'10.0 ** n',
'10.0 ** n if n < 309 else inf',
'0.0 if n < -323 else f[n] if n < 309 else inf',
]
for n in 200, -200, -5000, 5000:
print(f'{n = }')
setup = f'''
n = {n}
f = [10.0 ** i for i in [*range(309), *range(-323, 0)]]
inf = float('inf')
'''
for solution in solutions:
try:
ts = sorted(repeat(solution, setup))[:3]
except OverflowError:
ts = [None] * 3
print(*('%3d ns ' % (t * 1e3) if t else ' error ' for t in ts), solution)
print()
You could try it with a logarithmic approach using math.log and math.exp but the range of values will be limited (which you can handle with try/except).
This seems to be just as fast as fstring if not a bit faster.
import math
ln10 = math.log(10)
def mPow(power):
try:
return math.exp(ln10*power)
except:
return 0 if power<0 else math.inf
[EDIT] Given that we are constrained by the capabilities of floats, we might as well just prepare a list with the 617 possible powers of 10 (that can be held in a float) and get the answer by index:
import math
minP10,maxP10 = -308,308
powersOf10 = [10**i for i in range(maxP10+1)]+[10**i for i in range(minP10,0)]
def tenPower(power):
if power < minP10: return 0
if power > maxP10: return math.inf
return powersOf10[power] # negative indexes for powers -308...-1
This one is definitely faster than fstring

Most efficient way of testing a value is in a list in pandas

I have a dataframe that I have from a csv which I am testing various aspects of. These all seem to go along the lines of either is this column like this regex or is this column in this list.
So I have the dataframe a bit like this:
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey RudeWord Mouse'], 'nationality': ['Mouseland', 'United States', 'Canada']})
I am generating new columns based on that content like so:
def full_name_metrics(full_name):
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
# metric of whether full name has less than two distinct elements
full_name_less_than_2_parts = len(full_name.split(' '))<2
# metric of whether full_name contains an initial
full_name_with_initial = 1 in [len(x) for x in full_name.split(' ')]
# metric of whether name matches an offensive word
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
return pd.Series([full_name_less_than_2_parts, full_name_with_initial, full_name_with_offensive_word])
df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
full_name
nationality
full_name_less_than_2_parts
full_name_with_initial
full_name_with_offensive_word
0
Mickey Mouse
Mouseland
False
False
False
1
M Mouse
United States
False
True
False
2
Mickey RudeWord Mouse
Canada
False
False
True
It works but for 25k records and more of these types of controls its taking more time than I'd like.
So is there a better way? Am I better off having the rude word list as another dataframe or am I barking up the wrong tree?
If it is the list checking that you want to speed up - then probably the Series.str.contains method can help -
lst_rude_words_as_str = '|'.join(lst_rude_words)
df['full_name_with_offensive_word'] = df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
Here's how the %timeit looks for me:
def func_in_list(full_name):
'''Your function - just removed the other two columns.'''
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
full_name_with_offensive_word = any(item in full_name.upper().split(' ') for item in lst_rude_words)
%timeit df.apply(lambda x: func_in_list(x['full_name']), axis=1) #3.15 ms
%timeit df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True) #505 µs
EDIT
I added the other two columns that I'd left out before - here's the full code
import pandas as pd
df = pd.DataFrame({'full_name': ['Mickey Mouse', 'M Mouse', 'Mickey Rudeword Mouse']})
def df_metrics(input_df):
input_df['full_name_less_than_2_parts'] = input_df['full_name'].str.split().map(len) < 2
input_df['full_name_with_initial'] = input_df['full_name'].str.split(expand=True)[0].map(len) == 1
lst_rude_words = ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA']
lst_rude_words_as_str = '|'.join(lst_rude_words)
input_df['full_name_with_offensive_word'] = input_df['full_name'].str.upper().str.contains(lst_rude_words_as_str, regex=True)
return input_df
RESULTS
For the 3 row dataset - there is no difference between the two functions -
%timeit df_metrics(df)
#3.5 ms ± 67.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#3.7 ms ± 59.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But when I increase the size of the dataframe - then there is some speed up
df_big = pd.concat([df] * 10000)
%timeit df_metrics(df_big)
#135 ms ± 7.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df_big[['full_name_less_than_2_parts', 'full_name_with_initial', 'full_name_with_offensive_word']] = df_big.apply(lambda x: full_name_metrics(x['full_name']), axis=1)
#11.5 s ± 173 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm going to answer piecemeal...
All your ops rely on splitting the full name column on whitespace so do it once:
>>> stuff = df.full_name.str.split()
For name less than two parts:
>>> df['full_name_less_than_2_parts'] = stuff.agg(len) < 2
>>> df
full_name nationality full_name_less_than_2_parts
0 Mickey Mouse Mouseland False
1 M Mouse United States False
2 Mickey RudeWord Mouse Canada False
Name with only an initial.
Explode the, split, Series; find the items with length one; group by the index to consolidate the exploded Series and aggregate with any.
>>> q = (stuff.explode().agg(len) == 1)
>>> df['full_name_with_initial'] = q.groupby(q.index).agg('any')
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial
0 Mickey Mouse Mouseland False False
1 M Mouse United States False True
2 Mickey RudeWord Mouse Canada False False
Check for undesirable words.
Make a regular expression pattern from the undesireable words list and use it as an argument to the .str.contains method.
>>> rude_words =r'|'.join( ['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
>>> df['rude'] = df.full_name.str.upper().str.contains(rude_words,regex=True)
>>> df
full_name nationality full_name_less_than_2_parts full_name_with_initial rude
0 Mickey Mouse Mouseland False False False
1 M Mouse United States False True False
2 Mickey RudeWord Mouse Canada False False True
Put them yogether in a function (mainly to do a timing test) that returns three Series.
import pandas as pd
from timer import Timer
df = pd.DataFrame(
{
"full_name": ["Mickey Mouse", "M Mouse", "Mickey RudeWord Mouse"]*8000,
"nationality": ["Mouseland", "United States", "Canada"]*8000,
}
)
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
def f(df):
rude_words = r'|'.join(['RUDEWORD', 'ANOTHERRUDEWORD', 'YOUGETTHEIDEA'])
stuff = df.full_name.str.split()
s1 = stuff.agg(len) < 2
stuff = (stuff.explode().agg(len) == 1)
s2 = stuff.groupby(stuff.index).agg('any')
s3 = df.full_name.str.upper().str.contains(rude_words,regex=True)
return s1,s2,s3
t = Timer('f(df)','from __main__ import pd,df,f')
print(t.timeit(1)) # <--- 0.12 seconds on my computer
x,y,z = f(df)
df.loc[:,'full_name_less_than_2_parts'] = x
df.loc[:,'full_name_with_initial'] = y
df.loc[:,'rude'] = z
# print(df.head(100))
Series Accessors

Elegant way to get range of values from two columns using pandas

I have a dataframe like as shown below (run the full code below)
df1 = pd.DataFrame({'person_id': [11,21,31,41,51],
'date_birth': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961']})
df1 = df1.melt('person_id', value_name='date_birth')
df1['birth_dates'] = pd.to_datetime(df1['date_birth'])
df_ranges = df1.assign(until_prev_year_days=(df1['birth_dates'].dt.dayofyear - 1),
until_next_year_days=((df1['birth_dates'] + pd.offsets.YearEnd(0)) - df1['birth_dates']).dt.days)
f = {'until_prev_year_days': 'min', 'until_next_year_days': 'min'}
min_days = df_ranges.groupby('person_id',as_index=False).agg(f)
min_days.columns = ['person_id','no_days_to_prev_year','no_days_to_next_year']
df_offset = pd.merge(df_ranges[['person_id','birth_dates']], min_days, on='person_id',how='inner')
See below on what I tried to get the range
df_offset['range_to_shift'] = "[" + (-1 * df_offset['no_days_to_prev_year']).map(str) + "," + df_offset['no_days_to_next_year'].map(str) + "]"
Though my approach works, I would like to is there any better and elegant way to do the same
Please note that for values from no_days_to_prev_year, we have to prefix minus sign
I expect my output to be like as shown below
Use DataFrame.mul along with DataFrame.to_numpy:
cols = ['no_days_to_prev_year', 'no_days_to_next_year']
df_offset['range_to_shift'] = df_offset[cols].mul([-1, 1]).to_numpy().tolist()
Result:
# print(df_offset)
person_id birth_dates no_days_to_prev_year no_days_to_next_year range_to_shift
0 11 1967-05-29 148 216 [-148, 216]
1 21 1957-01-21 20 344 [-20, 344]
2 31 1959-07-27 207 157 [-207, 157]
3 41 1961-01-01 0 364 [0, 364]
4 51 1961-12-31 364 0 [-364, 0]
timeit performance results:
df_offset.shape
(50000, 5)
%%timeit -n100
cols = ['no_days_to_prev_year', 'no_days_to_next_year']
df_offset['range_to_shift'] = df_offset[cols].mul([-1, 1]).to_numpy().tolist()
15.5 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
IIUC, you can use zip to create your list of range:
df = pd.DataFrame({'person_id': [11,21,31,41,51],
'date_birth': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961']})
df['date_birth'] = pd.to_datetime(df['date_birth'],format="%m/%d/%Y")
df["day_to_prev"] = df['date_birth'].dt.dayofyear - 1
df["day_to_next"] = (pd.offsets.YearEnd(0) + df['date_birth'] - df["date_birth"]).dt.days
df["range_to_shift"] = [[-x, y] for x,y in zip(df["day_to_prev"],df["day_to_next"])]
print (df)
person_id date_birth day_to_prev day_to_next range_to_shift
0 11 1967-05-29 148 216 [-148, 216]
1 21 1957-01-21 20 344 [-20, 344]
2 31 1959-07-27 207 157 [-207, 157]
3 41 1961-01-01 0 364 [0, 364]
4 51 1961-12-31 364 0 [-364, 0]

Python and Pandas - Moving Average Crossover

There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.
I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64
The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105

using numpy percentile on binned data

Suppose house sale figures are presented for a town in ranges:
< $100,000 204
$100,000 - $199,999 1651
$200,000 - $299,999 2405
$300,000 - $399,999 1972
$400,000 - $500,000 872
> $500,000 1455
I want to know which house-price bin a given percentile falls. Is there a way of using numpy's percentile function to do this? I can do it by hand:
import numpy as np
a = np.array([204., 1651., 2405., 1972., 872., 1455.])
b = np.cumsum(a)/np.sum(a) * 100
q = 75
len(b[b <= q])
4 # ie bin $300,000 - $399,999
But is there a way to use np.percentile instead?
You were almost there:
cs = np.cumsum(a)
bin_idx = np.searchsorted(cs, np.percentile(cs, 75))
At least for this case (and a couple others with larger a arrays), it's not any faster, though:
In [9]: %%timeit
...: b = np.cumsum(a)/np.sum(a) * 100
...: len(b[b <= 75])
...:
10000 loops, best of 3: 38.6 µs per loop
In [10]: %%timeit
....: cs = np.cumsum(a)
....: np.searchsorted(cs, np.percentile(cs, 75))
....:
10000 loops, best of 3: 125 µs per loop
So unless you want to check for multiple percentiles, I'd stick with what you have.

Categories