using numpy percentile on binned data

using numpy percentile on binned data - python

Suppose house sale figures are presented for a town in ranges:
< $100,000 204
$100,000 - $199,999 1651
$200,000 - $299,999 2405
$300,000 - $399,999 1972
$400,000 - $500,000 872
> $500,000 1455
I want to know which house-price bin a given percentile falls. Is there a way of using numpy's percentile function to do this? I can do it by hand:
import numpy as np
a = np.array([204., 1651., 2405., 1972., 872., 1455.])
b = np.cumsum(a)/np.sum(a) * 100
q = 75
len(b[b <= q])
4 # ie bin $300,000 - $399,999
But is there a way to use np.percentile instead?

You were almost there:
cs = np.cumsum(a)
bin_idx = np.searchsorted(cs, np.percentile(cs, 75))
At least for this case (and a couple others with larger a arrays), it's not any faster, though:
In [9]: %%timeit
...: b = np.cumsum(a)/np.sum(a) * 100
...: len(b[b <= 75])
...:
10000 loops, best of 3: 38.6 µs per loop
In [10]: %%timeit
....: cs = np.cumsum(a)
....: np.searchsorted(cs, np.percentile(cs, 75))
....:
10000 loops, best of 3: 125 µs per loop
So unless you want to check for multiple percentiles, I'd stick with what you have.

Related

go through every rows of a dataframe without iteration

this is my sample data :
Inventory is based on a Product
Customer Product Quantity Inventory
1 A 100 800
2 A 1000 800
3 A 700 800
4 A 50 800
5 B 20 100
6 B 50 100
7 B 40 100
8 B 30 100
Code require to create this data :
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
I need to get a new column which is known Available to promise which is calculated by subtracting the quantity from previously available to promise and calculation only happens if the previously available inventory is greater than the order quantity .
here is my expected output:
Customer Product Quantity Inventory Available to Promise
1 A 100 800 700 (800-100 = 700)
2 A 1000 800 700 (1000 greater than 700 so same value)
3 A 700 800 0 (700-700 = 0)
4 A 50 800 0 (50 greater than 0)
5 B 20 100 80 (100-20 = 80)
6 B 50 100 30 (80-50 = 30)
7 B 40 100 30 (40 greater than 30)
8 B 30 100 0 (30 - 30 = 0)
i have achieved this using for loop and itterows in python pandas
this is my code:
master_df = df[['Product','Inventory']].drop_duplicates()
master_df['free'] = df['Inventory']
df['available_to_promise']=np.NaN
for i,row in df.iterrows():
if i%1000==0:
print(i)
try:
available = master_df[row['Product']==master_df['Product']]['free'].reset_index(drop=True).iloc[0]
if available-row['Quantity']>=0:
df.at[i,'available_to_promise']=available-row['Quantity']
a = master_df.loc[row['Product']==master_df['Product']].reset_index()['index'].iloc[0]
master_df.at[a,'free'] = available-row['Quantity']
else:
df.at[i,'available_to_promise']=available
except Exception as e:
print(i)
print(e)
print((df.columns))
df = df.fillna(0)
Due to for loop is so slow in python, when there is a huge data input this loop take so much time to execute thus my aws lambda function is failing
Can you guys help me to optimize this code by introducing a better alternative to this loop which can execute in a few seconds ?

I am not sure it is simple to write a vectorized and performant code that replicates the desired logic.
However, it is relatively simple to write it in a way that it is easy to accelerate with Numba.
Firstly, let us write your code as a (pure) function of the dataframe, returning the values to eventually put in df["Available to Promise"].
Eventually, it is easy to inglobate its result into the original dataframe with:
df["Available to Promise"] = calc_avail_OP(df)
The OP's code, save for exception handling and printing (and incorporation into the original dataframe as just discussed) is equivalent to the following:
import numpy as np
import pandas as pd
def calc_avail_OP(df):
temp_df = df[["Product", "Inventory"]].drop_duplicates()
temp_df["free"] = df["Inventory"]
result = np.zeros(len(df), dtype=df["Inventory"].dtype)
for i, row in df.iterrows():
available = (
temp_df[row["Product"] == temp_df["Product"]]["free"]
.reset_index(drop=True)
.iloc[0]
)
if available - row["Quantity"] >= 0:
result[i] = available - row["Quantity"]
a = (
temp_df.loc[row["Product"] == temp_df["Product"]]
.reset_index()["index"]
.iloc[0]
)
temp_df.at[a, "free"] = available - row["Quantity"]
else:
result[i] = available
return result
Now, if the input is sorted so that the unique products appear consecutively, the same can be achieved with a few scalar temporary variables on native NumPy objects, and this can be effectively accelerated with Numba:
import numba as nb
#nb.njit
def _calc_avail_nb(products, quantities, stocks):
n = len(products)
avails = np.empty(n, dtype=stocks.dtype)
last_product = products[0]
avail = stocks[0]
for i in range(n):
if products[i] != last_product:
last_product = products[i]
avail = stocks[i]
qty = quantities[i]
if avail >= qty:
avail -= qty
avails[i] = avail
return avails
def calc_avail_nb(df):
return _calc_avail_nb(
df["Product"].to_numpy(dtype="U"),
df["Quantity"].to_numpy(),
df["Inventory"].to_numpy()
)
If the input is not guaranteed to be sorted, one could keep track of inventory information with a dict():
import numba as nb
#nb.njit
def _calc_avail_dict_nb(products, quantities, stocks):
inventory = {products[0]: stocks[0]}
n = len(products)
avails = np.empty(n, dtype=stocks.dtype)
for i in range(n):
product = products[i]
avail = inventory.setdefault(products[i], stocks[i])
qty = quantities[i]
if avail >= qty:
avail -= qty
inventory[products[i]] = avail
avails[i] = avail
return avails
def calc_avail_dict_nb(df):
return _calc_avail_dict_nb(
df["Product"].to_numpy(dtype="U"),
df["Quantity"].to_numpy(),
df["Inventory"].to_numpy()
)
The following text include a comparison with some approaches from other answers:
a generator-based approach (based on #Vitalizzare's answer):
def stock(val):
s = val
q = yield
while True:
s = s - q if s >= q else s
q = yield s
def exaust_stock(df):
st = stock(df.iloc[0]['Inventory']).send
st(None)
return df['Quantity'].apply(st)
def calc_avail_gen(df):
return (
df
.groupby('Product')
.apply(exaust_stock)
.reset_index(level=0, drop=True)
.to_numpy()
)
another Numba-accelerated approach (based on #NathanFurnal's answer):
#nb.njit
def _calc_avail_grouped_nb(quant, inv):
stock = inv[0]
n = len(quant)
out = np.zeros((n,), dtype=np.int_)
for i in range(n):
if stock > 0 and quant[i] <= stock:
stock -= quant[i]
out[i] = stock
else:
out[i] = stock
return out
def calc_avail_grouped_nb(df):
return (
df
.groupby('Product')
.apply(lambda x: _calc_avail_grouped_nb(x['Quantity'].to_numpy(), x['Inventory'].to_numpy()))
.explode()
.to_numpy(dtype=np.int_)
)
The test indicate that while they do provide the same results, calc_avail_nb() and calc_avail_dict_nb() provide a speed increase of ~200x on the test input.
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
funcs = calc_avail_OP, calc_avail_nb, calc_avail_dict_nb, calc_avail_gen, calc_avail_grouped_nb
base = funcs[0](df)
timings = {}
n = len(df)
timings[n] = []
for func in funcs:
res = func(df)
is_good = np.allclose(base, res)
timed = %timeit -n 8 -r 8 -q -o func(df)
is_good = True
timing = timed.best * 1e6
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good!s:5} {timing:10.3f} µs {timings[n][0] / timing:5.1f}x")
# calc_avail_OP True 11699.373 µs 1.0x
# calc_avail_nb True 52.821 µs 221.5x
# calc_avail_dict_nb True 57.198 µs 204.5x
# calc_avail_gen True 3360.806 µs 3.5x
# calc_avail_grouped_nb True 1099.665 µs 10.6x
Similar tests on larger inputs seem to point to an even larger speed gain.
The timings are computed with the following:
import string
import random
def gen_df(n, m=None, max_stock=None):
if not m:
m = 2 + n // 16
if not max_stock:
max_stock = n
k = n.bit_length()
inventory = {
"".join(
random.choices(string.ascii_letters, k=random.randint(1, 2 + k))
): random.randint(max_stock // 2, max_stock)
for _ in range(m)
}
products = random.choices(list(inventory.keys()), k=n)
return pd.DataFrame(
{
"Customer": np.random.randint(1, int(1.1 * max_stock), n),
"Product": products,
"Quantity": np.random.randint(1, int(1.1 * max_stock), n),
"Inventory": [inventory[product] for product in products],
}
)
np.random.seed(0)
random.seed(0)
timings = {}
for i in range(3, 18, 3):
n = 2 ** i
print(f"i={i}, n={n}")
df = gen_df(n)
base = funcs[0](df)
timings[n] = []
for func in funcs:
res = func(df)
is_good = np.allclose(base, res)
timed = %timeit -n 1 -r 1 -q -o func(df)
is_good = True
timing = timed.best * 1e3
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good!s:5} {timing:10.3f} ms {timings[n][0] / timing:5.1f}x")
and plotted with:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input Size / #', ylabel='Best timing / µs', figsize=(6, 4))
fig = plt.gcf()
fig.patch.set_facecolor('white')
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df = df[[funcs[0].__name__]].to_numpy() / df
df.plot(marker='o', xlabel='Input Size / #', ylabel='Speed increase / %x', figsize=(6, 4))
fig = plt.gcf()
fig.patch.set_facecolor('white')
to obtain, respectively:
and

How to use generators to apply functions with intermediate states to pandas data frames
def stock(val):
s = val
q = yield
while True:
q = yield (s:=s-q) if s >= q else s
def exaust_stock(df):
st = stock(df.iloc[0]['Inventory']).send
st(None)
return df['Quantity'].apply(st)
df['Stock'] = (
df
.groupby('Product')
.apply(exaust_stock)
.reset_index(level=0, drop=True)
)

You're doing a lot of manipulating of the two dataframes you have, and I think that might be the cause of the speed issue.
I would use a dict to keep track of the available inventory.
I'm actually curious what the speed comparison if you apply this on a large dataframe... (see my edit below for that)
import pandas as pd
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
df["Available to Promise"] = 0
# create availability tracking
available = {k: None for k in set(df.Product)}
for idx, row in df.iterrows():
if available[row.Product] == None:
if row.Quantity <= row.Inventory:
available[row.Product] = row.Inventory - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = row.Inventory
available[row.Product] = 0
elif available[row.Product] > 0:
if row.Quantity <= available[row.Product]:
available[row.Product] = available[row.Product] - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = available[row.Product]
available[row.Product] = 0
print(df)
output
Customer Product Quantity Inventory Available to Promise
0 1 A 100 800 700
1 2 A 1000 800 700
2 3 A 700 800 0
3 4 A 50 800 0
4 5 B 20 100 80
5 6 B 50 100 30
6 7 B 40 100 30
7 8 B 30 100 0
EDIT:
After norok2's comment below I did a speed comparison.
adjusted code with timeit included
import pandas as pd
data = {
'Customer':[1,2,3,4,5,6,7,8],
'Product':['A','A','A','A','B','B','B','B'],
'Quantity':[100,1000,700,50,20,50,40,30],
'Inventory':[800,800,800,800,100,100,100,100]
}
df = pd.DataFrame(data)
df["Available to Promise"] = 0
def do_stuff(df):
available = {k: None for k in set(df.Product)}
for idx, row in df.iterrows():
if available[row.Product] == None:
if row.Quantity <= row.Inventory:
available[row.Product] = row.Inventory - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = row.Inventory
available[row.Product] = 0
elif available[row.Product] > 0:
if row.Quantity <= available[row.Product]:
available[row.Product] = available[row.Product] - row.Quantity
df.at[idx, "Available to Promise"] = available[row.Product]
else:
df.at[idx, "Available to Promise"] = available[row.Product]
available[row.Product] = 0
import timeit
import statistics
timings=[]
for _ in range(1000):
timings.append(timeit.timeit("do_stuff(df)", setup="from __main__ import do_stuff, df", number=1))
print(f"Mine:\n Mean: {statistics.mean(timings)}\n Min: {min(timings)}\n Max: {max(timings)}")
I then used the function calc_avail_OP(df, label="Avail") that norok2 created, and timed it in the same way as I did mine, with this piece of code:
import timeit
import statistics
timings=[]
for _ in range(1000):
timings.append(timeit.timeit("calc_avail_OP(df)", setup="from __main__ import calc_avail_OP, df", number=1))
print(f"OP's:\n Mean: {statistics.mean(timings)}\n Min: {min(timings)}\n Max: {max(timings)}")
output for both
Mine:
Mean: 0.0003488006000061432
Min: 0.0003338999995321501
Max: 0.001021500000206288
OP's:
Mean: 0.0037762733999825286
Min: 0.003618599999754224
Max: 0.005391000000599888
so, with %timeit I get this result:
%timeit -n 16 -r 16 do_stuff(df)
365 µs ± 19.5 µs per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_nb(df)
30 µs ± 13.2 µs per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_OP(df)
3.95 ms ± 258 µs per loop (mean ± std. dev. of 16 runs, 16 loops each)
norok2's is still the fastest, on a larger df the difference becomes very obvious
with a 100k row dataframe:
%timeit -n 16 -r 16 do_stuff(df)
3.26 s ± 153 ms per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_nb(df)
82.3 ms ± 15.9 ms per loop (mean ± std. dev. of 16 runs, 16 loops each)
%timeit -n 16 -r 16 calc_avail_OP(df)
39.3 s ± 3.01 s per loop (mean ± std. dev. of 16 runs, 16 loops each)

I have a bit of a solution, it's not incredibly powerful because it still uses loops but it has the advantage of being simpler and easy to optimize.
import pandas as pd
import numpy as np
def func_no_jit(quant, inv):
stock = inv[0]
n = len(quant)
out = np.zeros((n,), dtype=np.int64)
for i in range(n):
if stock > 0 and quant[i] <= stock:
stock -= quant[i]
out[i] = stock
else:
out[i] = stock
return out
res = (
df.groupby('Product')
.apply(lambda x: func(x['Quantity'].values, x['Inventory'].values))
.explode()
)
df["Promise"] = res
A possible solution is to use numba, When I used it, I could cut the time the process took in half, for a dataframe of 100_000 elements, it has no real effect on small dataframes though.
from numba import njit
#njit
def func(quant, inv):
stock = inv[0]
n = len(quant)
out = np.zeros((n,), dtype=np.int64)
for i in range(n):
if stock > 0 and quant[i] <= stock:
stock -= quant[i]
out[i] = stock
else:
out[i] = stock
return out
See the results here:
In [11]: big_df
Out[11]:
Customer Product Quantity Inventory
0 0 I 328 282
1 1 A 668 874
2 2 H 51 496
3 3 A 561 526
4 4 H 143 421
... ... ... ... ...
99995 99995 D 43 392
99996 99996 F 162 540
99997 99997 C 565 902
99998 99998 H 633 936
99999 99999 A 731 810
[100000 rows x 4 columns]
big_df.sort_values('Product', inplace=True) # Sort to keep track of indices
In [12]: %timeit big_df.groupby('Product').apply(lambda x : func_no_jit(x["Quantity"].values
...: ,x["Inventory"].values)).explode()
33.3 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit big_df.groupby('Product').apply(lambda x : func(x["Quantity"].values,x["Inv
...: entory"].values)).explode()
12.5 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
OP's solution on the 100_000 elements dataframe:
product_set = set(big_df.Product)
available = dict(zip(list(product_set), [None for _ in range(len(product_set))]))
def op_func():
big_df['Available to Promise'] = 0
for idx, row in big_df.iterrows():
if available[row.Product] == None:
if row.Quantity <= row.Inventory:
available[row.Product] = row.Inventory - row.Quantity
big_df.at[idx, "Available to Promise"] = available[row.Product]
else:
big_df.at[idx, "Available to Promise"] = row.Inventory
available[row.Product] = 0
elif available[row.Product] > 0:
if row.Quantity <= available[row.Product]:
available[row.Product] = available[row.Product] - row.Quantity
big_df.at[idx, "Available to Promise"] = available[row.Product]
else:
big_df.at[idx, "Available to Promise"] = available[row.Product]
available[row.Product] = 0
In [11]: %timeit op_func()
3.53 s ± 433 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

What is the fastest way to filter a numpy array by another array?

I have one pretty large np.array a (10,000-50,000 elements, each coordinates (x,y)) and another larger np.array b (100,000-200,000 coordinates). I need to remove as quickly as possible the elements of a that are not present in b and leave only the elements of a that are present in b. All coordinates are integers. For example:
a = np.array([[2,5],[6,3],[4,2],[1,4]])
b = np.array([[2,7],[4,2],[1,5],[6,3]])
Desired output:
a
>> [6,3],[4,2]
What is the fastest way of doing this for arrays of the size I mentioned?
I am OK with solutions that use any other packages or imports too (e.g., converting to a base Python list or set, using Pandas, etc.) besides those within Numpy.

This appears to depend a lot on the array size and "sparseness" (likely due to hash table magic).
The answer from Get intersecting rows across two 2D numpy arrays is the so_8317022 function.
The takeaways seem to be (on my machine) that:
the Pandas approach has an edge with large sparse sets
set intersection is very, very fast with small array sizes (though admittedly it returns a set, not a numpy array)
the other Numpy answer can be faster than set intersection with larger array sizes.
from collections import defaultdict
import numpy as np
import pandas as pd
import timeit
import matplotlib.pyplot as plt
def pandas_merge(a, b):
return pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()
def set_intersection(a, b):
return set(map(tuple, a.tolist())) & set(map(tuple, b.tolist()))
def so_8317022(a, b):
nrows, ncols = a.shape
dtype = {
"names": ["f{}".format(i) for i in range(ncols)],
"formats": ncols * [a.dtype],
}
C = np.intersect1d(a.view(dtype), b.view(dtype))
return C.view(a.dtype).reshape(-1, ncols)
def test_fn(f, a, b):
number, time_taken = timeit.Timer(lambda: f(a, b)).autorange()
return number / time_taken
def test(size, max_coord):
a = np.random.default_rng().integers(0, max_coord, size=(size, 2))
b = np.random.default_rng().integers(0, max_coord, size=(size, 2))
return {fn.__name__: test_fn(fn, a, b) for fn in (pandas_merge, set_intersection, so_8317022)}
series = []
datas = defaultdict(list)
for size in (100, 1000, 10000, 100000):
for max_coord in (50, 500, 5000):
print(size, max_coord)
series.append((size, max_coord))
for fn, result in test(size, max_coord).items():
datas[fn].append(result)
print("size", "sparseness", "func", "ops/sec")
for fn, values in datas.items():
for (size, max_coord), value in zip(series, values):
print(size, max_coord, fn, int(value))
The results on my machine are
size
sparseness
func
ops/sec
100
50
pandas_merge
895
100
500
pandas_merge
777
100
5000
pandas_merge
708
1000
50
pandas_merge
740
1000
500
pandas_merge
751
1000
5000
pandas_merge
660
10000
50
pandas_merge
513
10000
500
pandas_merge
460
10000
5000
pandas_merge
436
100000
50
pandas_merge
11
100000
500
pandas_merge
61
100000
5000
pandas_merge
49
100
50
set_intersection
42281
100
500
set_intersection
44050
100
5000
set_intersection
43584
1000
50
set_intersection
3693
1000
500
set_intersection
3234
1000
5000
set_intersection
3900
10000
50
set_intersection
453
10000
500
set_intersection
287
10000
5000
set_intersection
300
100000
50
set_intersection
47
100000
500
set_intersection
13
100000
5000
set_intersection
13
100
50
so_8317022
8927
100
500
so_8317022
9736
100
5000
so_8317022
7843
1000
50
so_8317022
698
1000
500
so_8317022
746
1000
5000
so_8317022
765
10000
50
so_8317022
89
10000
500
so_8317022
48
10000
5000
so_8317022
57
100000
50
so_8317022
10
100000
500
so_8317022
3
100000
5000
so_8317022
3

Not sure if this is the fastest way to do it, but if you turn it into a pandas index you can use its intersection method. Since it is using low-level c-code under the hood, the intersection step is probably pretty fast, but converting it into a pandas index may take some time
import numpy as np
import pandas as pd
a = np.array([[2, 5], [6, 3], [4, 2], [1, 4]])
b = np.array([[2, 7], [4, 2], [1, 5], [6, 3]])
df_a = pd.DataFrame(a).set_index([0, 1])
df_b = pd.DataFrame(b).set_index([0, 1])
intersection = df_a.index.intersection(df_b.index)
Result look like this
print(intersection.values)
[(6, 3) (4, 2)]
EDIT2:
Out of curiosity I made a comparison between the methods. Now with a larger list of indices. I have compared my first index method with a slightly improved method which does not require to create a dataframe first, but immediately creates the index, and then with the dataframe merge method proposed as well.
This is the code
from random import randint, seed
import time
import numpy as np
import pandas as pd
seed(0)
n_tuple = 100000
i_min = 0
i_max = 10
a = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
b = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
np_a = np.array(a)
np_b = np.array(b)
def method0(a_array, b_array):
index_a = pd.DataFrame(a_array).set_index([0, 1]).index
index_b = pd.DataFrame(b_array).set_index([0, 1]).index
return index_a.intersection(index_b).to_numpy()
def method1(a_array, b_array):
index_a = pd.MultiIndex.from_arrays(a_array.T)
index_b = pd.MultiIndex.from_arrays(b_array.T)
return index_a.intersection(index_b).to_numpy()
def method2(a_array, b_array):
df_a = pd.DataFrame(a_array)
df_b = pd.DataFrame(b_array)
return df_a.merge(df_b).to_numpy()
def method3(a_array, b_array):
set_a = {(_[0], _[1]) for _ in a_array}
set_b = {(_[0], _[1]) for _ in b_array}
return set_a.intersection(set_b)
for cnt, intersect in enumerate([method0, method1, method2, method3]):
t0 = time.time()
if cnt < 3:
intersection = intersect(np_a, np_b)
else:
intersection = intersect(a, b)
print(f"method{cnt}: {time.time() - t0}")
The output looks like:
method0: 0.1439347267150879
method1: 0.14012742042541504
method2: 4.740894317626953
method3: 0.05933070182800293
Conclusion: the merge method of dataframes (method2) is about 50 times slower than using intersections on the index. The version based on multiindex (method1) is only slightly faster than method0 (my first proposal)
EDIT2: As proposed by the comment of #AKX: if you do not use numpy but pure list and sets, you can again gain a speed up of about a factor 3. But it is clear you should not used the merge method.

Apply column operations to get a new column in pandas

i have the data like this
ID 8-Jan 15-Jan 22-Jan 29-Jan 5-Feb 12-Feb LowerBound UpperBound
001 618 720 645 573 503 447 - -
002 62 80 67 94 81 65 - -
003 32 10 23 26 26 31 - -
004 22 13 1 28 19 25 - -
005 9 7 9 6 8 4 - -
I want to create two columns with lower bounds and upper bounds for each product using 95% confidence intervals. I know manual way of writing a function which loops through each product ID
import numpy as np
import scipy as sp
import scipy.stats
# Method copied from http://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m-h, m+h
Is there an efficient way in Pandas or (one liner kind of thing) ?

Of course, you want df.apply. Note you need to modify mean_confidence_interval to return pd.Series([m-h, m+h]).
df[['LowerBound','UpperBound']] = df.apply(mean_confidence_interval, axis=1)

Standard error of the mean is pretty straightforward to calculate so you can easily vectorize this:
import scipy.stats as ss
df.mean(axis=1) + ss.t.ppf(0.975, df.shape[1]-1) * df.std(axis=1)/np.sqrt(df.shape[1])
will give you the upper bound. Use - ss.t.ppf for the lower bound.
Also, pandas seems to have a sem method. If you have a large dataset, I don't suggest using apply over rows. It is pretty slow. Here are some timings:
df = pd.DataFrame(np.random.randn(100, 10))
%timeit df.apply(mean_confidence_interval, axis=1)
100 loops, best of 3: 18.2 ms per loop
%%timeit
dist = ss.t.ppf(0.975, df.shape[1]-1) * df.sem(axis=1)
mean = df.mean(axis=1)
mean - dist, mean + dist
1000 loops, best of 3: 598 µs per loop

Since you already created a function for calculating the confidence interval, simply apply it to each row of your data:
def mean_confidence_interval(data):
confidence = 0.95
m = data.mean()
se = scipy.stats.sem(data)
h = se * sp.stats.t._ppf((1 + confidence) / 2, data.shape[0] - 1)
return pd.Series((m - h, m + h))
interval = df.apply(mean_confidence_interval, axis=1)
interval.columns = ("LowerBound", "UpperBound")
pd.concat([df, interval],axis=1)

How to speed up pandas groupby - apply function to be comparable to R's data.table

I have data like this
location sales store
0 68 583 17
1 28 857 2
2 55 190 59
3 98 517 64
4 94 892 79
...
For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.
location sales store pcnt_sales
0 68 583 17 0.254363
1 28 857 2 0.346543
2 55 190 59 1.000000
3 98 517 64 0.272105
4 94 892 79 1.000000
...
This works, but is slow
import pandas as pd
import numpy as np
df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})
import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time) # 1.46 seconds
By comparison, R's data.table does this super fast
library(data.table)
dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))
ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm # 0.007 seconds
How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?

For performance you want to avoid apply. You could use transform to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:
>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True

Python and Pandas - Moving Average Crossover

There is a Pandas DataFrame object with some stock data. SMAs are moving averages calculated from previous 45/15 days.
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
I want to find all dates, when SMA_15 and SMA_45 intersect.
Can it be done efficiently using Pandas or Numpy? How?
EDIT:
What I mean by 'intersection':
The data row, when:
long SMA(45) value was bigger than short SMA(15) value for longer than short SMA period(15) and it became smaller.
long SMA(45) value was smaller than short SMA(15) value for longer than short SMA period(15) and it became bigger.

I'm taking a crossover to mean when the SMA lines -- as functions of time --
intersect, as depicted on this investopedia
page.
Since the SMAs represent continuous functions, there is a crossing when,
for a given row, (SMA_15 is less than SMA_45) and (the previous SMA_15 is
greater than the previous SMA_45) -- or vice versa.
In code, that could be expressed as
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
If we change your data to
Date Price SMA_45 SMA_15
20150127 102.75 113 106
20150128 103.05 100 106
20150129 105.10 112 105
20150130 105.35 111 105
20150202 107.15 111 105
20150203 111.95 110 105
20150204 111.90 110 106
so that there are crossings,
then
import pandas as pd
df = pd.read_table('data', sep='\s+')
previous_15 = df['SMA_15'].shift(1)
previous_45 = df['SMA_45'].shift(1)
crossing = (((df['SMA_15'] <= df['SMA_45']) & (previous_15 >= previous_45))
| ((df['SMA_15'] >= df['SMA_45']) & (previous_15 <= previous_45)))
crossing_dates = df.loc[crossing, 'Date']
print(crossing_dates)
yields
1 20150128
2 20150129
Name: Date, dtype: int64

The following methods gives the similar results, but takes less time than the previous methods:
df['position'] = df['SMA_15'] > df['SMA_45']
df['pre_position'] = df['position'].shift(1)
df.dropna(inplace=True) # dropping the NaN values
df['crossover'] = np.where(df['position'] == df['pre_position'], False, True)
Time taken for this approach: 2.7 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Time taken for previous approach: 3.46 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As an alternative to the unutbu's answer, something like below can also be done to find the indices where SMA_15 crosses SMA_45.
diff = df['SMA_15'] < df['SMA_45']
diff_forward = diff.shift(1)
crossing = np.where(abs(diff - diff_forward) == 1)[0]
print(crossing)
>>> [1,2]
print(df.iloc[crossing])
>>>
Date Price SMA_15 SMA_45
1 20150128 103.05 100 106
2 20150129 105.10 112 105

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using numpy percentile on binned data - python

Related

go through every rows of a dataframe without iteration

What is the fastest way to filter a numpy array by another array?

Apply column operations to get a new column in pandas

How to speed up pandas groupby - apply function to be comparable to R's data.table

Python and Pandas - Moving Average Crossover

Categories

Resources