i have the data like this
ID 8-Jan 15-Jan 22-Jan 29-Jan 5-Feb 12-Feb LowerBound UpperBound
001 618 720 645 573 503 447 - -
002 62 80 67 94 81 65 - -
003 32 10 23 26 26 31 - -
004 22 13 1 28 19 25 - -
005 9 7 9 6 8 4 - -
I want to create two columns with lower bounds and upper bounds for each product using 95% confidence intervals. I know manual way of writing a function which loops through each product ID
import numpy as np
import scipy as sp
import scipy.stats
# Method copied from http://stackoverflow.com/questions/15033511/compute-a-confidence-interval-from-sample-data
def mean_confidence_interval(data, confidence=0.95):
a = 1.0*np.array(data)
n = len(a)
m, se = np.mean(a), scipy.stats.sem(a)
h = se * sp.stats.t._ppf((1+confidence)/2., n-1)
return m-h, m+h
Is there an efficient way in Pandas or (one liner kind of thing) ?
Of course, you want df.apply. Note you need to modify mean_confidence_interval to return pd.Series([m-h, m+h]).
df[['LowerBound','UpperBound']] = df.apply(mean_confidence_interval, axis=1)
Standard error of the mean is pretty straightforward to calculate so you can easily vectorize this:
import scipy.stats as ss
df.mean(axis=1) + ss.t.ppf(0.975, df.shape[1]-1) * df.std(axis=1)/np.sqrt(df.shape[1])
will give you the upper bound. Use - ss.t.ppf for the lower bound.
Also, pandas seems to have a sem method. If you have a large dataset, I don't suggest using apply over rows. It is pretty slow. Here are some timings:
df = pd.DataFrame(np.random.randn(100, 10))
%timeit df.apply(mean_confidence_interval, axis=1)
100 loops, best of 3: 18.2 ms per loop
%%timeit
dist = ss.t.ppf(0.975, df.shape[1]-1) * df.sem(axis=1)
mean = df.mean(axis=1)
mean - dist, mean + dist
1000 loops, best of 3: 598 µs per loop
Since you already created a function for calculating the confidence interval, simply apply it to each row of your data:
def mean_confidence_interval(data):
confidence = 0.95
m = data.mean()
se = scipy.stats.sem(data)
h = se * sp.stats.t._ppf((1 + confidence) / 2, data.shape[0] - 1)
return pd.Series((m - h, m + h))
interval = df.apply(mean_confidence_interval, axis=1)
interval.columns = ("LowerBound", "UpperBound")
pd.concat([df, interval],axis=1)
Related
I have one pretty large np.array a (10,000-50,000 elements, each coordinates (x,y)) and another larger np.array b (100,000-200,000 coordinates). I need to remove as quickly as possible the elements of a that are not present in b and leave only the elements of a that are present in b. All coordinates are integers. For example:
a = np.array([[2,5],[6,3],[4,2],[1,4]])
b = np.array([[2,7],[4,2],[1,5],[6,3]])
Desired output:
a
>> [6,3],[4,2]
What is the fastest way of doing this for arrays of the size I mentioned?
I am OK with solutions that use any other packages or imports too (e.g., converting to a base Python list or set, using Pandas, etc.) besides those within Numpy.
This appears to depend a lot on the array size and "sparseness" (likely due to hash table magic).
The answer from Get intersecting rows across two 2D numpy arrays is the so_8317022 function.
The takeaways seem to be (on my machine) that:
the Pandas approach has an edge with large sparse sets
set intersection is very, very fast with small array sizes (though admittedly it returns a set, not a numpy array)
the other Numpy answer can be faster than set intersection with larger array sizes.
from collections import defaultdict
import numpy as np
import pandas as pd
import timeit
import matplotlib.pyplot as plt
def pandas_merge(a, b):
return pd.DataFrame(a).merge(pd.DataFrame(b)).to_numpy()
def set_intersection(a, b):
return set(map(tuple, a.tolist())) & set(map(tuple, b.tolist()))
def so_8317022(a, b):
nrows, ncols = a.shape
dtype = {
"names": ["f{}".format(i) for i in range(ncols)],
"formats": ncols * [a.dtype],
}
C = np.intersect1d(a.view(dtype), b.view(dtype))
return C.view(a.dtype).reshape(-1, ncols)
def test_fn(f, a, b):
number, time_taken = timeit.Timer(lambda: f(a, b)).autorange()
return number / time_taken
def test(size, max_coord):
a = np.random.default_rng().integers(0, max_coord, size=(size, 2))
b = np.random.default_rng().integers(0, max_coord, size=(size, 2))
return {fn.__name__: test_fn(fn, a, b) for fn in (pandas_merge, set_intersection, so_8317022)}
series = []
datas = defaultdict(list)
for size in (100, 1000, 10000, 100000):
for max_coord in (50, 500, 5000):
print(size, max_coord)
series.append((size, max_coord))
for fn, result in test(size, max_coord).items():
datas[fn].append(result)
print("size", "sparseness", "func", "ops/sec")
for fn, values in datas.items():
for (size, max_coord), value in zip(series, values):
print(size, max_coord, fn, int(value))
The results on my machine are
size
sparseness
func
ops/sec
100
50
pandas_merge
895
100
500
pandas_merge
777
100
5000
pandas_merge
708
1000
50
pandas_merge
740
1000
500
pandas_merge
751
1000
5000
pandas_merge
660
10000
50
pandas_merge
513
10000
500
pandas_merge
460
10000
5000
pandas_merge
436
100000
50
pandas_merge
11
100000
500
pandas_merge
61
100000
5000
pandas_merge
49
100
50
set_intersection
42281
100
500
set_intersection
44050
100
5000
set_intersection
43584
1000
50
set_intersection
3693
1000
500
set_intersection
3234
1000
5000
set_intersection
3900
10000
50
set_intersection
453
10000
500
set_intersection
287
10000
5000
set_intersection
300
100000
50
set_intersection
47
100000
500
set_intersection
13
100000
5000
set_intersection
13
100
50
so_8317022
8927
100
500
so_8317022
9736
100
5000
so_8317022
7843
1000
50
so_8317022
698
1000
500
so_8317022
746
1000
5000
so_8317022
765
10000
50
so_8317022
89
10000
500
so_8317022
48
10000
5000
so_8317022
57
100000
50
so_8317022
10
100000
500
so_8317022
3
100000
5000
so_8317022
3
Not sure if this is the fastest way to do it, but if you turn it into a pandas index you can use its intersection method. Since it is using low-level c-code under the hood, the intersection step is probably pretty fast, but converting it into a pandas index may take some time
import numpy as np
import pandas as pd
a = np.array([[2, 5], [6, 3], [4, 2], [1, 4]])
b = np.array([[2, 7], [4, 2], [1, 5], [6, 3]])
df_a = pd.DataFrame(a).set_index([0, 1])
df_b = pd.DataFrame(b).set_index([0, 1])
intersection = df_a.index.intersection(df_b.index)
Result look like this
print(intersection.values)
[(6, 3) (4, 2)]
EDIT2:
Out of curiosity I made a comparison between the methods. Now with a larger list of indices. I have compared my first index method with a slightly improved method which does not require to create a dataframe first, but immediately creates the index, and then with the dataframe merge method proposed as well.
This is the code
from random import randint, seed
import time
import numpy as np
import pandas as pd
seed(0)
n_tuple = 100000
i_min = 0
i_max = 10
a = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
b = [[randint(i_min, i_max), randint(i_min, i_max)] for _ in range(n_tuple)]
np_a = np.array(a)
np_b = np.array(b)
def method0(a_array, b_array):
index_a = pd.DataFrame(a_array).set_index([0, 1]).index
index_b = pd.DataFrame(b_array).set_index([0, 1]).index
return index_a.intersection(index_b).to_numpy()
def method1(a_array, b_array):
index_a = pd.MultiIndex.from_arrays(a_array.T)
index_b = pd.MultiIndex.from_arrays(b_array.T)
return index_a.intersection(index_b).to_numpy()
def method2(a_array, b_array):
df_a = pd.DataFrame(a_array)
df_b = pd.DataFrame(b_array)
return df_a.merge(df_b).to_numpy()
def method3(a_array, b_array):
set_a = {(_[0], _[1]) for _ in a_array}
set_b = {(_[0], _[1]) for _ in b_array}
return set_a.intersection(set_b)
for cnt, intersect in enumerate([method0, method1, method2, method3]):
t0 = time.time()
if cnt < 3:
intersection = intersect(np_a, np_b)
else:
intersection = intersect(a, b)
print(f"method{cnt}: {time.time() - t0}")
The output looks like:
method0: 0.1439347267150879
method1: 0.14012742042541504
method2: 4.740894317626953
method3: 0.05933070182800293
Conclusion: the merge method of dataframes (method2) is about 50 times slower than using intersections on the index. The version based on multiindex (method1) is only slightly faster than method0 (my first proposal)
EDIT2: As proposed by the comment of #AKX: if you do not use numpy but pure list and sets, you can again gain a speed up of about a factor 3. But it is clear you should not used the merge method.
I have column like this shown below:
Data
0 A
1 Av
2 Zcef
I want desire output with using some function like
def len_mul(a,b):
return len(a) * len(b)
This function can be replace,
Data A Av Zcef
A 1 2 4
Av 2 4 8
Zcef 4 8 16
I am able to do this using for loop, But I don't want to use for loop.
I am trying using pd.crosstab, but I am stuck at aggfunc.
len_mul function is important as this is example function for simplicity.
Using your custom function:
def len_mul(a,b):
return len(a) * len(b)
idx = pd.MultiIndex.from_product([df['Data'], df['Data']])
df_out = pd.Series(idx.map(lambda x: len_mul(*x)), idx).unstack()
df_out
Output:
A Av Zcef
A 1 2 4
Av 2 4 8
Zcef 4 8 16
This was from #piRSquared SO Post
You can use np.outer with pd.DataFrame constructor:
lens = df['Data'].str.len()
pd.DataFrame(np.outer(lens,lens), index = df['Data'], columns=df['Data'])
Output:
Data A Av Zcef
Data
A 1 2 4
Av 2 4 8
Zcef 4 8 16
Let's take this as an elaborated comment. I think it mostly depends on your len_mul function. If you want to do exactly the same as in your question you could use a little of linear algebra. In particular the fact that multipl a matrix nxq with a matrix qxm you obtain a matrix nxm.
import pandas as pd
import numpy as np
df = pd.DataFrame({"Data":["A", "Av", "Zcef"]})
# this is the len of every entries
v = df["Data"].str.len().values
# this reshape as a (3,1) matrix
v.reshape((-1,1))
# this reshape as a (1,3) matrix
v.reshape((1,-1))
#
arr = df["Data"].values
# this is the matrix multiplication
m = v.reshape((-1,1)).dot(v.reshape((1,-1)))
# your expected output
df_out = pd.DataFrame(m,
columns=arr,
index=arr)
Update
I agree that Scott Boston solution is good for the general case of a custom function. But I think you should look for a possible way to translate your function to something you could do using Linear Algebra.
Here some timing:
import pandas as pd
import numpy as np
import string
alph = list(string.ascii_letters)
n = 10000
data = ["".join(np.random.choice(alph,
np.random.randint(1,10)))
for i in range(n)]
data = sorted(list(set(data)))
df = pd.DataFrame({"Data":data})
def len_mul(a,b):
return len(a) * len(b)
Scott Boston 1st solution
%%time
idx = pd.MultiIndex.from_product([df['Data'], df['Data']])
df_out1 = pd.Series(idx.map(lambda x: len_mul(*x)), idx).unstack()
CPU times: user 1min 32s, sys: 10.3 s, total: 1min 43s
Wall time: 1min 43s
Scott Boston 2nd solution
%%time
lens = df['Data'].str.len()
arr = df['Data'].values
df_out2 = pd.DataFrame(np.outer(lens,lens),
index=arr,
columns=arr)
CPU times: user 99.7 ms, sys: 232 ms, total: 332 ms
Wall time: 331 ms
Vectorial solution
%%time
v = df["Data"].str.len().values
arr = df["Data"].values
m = v.reshape((-1,1)).dot(v.reshape((1,-1)))
df_out3 = pd.DataFrame(m,
columns=arr,
index=arr)
CPU times: user 477 ms, sys: 188 ms, total: 666 ms
Wall time: 666 ms
Conclusions:
The clear winner is Scott Boston 2nd solution with my 2x slower. The 1st solution is, respectively, 311x and 154x slower.
My suggestion would be building the array with list comprehension instead of a loop.
That way, you can easily create a dataframe with it afterwards.
Example usage:
import pandas as pd
array = ['A','B','C']
def function (X):
return X**2
L = [[function(X) for X in pd.np.arange(3)] for Y in pd.np.arange(3)]
L
>>> [[0, 1, 4], [0, 1, 4], [0, 1, 4]]
pd.DataFrame(L, columns=array, index=array)
some text on it: https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
hope it helps!
I have data like this
location sales store
0 68 583 17
1 28 857 2
2 55 190 59
3 98 517 64
4 94 892 79
...
For each unique pair (location, store), there are 1 or more sales. I want to add a column, pcnt_sales that shows what percent of the total sales for that (location, store) pair was made up by the sale in the given row.
location sales store pcnt_sales
0 68 583 17 0.254363
1 28 857 2 0.346543
2 55 190 59 1.000000
3 98 517 64 0.272105
4 94 892 79 1.000000
...
This works, but is slow
import pandas as pd
import numpy as np
df = pd.DataFrame({'location':np.random.randint(0, 100, 10000), 'store':np.random.randint(0, 100, 10000), 'sales': np.random.randint(0, 1000, 10000)})
import timeit
start_time = timeit.default_timer()
df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
print(timeit.default_timer() - start_time) # 1.46 seconds
By comparison, R's data.table does this super fast
library(data.table)
dt <- data.table(location=sample(100, size=10000, replace=TRUE), store=sample(100, size=10000, replace=TRUE), sales=sample(1000, size=10000, replace=TRUE))
ptm <- proc.time()
dt[, pcnt_sales:=sales/sum(sales), by=c("location", "store")]
proc.time() - ptm # 0.007 seconds
How do I do this efficiently in Pandas (especially considering my real dataset has millions of rows)?
For performance you want to avoid apply. You could use transform to get the result of the groupby expanded to the original index instead, at which point a division would work at vectorized speed:
>>> %timeit df['pcnt_sales'] = df.groupby(['location', 'store'])['sales'].apply(lambda x: x/x.sum())
1 loop, best of 3: 2.27 s per loop
>>> %timeit df['pcnt_sales2'] = (df["sales"] /
df.groupby(['location', 'store'])['sales'].transform(sum))
100 loops, best of 3: 6.25 ms per loop
>>> df["pcnt_sales"].equals(df["pcnt_sales2"])
True
I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.
This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)
I'm trying to work out how to speed up a Python function which uses numpy. The output I have received from lineprofiler is below, and this shows that the vast majority of the time is spent on the line ind_y, ind_x = np.where(seg_image == i).
seg_image is an integer array which is the result of segmenting an image, thus finding the pixels where seg_image == i extracts a specific segmented object. I am looping through lots of these objects (in the code below I'm just looping through 5 for testing, but I'll actually be looping through over 20,000), and it takes a long time to run!
Is there any way in which the np.where call can be speeded up? Or, alternatively, that the penultimate line (which also takes a good proportion of the time) can be speeded up?
The ideal solution would be to run the code on the whole array at once, rather than looping, but I don't think this is possible as there are side-effects to some of the functions I need to run (for example, dilating a segmented object can make it 'collide' with the next region and thus give incorrect results later on).
Does anyone have any ideas?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def correct_hot(hot_image, seg_image):
6 1 239810 239810.0 2.3 new_hot = hot_image.copy()
7 1 572966 572966.0 5.5 sign = np.zeros_like(hot_image) + 1
8 1 67565 67565.0 0.6 sign[:,:] = 1
9 1 1257867 1257867.0 12.1 sign[hot_image > 0] = -1
10
11 1 150 150.0 0.0 s_elem = np.ones((3, 3))
12
13 #for i in xrange(1,seg_image.max()+1):
14 6 57 9.5 0.0 for i in range(1,6):
15 5 6092775 1218555.0 58.5 ind_y, ind_x = np.where(seg_image == i)
16
17 # Get the average HOT value of the object (really simple!)
18 5 2408 481.6 0.0 obj_avg = hot_image[ind_y, ind_x].mean()
19
20 5 333 66.6 0.0 miny = np.min(ind_y)
21
22 5 162 32.4 0.0 minx = np.min(ind_x)
23
24
25 5 369 73.8 0.0 new_ind_x = ind_x - minx + 3
26 5 113 22.6 0.0 new_ind_y = ind_y - miny + 3
27
28 5 211 42.2 0.0 maxy = np.max(new_ind_y)
29 5 143 28.6 0.0 maxx = np.max(new_ind_x)
30
31 # 7 is + 1 to deal with the zero-based indexing, + 2 * 3 to deal with the 3 cell padding above
32 5 217 43.4 0.0 obj = np.zeros( (maxy+7, maxx+7) )
33
34 5 158 31.6 0.0 obj[new_ind_y, new_ind_x] = 1
35
36 5 2482 496.4 0.0 dilated = ndimage.binary_dilation(obj, s_elem)
37 5 1370 274.0 0.0 border = mahotas.borders(dilated)
38
39 5 122 24.4 0.0 border = np.logical_and(border, dilated)
40
41 5 355 71.0 0.0 border_ind_y, border_ind_x = np.where(border == 1)
42 5 136 27.2 0.0 border_ind_y = border_ind_y + miny - 3
43 5 123 24.6 0.0 border_ind_x = border_ind_x + minx - 3
44
45 5 645 129.0 0.0 border_avg = hot_image[border_ind_y, border_ind_x].mean()
46
47 5 2167729 433545.8 20.8 new_hot[seg_image == i] = (new_hot[ind_y, ind_x] + (sign[ind_y, ind_x] * np.abs(obj_avg - border_avg)))
48 5 10179 2035.8 0.1 print obj_avg, border_avg
49
50 1 4 4.0 0.0 return new_hot
EDIT I have left my original answer at the bottom for the record, but I have actually looked into your code in more detail over lunch, and I think that using np.where is a big mistake:
In [63]: a = np.random.randint(100, size=(1000, 1000))
In [64]: %timeit a == 42
1000 loops, best of 3: 950 us per loop
In [65]: %timeit np.where(a == 42)
100 loops, best of 3: 7.55 ms per loop
You could get a boolean array (that you can use for indexing) in 1/8 of the time you need to get the actual coordinates of the points!!!
There is of course the cropping of the features that you do, but ndimage has a find_objects function that returns enclosing slices, and appears to be very fast:
In [66]: %timeit ndimage.find_objects(a)
100 loops, best of 3: 11.5 ms per loop
This returns a list of tuples of slices enclosing all of your objects, in 50% more time thn it takes to find the indices of one single object.
It may not work out of the box as I cannot test it right now, but I would restructure your code into something like the following:
def correct_hot_bis(hot_image, seg_image):
# Need this to not index out of bounds when computing border_avg
hot_image_padded = np.pad(hot_image, 3, mode='constant',
constant_values=0)
new_hot = hot_image.copy()
sign = np.ones_like(hot_image, dtype=np.int8)
sign[hot_image > 0] = -1
s_elem = np.ones((3, 3))
for j, slice_ in enumerate(ndimage.find_objects(seg_image)):
hot_image_view = hot_image[slice_]
seg_image_view = seg_image[slice_]
new_shape = tuple(dim+6 for dim in hot_image_view.shape)
new_slice = tuple(slice(dim.start,
dim.stop+6,
None) for dim in slice_)
indices = seg_image_view == j+1
obj_avg = hot_image_view[indices].mean()
obj = np.zeros(new_shape)
obj[3:-3, 3:-3][indices] = True
dilated = ndimage.binary_dilation(obj, s_elem)
border = mahotas.borders(dilated)
border &= dilated
border_avg = hot_image_padded[new_slice][border == 1].mean()
new_hot[slice_][indices] += (sign[slice_][indices] *
np.abs(obj_avg - border_avg))
return new_hot
You would still need to figure out the collisions, but you could get about a 2x speed-up by computing all the indices simultaneously using a np.unique based approach:
a = np.random.randint(100, size=(1000, 1000))
def get_pos(arr):
pos = []
for j in xrange(100):
pos.append(np.where(arr == j))
return pos
def get_pos_bis(arr):
unq, flat_idx = np.unique(arr, return_inverse=True)
pos = np.argsort(flat_idx)
counts = np.bincount(flat_idx)
cum_counts = np.cumsum(counts)
multi_dim_idx = np.unravel_index(pos, arr.shape)
return zip(*(np.split(coords, cum_counts) for coords in multi_dim_idx))
In [33]: %timeit get_pos(a)
1 loops, best of 3: 766 ms per loop
In [34]: %timeit get_pos_bis(a)
1 loops, best of 3: 388 ms per loop
Note that the pixels for each object are returned in a different order, so you can't simply compare the returns of both functions to assess equality. But they should both return the same.
One thing you could do to same a little bit of time is to save the result of seg_image == i so that you don't need to compute it twice. You're computing it on lines 15 & 47, you could add seg_mask = seg_image == i and then reuse that result (It might also be good to separate out that piece for profiling purposes).
While there a some other minor things that you could do to eke out a little bit of performance, the root issue is that you're using a O(M * N) algorithm where M is the number of segments and N is the size of your image. It's not obvious to me from your code whether there is a faster algorithm to accomplish the same thing, but that's the first place I'd try and look for a speedup.