Pandas: create new column which swaps values of other rows

Pandas: create new column which swaps values of other rows - python

I'm trying to create a pandas dataframe like this:
x2 x3
0 3.536220 0.681269
1 0.681269 3.536220
2 -0.402380 2.303833
3 2.303833 -0.402380
4 2.032329 3.334412
5 3.334412 2.032329
6 0.371338 5.879732
. . .
So x2 is a column of random numbers, and x3 has the values of row 0 and 1 in x2 swapped, the values of 2 and 3 swapped, and so on. My current code is like this:
import numpy as np
import pandas as pd
x2 = pd.Series(np.random.normal(loc = 2, scale = 2.5, size = 1000))
x3 = pd.Series([x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)])
df = pd.DataFrame({'x2': x2, 'x3': x3})
I'm wondering if there is any faster or more elegant way, particularly if I want to have many rows (e.g. 1 million?) or do this over and over again (e.g. Monte Carlo simulation)?

Instead of
[x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
you could use
def swap(arr):
result = np.empty_like(arr)
result[::2] = arr[1::2]
result[1::2] = arr[::2]
return result
For a sequence of length 1000, using swap is over 3000x faster:
In [84]: %timeit [x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
100 loops, best of 3: 12.7 ms per loop
In [98]: %timeit swap(x2.values)
100000 loops, best of 3: 3.82 µs per loop
import numpy as np
import pandas as pd
np.random.seed(2017)
x2 = pd.Series(np.random.normal(loc = 2, scale = 2.5, size = 1000))
x3 = [x2[i + 1] if i % 2 == 0 else x2[i - 1] for i in range(1000)]
def swap(arr):
result = np.empty_like(arr)
result[::2] = arr[1::2]
result[1::2] = arr[::2]
return result
df = pd.DataFrame({'x2': x2, 'x3': x3, 'x4': swap(x2.values)})
print(df.head())
prints
x2 x3 x4
0 -0.557363 1.649005 1.649005
1 1.649005 -0.557363 -0.557363
2 2.497731 3.433690 3.433690
3 3.433690 2.497731 2.497731
4 1.013555 0.679394 0.679394

Related

Speed up turn probabilities into binary features

I have a dataframe with 3 columns, in each row I have the probability that this row, the feature T has the value 1, 2 and 3
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
For row 0, T is 1 with 80% chance, 2 with 10% and 3 with 10%
I want to simulate the value of T for each row and change the columns T1,T2, T3 to binary features.
I have a solution but it needs to loop on the rows of the dataframe, it is really slow (my real dataframe has over 1 million rows) :
possib = df.columns
for i in range(df.shape[0]):
probas = df.iloc[i][possib].tolist()
choix_transp = np.random.choice(possib,1, p=probas)[0]
for pos in possib:
if pos==choix_transp:
df.iloc[i][pos] = 1
else:
df.iloc[i][pos] = 0
Is there a way to vectorize this code ?
Thank you !

Here's one based on vectorized random.choice with a given matrix of probabilities -
def matrixprob_to_onehot(ar):
# Get one-hot encoded boolean array based on matrix of probabilities
c = ar.cumsum(axis=1)
idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
ar_out = np.zeros(ar.shape, dtype=bool)
ar_out[np.arange(len(idx)),idx] = 1
return ar_out
ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
Verify with a large dataset for the probabilities -
In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
In [140]: df
Out[140]:
T1 T2 T3
0 0.80 0.10 0.1
1 0.50 0.20 0.3
2 0.01 0.89 0.1
In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)
In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]:
array([[0.80064, 0.0995 , 0.09986],
[0.50051, 0.20113, 0.29836],
[0.01015, 0.89045, 0.0994 ]])
In [145]: np.round(_,2)
Out[145]:
array([[0.8 , 0.1 , 0.1 ],
[0.5 , 0.2 , 0.3 ],
[0.01, 0.89, 0.1 ]])
Timings on 1000,000 rows -
# Setup input
In [169]: N = 1000000
...: a = np.random.rand(N,3)
...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])
# #gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop
# Soln from this post
In [172]: %%timeit
...: ar_out = matrixprob_to_onehot(df.values)
...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop

We can use numpy for this:
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
This generates a single column of random values and compares it to the column-wise cumsum of the dataframe, which results in a DataFrame of values where the first False value shows which "bucket" the random value falls in. With idxmax, we can get the index of this bucket, which we can then convert back with pd.get_dummies.
Example:
import numpy as np
import pandas as pd
np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]
df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
print(result)
Output:
0 1 2
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 1 0
8 0 0 1
9 0 1 0
A note:
Most of the slowdown comes from pd.get_dummies; if you use Divakar's method of pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns), it gets a lot faster.

Cumulative apply within window defined by other columns

I am trying to apply a function, cumulatively, to values that lie within a window defined by 'start' and 'finish' columns. So, 'start' and 'finish' define the intervals where the value is 'active'; for each row, I want to get a sum of all 'active' values at the time.
Here is a 'bruteforce' example that does what I am after - is there a more elegant, faster or more memory efficient way of doing this?
df = pd.DataFrame(data=[[1,3,100], [2,4,200], [3,6,300], [4,6,400], [5,6,500]],
columns=['start', 'finish', 'val'])
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
Originally, df is:
start finish val
0 1 3 100
1 2 4 200
2 3 6 300
3 4 6 400
4 5 6 500
The result I am after is:
1 100
2 300
3 500
4 700
5 1200

numba
from numba import njit
#njit
def pir_numba(S, F, V):
mn = S.min()
mx = F.max()
out = np.zeros(mx)
for s, f, v in zip(S, F, V):
out[s:f] += v
return out[mn:]
pir_numba(*[df[c].values for c in ['start', 'finish', 'val']])
np.bincount
s, f, v = [df[col].values for col in ['start', 'finish', 'val']]
np.bincount([i - 1 for r in map(range, s, f) for i in r], v.repeat(f - s))
array([ 100., 300., 500., 700., 1200.])
Comprehension
This depends on the index being unique
pd.Series({
(k, i): v
for i, s, f, v in df.itertuples()
for k in range(s, f)
}).sum(level=0)
1 100
2 300
3 500
4 700
5 1200
dtype: int64
With no dependence on index
pd.Series({
(k, i): v
for i, (s, f, v) in enumerate(zip(*map(df.get, ['start', 'finish', 'val'])))
for k in range(s, f)
}).sum(level=0)

Using numpy boardcast , unfortunately it is still O(n*m) solution , but should be faster than the groupby. So far base on my test Pir 's solution performance is the best
s1=df['start'].values
s2=df['finish'].values
np.sum(((s1<=s1[:,None])&(s2>=s2[:,None]))*df.val.values,1)
Out[44]: array([ 100, 200, 300, 700, 1200], dtype=int64)
Some timing
#df=pd.concat([df]*1000)
%timeit merged(df)
1 loop, best of 3: 5.02 s per loop
%timeit npb(df)
1 loop, best of 3: 283 ms per loop
% timeit PIR(df)
100 loops, best of 3: 9.8 ms per loop
def merged(df):
df['dummy'] = 1
df = df.merge(df, on=['dummy'], how='left')
df = df[(df['start_y'] <= df['start_x']) & (df['finish_y'] > df['start_x'])]
val = df.groupby('start_x')['val_y'].sum()
return val
def npb(df):
s1 = df['start'].values
s2 = df['finish'].values
return np.sum(((s1 <= s1[:, None]) & (s2 >= s2[:, None])) * df.val.values, 1)

Python pandas dataframe get all combinations of column values?

I have a pandas dataframe which looks like this:
colour points
0 red 1
1 yellow 10
2 black -3
Then I'm trying to do the following algorithm:
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].values[i1]
colour_secondary = df['colour'].values[i2]
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 + point2
combos.append(combo)
points.append(new_points)
df_new = pd.DataFrame({'colours': combos,
'points': points})
print(df_new)
I want to get all combinations and sum points:
if the colour is used as main I want to sum his value
if the colour is used as a secondary I want to sum opposite value
Example:
red_yellow = 1 + (-10) = -9
red_black = 1 + ( +3) = 4
black_red = -3 + ( -1) = -4
The output I currently get:
colours points
0 red_red 2
1 red_yellow 11
2 red_black -2
3 yellow_red 11
4 yellow_yellow 20
5 yellow_black 7
6 black_red -2
7 black_yellow 7
8 blac_kblack -6
The output I'm looking for:
red_yellow -9
red_black 4
yellow_red 9
yellow_black 13
black_red -4
black_yellow -13
I don't know how to apply my logic to this code, also I bet there is a more simplest way to get all combinations without doing two loops, but currently, that's the only thing that comes to my mind.
I would like to:
get deserved output
improve the performance in cases when we get like 20 input colours
remove duplicates like red_red

Here is a timeit comparison of a few alternatives.
| method | ms per loop |
|--------------------+-------------|
| alt2 | 2.36 |
| using_concat | 3.26 |
| using_double_merge | 22.4 |
| orig | 22.6 |
| alt | 45.8 |
The timeit results were generated using IPython:
In [138]: df = make_df(20)
In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop
import numpy as np
import pandas as pd
def alt(df):
df['const'] = 1
result = pd.merge(df, df, on='const', how='outer')
result = result.loc[(result['colour_x'] != result['colour_y'])]
result['color'] = result['colour_x'] + '_' + result['colour_y']
result['points'] = result['points_x'] - result['points_y']
result = result[['color', 'points']]
return result
def alt2(df):
points = np.add.outer(df['points'], -df['points'])
color = pd.MultiIndex.from_product([df['colour'], df['colour']])
mask = color.labels[0] != color.labels[1]
color = color.map('_'.join)
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
return result
def orig(df):
combos = []
points = []
for i1 in range(len(df)):
for i2 in range(len(df)):
colour_main = df['colour'].iloc[i1]
colour_secondary = df['colour'].iloc[i2]
if colour_main != colour_secondary:
combo = colour_main + "_" + colour_secondary
point1 = df['points'].values[i1]
point2 = df['points'].values[i2]
new_points = point1 - point2
combos.append(combo)
points.append(new_points)
return pd.DataFrame({'color':combos, 'points':points})
def using_concat(df):
"""https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
return df2
def using_double_merge(df):
"""https://stackoverflow.com/a/51641007/190597 (sacul)"""
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_y'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
return new[['colours', 'sum']]
def make_df(N):
df = pd.DataFrame({'colour': np.arange(N),
'points': np.random.randint(10, size=N)})
df['colour'] = df['colour'].astype(str)
return df
The main idea in alt2 is to use np.add_outer to construct an addition table
out of df['points']:
In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]:
array([[ 0, -9, 4],
[ 9, 0, 13],
[ -4, -13, 0]])
ravel is used to make the array 1-dimensional:
In [152]: points.ravel()
Out[152]: array([ 0, -9, 4, 9, 0, 13, -4, -13, 0])
and the color combinations are generated with pd.MultiIndex.from_product:
In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]:
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
'yellow_black', 'black_red', 'black_yellow', 'black_black'],
dtype='object')
A mask is generated to remove duplicates:
mask = color.labels[0] != color.labels[1]
and then the result is generated from these parts:
result = pd.DataFrame({'points':points.ravel(), 'color':color})
result = result.loc[mask]
The idea behind alt is explained in my original answer, here.

This is a bit long-winded, but gets you the output you want:
new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
.reset_index()
.drop(['colour', 'points'], 1)
.merge(df.set_index('colour'), left_on='level_0', right_index=True)
.merge(df.set_index('colour'), left_on='level_1', right_index=True))
new['points_x'] *= -1
new['sum'] = new.sum(axis=1)
new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
>>> new
level_0 level_1 sum colours
3 yellow red -9 yellow_red
6 black red 4 black_red
1 red yellow 9 red_yellow
7 black yellow 13 black_yellow
2 red black -4 red_black
5 yellow black -13 yellow_black

d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
comb values
0 red_yellow -9
1 red_black 4
2 yellow_black 13
0 yellow_red 9
1 black_red -4
2 black_yellow -13

You have to change this line in your code
new_points = point1 + point2
to this
new_points = point1 - point2

Marking all groups in DataFrame smaller than N

I'm trying to mark (in ok) all groups in a pandas DataFrame which are smaller than 'N'. I have a working solution but it's slow, is there a way to speed this up?
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
df['ok'] = True
c = df.groupby(keys)['ok'].count()
for vals in c[c < N].index:
local_dict = dict(zip(keys, vals))
query = ' & '.join(f'{key}==#{key}' for key in keys)
idx = df.query(query, local_dict=local_dict).index
df.loc[idx, 'ok'] = False
print(df)

Instead of using groupby/count, use groupby/transform/count to form a Series which is the same length as the original DataFrame df:
c = df.groupby(keys)['z'].transform('count')
Then you can form a boolean mask which has the same length as df:
In [35]: c<N
Out[35]:
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
Name: ok, dtype: bool
Assignment to ok goes much more smoothly now, without a loop, querying or sub-indexing:
df['ok'] = c >= N
import pandas as pd
df = pd.DataFrame([
[1, 2, 1],
[1, 2, 2],
[1, 2, 3],
[2, 3, 1],
[2, 3, 2],
[4, 5, 1],
[4, 5, 2],
[4, 5, 3],
], columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
print(df)
yields
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True
Since the builtin groupby/transform methods (such as transform('count')) are
Cythonized they are in general faster than calling groupby/transform
with an custom lambda function.
Thus, computing the ok column in two steps using
c = df.groupby(keys)['z'].transform('count')
df['ok'] = c >= N
is faster than
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
In addition, vectorized operations over an entire column (such as c >= N), are
faster than multiple operations over subgroups. transform(lambda x: x.size >=
N)) performs the comparison x.size >= N once for each group. If there are
many groups, then computing c >= N yields an improvement in performance.
For example, with this 1000-row DataFrame:
import numpy as np
import pandas as pd
np.random.seed(2017)
df = pd.DataFrame(np.random.randint(10, size=(1000, 3)), columns=['x', 'y', 'z'])
keys = ['x', 'y']
N = 3
using transform('count') is about 12x faster:
In [37]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.69 ms per loop
In [38]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
1 loop, best of 3: 20.2 ms per loop
In [39]: 20.2/1.69
Out[39]: 11.95266272189349
In the example above there were 100 groups:
In [47]: df.groupby(keys).ngroups
Out[47]: 100
The speed advantage of using transform('count') increases as the number of
groups increase. For example, with 955 groups:
In [48]: np.random.seed(2017); df = pd.DataFrame(np.random.randint(100, size=(1000, 3)), columns=['x', 'y', 'z'])
In [51]: df.groupby(keys).ngroups
Out[51]: 955
the transform('count') method performs about 92x faster:
In [49]: %%timeit
....: c = df.groupby(keys)['z'].transform('count')
....: df['ok'] = c >= N
1000 loops, best of 3: 1.88 ms per loop
In [50]: %timeit df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
10 loops, best of 3: 173 ms per loop
In [52]: 173/1.88
Out[52]: 92.02127659574468

Input variables:
keys = ['x','y']
N = 3
Calculate okay or not with groupby, transform and size:
df.assign(ok=df.groupby(keys)['z'].transform(lambda x: x.size >= N))
Output:
x y z ok
0 1 2 1 True
1 1 2 2 True
2 1 2 3 True
3 2 3 1 False
4 2 3 2 False
5 4 5 1 True
6 4 5 2 True
7 4 5 3 True

count how many elements in a numpy array are within delta of every other element

consider the array x and delta variable d
np.random.seed([3,1415])
x = np.random.randint(100, size=10)
d = 10
For each element in x, I want to count how many other elements in each are within delta d distance away.
So x looks like
print(x)
[11 98 74 90 15 55 13 11 13 26]
The results should be
[5 2 1 2 5 1 5 5 5 1]
what I've tried
Strategy:
Use broadcasting to take the outer difference
Absolute value of outer difference
sum how many exceed threshold
(np.abs(x[:, None] - x) <= d).sum(-1)
[5 2 1 2 5 1 5 5 5 1]
This works great. However, it doesn't scale. That outer difference is O(n^2) time. How can I get the same solution that doesn't scale with quadratic time?

Listed in this post are two more variants based on the searchsorted strategy from OP's answer post.
def pir3(a,d): # Short & less efficient
sidx = a.argsort()
p1 = a.searchsorted(a+d,'right',sorter=sidx)
p2 = a.searchsorted(a-d,sorter=sidx)
return p1 - p2
def pir4(a, d): # Long & more efficient
s = a.argsort()
y = np.empty(s.size,dtype=np.int64)
y[s] = np.arange(s.size)
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[y]
The more efficient approach derives the efficient idea to get s.argsort() from this post.
Runtime test -
In [155]: # Inputs
...: a = np.random.randint(0,1000000,(10000))
...: d = 10
In [156]: %timeit pir2(a,d) ## piRSquared's post solution
...: %timeit pir3(a,d)
...: %timeit pir4(a,d)
...:
100 loops, best of 3: 2.43 ms per loop
100 loops, best of 3: 4.44 ms per loop
1000 loops, best of 3: 1.66 ms per loop

Strategy
Since x is not necessarily sorted, we'll sort it and track the sorting permutation via argsort so we can reverse the permutation.
We'll use np.searchsorted on x with x - d to find the starting place for when values of x start to exceed x - d.
Do it again on the other side except we'll have to use the np.searchsorted parameter side='right' and using x + d
Take the difference between right and left searchsorts to calculate number of elements that are within +/- d of each element
Use argsort to reverse the sorting permutation
define method presented in question as pir1
def pir1(a, d):
return (np.abs(a[:, None] - a) <= d).sum(-1)
We'll define a new function pir2
def pir2(a, d):
s = x.argsort()
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[s.argsort()]
demo
pir1(x, d)
[5 2 1 2 5 1 5 5 5 1]
pir1(x, d)
[5 2 1 2 5 1 5 5 5 1]
timing
pir2 is the clear winner!
code
functions
def pir1(a, d):
return (np.abs(a[:, None] - a) <= d).sum(-1)
def pir2(a, d):
s = x.argsort()
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[s.argsort()]
#######################
# From Divakar's post #
#######################
def pir3(a,d): # Short & less efficient
sidx = a.argsort()
p1 = a.searchsorted(a+d,'right',sorter=sidx)
p2 = a.searchsorted(a-d,sorter=sidx)
return p1 - p2
def pir4(a, d): # Long & more efficient
s = a.argsort()
y = np.empty(s.size,dtype=np.int64)
y[s] = np.arange(s.size)
a_ = a[s]
return (
a_.searchsorted(a_ + d, 'right')
- a_.searchsorted(a_ - d)
)[y]
test
from timeit import timeit
results = pd.DataFrame(
index=np.arange(1, 50),
columns=['pir%s' %i for i in range(1, 5)])
for i in results.index:
np.random.seed([3,1415])
x = np.random.randint(1000000, size=i)
for j in results.columns:
setup = 'from __main__ import x, {}'.format(j)
results.loc[i, j] = timeit('{}(x, 10)'.format(j), setup=setup, number=10000)
results.plot()
extended out to larger arrays
got rid of pir1
from timeit import timeit
results = pd.DataFrame(
index=np.arange(1, 11) * 1000,
columns=['pir%s' %i for i in range(2, 5)])
for i in results.index:
np.random.seed([3,1415])
x = np.random.randint(1000000, size=i)
for j in results.columns:
setup = 'from __main__ import x, {}'.format(j)
results.loc[i, j] = timeit('{}(x, 10)'.format(j), setup=setup, number=100)
results.insert(0, 'pir1', 0)
results.plot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: create new column which swaps values of other rows - python

Related

Speed up turn probabilities into binary features

Cumulative apply within window defined by other columns

Python pandas dataframe get all combinations of column values?

Marking all groups in DataFrame smaller than N

count how many elements in a numpy array are within delta of every other element

Categories

Resources