While loop repeat same output when it should be different - python

I don't understand why the following code output the same random variables for simulated_returns_pr from the SECOND loop (same for the 2 charts from the function). Actually I removed some code but all following variable which should be different are also the same from the SECOND loop. I am missing something but do not understand. Any contribution would be appreciated.
My code:
logR= timeseries
i=1
while i < 5:
simulated_returns_pr= np.random.normal(loc=mean(logR)*30, scale=stdev(logR)*np.sqrt(30.), size=30)
seed = 2
N = 30
def Brownian(seed, N):
np.random.seed(seed)
dt = 1./N # time step
b = simulated_returns_pr*np.sqrt(dt)
W = np.cumsum(b) # brownian path
return W, b
b = Brownian(seed, N)[1]
W = Brownian(seed, N)[0]
W = np.insert(W, 0, 0.)
plt.rcParams['figure.figsize'] = (10,8)
xb = np.linspace(1, len(b), len(b))
plt.plot(xb, b)
plt.title('Brownian Increments')
plt.show()
xw = np.linspace(1, len(W), len(W))
plt.plot(xw, W)
plt.title('Brownian Motion')
plt.show()
i += 1
Output simulated_returns_pr:
[ 0.012191 1.16322303 -0.23225735 -0.12357125 0.35687974 1.02187274
0.25248517 0.74665974 0.54373161 0.43677913 0.69960184 -0.81226681
0.50380517 -0.25108897 0.47459444 0.49541601 0.79958083 -0.20233765
0.5142276 -0.31340253 0.46332258 0.48350956 0.06662023 0.53800548
-0.01440759 -0.23280276 -0.07377719 -0.29948791 0.15798112 0.10707121]
[-0.10796927 0.07350919 -0.97356921 0.9275805 -0.80101665 -0.32191758
0.35499571 -0.52506813 -0.43075947 -0.35577774 0.37944815 1.25577886
0.12274682 -0.4609512 0.37320789 -0.19828379 0.09220437 0.69335439
-0.27465829 0.10637854 -0.3402222 0.02308293 0.2309978 -0.3959363
-0.06873477 -0.01706476 -0.21917336 -0.49603296 -0.61363441 0.02456247]
[-0.10796927 0.07350919 -0.97356921 0.9275805 -0.80101665 -0.32191758
0.35499571 -0.52506813 -0.43075947 -0.35577774 0.37944815 1.25577886
0.12274682 -0.4609512 0.37320789 -0.19828379 0.09220437 0.69335439
-0.27465829 0.10637854 -0.3402222 0.02308293 0.2309978 -0.3959363
-0.06873477 -0.01706476 -0.21917336 -0.49603296 -0.61363441 0.02456247]
[-0.10796927 0.07350919 -0.97356921 0.9275805 -0.80101665 -0.32191758
0.35499571 -0.52506813 -0.43075947 -0.35577774 0.37944815 1.25577886
0.12274682 -0.4609512 0.37320789 -0.19828379 0.09220437 0.69335439
-0.27465829 0.10637854 -0.3402222 0.02308293 0.2309978 -0.3959363
-0.06873477 -0.01706476 -0.21917336 -0.49603296 -0.61363441 0.02456247]

Related

How to get rid of excess nested array in Python

Nested Array
I want to turn the above into the below. This accidentally happened as I was doing a linear regression that the output was already in a 1x1 array, let me know if you would like to see more of my code. It looks like my betas variable is the issue with the nesting.
Normal Array
Generally speaking, I am just trying to get the output from
[[ array([x]), array([x]), array([x]), array([x]), array([x])]]
to
[[x, x, x, x, x ]]
def si_model():
dj_data = pd.read_csv("/data.tsv", sep = "\t")
dj_data = dj_data.pct_change().dropna()
ann_dj_data = dj_data * 252
dj_index = ann_dj_data['^DJI']
ann_dj_data = ann_dj_data.drop('^DJI', axis='columns')
# Function to Linear Regress Each Stock onto DJ
def model_regress(stock):
# Fit DJ to Index Data
DJ = np.array(dj_index).reshape(len(stock), 1)
# Regression of each stock onto DJ
lm = LinearRegression().fit(DJ, y=stock.to_numpy())
resids = stock.to_numpy() - lm.predict(DJ)
return lm.coef_, lm.intercept_, resids.std()
# Run model regression on each stock
lm_all = ann_dj_data.apply(lambda stock: model_regress(stock)).T
# Table of the Coeffeicents
lm_all = lm_all.rename(columns={0: 'Beta ', 1: 'Intercept', 2: 'Rsd Std'})
# Varaince of the index's returns
dj_index_var = dj_index.std() ** 2
betas = lm_all['Beta '].to_numpy()
resid_vars = lm_all['Rsd Std'].to_numpy() ** 2
# Single index approximation of covariance matrix using identity matrix (np.eye)
Qsi = dj_index_var * betas * betas.reshape(-1, 1) + np.eye(len(betas)) * resid_vars
return Qsi
# Printing first five rows of approximation
Qsi = si_model()
print("Covariance Matrix")
print(Qsi[:5, :5])
You can use squeeze().
Here is a small example similar to yours:
import numpy as np
a = np.array([17.1500691])
b = np.array([5.47690856])
c = np.array([5.47690856])
d = np.array([11.7700696])
e = list([[a,b],[c,d]])
print(e)
f = np.squeeze(np.array(e), axis=2)
print(f)
Output:
[[array([17.1500691]), array([5.47690856])], [array([5.47690856]), array([11.7700696])]]
[[17.1500691 5.47690856]
[ 5.47690856 11.7700696 ]]

What is the reason for this many errors in my former MATLAB, now Python code?

I am currently translating this piece of MATLAB code to Python code. The code is about calculating fractal dimension of images by turning it to a binary array and using box counting method.
But errors such as "found bad_character", "end of statement expected" happen all at once in some part of the code.
import numpy as np
import math as mt
import cv2
import matplotlib.pyplot
I = cv2.imread()#plase image location
h, w, c = I.cv2.shape
#print('width: ', w)
#print('height: ', h)
#print('channel:', c)
largerLength = np.maximum(h, w)
power = np.ceil(mt.log2(largerLength))
lengthNum =2**power
grayIm = cv2.cvtColor(I, cv2.COLOR_BGR2GRAY)
ret, binaryIm = cv2.threshold(grayIm,125,256,cv2.THRESH_BINARY)
#get the amount of padding to add
padRow = lengthNum - I.shape
padCol = lengthNum − I.shape
#pad I with 0’s after its last row and column
I = np.padarray(I , [padRow , padCol], ’post’) #??
boxCountstore = np.zeros(1 , power) #??
#boxcountstore = zeros(1, power)
scalestore = np.zeros(1 , power) #??
#scalestore = zeros(1, power)
boxNum = 1
#use the for loop to shrink the box size
for i in range(1, power):
boxCount=0
for boxrow in range(1,2**i):#i was i-1
for boxcol in range(1,2**i):
#thefourtermsbelowaretheindexrange
#ofthecurrentboxwearechecking
var1 = lengthNum/boxNum
var2 = boxrow-1
minRow = 1 + var1∗var2
minCol = 1 + var1∗var2
maxRow = var1∗boxrow
maxCol = var1∗boxcol
contain=0
for row in range(minRow,maxRow):
for col in range(minCol,maxCol):
if I(row,col): ###????
#ifture,thenthecurrentbox
#containstheobject
boxCount=boxCount+1
contain=1
break #breakfromthe”col”
if contain:
break #breakfromthe”row”loop
scale=1/(lengthNum/boxNum)
boxNum = 2*boxNum #doublethenumberofboxes
#per dimension
#fit a line for the log − log plot in the least square
#sense
FD = np.polyfit(np.log(scalestore),np.log(boxCountstore),1)
#returntheslope
#FD=FD(1)
# (225, 400, 3)
# <class 'tuple'>
original code, matlab_pt1
original code, matlab_pt2
I couldn't figure what is wrong with the part that I use the variables var1 and var2.
Could anyone help me solve this problem?

Try to get a multiple plot with different X-axes

I need to get something like this:
but I understand that
I know I have to give a separate time for each time of the plot but I don't know how. There is also something wrong with my plot I think it plot more than one tiem and they are over each other.
Update:
With help of #Zephyr
Now it look like this:
But it should be like this:
Here is the code:
import mpmath as mp
import numpy as np
import sympy
beta = 0.25;
L=[0.04980905361573844, 0.0208207352451072, 0.012368753716465475, 0.009117292529338674, 0.007461219338976698, 0.006510364609693688, 0.005899506135250773, 0.005485130537183343, 0.0051898472561455, 0.004961157595209418, 0.004778617403698715, 0.00463084459959999, 0.004510113095117956, 0.004410195593051274, 0.004330450690278247]
Lc=[1.7509008762765992, 0.14986486457338544, 0.03453912303580302, 0.014622269851256788, 0.00831141421008418, 0.005660123321843252, 0.004287823173522503, 0.0034922189865254395, 0.0029879534061896186, 0.0026315863522143363, 0.002367747989524076, 0.0021671535838986545, 0.0020116727106455415, 0.0018885896058416002, 0.0017939246597491803]
for k0 in [0.052,0.12,0.252,0.464,0.792,1.264,1.928,2.824,4,5.600,7.795,10.806,14.928,20.599,28.000]:
t=np.linspace(k0,30,50)
for i in range(len(L)):
j0 = L[i];
j1 = Lc[i];
G = []
def f(s):
return s**(beta - 1)/(j0*s**beta + j1*sympy.gamma(beta + 1))
for j in range(len(t)):
G.append(mp.invertlaplace(f, t[j], method = 'dehoog', dps = 10, degree = 50))
plt.plot(t,G)
The answer should be like that but I get this erorr
import numpy as np
import sympy
beta = 0.25;
L=[0.04980905361573844, 0.0208207352451072, 0.012368753716465475, 0.009117292529338674, 0.007461219338976698, 0.006510364609693688, 0.005899506135250773, 0.005485130537183343, 0.0051898472561455, 0.004961157595209418, 0.004778617403698715, 0.00463084459959999, 0.004510113095117956, 0.004410195593051274, 0.004330450690278247]
Lc=[1.7509008762765992, 0.14986486457338544, 0.03453912303580302, 0.014622269851256788, 0.00831141421008418, 0.005660123321843252, 0.004287823173522503, 0.0034922189865254395, 0.0029879534061896186, 0.0026315863522143363, 0.002367747989524076, 0.0021671535838986545, 0.0020116727106455415, 0.0018885896058416002, 0.0017939246597491803]
for k0 in [0.052,0.12,0.252,0.464,0.792,1.264,1.928,2.824,4,5.600,7.795,10.806,14.928,20.599,28.000]:
t[k0]=np.linspace(k0,30,50)
for i in range(len(L)):
j0 = L[i];
j1 = Lc[i];
G = []
def f(s):
return s**(beta - 1)/(j0*s**beta + j1*sympy.gamma(beta + 1))
for j in range(len(t[k0])):
G.append(mp.invertlaplace(f, t[k0][j], method = 'dehoog', dps = 10, degree = 50))
plt.plot(t[k0],G)
The function of dashed line is:
31.9279939766313*exp(-0.18*sympy.sqrt(7)/sympy.sqrt(t))
]
For each step should be done something like this.
You are looping one time too many, this is the reason why you are plotting more than one line overlapped.
You should re-structured your code in this way:
import mpmath as mp
import numpy as np
import sympy
import matplotlib.pyplot as plt
beta = 0.25
L = [0.04980905361573844, 0.0208207352451072, 0.012368753716465475, 0.009117292529338674, 0.007461219338976698, 0.006510364609693688, 0.005899506135250773, 0.005485130537183343, 0.0051898472561455, 0.004961157595209418, 0.004778617403698715, 0.00463084459959999, 0.004510113095117956, 0.004410195593051274, 0.004330450690278247]
Lc = [1.7509008762765992, 0.14986486457338544, 0.03453912303580302, 0.014622269851256788, 0.00831141421008418, 0.005660123321843252, 0.004287823173522503, 0.0034922189865254395, 0.0029879534061896186, 0.0026315863522143363, 0.002367747989524076, 0.0021671535838986545, 0.0020116727106455415, 0.0018885896058416002, 0.0017939246597491803]
K = [0.052, 0.12, 0.252, 0.464, 0.792, 1.264, 1.928, 2.824, 4, 5.600, 7.795, 10.806, 14.928, 20.599, 28.000]
def f(s):
return s**(beta - 1)/(j0*s**beta + j1*sympy.gamma(beta + 1))
for k0, j0, j1 in zip(K, L, Lc):
t = np.linspace(k0, 30, 50)
G = []
for j in range(len(t)):
G.append(mp.invertlaplace(f, t[j], method = 'dehoog', dps = 10, degree = 50))
plt.plot(t, G)
plt.show()
Thanks to zip you can iterate over K, L and Lc lists, picking one element from each lists at the same time; no need of i and j counters.

Monte Carlo with Metropolis algorithm extremely slow in Python

I'm trying to implement a simple Monte Carlo in Python (to which I'm fairly new). Coming from C I'm probably following the wrongest path since my code is far too slow for what I'm asking: I have a potential hard sphere-like (see V_pot(r) in the code) for 60 3d particles and periodic boundary conditions (PBC), so I defined the following functions
import timeit
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from numpy import inf
#
L, kb, d, eps, DIM = 100, 1, 1, 1, 3
r_c, T = L/2, eps/(.5*kb)
beta = 1/(kb*T)
#
def dist(A, B):
d = A - B
d -= L*np.around(d/L)
return np.sqrt(np.sum(d**2))
#
def V_pot(r):
V = -eps*(d**6/r**6 - d**6/r_c**6)
if r > r_c:
V = 0
elif r < d:
V = inf
return V
#
def ener(config):
V_jk_val, j = 0, N
#
while (j > 0):
j -= 1
i = 0
while (i < j):
V_jk_val += V_pot(dist(config[j,:], config[i,:]))
i += 1
#
return V_jk_val
#
def acc(en_n, en_o):
d_en = en_n-en_o
if (d_en <= 0):
acc_val = 1
else:
acc_val = np.exp(-beta*(d_en))
return acc_val
#
then, starting from the configuration (where every line of the array represents the coordinates of a 3D particle)
config = np.array([[16.24155657, 57.41672173, 94.39565792],
[76.38121764, 55.88334066, 5.72255163],
[38.41393783, 58.09432145, 6.26448054],
[86.44286438, 61.37100899, 91.97737383],
[37.7315366 , 44.52697269, 23.86320444],
[ 0.59231801, 39.20183376, 89.63974115],
[38.00998141, 3.84363202, 52.74021401],
[99.53480756, 69.97688928, 21.43528924],
[49.62030291, 93.60889503, 15.73723259],
[54.49195524, 0.6431965 , 25.37401196],
[33.82527814, 25.37776021, 67.4320553 ],
[64.61952893, 46.8407798 , 4.93960443],
[60.47322732, 16.48140136, 33.26481306],
[19.71667792, 46.56999616, 35.61044526],
[ 5.33252557, 4.44393836, 60.55759256],
[44.95897856, 7.81728046, 10.26000715],
[86.5548395 , 49.74079452, 4.80480133],
[52.47965686, 42.831448 , 22.03890639],
[ 2.88752006, 59.84605062, 22.75760029],
[ 9.49231045, 42.08653603, 40.63380097],
[13.90093641, 74.40377984, 32.62917915],
[97.44839233, 90.47695772, 91.60794836],
[51.29501624, 27.03796277, 57.09525454],
[10.30180295, 21.977336 , 69.54173272],
[59.61327648, 14.29582325, 11.70942289],
[89.52722796, 26.87758644, 76.34934637],
[82.03736088, 78.5665713 , 23.23587395],
[79.77571695, 66.140968 , 53.6784269 ],
[82.86070472, 40.82189833, 51.48739072],
[99.05647523, 98.63386809, 6.33888993],
[31.02997123, 66.99709163, 95.88332332],
[97.71654767, 59.24793618, 5.20183793],
[ 6.79964473, 45.01258652, 48.69477807],
[93.34977049, 55.20537774, 82.35693526],
[17.35577815, 20.45936211, 29.27981422],
[55.51942207, 52.22875901, 3.6616131 ],
[61.45612224, 36.50170405, 62.89796773],
[23.55822368, 7.09069623, 37.38274914],
[39.57082799, 58.95457592, 48.0304924 ],
[93.94997617, 64.34383203, 77.63346308],
[17.47989107, 90.01113402, 81.00648645],
[86.79068539, 66.35768515, 56.64402907],
[98.71924121, 38.33749023, 73.4715132 ],
[ 0.42356139, 78.32172925, 15.19883322],
[77.75572529, 2.60088767, 56.4683935 ],
[49.76486142, 3.01800153, 93.48019286],
[42.54483899, 4.27174457, 4.38942325],
[66.75777178, 41.1220603 , 19.64484167],
[19.69520773, 41.09230171, 2.51986091],
[73.20493772, 73.16590392, 99.19174281],
[94.16756184, 72.77653334, 10.32128552],
[29.95281655, 27.58596604, 85.12791195],
[ 2.44803886, 32.82333962, 41.6654683 ],
[23.9665915 , 49.94906612, 37.42701059],
[30.40282934, 39.63854309, 47.16572743],
[56.04809276, 30.19705527, 29.15729635],
[ 2.50566522, 70.37965564, 16.78016719],
[28.39713572, 4.04948368, 27.72615789],
[26.11873563, 41.49557167, 14.38703697],
[81.91731981, 12.10514972, 12.03083427]])
I make the 5000 time steps of the simulation with the following code
N = 60
TIME_MC = 5000
DELTA_LIST = [d]
#d/6, d/3, d, 2*d, 3*d
np.random.seed(19680801)
en_mc_delta = np.zeros((TIME_MC, len(DELTA_LIST)))
start = timeit.default_timer()
config_tmp = config
#
for iD, Delta in enumerate(DELTA_LIST):
t=0
while (t < TIME_MC):
for k in range(N):
RND = np.random.rand()
config_tmp[k,:] = config[k,:] + Delta*(np.random.random_sample((1,3))-.5)
en_o, en_n = ener(config), ener(config_tmp)
ACC = acc(en_n, en_o)
if (RND < ACC):
config[k,:] = config_tmp[k,:]
en_o = en_n
en_mc_delta[t][iD] = en_o
t += 1
stop = timeit.default_timer()
print('Time: ', stop-start)
following the rule of the Metropolis algorithm for the acceptance of the proposed move extracted with config_tmp[k,:] = config[k,:] + Delta*(np.random.random_sample((1,3))-.5).
I made some attempts to check where the code get stuck and I found that the function ener (also because of the function dist) is extremely slow: it takes something like ~0.02s to calculate the energy of a configuration, which means something around ~6000s to run the complete simulation (60 particles, 5000 proposed moves).
The outer for it's just to calculate the results for different values of Delta.
Running this code with TIME_MC=60 can make you an idea of how much slow is this code (~218s) which takes just some seconds if implemented in C. I read some other question about how to speed up Python codes but I can't understand how to do it here.
EDIT:
I'm now almost sure that the problem is in the function dist, since just to calculate PBC distance between two 3D vectors it takes around ~0.0012s which gives crazy long times when you calculate it 5000*60 times.
Note that this is a partial answer continued from comments on the original question.
Here's an example of how "unrolling" numpy's function can improve performance when replaced with a more direct calculation of the distance. Note that this was not verified to be equivalent, especially concerning the rounding. The principle still applies, I think.
import random
import time
import numpy as np
L = 100
inv_L = 0.01
vec_length = 10
repetitions = 100000
def dist_np(A, B):
d = A - B
d -= L*np.around(d/L)
return np.sqrt(np.sum(d**2))
def dist_direct(A, B):
sum = 0
for i in range(0, len(A)):
diff = (A[0,i] - B[0,i])
diff -= L * int(diff * inv_L)
sum += diff * diff
return np.sqrt(sum)
vec1 = np.zeros((1,vec_length))
vec2 = np.zeros((1,vec_length))
for i in range(0, vec_length):
vec1[0,i] = random.random()
vec2[0,i] = random.random()
print("with numpy method:")
start = time.time()
for i in range(0, repetitions):
dist_np(vec1, vec2)
print("done in {}".format(time.time() - start))
print("with direct method:")
start = time.time()
for i in range(0, repetitions):
dist_direct(vec1, vec2)
print("done in {}".format(time.time() - start))
Output:
with numpy method:
done in 6.332799911499023
with direct method:
done in 1.0938000679016113
Play around with the average vector length and the repetitions to see where the sweet spot is. I expect the performance gain is not constant when varying these meta-parameters.

How to get the index of a list items in another list?

Consider I have these lists:
l = [5,6,7,8,9,10,5,15,20]
m = [10,5]
I want to get the index of m in l. I used list comprehension to do that:
[(i,i+1) for i,j in enumerate(l) if m[0] == l[i] and m[1] == l[i+1]]
Output : [(5,6)]
But if I have more numbers in m, I feel its not the right way. So is there any easy approach in Python or with NumPy?
Another example:
l = [5,6,7,8,9,10,5,15,20,50,16,18]
m = [10,5,15,20]
The output should be:
[(5,6,7,8)]
The easiest way (using pure Python) would be to iterate over the items and first only check if the first item matches. This avoids doing sublist comparisons when not needed. Depending on the contents of your l this could outperform even NumPy broadcasting solutions:
def func(haystack, needle): # obviously needs a better name ...
if not needle:
return
# just optimization
lengthneedle = len(needle)
firstneedle = needle[0]
for idx, item in enumerate(haystack):
if item == firstneedle:
if haystack[idx:idx+lengthneedle] == needle:
yield tuple(range(idx, idx+lengthneedle))
>>> list(func(l, m))
[(5, 6, 7, 8)]
In case your interested in speed I checked the performance of the approaches (borrowing from my setup here):
import random
import numpy as np
# strided_app is from https://stackoverflow.com/a/40085052/
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
def pattern_index_broadcasting(all_data, search_data):
n = len(search_data)
all_data = np.asarray(all_data)
all_data_2D = strided_app(np.asarray(all_data), n, S=1)
return np.flatnonzero((all_data_2D == search_data).all(1))
# view1D is from https://stackoverflow.com/a/45313353/
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def pattern_index_view1D(all_data, search_data):
a = strided_app(np.asarray(all_data), L=len(search_data), S=1)
a0v, b0v = view1D(np.asarray(a), np.asarray(search_data))
return np.flatnonzero(np.in1d(a0v, b0v))
def find_sublist_indices(haystack, needle):
if not needle:
return
# just optimization
lengthneedle = len(needle)
firstneedle = needle[0]
restneedle = needle[1:]
for idx, item in enumerate(haystack):
if item == firstneedle:
if haystack[idx+1:idx+lengthneedle] == restneedle:
yield tuple(range(idx, idx+lengthneedle))
def Divakar1(l, m):
return np.squeeze(pattern_index_broadcasting(l, m)[:,None] + np.arange(len(m)))
def Divakar2(l, m):
return np.squeeze(pattern_index_view1D(l, m)[:,None] + np.arange(len(m)))
def MSeifert(l, m):
return list(find_sublist_indices(l, m))
# Timing setup
timings = {Divakar1: [], Divakar2: [], MSeifert: []}
sizes = [2**i for i in range(5, 20, 2)]
# Timing
for size in sizes:
l = [random.randint(0, 50) for _ in range(size)]
m = [random.randint(0, 50) for _ in range(10)]
larr = np.asarray(l)
marr = np.asarray(m)
for func in timings:
# first timings:
# res = %timeit -o func(l, m)
# second timings:
if func is MSeifert:
res = %timeit -o func(l, m)
else:
res = %timeit -o func(larr, marr)
timings[func].append(res)
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
In case your l and m are lists my function outperforms the NumPy solutions for all sizes:
But in case you have these as numpy arrays you'll get faster results for large arrays (size > 1000 elements) when using Divakars NumPy solutions:
You are basically looking for the starting indices of a list in another list.
Approach #1 : One approach to solve it would be to create sliding windows of the elements in list in which we are searching, giving us a 2D array and then simply use NumPy broadcasting to perform broadcasted comparison against the search list against each row of the 2D sliding window version obtained earlier. Thus, one method would be -
# strided_app is from https://stackoverflow.com/a/40085052/
def strided_app(a, L, S ): # Window len = L, Stride len/stepsize = S
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
def pattern_index_broadcasting(all_data, search_data):
n = len(search_data)
all_data = np.asarray(all_data)
all_data_2D = strided_app(np.asarray(all_data), n, S=1)
return np.flatnonzero((all_data_2D == search_data).all(1))
out = np.squeeze(pattern_index_broadcasting(l, m)[:,None] + np.arange(len(m)))
Sample runs -
In [340]: l = [5,6,7,8,9,10,5,15,20,50,16,18]
...: m = [10,5,15,20]
...:
In [341]: np.squeeze(pattern_index_broadcasting(l, m)[:,None] + np.arange(len(m)))
Out[341]: array([5, 6, 7, 8])
In [342]: l = [5,6,7,8,9,10,5,15,20,50,16,18,10,5,15,20]
...: m = [10,5,15,20]
...:
In [343]: np.squeeze(pattern_index_broadcasting(l, m)[:,None] + np.arange(len(m)))
Out[343]:
array([[ 5, 6, 7, 8],
[12, 13, 14, 15]])
Approach #2 : Another method would be to get the sliding window and then get the row-wise scalar view into the data to be search data and the data to be search for, giving us 1D data to work with, like so -
# view1D is from https://stackoverflow.com/a/45313353/
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def pattern_index_view1D(all_data, search_data):
a = strided_app(np.asarray(all_data), L=len(search_data), S=1)
a0v, b0v = view1D(np.asarray(a), np.asarray(search_data))
return np.flatnonzero(np.in1d(a0v, b0v))
out = np.squeeze(pattern_index_view1D(l, m)[:,None] + np.arange(len(m)))
2020 Versions
In search of more easy/compact approaches, we could look into scikit-image's view_as_windows for getting sliding windows with a built-in. I am assuming arrays as inputs for less messy code. For lists as input, we have to use np.asarray() as shown earlier.
Approach #3 : Basically a derivative of pattern_index_broadcasting with view_as_windows for a one-liner with a as the larger data and b is the array to be searched -
from skimage.util import view_as_windows
np.flatnonzero((view_as_windows(a,len(b))==b).all(1))[:,None]+np.arange(len(b))
Approach #4 : For a small number of matches from b in a, we could optimize, by looking for first element match from b to reduce the dataset size for searches -
mask = a[:-len(b)+1]==b[0]
mask[mask] = (view_as_windows(a,len(b))[mask]).all(1)
out = np.flatnonzero(mask)[:,None]+np.arange(len(b))
Approach #5 : For a small sized b, we could simply run a loop for each of the elements in b and perform bitwise and-reduction -
mask = np.bitwise_and.reduce([a[i:len(a)-len(b)+1+i]==b[i] for i in range(len(b))])
out = np.flatnonzero(mask)[:,None]+np.arange(len(b))
Just making the point that #MSeifert's approach can, of course, also be implemented in numpy:
def pp(h,n):
nn = len(n)
NN = len(h)
c = (h[:NN-nn+1]==n[0]).nonzero()[0]
if c.size==0: return
for i,l in enumerate(n[1:].tolist(),1):
c = c[h[i:][c]==l]
if c.size==0: return
return np.arange(c[0],c[0]+nn)
def get_data(l1,l2):
d=defaultdict(list)
[d[item].append(index) for index,item in enumerate(l1)]
print(d)
Using defaultdict to store indices of elements from other list.

Categories