fast implementation of stepwise regression

fast implementation of stepwise regression - python

from wikipedia https://en.wikipedia.org/wiki/Stepwise_regression
Forward selection, which involves starting with no variables in the
model, testing the addition of each variable using a chosen model
comparison criterion, adding the variable (if any) that improves the
model the most, and repeating this process until none improves the
model.
I think that the implementation of this algorithm is very interesting because it can be seen as a combinatorial version of the hill climbing algorithm where the neighbours function is equivalent to add a variable to the current model.
I have not enough experience to write this algorithm in an optimized way. This is my current implementation:
class FSR():
def __init__(self, n_components):
self.n_components = n_components
def cost(self, index):
lr = LinearRegression().fit(self.x[:, index], self.y)
hat_y = lr.predict(self.x[:, index])
e = np.linalg.norm(hat_y - self.y)
return e
def next_step_fsr(self, comp, cand):
""" given the current components and candidates the function
return the new components, the new candidates and the new EV"""
if comp == []:
er = np.inf
else:
er = self.cost(comp)
for i in range(len(cand)):
e = cand.popleft()
comp.append(e)
new_er = self.cost(comp)
if new_er < er:
new_comp = comp.copy()
new_cand = deque(i for i in cand)
er = new_er
comp.pop()
cand.append(e)
return new_comp, new_cand, new_er
def fsr(self):
n, p = self.x.shape
er = []
comp = []
cand = deque(range(p))
for i in range(self.n_components):
comp, cand, new_er = self.next_step_fsr(comp, cand)
er.append(new_er)
return comp, er
def fit(self, x, y):
self.x = x
self.y = y
self.comp_, self.er_ = self.fsr()
I would like to know how can I improve the speed of this code.
x = np.random.normal(0,1, (100,20))
y = x[:,1] + x[:,2] + np.random.normal(0,.1, 100)
fsr = FSR(n_components=2)
fsr.fit(x,y)
print('selected component = ', fsr.comp_)
I want the final code too look not too different from the posted one. This is because I would like to extend the problem to different combinatorial problems with different cost function as well.
I think that the function that should be changed is next_step_fsr where given the current selected variables try which one is the best variable to include in the model. In particular I am interested in situation where x has a lot of columns (like 10000). I think that the current bottle neck is the line new_cand = deque(i for i in cand) where the list of candidates is copied.

Related

Optimization of A* implementation in python

To solve problem 83 of project euler I tried to use the A* algorithm. The algorithm works fine for the given problem and I get the correct result. But when I visualized the algorithm I realized that it seems as if the algorithm checked way to many possible nodes. Is it because I didn't implement the algorithm properly or am I missing something else? I tried using two different heuristic functions which you can see in the code below, but the output didn't change much.
Are any tips to make the code efficient?
import heapq
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from matplotlib import animation
import numpy as np
class PriorityQueue:
def __init__(self):
self.elements = []
def empty(self):
return not self.elements
def put(self, item, priority):
heapq.heappush(self.elements, (priority, item))
def get(self):
return heapq.heappop(self.elements)[1]
class A_star:
def __init__(self, data, start, end):
self.data = data
self.start = start
self.end = end
self.a = len(self.data)
self.b = len(self.data[0])
def h_matrix(self):
elements = sorted([self.data[i][j] for j in range(self.b) for i in range(self.a)])
n = self.a + self.b - 1
minimum = elements[:n]
h = []
for i in range(self.a):
h_i = []
for j in range(self.b):
h_i.append(sum(minimum[:(n-i-j-1)]))
h.append(h_i)
return h
def astar(self):
h = self.h_matrix()
open_list = PriorityQueue()
open_list.put(self.start, 0)
came_from = {}
cost_so_far = {}
came_from[self.start] = None
cost_so_far[self.start] = self.data[0][0]
checked = []
while not open_list.empty():
current = open_list.get()
checked.append(current)
if current == self.end:
break
neighbors = [(current[0]+x, current[1]+y) for x, y in {(-1,0), (0,-1), (1,0), (0,1)}
if 0 <= current[0]+x < self.a and 0 <= current[1]+y < self.b]
for next in neighbors:
new_cost = cost_so_far[current] + self.data[next[0]][next[1]]
if next not in cost_so_far or new_cost < cost_so_far[next]:
cost_so_far[next] = new_cost
priority = new_cost + h[next[0]][next[1]]
open_list.put(next, priority)
came_from[next] = current
return came_from, checked, cost_so_far[self.end]
def reconstruct_path(self):
paths = self.astar()[0]
best_path = [self.end]
while best_path[0] is not None:
new = paths[best_path[0]]
best_path.insert(0, new)
return best_path[1:]
def minimum(self):
return self.astar()[2]
if __name__ == "__main__":
liste = [[131, 673, 234, 103, 18], [201, 96, 342, 965, 150], [630, 803, 746, 422, 111], [537, 699, 497, 121, 956], [805, 732, 524, 37, 331]]
path = A_star(liste, (0,0), (4,4))
print(path.astar())
#print(path.reconstruct_path())
path.plot_path(speed=200)
Here you can see my visualization for the 80x80 matrix given in the problem. Blue are all the points in checked and red is the optimal path. From my understanding it shouldn't be the case that every point in the matrix is in checked i.e. blue.
https://i.stack.imgur.com/LKkdh.png
My initial guess would be that my heuristic function is not good enough. If I choose h=0, which would mean Dijkstra Algorithm the length of my checked list is 6400. Contrary if I use my custom h the length is 6455. But how can I improve the heuristic function for an arbitrary matrix?

Let me start at the end of your post:
Marking cells as checked
if I use my custom h the length is 6455.
You should not have a size of checked that exceeds the number of cells. So let me first suggest an improvement for that: instead of using a list, use set, and skip anything popped from the priority queue that is already in the set. The relevant code will then look like this:
checked = set()
while not open_list.empty():
current = open_list.get()
if current in checked: # don't repeat the same search
continue
checked.add(current)
And if in the end you need the list version of checked, just return it that way:
return came_from, list(checked), cost_so_far[self.end]
Now to the main question:
Improving heuristic function
From my understanding it shouldn't be the case that every point in the matrix is in checked i.e. blue. My initial guess would be that my heuristic function is not good enough.
That is the right explanation. Combine that with the fact that the given matrix has paths which have a total cost which come quite close, so there is a competition field that involves much of the matrix.
how can I improve the heuristic function for an arbitrary matrix?
One idea is to consider the following. A path must include at least one element from each "forward" diagonal (/). So if we work with the minimum value on each such diagonal and create running sums (backwards -- starting from the target), we'll have a workable value for h.
Here is the code for that idea:
def h_matrix(self):
min_diagonals = [float("inf")] * (self.a + self.b - 1)
# For each diagonal get the minimum cost on that diagonal
for i, row in enumerate(self.data):
for j, cost in enumerate(row):
min_diagonals[i+j] = min(min_diagonals[i+j], cost)
# Create a running sum in backward direction
for i in range(len(min_diagonals) - 2, -1, -1):
min_diagonals[i] += min_diagonals[i + 1]
min_diagonals.append(0) # Add an entry for allowing the next logic to work
# These sums are a lower bound for the remaining cost towards the target
h = [
[min_diagonals[i + j + 1] for j in range(self.b)]
for i in range(self.a)
]
return h
With these improvements, we get these counts:
len(cost_so_far) == 6374
len(checked) == 6339
This represents still a large portion of the matrix, but at least a few cells were left out.

Different forms of genetic algorithim

I wrote a code that implements a simple genetic algorithm to maximize:
f(x) = 15x - x^2
The function has its maximum at 7.5, so the code output should be 7 or 8 since the population are integers.
When I run the code 10 times I get 7 or 8 around three times out of 10.
What modification should I make to further improve the algorithm and what are different types of genetic algorithms?
Here is the code:
from random import *
import numpy as np
#fitness function
def fit(x):
return 15*x -x**2
#covert binary list to decimal number
def to_dec(x):
return int("".join(str(e) for e in x), 2)
#picks pairs from the original population
def gen_pairs(populationl, prob):
pairsl = []
test = [0, 1, 2, 3, 4, 5]
for i in range(3):
pair = []
for j in range(2):
temp = np.random.choice(test, p=prob)
pair.append(populationl[temp].copy())
pairsl.append(pair)
return pairsl
#mating function
def cross_over(prs, mp):
new = []
for pr in prs:
if mp[prs.index(pr)] == 1:
index = np.random.choice([1,2,3], p=[1/3, 1/3, 1/3])
pr[0][:index], pr[1][:index] = pr[1][:index], pr[0][:index]
for pr in prs:
new.append(pr[0])
new.append(pr[1])
return new
#mutation
def mutation(x):
for chromosome in x:
for gene in chromosome:
mutation_prob = np.random.choice([0, 1], p=[0.999, .001])
if mutation_prob == 1:
#m_index = np.random.choice([0,1,2,3])
if gene == 0:
gene = 1
else:
gene = 0
#generate initial population
randlist = lambda n:[randint(0,1) for b in range(1, n+1)]
for j in range(10):
population = [randlist(4) for i in range(6)]
for _ in range(20):
fittness = [fit(to_dec(y)) for y in population]
s = sum(fittness)
prob = [e/s for e in fittness]
pairsg = gen_pairs(population.copy(), prob)
mating_prob = []
for i in pairsg:
mating_prob.append(np.random.choice([0,1], p=[0.4,0.6]))
new_population = cross_over(pairsg, mating_prob)
mutated = mutation(new_population)
decimal_p = [to_dec(i)for i in population]
decimal_new = [to_dec(i)for i in new_population]
# print(decimal_p)
# print(decimal_new)
population = new_population
print(decimal_new)

This is a very typical situation with evolutionary algorithms. Success rate is a quite common metric, and 30% is a decent result.
Just an example, recently I implemented a GP/GE solver for Santa Fe Trail problem, and it demonstrates the success rate of 30% or less.
How to improve success rate
A personal interpretation of the problem based on limited experience follows.
An evolutionary algorithm fails to find a close to global optimum solution when it converges around a local optimum or gets stuck on a great plateau, and has not enough diversity in its population to escape this trap by finding a better region.
You may try to supply your algorithm with more diversity by increasing the size of the population. Or you may look into techniques like novelty search, and quality diversity.
By the way, here is a very nice interactive demonstration of novelty search vs. fitness search: http://eplex.cs.ucf.edu/noveltysearch/userspage/demo.html

Find a formula that describes a given list of integers

I'm trying to find some Python library or method that will create a formula that describes an an arbitrary list of integers like: [1,1,2,4,3,1,2,3,4,1,...]
For example:
from some_awesome_library import magic_method
seq = [1,1,2,4,3,1,2,3,4,1,4,3,2,2,4,3]
my_func(sequence):
equation = magic_method(seq)
return equation
print(my_func(seq))
The order of the sequence matters, but it has certain rules. For example, all integers will be between 1 and 4, and there will be an equal number of each integer within the sequence.
I've looked into numpy.polyfit and scipy.optimize.leastsq. I suspect that scipy is what I need, but it'd be great to have confirmation of that approach and any suggestions for the types of mathematical functions I should look into using (I'm not much of a math's person - just studied up to college calculus). Maybe a some sort of modulo function? Maybe a sine wave?
Thanks in advance for any help or suggestions you have.
EDIT: Thanks for your comments below. I'm curious about Sudoku puzzles, specifically N=2 puzzles. I suspect that if you take the entire solution space and line them up in a certain order that patterns will emerge that might be useful for solving Sudoku faster. I've got a class that represents the solution space called SolutionManager and returns "slices" of the solution space that look like the list of integers shown above. Below is an image of one such example for an N=2 puzzle solution space (generated with Jupyter Notebook):
I think I can see patterns in the data, but I'm trying to figure out how to develop formulas that represent these patterns. I also suspect that reordering the solutions will make for simpler equations describing those patterns.
To prove that I'm trying to write a genetic algorithm that will reorder the solutions described in the SolutionManager according to how simple the equations describing them are. I'm stuck on writing the fitness function, which should rate the SolutionManager instance by how simple it's described equations are.
Code for the SolutionManager is below:
class SolutionManager:
"""Object that contains all possible Solutions for an n=2 Sudoku array"""
def __init__(self, solutions_file):
input_file = open(solutions_file, 'r')
grids = [r.strip() for r in input_file.readlines()]
list_of_solutions = []
i = 0
for grid in grids:
list_of_cubes = []
i += 1
for r in range(4):
for c in range(4):
pos = r * 4 + c
digit = int(grid[pos])
cube = Cube(i, c, r, digit)
list_of_cubes.append(cube)
list_of_solutions.append(Solution(list_of_cubes))
self.solutions = list_of_solutions
assert isinstance(self.solutions, list)
assert all(isinstance(x, Solution) for x in self.solutions)
"""Get a vertical slice of the Solution Space"""
def get_vertical_slice(self, slice_index):
assert slice_index <= 4
assert slice_index >= 0
slice = []
for sol in self.solutions:
slice.append(sol.get_column(slice_index))
return slice
"""Get a horizontal slice of the Solution Space"""
def get_horizontal_slice(self, slice_index):
assert slice_index <= 4
assert slice_index >= 0
slice = []
for sol in self.solutions:
slice.append(sol.get_row(slice_index))
return slice
"""Sorts the solutions by a vertical axis using an algorithm"""
def sort_solutions_by_vertical_axis(self, axis_index):
pass
class Solution:
def __init__(self, cubes):
assert (len(cubes) == 16)
self.solution_cubes = cubes
def get_column(self, c):
return list(_cube for _cube in self.solution_cubes if _cube.row == c)
def get_row(self, r):
return list(_cube for _cube in self.solution_cubes if _cube.column == r)
def get_column_value(self, c, v):
single_cube = list(_cube for _cube in self.solution_cubes if _cube.row == r and _cube.value == v)
assert (len(single_cube) == 1)
return single_cube[0]
def get_row_value(self, r, v):
single_cube = list(_cube for _cube in self.solution_cubes if _cube.column == c and _cube.value == v)
assert (len(single_cube) == 1)
return single_cube[0]
def get_position(self, r, c):
single_cube = list(_cube for _cube in self.solution_cubes if _cube.column == c and _cube.row == r)
assert (len(single_cube) == 1)
return single_cube[0]
class Cube:
def __init__(self, d, c, r, v):
self.depth = d
self.column = c
self.row = r
self.value = v
def __str__(self):
return str(self.value)

Quickly counting particles in grid

I've written some python code to calculate a certain quantity from a cosmological simulation. It does this by checking whether a particle in contained within a box of size 8,000^3, starting at the origin and advancing the box when all particles contained within it are found. As I am counting ~2 million particles altogether, and the total size of the simulation volume is 150,000^3, this is taking a long time.
I'll post my code below, does anybody have any suggestions on how to improve it?
Thanks in advance.
from __future__ import division
import numpy as np
def check_range(pos, i, j, k):
a = 0
if i <= pos[2] < i+8000:
if j <= pos[3] < j+8000:
if k <= pos[4] < k+8000:
a = 1
return a
def sigma8(data):
N = []
to_do = data
print 'Counting number of particles per cell...'
for k in range(0,150001,8000):
for j in range(0,150001,8000):
for i in range(0,150001,8000):
temp = []
n = []
for count in range(len(to_do)):
n.append(check_range(to_do[count],i,j,k))
to_do[count][1] = n[count]
if to_do[count][1] == 0:
temp.append(to_do[count])
#Only particles that have not been found are
# searched for again
to_do = temp
N.append(sum(n))
print 'Next row'
print 'Next slice, %i still to find' % len(to_do)
print 'Calculating sigma8...'
if not sum(N) == len(data):
return 'Error!\nN measured = {0}, total N = {1}'.format(sum(N), len(data))
else:
return 'sigma8 = %.4f, variance = %.4f, mean = %.4f' % (np.sqrt(sum((N-np.mean(N))**2)/len(N))/np.mean(N), np.var(N),np.mean(N))

I'll try to post some code, but my general idea is the following: create a Particle class that knows about the box that it lives in, which is calculated in the __init__. Each box should have a unique name, which might be the coordinate of the bottom left corner (or whatever you use to locate your boxes).
Get a new instance of the Particle class for each particle, then use a Counter (from the collections module).
Particle class looks something like:
# static consts - outside so that every instance of Particle doesn't take them along
# for the ride...
MAX_X = 150,000
X_STEP = 8000
# etc.
class Particle(object):
def __init__(self, data):
self.x = data[xvalue]
self.y = data[yvalue]
self.z = data[zvalue]
self.compute_box_label()
def compute_box_label(self):
import math
x_label = math.floor(self.x / X_STEP)
y_label = math.floor(self.y / Y_STEP)
z_label = math.floor(self.z / Z_STEP)
self.box_label = str(x_label) + '-' + str(y_label) + '-' + str(z_label)
Anyway, I imagine your sigma8 function might look like:
def sigma8(data):
import collections as col
particles = [Particle(x) for x in data]
boxes = col.Counter([x.box_label for x in particles])
counts = boxes.most_common()
#some other stuff
counts will be a list of tuples which map a box label to the number of particles in that box. (Here we're treating particles as indistinguishable.)
Using list comprehensions is much faster than using loops---I think the reason is that you're basically relying more on the underlying C, but I'm not the person to ask. Counter is (supposedly) highly-optimized as well.
Note: None of this code has been tested, so you shouldn't try the cut-and-paste-and-hope-it-works method here.

Is it possible to minimise with PyMinuit using a dictionary for parameter reference

Is it possible to carry out a PyMinuit function minimisation by passing a dictionary of parameters to the minimiser?
For example, the usual use of PyMinuit would be called using something like:
def f(x, a, b): return a + b*x
def chi2(a,b):
c2 = 0.
for x, y, yerr in data:
c2 += (f(x, a, b) - y)**2 / yerr**2
return c2
m = minuit.Minuit(chi2)
m.migrad()
From this question, I understand PyMinuit uses introspection to determine the parameters x and y (but I am not entirely sure what that means). Ideally, I would like to be able to do something like:
p = dict()
p['x'] = 0.
p['y'] = 0.
def f(x,a,b): return a + b*x
def chi2():
c2 = 0.
for x, y, yerr in data:
c2 += (f(x, a, b) - y)**2 / yerr**2
return c2
m = minuit.Minuit(chi2,**p)
m.migrad()
or even:
p = <dictionary of parameters + initial values>
model = <list containing strings representing functions e.g. 'a*b+a**2*x'>
data = x, y, yerr, model
def chi2():
c2 = 0.
for x, y, yerr, model in data:
c2 += (eval(model,{"__builtins__":None},p) - y)**2 / yerr**2
return c2
m = minuit.Minuit(chi2)
m.migrad()
I saw a work-around to a similar problem on the google groups issues page where they generated 'fake code' and 'fake functions' from an integer input (follow link to see). I tried something similar with my dictionary p:
class fake_code:
def __init__(self,p):
self.co_argcount = len(p)
self.co_varnames = tuple(p.keys())
print tuple(p.keys())
class fake_function:
def __init__(self,p):
self.func_code = fake_code(p)
def __call__(self,*args):
c2 = 0.
print args
for x, y, yerr in data:
c2 += (f(x, a, b) - y)**2 / yerr**2
return c2
but for some reason all the parameters are classed as 'fixed' and I can't seem to 'unfix' them.
I think it should be possible to do it this way, but I do not know enough about python to say if this is the best way, or even if it should be attempted. If anyone can shed some light onto this I'd be grateful to know. :)

OK, I don't like answering my own questions, but I think I've found a solution using exec. If one defines the chi2 function in a template and builds it at run-time with a function make_chi_squared then it is possible. The solution I've managed to come up with is shown below.
import minuit
import numpy
chi_squared_template = """
def chi_squared(%(params)s):
li = [%(params)s]
for i,para in enumerate(li):
p[l[i]] = para
return (((f(data_x, p) - data_y) / errors) ** 2).sum()
"""
l = ['a1','a2','a3','a4']
p = dict()
p['a1'] = 1.
p['a2'] = 1.
p['a3'] = 1.
p['a4'] = 1.
def make_chi_squared(f, data_x, data_y, errors):
params = ", ".join(l)
exec chi_squared_template % {"params": params}
return chi_squared
def f(x,p):
return eval('a1 + a2*x + a3*x**2 + a4*x**3',
{"__builtins__":locals()},
p)
data_x = numpy.arange(50)
errors = numpy.random.randn(50) * 0.3
data_y = data_x**3 + errors
chi_squared = make_chi_squared(f, data_x, data_y, errors)
m = minuit.Minuit(chi_squared)
m.printMode = 1
m.migrad()
print m.values
p = m.values
print p
It's a bit messy, and I'm not sure if its the best way of handling this type of problem, but it works!

This following is largely untested, which I usually try to avoid doing, but am making an exception to better explain to you the simplified way I referred to in my comments that might work for this. It's based on the first example shown here.
import minuit
def minuit_call(func, **kwargs):
CALL_TEMPLATE = "minuit.Minuit({0.__name__}, {1})"
arg_str = ', '.join('{}={}'.format(k, v) for k,v in kwargs.iteritems())
return eval(CALL_TEMPLATE.format(func, arg_str))
def f(x, y):
return ((x-2) / 3)**2 + y**2 + y**4
m = minuit_call(f, x=0, y=0)
m.migrad()
As you can see, the template used is fairly trivial and creating it didn't require manually translating any of the code in the body of the function to be minimization into a formatting string.

Might be late for answer. Try this out iminuit. I wrote it because of the lack of this specific feature among others.
http://iminuit.github.com/iminuit/
See example how you would write a generic cost function here:
http://nbviewer.ipython.org/urls/raw.github.com/iminuit/iminuit/master/tutorial/hard-core-tutorial.ipynb
However, although it's easy to write a chi^2/likelihood function, it's already written for you in probfit
http://iminuit.github.com/probfit/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

fast implementation of stepwise regression - python

Related

Optimization of A* implementation in python

Different forms of genetic algorithim

Find a formula that describes a given list of integers

Quickly counting particles in grid

Is it possible to minimise with PyMinuit using a dictionary for parameter reference

Categories

Resources