python statsmodels.predict does not work - python

x=range(100)
y=sin(x)
result=sm.OLS(x,y).fit()
result.predict(x)
Gives:
ValueError: matrices are not aligned
This is very simple code, not sure why it's not working? I searched lots of forums but could not find exact solution.

quick answer:
I think you want x and y reversed result=sm.OLS(y, x).fit()
The dependent variable (y) comes first, and then the array of explanatory variables (x).
The call to predict works with statsmodels master, but maybe you need a 2-D x in an older version:
result.predict(x[:,None])
to make the explanatory variable into a column_array. I don't remember when this was changed for 1-D x.)
Note also, that there is no constant/intercept added automatically when we don't use the formula interface.
The predict for the sample or training data can also be accessed through results.fittedvalues.

Related

Traceback in Lagrange Interpolation code in Python

Can someone please explain where I went wrong? I used the same code for a different problem where i had to create arrays from the given specifications, but i dont understand where this code went wrong.
I think your problem is that you haven't quite got straight what your variables mean.
In the function definition, you use Xp to mean a single value.
You also define it as a single value.
However just before you call the function, you treat it as though it was a list:
[ F(X,Y,i) for i in Xp ]
One fix would be to set Xp = [302], not 302.
Preventing future similar errors
Even better would be to use a more mnemonic variable name, such as Xp_list, so that you don't fall into a similar trap in future. In my code I would typically call those variables:
x_list
y_list
xp_list
Or
xs
ys
xps

Is there anyway to use tf.keras.Model.predict within a tf.data.Dataset.map?

I have a Dataset that uses a keras model call within a map like this toy example:
ds=tf.data.Dataset.from_tensor_slices([tf.ones([1]) for i in range(10)])
model=tf.keras.models.Sequential([tf.keras.layers.Dense(1),tf.keras.layers.Dense(1)])
ds=ds.batch(4).map(lambda x:model(x))
I was wondering whether there is anyway to use the built-in model.predict(x) instead, since the map using the model call is quite slow (in my real project). I have tried
ds=tf.data.Dataset.from_tensor_slices([tf.ones([1]) for i in range(10)])
model=tf.keras.models.Sequential([tf.keras.layers.Dense(1),tf.keras.layers.Dense(1)])
ds=ds.batch(4).map(lambda x:model.predict(x))
and
ds=tf.data.Dataset.from_tensor_slices([tf.ones([1]) for i in range(10)])
model=tf.keras.models.Sequential([tf.keras.layers.Dense(1),tf.keras.layers.Dense(1)])
def predict(x):
return model.predict(x)
def predict_wrapper(x):
y=tf.py_function(predict,[x],tf.float32)
y.set_shape([None,None])
return y
ds=ds.batch(4).map(predict_wrapper)
for x in ds:
print(x)
Would it be possible? Would it make any difference in speed? My guess is that probably not since Dataset is already optimized for distributed strategy and it is like distributing ops within dristributed ops. But since I have no idea about the matter I thought I would ask.
Also I am working in Google Colab if that makes any difference.

theano.function - how does it work?

I've read the official documenation and read the comments here https://github.com/Theano/theano/blob/ddfd7d239a1e656cee850cdbc548da63f349c37d/theano/compile/function.py#L74-L324, and one man told me that it tells Theano to compile the symbolic computation graph into an actual program that you can run.
However, I still cannot figure out how does it know, for example in this code:
self.update_fun = theano.function(
inputs=[self.input_mat, self.output_mat],
outputs=self.cost,
updates=updates,
allow_input_downcast=True)
how to compute all that, if it has no body? I mean all those things are computed in some code above these pasted lines, but... is theano.function actually looking to source code to find out how to compute those things? I'm just guessing and would really like to know how it works.
Maybe the problem I have in the explanation that "it tells Theano to compile the symbolic computation graph into an actual program" is that I have no clue what symbolic computation graph is, so that would be another question very related to the previous.
Explanation would be appreciated.
I'm no expert but here's my take at explaining it:
Yes the 'body' is defined in the code above. But theano doesn't 'interpret' that code directly like the python interpreter would. The code in question is just creating theano objects that will allow theano to compile the desired function. Let's take a simple example: how you would create a function f(x) = 2x + x**3.
You first create a symbolic input variable x. Then you define the 'body' of the function by building the symbolic expression of f(x):
y = 2 * x + x**3 # defines a new symbolic variable which depends on x
This y object is equivalent to a graph representing the formula. Something like Plus(Times(2,x), Power(x,3)).
You finally call theano.function with input=x and output=y. Then theano does its magic and compiles the actual function f(x) = y = 2 * x + x**3 from the information (the graph) 'contained' in y.
Does it make things clearer?

Understanding numpy.random.lognormal

I'm translating Matlab code (written by someone else) to Python.
In one section of the Matlab code, a variable X_new is set to a value drawn from a log-normal distribution as follows:
% log normal distribution
X_new = exp(normrnd(log(X_old), sigma));
That is, a random value is drawn from a normal distribution centered at log(X_old), and X_new is set to e raised to this value.
The direct translation of this code to Python is as follows:
import numpy as np
X_new = np.exp(np.random.normal(np.log(X_old), sigma))
But numpy includes a log-normal distribution which can be sampled directly.
My question is, is the line of code that follows equivalent to the lines of code above?
X_new = np.random.lognormal(np.log(X_old), sigma)
I think I'm going to have to answer my own question here.
From the documentation for np.random.lognormal, we have
A variable x has a log-normal distribution if log(x) is normally distributed.
Let's think about X_new from the Matlab code as a particular instance of a random variable x. The question is, is log(x) normally distributed here? Well, log(X_new) is just normrnd(log(X_old), sigma). So the answer is yes.
Now let's move to the call to np.random.lognormal in the second version of the Python code. X_new is again a particular instance of a random variable we can call x. Is log(x) normally distributed here? Yes, it must be, else numpy would not call this function lognormal. The mean of the underlying normal distribution is log(X_old) which is the same as the mean of the normal distribution in the Matlab code.
Hence, all implementations of the log-normal distribution in the question are equivalent (ignoring any very low-level implementation differences between the languages).

Speed up NumPy loop

I'm running a model in Python and I'm trying to speed up the execution time. Through profiling the code I've found that a huge amount of the total processing time is spent in the cell_in_shadow function below. I'm wondering if there is any way to speed it up?
The aim of the function is to provide a boolean response stating whether the specified cell in the NumPy array is shadowed by another cell (in the x direction only). It does this by stepping backwards along the row checking each cell against the height it must be to make the given cell in shadow. The values in shadow_map are calculated by another function not shown here - for this example, take shadow_map to be an array with values similar to:
[0] = 0 (not used)
[1] = 3
[2] = 7
[3] = 18
The add_x function is used to ensure that the array indices loop around (using clock-face arithmetic), as the grid has periodic boundaries (anything going off one side will re-appear on the other side).
def cell_in_shadow(x, y):
"""Returns True if the specified cell is in shadow, False if not."""
# Get the global variables we need
global grid
global shadow_map
global x_len
# Record the original length and move to the left
orig_x = x
x = add_x(x, -1)
while x != orig_x:
# Gets the height that's needed from the shadow_map (the array index is the distance using clock-face arithmetic)
height_needed = shadow_map[( (x - orig_x) % x_len)]
if grid[y, x] - grid[y, orig_x] >= height_needed:
return True
# Go to the cell to the left
x = add_x(x, -1)
def add_x(a, b):
"""Adds the two numbers using clockface arithmetic with the x_len"""
global x_len
return (a + b) % x_len
I do agree with Sancho that Cython will probably be the way to go, but here are a couple of small speed-ups:
A. Store grid[y, orig_x] in some variable before you start the while loop and use that variable instead. This will save a bunch of look-up calls to the grid array.
B. Since you are basically just starting at x_len - 1 in shadow_map and working down to 1, you can avoid using the modulus so much. Basically, change:
while x != orig_x:
height_needed = shadow_map[( (x - orig_x) % x_len)]
to
for i in xrange(x_len-1,0,-1):
height_needed = shadow_map[i]
or just get rid of the height_needed variable all together with:
if grid[y, x] - grid[y, orig_x] >= shadow_map[i]:
These are small changes, but they might help a little bit.
Also, if you plan on going the Cython route, I would consider having your function do this process for the whole grid, or at least a row at a time. That will save a lot of the function call overhead. However, you might not be able to really do this depending on how you are using the results.
Lastly, have you tried using Psyco? It takes less work than Cython though it probably won't give you quite as big of a speed boost. I would certainly try it first.
If you're not limited to strict Python, I'd suggest using Cython for this. It can allow static typing of the indices and efficient, direct access to a numpy array's underlying data buffer at c speed.
Check out a short tutorial/example at http://wiki.cython.org/tutorials/numpy
In that example, which is doing operations very similar to what you're doing (incrementing indices, accessing individual elements of numpy arrays), adding type information to the index variables cut the time in half compared to the original. Adding efficient indexing into the numpy arrays by giving them type information cut the time to about 1% of the original.
Most Python code is already valid Cython, so you can just use what you have and add annotations and type information where needed to give you some speed-ups.
I suspect you'd get the most out of adding type information your indices x, y, orig_x and the numpy arrays.
The following guide compares several different approaches to optimising numerical code in python:
Scipy PerformancePython
It is a bit out of date, but still helpful. Note that it refers to pyrex, which has since been forked to create the Cython project, as mentioned by Sancho.
Personally I prefer f2py, because I think that fortran 90 has many of the nice features of numpy (e.g. adding two arrays together with one operation), but has the full speed of compiled code. On the other hand if you don't know fortran then this may not be the way to go.
I briefly experimented with cython, and the trouble I found was that by default cython generates code which can handle arbitrary python types, but which is still very slow. You then have to spend time adding all the necessary cython declarations to get it to be more specific and fast, whereas if you go with C or fortran then you will tend to get fast code straight out of the box. Again this is biased by me already being familiar with these languages, whereas Cython may be more appropriate if Python is the only language you know.

Categories