I am using sklearn.svm.SVR for a "regression task" which I want to use my "customized kernel method". Here is the dataset samples and the code:
index density speed label
0 14 58.844020 77.179139
1 29 67.624946 78.367394
2 44 77.679100 79.143744
3 59 79.361877 70.048869
4 74 72.529289 74.499239
.... and so on
from sklearn import svm
import pandas as pd
import numpy as np
density = np.random.randint(0,100, size=(3000, 1))
speed = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
label = np.random.randint(20,80, size=(3000, 1)) + np.random.random(size=(3000, 1))
d = np.hstack((a,b,c))
data = pd.DataFrame(d, columns=['density', 'speed', 'label'])
data.density = data.density.astype(dtype=np.int32)
def my_kernel(X,Y):
return np.dot(X,X.T)
svr = svm.SVR(kernel=my_kernel)
x = data[['density', 'speed']].iloc[:2000]
y = data['label'].iloc[:2000]
x_t = data[['density', 'speed']].iloc[2000:3000]
y_t = data['label'].iloc[2000:3000]
svr.fit(x,y)
y_preds = svr.predict(x_t)
the problem happens in the last line svm.predict which says:
X.shape[1] = 1000 should be equal to 2000, the number of samples at training time
I searched the web to find a way to deal with the problem but many questions alike (like {1}, {2}, {3}) were left unanswered.
Actually, I had used SVM methods with rbf, sigmoid, ... before and the code was working just fine but this was my first time using customized kernels and I suspected that it must be the reason why this error happened.
So after a little research and reading documentation I found out that when using precomputed kernels, the shape of the matrix for SVR.predict() must be like [n_samples_test, n_samples_train] shape.
I wonder how to modify x_test in order to get predictions and everything works just fine with no problem like when we don't use customized kernels?
If possible please describe "the reason that why the inputs for svm.predict function in precomputed kernel differentiates with the other kernels".
I really hope the unanswered questions that are related to this issue could be answered respectively.
The problem is in your kernel function, it doesn't do the job.
As the documentation https://scikit-learn.org/stable/modules/svm.html#using-python-functions-as-kernels says, "Your kernel must take as arguments two matrices of shape (n_samples_1, n_features), (n_samples_2, n_features) and return a kernel matrix of shape (n_samples_1, n_samples_2)." The sample kernel on the same page satisfies this criteria:
def my_kernel(X, Y):
return np.dot(X, Y.T)
In your function the second argument of dot is X.T and thus the output will have shape (n_samples_1, n_samples_1) which is not that is expected.
The shape does not match means the test data and train data are of not equal shape, always think about matrix or array in numpy. If you are doing any arithmetic operation you always need a similar shape. That's why we check array.shape.
[n_samples_test, n_samples_train] you can modify shapes but its not best idea.
array.shape, reshape, resize
are used for that
Is there any real difference between the math functions performed by numpy and tensorflow. For example, exponential function, or the max function?
The only difference I noticed is that tensorflow takes input of tensors, and not numpy arrays.
Is this the only difference, and no difference in the results of the function, by value?
As has been mentioned, there is the performance difference. TensorFlow has the advantage that it has been designed to work on both CPUs or GPUs, so if you have a CUDA-enabled GPU, chances are TensorFlow is going to be much faster. You can find several benchmarks on the web with different comparisons, and also with other packages such as Numba or Theano.
However, I think that you are talking about whether NumPy and TensorFlow operations are exactly equivalent. The answer is basically yes, that is, the meaning of the operations is the same. However, since they are completely separate libraries with different implementations for everything, you will find small differences in the results. Take this code, for example (TensorFlow 1.2.0, NumPy 1.13.1):
# Force TensorFlow to run on CPU only
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import numpy as np
import tensorflow as tf
# float32 NumPy array
a = np.arange(100, dtype=np.float32)
# The same array with the same dtype in TensorFlow
a_tf = tf.constant(a, dtype=tf.float32)
# Square root with NumPy
sqrt = np.sqrt(a)
# Square root with TensorFlow
with tf.Session() as sess:
sqrt_tf = sess.run(tf.sqrt(a_tf))
You would expect to get pretty much the same output from both, I mean, a square root doesn't sound like an extremely complex operation after all. However, printing these arrays in my computer I get:
print(sqrt)
>>> array([ 0. , 1. , 1.41421354, 1.73205078, 2. ,
2.23606801, 2.44948983, 2.64575124, 2.82842708, 3. ,
3.1622777 , 3.31662488, 3.46410155, 3.60555124, 3.7416575 ,
3.87298346, 4. , 4.12310553, 4.2426405 , 4.35889912,
4.47213602, 4.5825758 , 4.69041586, 4.79583168, 4.89897966,
5. , 5.09901953, 5.19615221, 5.29150248, 5.38516474,
5.47722578, 5.56776428, 5.65685415, 5.74456263, 5.83095169,
5.91608 , 6. , 6.08276272, 6.16441393, 6.24499798,
6.3245554 , 6.40312433, 6.48074055, 6.55743837, 6.63324976,
6.70820379, 6.78233004, 6.85565472, 6.92820311, 7. ,
7.07106781, 7.14142847, 7.21110249, 7.28010988, 7.34846926,
7.41619825, 7.48331499, 7.54983425, 7.6157732 , 7.68114567,
7.74596691, 7.81024981, 7.8740077 , 7.93725395, 8. ,
8.06225777, 8.1240387 , 8.18535233, 8.24621105, 8.30662346,
8.36660004, 8.42614937, 8.48528099, 8.54400349, 8.60232544,
8.66025448, 8.71779823, 8.77496433, 8.83176041, 8.88819408,
8.94427204, 9. , 9.05538559, 9.11043358, 9.1651516 ,
9.21954441, 9.2736187 , 9.32737923, 9.38083172, 9.43398094,
9.48683262, 9.53939247, 9.59166336, 9.64365101, 9.69536018,
9.7467947 , 9.79795933, 9.84885788, 9.89949512, 9.94987392], dtype=float32)
print(sqrt_tf)
>>> array([ 0. , 0.99999994, 1.41421342, 1.73205078, 1.99999988,
2.23606801, 2.44948959, 2.64575124, 2.82842684, 2.99999976,
3.1622777 , 3.31662488, 3.46410155, 3.60555077, 3.74165726,
3.87298322, 3.99999976, 4.12310553, 4.2426405 , 4.35889864,
4.47213602, 4.58257532, 4.69041538, 4.79583073, 4.89897919,
5. , 5.09901857, 5.19615221, 5.29150248, 5.38516474,
5.47722483, 5.56776428, 5.65685368, 5.74456215, 5.83095121,
5.91607952, 5.99999952, 6.08276224, 6.16441393, 6.24499846,
6.3245554 , 6.40312433, 6.48074055, 6.5574379 , 6.63324976,
6.70820427, 6.78233004, 6.85565472, 6.92820311, 6.99999952,
7.07106733, 7.14142799, 7.21110153, 7.28010893, 7.34846973,
7.41619825, 7.48331451, 7.54983425, 7.61577368, 7.68114567,
7.74596643, 7.81025028, 7.8740077 , 7.93725395, 7.99999952,
8.06225681, 8.12403774, 8.18535233, 8.24621105, 8.30662346,
8.36660004, 8.42614937, 8.48528099, 8.54400253, 8.60232449,
8.66025352, 8.71779728, 8.77496433, 8.83176041, 8.88819408,
8.94427204, 8.99999905, 9.05538464, 9.11043262, 9.16515064,
9.21954441, 9.27361774, 9.32737923, 9.38083076, 9.43398094,
9.48683357, 9.53939152, 9.59166145, 9.64365005, 9.69535923,
9.7467947 , 9.79795837, 9.84885788, 9.89949417, 9.94987392], dtype=float32)
So, okay, it's similar, but there are obvious differences. TensorFlow couldn't even get right the square roots of 1, 4 or 9, for example. And you would probably get yet a different result if you run it on a GPU (due to the GPU kernels being different from the CPU kernels and the dependence on CUDA routines implemented by NVIDIA, another player in the field).
My impression (although I may be wrong) is that TensorFlow is more willing to sacrifice a bit of precision in exchange of performance (which would make sense considering its typical use case). I have even seen some more complicated operations to produce (very slightly) different results just running it twice (on the same hardware), probably due to unspecified order in aggregation and averaging operations causing rounding errors (I generally use float32, so that's a factor too I guess).
Of course there is a real difference. Numpy works on arrays which can use highly optimized vectorized computations and it's doing pretty well on CPU whereas tensorflow's math functions are optimized for GPU where many matrix multiplications are much more important. So the question is where you want to use what. For CPU, I would just go with numpy whereas for GPU, it makes sense to use TF operations.
I would like to write a TensorFlow op in python, but I would like it to be differentiable (to be able to compute a gradient).
This question asks how to write an op in python, and the answer suggests using py_func (which has no gradient): Tensorflow: Writing an Op in Python
The TF documentation describes how to add an op starting from C++ code only: https://www.tensorflow.org/versions/r0.10/how_tos/adding_an_op/index.html
In my case, I am prototyping so I don't care about whether it runs on GPU, and I don't care about it being usable from anything other than the TF python API.
Yes, as mentionned in #Yaroslav's answer, it is possible and the key is the links he references: here and here. I want to elaborate on this answer by giving a concret example.
Modulo opperation: Let's implement the element-wise modulo operation in tensorflow (it already exists but its gradient is not defined, but for the example we will implement it from scratch).
Numpy function: The first step is to define the opperation we want for numpy arrays. The element-wise modulo opperation is already implemented in numpy so it is easy:
import numpy as np
def np_mod(x,y):
return (x % y).astype(np.float32)
The reason for the .astype(np.float32) is because by default tensorflow takes float32 types and if you give it float64 (the numpy default) it will complain.
Gradient Function: Next we need to define the gradient function for our opperation for each input of the opperation as tensorflow function. The function needs to take a very specific form. It need to take the tensorflow representation of the opperation op and the gradient of the output grad and say how to propagate the gradients. In our case, the gradients of the mod opperation are easy, the derivative is 1 with respect to the first argument and
with respect to the second (almost everywhere, and infinite at a finite number of spots, but let's ignore that, see https://math.stackexchange.com/questions/1849280/derivative-of-remainder-function-wrt-denominator for details). So we have
def modgrad(op, grad):
x = op.inputs[0] # the first argument (normally you need those to calculate the gradient, like the gradient of x^2 is 2x. )
y = op.inputs[1] # the second argument
return grad * 1, grad * tf.neg(tf.floordiv(x, y)) #the propagated gradient with respect to the first and second argument respectively
The grad function needs to return an n-tuple where n is the number of arguments of the operation. Notice that we need to return tensorflow functions of the input.
Making a TF function with gradients: As explained in the sources mentioned above, there is a hack to define gradients of a function using tf.RegisterGradient [doc] and tf.Graph.gradient_override_map [doc].
Copying the code from harpone we can modify the tf.py_func function to make it define the gradient at the same time:
import tensorflow as tf
def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
# Need to generate a unique name to avoid duplicates:
rnd_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))
tf.RegisterGradient(rnd_name)(grad) # see _MySquareGrad for grad example
g = tf.get_default_graph()
with g.gradient_override_map({"PyFunc": rnd_name}):
return tf.py_func(func, inp, Tout, stateful=stateful, name=name)
The stateful option is to tell tensorflow whether the function always gives the same output for the same input (stateful = False) in which case tensorflow can simply the tensorflow graph, this is our case and will probably be the case in most situations.
Combining it all together: Now that we have all the pieces, we can combine them all together:
from tensorflow.python.framework import ops
def tf_mod(x,y, name=None):
with ops.op_scope([x,y], name, "mod") as name:
z = py_func(np_mod,
[x,y],
[tf.float32],
name=name,
grad=modgrad) # <-- here's the call to the gradient
return z[0]
tf.py_func acts on lists of tensors (and returns a list of tensors), that is why we have [x,y] (and return z[0]).
And now we are done. And we can test it.
Test:
with tf.Session() as sess:
x = tf.constant([0.3,0.7,1.2,1.7])
y = tf.constant([0.2,0.5,1.0,2.9])
z = tf_mod(x,y)
gr = tf.gradients(z, [x,y])
tf.initialize_all_variables().run()
print(x.eval(), y.eval(),z.eval(), gr[0].eval(), gr[1].eval())
[ 0.30000001 0.69999999 1.20000005 1.70000005] [ 0.2 0.5 1. 2.9000001] [ 0.10000001 0.19999999 0.20000005 1.70000005] [ 1. 1. 1. 1.] [ -1. -1. -1. 0.]
Success!
Here's an example of adding gradient to a specific py_func
https://gist.github.com/harpone/3453185b41d8d985356cbe5e57d67342
Here's the issue discussion
I'm trying to build a program to map a 2d coordinate (latitude, longitude) to a float value. I have about 1 million rows of training data like
(41.140359, -8.612964) -> 65
... -> ...
I think this is a regression problem, except all of the regression examples I've found are only using 1 dimension, so I'm not sure.
What algorithm (or category of algorithms) should I use in this instance?
Before trying to find a function, plot your data on an excel of python plot, you may see the kind of function you are looking for.
In addition, excel has a regression computation module.
It is a regression problem and you can freely use e.g. linear regression to solve it. The examples are often one-dimensional so it is easy to understand, however they work for an arbitrary number of dimensions.
You can try to use linear regression first.
Lets give an example using numpy.linalg.lstsq:
>>> import numpy as np
>>> x = np.random.rand(10, 2)
>>> x
array([[ 0.7920302 , 0.05650698],
[ 0.76380636, 0.07123805],
[ 0.18650694, 0.89150851],
[ 0.22730377, 0.83013102],
[ 0.72369719, 0.07772721],
[ 0.26277287, 0.44253368],
[ 0.44421399, 0.98533921],
[ 0.91476656, 0.27183732],
[ 0.74745802, 0.08840694],
[ 0.60000819, 0.67162258]])
>>> y = np.random.rand(10)
>>> y
array([ 0.53341968, 0.63964031, 0.46097061, 0.68602146, 0.20041928,
0.42642768, 0.34039486, 0.93539655, 0.29946688, 0.57526445])
>>> m, c = np.linalg.lstsq(x, y)[0]
>>> print m,c
0.605269341974 0.370359070752
See documentation for more information about plotting and what those values represent.
Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.