I am trying to implement Non-negative Matrix Factorization so as to find the missing values of a matrix for a Recommendation Engine Project. I am using the nimfa library to implement matrix factorization. But can't seem to figure out how to predict the missing values.
The missing values in this matrix is represented by 0.
a=[[ 1. 0.45643546 0. 0.1 0.10327956 0.0225877 ]
[ 0.15214515 1. 0.04811252 0.07607258 0.23570226 0.38271325]
[ 0. 0.14433757 1. 0.07905694 0. 0.42857143]
[ 0.1 0.22821773 0.07905694 1. 0. 0.27105237]
[ 0.06885304 0.47140452 0. 0. 1. 0.13608276]
[ 0.00903508 0.4592559 0.17142857 0.10842095 0.08164966 1. ]]
import nimfa
model = nimfa.Lsnmf(a, max_iter=100000,rank =4)
#fit the model
fit = model()
#get U and V matrices from fit
U = fit.basis()
V = fit.coef()
print numpy.dot(U,V)
But the ans given is nearly same as a and I can't predict the zero values.
Please tell me which method to use or any other implementations possible and any possible resources.
I want to use this function to minimize the error in predicting the values.
error=|| a - UV ||_F + c*||U||_F + c*||V||_F
where _F denotes the frobenius norm
I have not used nimfa before so I cannot answer on exactly how to do that, but with sklearn you can perform a preprocessor to transform the missing values, like this:
In [28]: import numpy as np
In [29]: from sklearn.preprocessing import Imputer
# prepare a numpy array
In [30]: a = np.array(a)
In [31]: a
Out[31]:
array([[ 1. , 0.45643546, 0. , 0.1 , 0.10327956,
0.0225877 ],
[ 0.15214515, 1. , 0.04811252, 0.07607258, 0.23570226,
0.38271325],
[ 0. , 0.14433757, 1. , 0.07905694, 0. ,
0.42857143],
[ 0.1 , 0.22821773, 0.07905694, 1. , 0. ,
0.27105237],
[ 0.06885304, 0.47140452, 0. , 0. , 1. ,
0.13608276],
[ 0.00903508, 0.4592559 , 0.17142857, 0.10842095, 0.08164966,
1. ]])
In [32]: pre = Imputer(missing_values=0, strategy='mean')
# transform missing_values as "0" using mean strategy
In [33]: pre.fit_transform(a)
Out[33]:
array([[ 1. , 0.45643546, 0.32464951, 0.1 , 0.10327956,
0.0225877 ],
[ 0.15214515, 1. , 0.04811252, 0.07607258, 0.23570226,
0.38271325],
[ 0.26600665, 0.14433757, 1. , 0.07905694, 0.35515787,
0.42857143],
[ 0.1 , 0.22821773, 0.07905694, 1. , 0.35515787,
0.27105237],
[ 0.06885304, 0.47140452, 0.32464951, 0.27271009, 1. ,
0.13608276],
[ 0.00903508, 0.4592559 , 0.17142857, 0.10842095, 0.08164966,
1. ]])
You can read more here.
Related
I'm currently trying to develop a function that performs matrix multiplication while expanding a differential equation with odeint in Python and am seeing strange results.
I converted the function:
def f(x, t):
return [
-0.1 * x[0] + 2 * x[1],
-2 * x[0] - 0.1 * x[1]
]
to the below so that I can incorporate different matrices.
I have the below matrix of values and function that takes specific values of that matrix:
from scipy.integrate import odeint
x0_train = [2,0]
dt = 0.01
t = np.arange(0, 1000, dt)
matrix_a = np.array([-0.09999975, 1.999999, -1.999999, -0.09999974])
# Function to run odeint with
def f(x, t, a):
return [
a[0] * x[0] + a[1] * x[1],
a[2] * x[0] - a[3] * x[1]
]
odeint(f, x0_train, t, args=(matrix_a,))
>>> array([[ 2. , 0. ],
[ 1.99760115, -0.03999731],
[ 1.99440529, -0.07997867],
...,
[ 1.69090227, 1.15608741],
[ 1.71199436, 1.12319701],
[ 1.73240339, 1.08985846]])
This seems right, but when I create my own function to perform multiplication/regression, I see the results at the bottom of the array are completely different. I have two sparse arrays that provide the same conditions as matrix_a but with zeros around them.
from sklearn.preprocessing import PolynomialFeatures
new_matrix_a = array([[ 0. , -0.09999975, 1.999999 , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , -1.999999 , -0.09999974, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]])
# New function
def f_new(x, t, parameters):
polynomials = PolynomialFeatures(degree=5)
x = np.array(x).reshape(-1,2)
#x0_train_array_reshape = x0_train_array.reshape(1,2)
polynomial_transform = polynomials.fit(x)
polynomial_features = polynomial_transform.fit_transform(x).T
x_ode = np.matmul(parameters[0],polynomial_features)
y_ode = np.matmul(parameters[1],polynomial_features)
return np.concatenate((x_ode, y_ode), axis=None).tolist()
odeint(f_new, x0_train, t, args=(new_matrix_a,))
>>> array([[ 2.00000000e+00, 0.00000000e+00],
[ 1.99760142e+00, -3.99573216e-02],
[ 1.99440742e+00, -7.98188169e-02],
...,
[-3.50784051e-21, -9.99729456e-22],
[-3.50782881e-21, -9.99726119e-22],
[-3.50781711e-21, -9.99722781e-22]])
As you can see, I'm getting completely different values at the end of the array. I've been running through my code and can't seem to find a reason why they would be different. Does anybody have a clear reason why or if I'm doing something wrong with my f_new? Ideally, I'd like to develop a function that can take any values in that matrix_a, which is why I'm trying to create this new function.
Thanks in advance.
You should perhaps use numpy even more in the first version, to avoid sign errors in routine algorithms.
def f(x, t, a):
return a.reshape([2,2]) # x # or use matmul, or a.reshape([2,2]).dot(x)
or, for efficiency, pass the already reshaped a.
So I have an array XsN of shape (590,) and I am trying to standardise the data.
This is an example of one of the 590 elements in my array:
print(XsN[:1])
[array([[ 0. , 0.27229556, -1.8033657 , ..., 0. ,
0. , 0. ],
[ 0. , 0.20665401, -1.9340569 , ..., 0. ,
0. , 0. ],
[ 4. , 0. , 0.04352444, ..., 0. ,
0. , 0. ],
...,
[10. , 0. , -0.5655 , ..., 0. ,
0. , 0. ],
[10. , 0. , 0.9150001 , ..., 0. ,
0. , 0. ],
[10. , 0. , 1.0005 , ..., 0. ,
0. , 0. ]], dtype=float32)]
I'm then reshaping it so that it has shape (590,1):
XsN_2 = XsN.reshape(-1,1)
Now when I use StandardScaler:
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(XsN_2)
I get the error that
TypeError: only size-1 arrays can be converted to Python scalars
and
ValueError: setting an array element with a sequence.
I understand it tries to find a number but instead it finds an ndarray but I'm not quite sure how to standardise data of shape (590,) where each element is its own ndarray.
Edit 1:
Referring to this csv file: https://gofile.io/?c=YGxCWQ
Here is some code with a sample data:
import pandas as pd
from sklearn.preprocessing import StandardScaler
imp = pd.read_csv('foo.csv', sep=',', header=None)
data = imp.values
print(data)
standardized_data = StandardScaler().fit_transform(data)
The error I get now is:
ValueError: could not convert string to float
Is there any way I can standardise this data?
Without access to your original data in the form of a valid .csv file it is a little difficult to debug this. From the look of what you printed it seems like XsN is a list of arrays, so you may want to loop through each in turn or convert it into an array with expanded dimensions.
Here is an example of standardizing some dummy data which I think resembles the structure of your data. Hope that helps.
n = 100
# Create feature 1
mean1 = 10
standard_dev1 = 2
col1 = np.random.normal(loc=mean1,scale=standard_dev1,size=[n,1])
# Create feature 2
mean2 = 20
standard_dev2 = 4
col2 = np.random.normal(loc=mean2,scale=standard_dev2,size=[n,1])
data = np.concatenate([col1,col2],axis=1)
print(f"means of raw data: {data.mean(axis=0)}")
>>>
means of raw data: [10.15783287 19.82541124]
print(f"standard devations of raw data: {data.std(axis=0)}")
>>>
standard devations of raw data: [2.00049111 3.87277793]
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
print(f"means of standardized data: {standardized_data.mean(axis=0)}")
>>>
means of standardized data: [-6.92779167e-16 -1.78745907e-15]
print(f"standard devations of standardized data: {standardized_data.std(axis=0)}")
>>>
standard devations of standardized data: [1. 1.]
I am using the following to calculate the running gradients between data in the same indexes across multiple matrices:
import numpy as np
array_1 = np.array([[1,2,3], [4,5,6]])
array_2 = np.array([[2,3,4], [5,6,7]])
array_3 = np.array([[1,8,9], [9,6,7]])
flat_1 = array_1.flatten()
flat_2 = array_2.flatten()
flat_3 = array_3.flatten()
print('flat_1: {0}'.format(flat_1))
print('flat_2: {0}'.format(flat_2))
print('flat_3: {0}'.format(flat_3))
data = []
gradient_list = []
for item in zip(flat_1,flat_2,flat_3):
data.append(list(item))
print('items: {0}'.format(list(item)))
grads = np.gradient(list(item))
print('grads: {0}'.format(grads))
gradient_list.append(grads)
grad_array=np.array(gradient_list)
print('grad_array: {0}'.format(grad_array))
This doesn't look like an optimal way of doing this - is there a vectorized way of calculating gradients between data in 2d arrays?
numpy.gradient takes axis as parameter, so you might just stack the arrays, and then calcualte the gradient along a certain axis; For instance, use np.dstack with axis=2; If you need a different shape as result, just use reshape method:
np.gradient(np.dstack((array_1, array_2, array_3)), axis=2)
#array([[[ 1. , 0. , -1. ],
# [ 1. , 3. , 5. ],
# [ 1. , 3. , 5. ]],
# [[ 1. , 2.5, 4. ],
# [ 1. , 0.5, 0. ],
# [ 1. , 0.5, 0. ]]])
Or if flatten the arrays first:
np.gradient(np.column_stack((array_1.ravel(), array_2.ravel(), array_3.ravel())), axis=1)
#array([[ 1. , 0. , -1. ],
# [ 1. , 3. , 5. ],
# [ 1. , 3. , 5. ],
# [ 1. , 2.5, 4. ],
# [ 1. , 0.5, 0. ],
# [ 1. , 0.5, 0. ]])
i want to solve the following ode
KT + CT' = Q
to given example Data is my code below
import numpy as np
import scipy as sp
# Solve the following ODE
# K*T + C*T' = Q
# T' = C^-1 ( Q - K * T )
T_start=sp.array([ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87])
K = sp.array([[-0.01761969, 0.02704873, 0.00572222, 0. , 0. ,
0. ],
[ 0.02704873, -0.03546941, 0. , 0. , 0.00513177,
0. ],
[ 0.00572222, 0. , 0.03001858, -0.04752982, 0. ,
0.02030505],
[ 0. , 0. , -0.04752982, 0.0444405 , 0.00308932,
0. ],
[ 0. , 0.00513177, 0. , 0.00308932, 0.02629577,
-0.01793915],
[ 0. , 0. , 0.02030505, 0. , -0.01793915,
0.00084506]])
Q = sp.array([ 1.66342077, 0.16187956, 0.65115035, 0.71274755,2.54614269, 0.13680399])
C_invers = sp.array([[ 3.44827586, 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , 1.5625 , 0. , 0. , 0. ,
-0. ],
[ 0. , 0. , 2.63157895, 0. , 0. ,
-0. ],
[ 0. , 0. , 0. , 2.17391304, 0. ,
-0. ],
[ 0. , 0. , 0. , 0. , 1.63934426,
-0. ],
[ 0. , 0. , 0. , 0. , 0. ,
2.38095238]])
time = np.linspace(0, 20, 10000)
#T_real = sp.array([[ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87]])
def deriv(T, t):
return sp.dot( C_invers, Q - np.dot(K, T) )
T_sol = sp.integrate.odeint(deriv, T_start, time)
i know that the result is
sp.array([ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87])
the solution is "stable" if and only if i use this as the T_start condition
but if i change my start condition for example to
T_start=sp.array([ 0, 0, 0, 0, 0, 0])
it won't converge im getting the following result:
where is my fault? Negative values make no sense for my system :/ Can you help me? thanks ;)
The array
array([ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87])
is the equilibrium of your system (approximately). You can find this by setting the right-hand side of your system of equations to 0, which leads to Teq = inv(K)*Q:
In [9]: Teq = np.linalg.solve(K, Q)
In [10]: Teq
Out[10]:
array([ 151.25960795, 132.17972469, 131.6402527 , 146.55025359,
147.87025015, 137.87029892])
That's why your solution appears to be stable when you use these values for the starting point. The solution is very close to the equilibrium, so it doesn't change much.
Long term, however, the solution will eventually diverge away from Teq, because that equilibrium point is unstable. Your system, T' = inv(C)*(Q - K*T), is linear in T, so you can determine the stability by computing the eigenvalues of the coefficient matrix of T. That is, write T = inv(C)*Q - inv(C)*K*T. The coefficient matrix of T is -inv(C)*K. Here's how you can find the eigenvalues of that matrix:
In [11]: A = -C_invers.dot(K)
In [12]: np.linalg.eigvals(A)
Out[12]:
array([-0.2089754 , 0.12257481, -0.06349952, -0.01489581, 0.00146708,
0.05878143])
The coefficent matrix A has three positive eigenvalues. Those correspond to modes that will grow exponentially in time. That is, the equilibrium is unstable, so the growth that you see is to be expected.
'car3.csv' file download link
import csv
num = open('car3.csv')
nums = csv.reader(num)
nums_list = []
for i in nums:
nums_list.append(i)
import numpy as np
nums_arr = np.array(nums_list, dtype = np.float32)
print(nums_arr)
print(np.std(nums_arr, axis=0))
The result is this.
[[ 1. 1. 2.]
[ 1. 1. 2.]
[ 1. 1. 2.]
...,
[ 0. 0. 5.]
[ 0. 0. 5.]
[ 0. 0. 5.]]
[ 0.5 0.5 1.11803401]
There are lots of spaces that I didn't expected.
How can I handle these anyway?
That is not a spacing problem. What all you need to do is to save the output of the standard deviation. Then, you can access each value like this:
std_arr = np.std(nums_arr, axis=0) # array which holds std of each column
# now, you can access them by indexing:
print(std_arr[0]) # output here is 0.5
print(std_arr[1]) # output here is 0.5
print(std_arr[2]) # output here is 1.118034