linalg.svd and JacobiSVD<MatrixXf> svd ,The results are different - python

I'm translating the Python code into a C + + version,But I found that the two functions(linalg.svd and JacobiSVD svd ) produce different results.What should I do?
A = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
U, S, V = svd(A,0)
print("U =\n", U)
print("S =\n", S)
print("V =\n", V)
MatrixXf m = MatrixXf::Zero(4,4);
m << 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16;
cout << "Here is the matrix m:" << endl << m << endl;
JacobiSVD<MatrixXf> svd(m, ComputeFullU | ComputeFullV);
cout << "Its singular values are:" << endl << svd.singularValues() << endl;
cout << "Its left singular vectors are the columns of the thin U matrix:" << endl << endl << svd.matrixU() << endl;
cout << "Its right singular vectors are the columns of the thin V matrix:" << endl << endl << svd.matrixV() << endl;
Forgive me for not being clear, but here are the Python code and C ++
code results
U =
[[-0.13472212 -0.82574206 0.54255324 0.07507318]
[-0.3407577 -0.4288172 -0.77936056 0.30429774]
[-0.54679327 -0.03189234 -0.06893859 -0.83381501]
[-0.75282884 0.36503251 0.30574592 0.45444409]]
S =
[3.86226568e+01 2.07132307e+00 1.57283823e-15 3.14535571e-16]
V =
[[-0.4284124 -0.47437252 -0.52033264 -0.56629275]
[ 0.71865348 0.27380781 -0.17103786 -0.61588352]
[-0.19891147 -0.11516042 0.82705525 -0.51298336]
[ 0.51032757 -0.82869661 0.12641052 0.19195853]]
C++
Here is the matrix m:
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Its singular values are:
38.6227
2.07132
2.69062e-16
6.823e-17
Its left singular vectors are the columns of the thin U matrix:
0.134722 0.825742 0.0384608 0.546371
0.340758 0.428817 0.35596 -0.757161
0.546793 0.0318923 -0.827301 -0.12479
0.752829 -0.365033 0.432881 0.33558
Its right singular vectors are the columns of the thin V matrix:
0.428412 -0.718653 -0.124032 0.533494
0.474373 -0.273808 -0.232267 -0.803774
0.520333 0.171038 0.83663 0.00706489
0.566293 0.615884 -0.480331 0.263215
Although it turns out that there are some small deviations, will this affect my work?

Related

Difference between Eigen Svd and np.linag svd

I'm currently trying to implement some prototype Python code into C++ and am running into an issue where I am getting different results for svd calculate between the 2 when using the same exact array
Code:
python:
(U, S, V) = np.linalg.svd(H)
print("h\n",H)
print("u\n",U)
print("v\n",V)
print("s\n",S)
rotation_matrix = np.dot(U, V)
prints
h
[[ 1.19586781e+00 -1.36504900e+00 3.04707238e+00]
[-3.24276981e-01 4.25640964e-01 -6.78455372e-02]
[ 4.58970250e-02 -7.33566042e-02 -2.96605698e-03]]
u
[[-0.99546325 -0.09501679 0.0049729 ]
[ 0.09441242 -0.97994807 0.17546529]
[-0.01179897 0.17513875 0.98447306]]
v
[[-0.34290622 0.39295764 -0.85322893]
[ 0.49311955 -0.6977843 -0.51954806]
[-0.79953014 -0.59890012 0.04549948]]
s
[3.5624894 0.43029207 0.00721429]
C++
code:
std::cout << "H\n" << HTest << std::endl;
Eigen::JacobiSVD<Eigen::MatrixXd> svd;
svd.compute(HTest, Eigen::ComputeThinV | Eigen::ComputeThinU);
std::cout << "h is" << std::endl << HTest << std::endl;
std::cout << "Its singular values are:" << std::endl << svd.singularValues() << std::endl;
std::cout << "Its left singular vectors are the columns of the thin U matrix:" << std::endl << -1*svd.matrixU() << std::endl;
std::cout << "Its right singular vectors are the columns of the thin V matrix:" << std::endl << -1*svd.matrixV() << std::endl;
prints:
h is
1.19587 -1.36505 3.04707
-0.324277 0.425641 -0.0678455
0.045897 -0.0733566 -0.00296606
Its singular values are:
3.56249
0.430292
0.00721429
Its left singular vectors are the columns of the thin U matrix:
-0.995463 -0.0950168 0.0049729
0.0944124 -0.979948 0.175465
-0.011799 0.175139 0.984473
Its right singular vectors are the columns of the thin V matrix:
-0.342906 0.49312 -0.79953
0.392958 -0.697784 -0.5989
-0.853229 -0.519548 0.0454995
So H,U,S are eqivalent between the 2, but V is not. What could cause this?
I didn't notice that the V's were just transposes of each other. user chrslg has a good explanation for why this is so Ill just copy it here:
"I'd say, "because" :-). I don't think there is a good reason. Just 2 implementations. In maths lessons, you've probably learned SVD decomposition with formula M=U.S.Vᵀ. So C++ library probably stick to this formula, and gives U, S, V such as M=U.S.Vᵀ. Where as linalg documentation says that it returns U,S,V such as M=(U*S)#V. So one call V what the other calls Vᵀ. Hard to say which one is right. As long as they do what their doc say they do"

Eigen OLS vs python statsmodel.api.OLS

I need to calculate the slope,intercept of the line for a regression between 2 vectors with data. So i made a prototype with python below code:
A = [1,2,5,7,14,17,19]
b = [2,14,6,7,13,27,29]
A = sm.add_constant(A)
results = sm.OLS(A, b).fit()
print("results: ", results.params)
output: [0.04841897 0.64278656]
Now I need to replicate this using Eigen lib in C++ and as I understood, I need to pass a 1 column in the matrix of A. If I do so, I get totally different results for the regression than if I just use no second column or a 0 column. C++ code below:
Eigen::VectorXd A(7);
Eigen::VectorXd b(7);
A << 1,2,5,7,14,17,19;
b << 2,14,6,7,13,27,29;
MatrixXd new_A(A.rows(), 2);
VectorXd d = VectorXd::Constant(A.rows(), 1);
new_A << A, d;
Eigen::MatrixXd res = new_A.bdcSvd(Eigen::ComputeThinU | Eigen::ComputeThinV).solve(b);
cout << " slope: " << res.coeff(0, 0) << " intercept: " << res.coeff(1, 0) << endl;
cout << "dbl check: " << (new_A.transpose() * new_A).ldlt().solve(new_A.transpose() * b) << endl;
output with '1' column added to new_A -> slope: 1.21644 intercept:
2.70444 output with '0' or no column added -> slope: 0.642787 intercept: 0
How to get same results in C++? Which one is the right one, I seem to trust more the python one since I get the same when I use 0 column.
thank you,
Merlin
It seems i had to invert new_A with b, and replace ComputeThin with ComputeFull so that it builds.
Eigen::MatrixXd res = b.bdcSvd(Eigen::ComputeFullU | Eigen::ComputeFullV).solve(new_A);

save data (matrix or ndarray) with python then it load in c++ (as OpenCV Mat)

I've created some data in numpy that I would like to use in a separate C++ program. Therefore I need to save the data using python and later load it in C++. What is the best way of doing this?
My numpy ndarray is float 32 and of shape [10000 x 18 x 5]. I can save it for example using
numpy.save(filename, data)
Is there an easy way to load such data in C++? Target structure could be an Eigen::Matrix for example.
After searching for hours I found my year-old example files.
Caveat:
solution only covers 2D matrices
not suited for 3 dimensional or generic ndarrays
Write numpy array to ascii file with header specifying nrows, ncols:
def write_matrix2D_to_ascii(filename, matrix2D):
nrows, ncols = matrix2D.shape
with open(filename, "w") as file:
# write header [rows x cols]
nrows, ncols = matrix2D.shape
file.write(f"{nrows} {ncols}")
file.write("\n")
# write values
for row in range(nrows):
for col in range(ncols):
value = matrix2D[row, col]
file.write(str(value))
file.write(" ")
file.write("\n")
Example output data-file.txt looks like this (first row is header specifying nrows and ncols):
2 3
1.0 2.0 3.0
4.0 5.0 6.0
Cpp function to read matrix from ascii file into OpenCV matrix:
#include <iostream>
#include <fstream>
#include <iomanip> // set precision of output string
#include <opencv2/core/core.hpp> // OpenCV matrices for storing data
using namespace std;
using namespace cv;
void readMatAsciiWithHeader( const string& filename, Mat& matData)
{
cout << "Create matrix from file :" << filename << endl;
ifstream inFileStream(filename.c_str());
if(!inFileStream){
cout << "File cannot be found" << endl;
exit(-1);
}
int rows, cols;
inFileStream >> rows;
inFileStream >> cols;
matData.create(rows,cols,CV_32F);
cout << "numRows: " << rows << "\t numCols: " << cols << endl;
matData.setTo(0); // init all values to 0
float *dptr;
for(int ridx=0; ridx < matData.rows; ++ridx){
dptr = matData.ptr<float>(ridx);
for(int cidx=0; cidx < matData.cols; ++cidx, ++dptr){
inFileStream >> *dptr;
}
}
inFileStream.close();
}
Driver code to use above mentioned function in cpp program:
Mat myMatrix;
readMatAsciiWithHeader("path/to/data-file.txt", myMatrix);
For completeness, some code to save the data using C++:
int saveMatAsciiWithHeader( const string& filename, Mat& matData)
{
if (matData.empty()){
cout << "File could not be saved. MatData is empty" << endl;
return 0;
}
ofstream oStream(filename.c_str());
// Create header
oStream << matData.rows << " " << matData.cols << endl;
// Write data
for(int ridx=0; ridx < matData.rows; ridx++)
{
for(int cidx=0; cidx < matData.cols; cidx++)
{
oStream << setprecision(9) << matData.at<float>(ridx,cidx) << " ";
}
oStream << endl;
}
oStream.close();
cout << "Saved " << filename.c_str() << endl;
return 1;
}
Future work:
solution for 3D matrices
conversion to Eigen::Matrix

Converting 8bit values into 24bit values

I am reading in ADC values, assuming I am reading things right I'm still new to this, from a [NAU7802](14www.nuvoton.com/resource-files/NAU7802 Data Sheet V1.7.pdf0) and I am getting values outputted as an 8 bit integer (i.e. 0-255) as three bytes. How do I merge the three bytes together to get the output as a 24bit value (0-16777215)?
Here is the code I am using, if I am assuming I did this right, I am still new to I2C communication.
from smbus2 import SMBus
import time
bus = SMBus(1)
address = 0x2a
bus.write_byte_data(0x2a, 0x00, 6)
data = bus.read_i2c_block_data(0x2a,0x12,3)
print bus.read_i2c_block_data(0x2a,0x12,3)
adc1 = bin(data[2])
adc2 = bin(data[1])
adc3 = bin(data[0])
print adc1
print adc2
print adc3
When I convert the binary manually I get and output that corresponds to what I am inputting to the adc.
Ouput:
[128, 136, 136]
0b10001001
0b10001000
0b10000000
try this:
data=[128, 136, 136]
data[0] + (data[1] << 8) + (data[2] << 16)
# 8947840
or
((data[2] << 24) | (data[1] << 16) | (data[0] << 8)) >> 8
# 8947840
(8947840 & 0xFF0000) >> 16
#136
(8947840 & 0x00FF00) >> 8
#136
(8947840 & 0x0000FF)
#128
Here's an example on unpacking 3 different numbers:
data=[118, 123, 41]
c = data[0] + (data[1] << 8) + (data[2] << 16)
#2718582
(c & 0xFF0000) >> 16
#41
(c & 0x00FF00) >> 8
#123
(c & 0x0000FF)
#118

How to find closest locations for a list of locations in more efficient way?

Looking for the help with an algorithm for local machine or cluster (Python, R, JavaScript, any languages).
I have a list of locations with coordinates.
# R script
n <- 10
set.seed(1)
index <- paste0("id_",c(1:n))
lat <- runif(n, 32.0, 41)
lon <- runif(n, 84, 112)*(-1)
values <- as.integer(runif(n, 50, 100))
df <- data.frame(index, lat, lon, values, stringsAsFactors = FALSE)
names(df) <- c('loc_id','lat','lon', 'value')
loc_id lat lon value
1 id_1 34.38958 -89.76729 96
2 id_2 35.34912 -88.94359 60
3 id_3 37.15568 -103.23664 82
4 id_4 40.17387 -94.75490 56
5 id_5 33.81514 -105.55556 63
6 id_6 40.08551 -97.93558 69
7 id_7 40.50208 -104.09332 50
8 id_8 37.94718 -111.77337 69
9 id_9 37.66203 -94.64099 93
10 id_10 32.55608 -105.76847 67
I need to find 3 closets locations for each location in the table.
This is my code in R:
# R script
require(dplyr)
require(geosphere)
start.time <- Sys.time()
d1 <- df
sample <- 999999999999
distances <- list("init1" = sample, "init2" = sample, "init3" = sample)
d1$distances <- apply(d1, 1, function(x){distances})
n_rows = nrow(d1)
for (i in 1:(n_rows-1)) {
# current location
dot1 <- c(d1$lon[i], d1$lat[i])
for (k in (i+1):n_rows) {
# next location
dot2 <- c(d1$lon[k], d1$lat[k])
# distance between locations
meters_between <- as.integer(distm(dot1, dot2, fun = distHaversine))
# updating current location distances
distances <- d1$distances[[i]]
distances[d1$loc_id[k]] <- meters_between
d1$distances[[i]] <- distances[order(unlist(distances), decreasing=FALSE)][1:3]
# updating next location distances
distances <- d1$distances[[k]]
distances[d1$loc_id[i]] <- meters_between
d1$distances[[k]] <- distances[order(unlist(distances), decreasing=FALSE)][1:3]
}
}
But it takes too much time:
# [1] "For 10 rows and 45 iterations takes 0.124729156494141 sec. Average sec 0.00277175903320313 per row."
# [1] "For 100 rows and 4950 iterations takes 2.54944682121277 sec. Average sec 0.000515039761861165 per row."
# [1] "For 200 rows and 19900 iterations takes 10.1178169250488 sec. Average sec 0.000508433011308986 per row."
# [1] "For 500 rows and 124750 iterations takes 73.7151870727539 sec. Average sec 0.000590903303188408 per row."
I did the same in Python:
# Python script
import pandas as pd
import numpy as np
n = 10
np.random.seed(1)
data_m = np.random.uniform(0, 5, 5)
data = {'loc_id':range(1, n+1),
'lat':np.random.uniform(32, 41, n),
'lon':np.random.uniform(84, 112, n)*(-1),
'values':np.random.randint(50, 100, n)}
df = pd.DataFrame(data)[['loc_id', 'lat', 'lon', 'values']]
df['loc_id'] = df['loc_id'].apply(lambda x: 'id_{0}'.format(x))
df = df.reset_index().drop('index', axis = 1).set_index('loc_id')
from geopy.distance import distance
from datetime import datetime
start_time = datetime.now()
sample = 999999999999
df['distances'] = np.nan
df['distances'] = df['distances'].apply(lambda x: [{'init1': sample}, {'init2': sample}, {'init3': sample}])
n_rows = len(df)
rows_done = 0
for i, row_i in df.head(n_rows-1).iterrows():
dot1 = (row_i['lat'], row_i['lon'])
rows_done = rows_done + 1
for k, row_k in df.tail(n_rows-rows_done).iterrows():
dot2 = (row_k['lat'], row_k['lon'])
meters_between = int(distance(dot1,dot2).meters)
distances = df.at[i, 'distances']
distances.append({k: meters_between})
distances_sorted = sorted(distances, key=lambda x: x[next(iter(x))])[:3]
df.at[i, 'distances'] = distances_sorted
distances = df.at[k, 'distances']
distances.append({i: meters_between})
distances_sorted = sorted(distances, key=lambda x: x[next(iter(x))])[:3]
df.at[k, 'distances'] = distances_sorted
print df
Almost the same performance.
Anybody knows if there is a better approach? In my task it has to be done for 90000 locations. Even thought about Hadoop/MpRc/Spark, but have no idea how to do in distributed mode.
I am glad to hear any ideas or suggestions.
If Euclidean distance is ok then nn2 uses kd-trees and C code so it should be fast:
library(RANN)
nn2(df[2:3], k = 4)
This took a total of 0.06 to 0.11 seconds on my not particularly fast laptop to process n = 10,000 rows and a total of 1.00 to 1.25 seconds for 90,000 rows.
I can offer a python solution with scipy
from scipy.spatial import distance
from geopy.distance import vincenty
v=distance.cdist(df[['lat','lon']].values,df[['lat','lon']].values,lambda u, v: vincenty(u, v).kilometers)
np.sort(v,axis=1)[:,1:4]
Out[1033]:
array([[384.09948155, 468.15944729, 545.41393271],
[270.07677993, 397.21974571, 659.96238603],
[384.09948155, 397.21974571, 619.616239 ],
[203.07302273, 483.54687912, 741.21396029],
[203.07302273, 444.49156394, 659.96238603],
[437.31308598, 468.15944729, 494.91879983],
[494.91879983, 695.91437812, 697.27399161],
[270.07677993, 444.49156394, 483.54687912],
[530.54946479, 626.29467739, 695.91437812],
[437.31308598, 545.41393271, 697.27399161]])
Here's how to solve this problem with C++ and my library
GeographicLib (version 1.47 or later). This uses true ellipsoidal geodesic
distances and a
vantage point tree
to optimize the search for nearest neighbors.
#include <exception>
#include <vector>
#include <fstream>
#include <string>
#include <GeographicLib/NearestNeighbor.hpp>
#include <GeographicLib/Geodesic.hpp>
using namespace std;
using namespace GeographicLib;
// A structure to hold a geographic coordinate.
struct pos {
string id;
double lat, lon;
pos(const string& _id = "", double _lat = 0, double _lon = 0) :
id(_id), lat(_lat), lon(_lon) {}
};
// A class to compute the distance between 2 positions.
class DistanceCalculator {
private:
Geodesic _geod;
public:
explicit DistanceCalculator(const Geodesic& geod) : _geod(geod) {}
double operator() (const pos& a, const pos& b) const {
double d;
_geod.Inverse(a.lat, a.lon, b.lat, b.lon, d);
if ( !(d >= 0) )
// Catch illegal positions which result in d = NaN
throw GeographicErr("distance doesn't satisfy d >= 0");
return d;
}
};
int main() {
try {
// Read in pts
vector<pos> pts;
string id;
double lat, lon;
{
ifstream is("pts.txt"); // lines of "id lat lon"
if (!is.good())
throw GeographicErr("pts.txt not readable");
while (is >> id >> lon >> lat)
pts.push_back(pos(id, lat, lon));
if (pts.size() == 0)
throw GeographicErr("need at least one location");
}
// Define a distance function object
DistanceCalculator distance(Geodesic::WGS84());
// Create NearestNeighbor object
NearestNeighbor<double, pos, DistanceCalculator>
ptsset(pts, distance);
vector<int> ind;
int n = 3; // Find 3 nearest neighbors
for (unsigned i = 0; i < pts.size(); ++i) {
ptsset.Search(pts, distance, pts[i], ind,
n, numeric_limits<double>::max(),
// exclude the point itself
0.0);
if (ind.size() != n)
throw GeographicErr("unexpected number of results");
cout << pts[i].id;
for (unsigned j = 0; j < ind.size(); ++j)
cout << " " << pts[ind[j]].id;
cout << "\n";
}
int setupcost, numsearches, searchcost, mincost, maxcost;
double mean, sd;
ptsset.Statistics(setupcost, numsearches, searchcost,
mincost, maxcost, mean, sd);
long long
totcost = setupcost + searchcost,
exhaustivecost = ((pts.size() - 1) * pts.size())/2;
cerr
<< "Number of distance calculations = " << totcost << "\n"
<< "With an exhaustive search = " << exhaustivecost << "\n"
<< "Ratio = " << double(totcost) / exhaustivecost << "\n"
<< "Efficiency improvement = "
<< 100 * (1 - double(totcost) / exhaustivecost) << "%\n";
}
catch (const exception& e) {
cerr << "Caught exception: " << e.what() << "\n";
return 1;
}
}
This reads in a set of points (in the form "id lat lon") for pts.txt,
puts them in a VP tree. Then for each point it looks up the 3 nearest
neighbors and prints the id and the id's of the neighbors (ranked by
distance).
Compile this with, e.g.,
g++ -O3 -o nearest nearest.cpp -lGeographic
If pts.txt contains 90000 points, then the computation completes in
about 6 secs (or 70 μs per point) on my home computer after doing about 3380000 distance
calculations. This
is about 1200 times more efficient than a brute force calculaion
(doing all N(N − 1)/2 distance calculations).
You could speed this up (by a factor of a "few") by using a crude
approximation to the distance (e.g., spherical or euclidean); just
modify the DistanceCalculator class appropriately. For example, this
version of DistanceCalculator returns the spherical distance in
degrees:
// A class to compute the spherical distance between 2 positions.
class DistanceCalculator {
public:
explicit DistanceCalculator(const Geodesic& /*geod*/) {}
double operator() (const pos& a, const pos& b) const {
double sphia, cphia, sphib, cphib, somgab, comgab;
Math::sincosd(a.lat, sphia, cphia);
Math::sincosd(b.lat, sphib, cphib);
Math::sincosd(Math::AngDiff(a.lon, b.lon), somgab, comgab);
return Math::atan2d(Math::hypot(cphia * sphib - sphia * cphib * comgab,
cphib * somgab),
sphia * sphib + cphia * cphib * comgab);
}
};
But now you have the added burden of ensuring that the approximation
is good enough. I recommend just using the correct geodesic distance
in the first place.
Details of the implementation of VP trees in GeographicLib are given
here.

Categories