I'm currently trying to implement some prototype Python code into C++ and am running into an issue where I am getting different results for svd calculate between the 2 when using the same exact array
Code:
python:
(U, S, V) = np.linalg.svd(H)
print("h\n",H)
print("u\n",U)
print("v\n",V)
print("s\n",S)
rotation_matrix = np.dot(U, V)
prints
h
[[ 1.19586781e+00 -1.36504900e+00 3.04707238e+00]
[-3.24276981e-01 4.25640964e-01 -6.78455372e-02]
[ 4.58970250e-02 -7.33566042e-02 -2.96605698e-03]]
u
[[-0.99546325 -0.09501679 0.0049729 ]
[ 0.09441242 -0.97994807 0.17546529]
[-0.01179897 0.17513875 0.98447306]]
v
[[-0.34290622 0.39295764 -0.85322893]
[ 0.49311955 -0.6977843 -0.51954806]
[-0.79953014 -0.59890012 0.04549948]]
s
[3.5624894 0.43029207 0.00721429]
C++
code:
std::cout << "H\n" << HTest << std::endl;
Eigen::JacobiSVD<Eigen::MatrixXd> svd;
svd.compute(HTest, Eigen::ComputeThinV | Eigen::ComputeThinU);
std::cout << "h is" << std::endl << HTest << std::endl;
std::cout << "Its singular values are:" << std::endl << svd.singularValues() << std::endl;
std::cout << "Its left singular vectors are the columns of the thin U matrix:" << std::endl << -1*svd.matrixU() << std::endl;
std::cout << "Its right singular vectors are the columns of the thin V matrix:" << std::endl << -1*svd.matrixV() << std::endl;
prints:
h is
1.19587 -1.36505 3.04707
-0.324277 0.425641 -0.0678455
0.045897 -0.0733566 -0.00296606
Its singular values are:
3.56249
0.430292
0.00721429
Its left singular vectors are the columns of the thin U matrix:
-0.995463 -0.0950168 0.0049729
0.0944124 -0.979948 0.175465
-0.011799 0.175139 0.984473
Its right singular vectors are the columns of the thin V matrix:
-0.342906 0.49312 -0.79953
0.392958 -0.697784 -0.5989
-0.853229 -0.519548 0.0454995
So H,U,S are eqivalent between the 2, but V is not. What could cause this?
I didn't notice that the V's were just transposes of each other. user chrslg has a good explanation for why this is so Ill just copy it here:
"I'd say, "because" :-). I don't think there is a good reason. Just 2 implementations. In maths lessons, you've probably learned SVD decomposition with formula M=U.S.Vᵀ. So C++ library probably stick to this formula, and gives U, S, V such as M=U.S.Vᵀ. Where as linalg documentation says that it returns U,S,V such as M=(U*S)#V. So one call V what the other calls Vᵀ. Hard to say which one is right. As long as they do what their doc say they do"
I need to calculate the slope,intercept of the line for a regression between 2 vectors with data. So i made a prototype with python below code:
A = [1,2,5,7,14,17,19]
b = [2,14,6,7,13,27,29]
A = sm.add_constant(A)
results = sm.OLS(A, b).fit()
print("results: ", results.params)
output: [0.04841897 0.64278656]
Now I need to replicate this using Eigen lib in C++ and as I understood, I need to pass a 1 column in the matrix of A. If I do so, I get totally different results for the regression than if I just use no second column or a 0 column. C++ code below:
Eigen::VectorXd A(7);
Eigen::VectorXd b(7);
A << 1,2,5,7,14,17,19;
b << 2,14,6,7,13,27,29;
MatrixXd new_A(A.rows(), 2);
VectorXd d = VectorXd::Constant(A.rows(), 1);
new_A << A, d;
Eigen::MatrixXd res = new_A.bdcSvd(Eigen::ComputeThinU | Eigen::ComputeThinV).solve(b);
cout << " slope: " << res.coeff(0, 0) << " intercept: " << res.coeff(1, 0) << endl;
cout << "dbl check: " << (new_A.transpose() * new_A).ldlt().solve(new_A.transpose() * b) << endl;
output with '1' column added to new_A -> slope: 1.21644 intercept:
2.70444 output with '0' or no column added -> slope: 0.642787 intercept: 0
How to get same results in C++? Which one is the right one, I seem to trust more the python one since I get the same when I use 0 column.
thank you,
Merlin
It seems i had to invert new_A with b, and replace ComputeThin with ComputeFull so that it builds.
Eigen::MatrixXd res = b.bdcSvd(Eigen::ComputeFullU | Eigen::ComputeFullV).solve(new_A);
I've created some data in numpy that I would like to use in a separate C++ program. Therefore I need to save the data using python and later load it in C++. What is the best way of doing this?
My numpy ndarray is float 32 and of shape [10000 x 18 x 5]. I can save it for example using
numpy.save(filename, data)
Is there an easy way to load such data in C++? Target structure could be an Eigen::Matrix for example.
After searching for hours I found my year-old example files.
Caveat:
solution only covers 2D matrices
not suited for 3 dimensional or generic ndarrays
Write numpy array to ascii file with header specifying nrows, ncols:
def write_matrix2D_to_ascii(filename, matrix2D):
nrows, ncols = matrix2D.shape
with open(filename, "w") as file:
# write header [rows x cols]
nrows, ncols = matrix2D.shape
file.write(f"{nrows} {ncols}")
file.write("\n")
# write values
for row in range(nrows):
for col in range(ncols):
value = matrix2D[row, col]
file.write(str(value))
file.write(" ")
file.write("\n")
Example output data-file.txt looks like this (first row is header specifying nrows and ncols):
2 3
1.0 2.0 3.0
4.0 5.0 6.0
Cpp function to read matrix from ascii file into OpenCV matrix:
#include <iostream>
#include <fstream>
#include <iomanip> // set precision of output string
#include <opencv2/core/core.hpp> // OpenCV matrices for storing data
using namespace std;
using namespace cv;
void readMatAsciiWithHeader( const string& filename, Mat& matData)
{
cout << "Create matrix from file :" << filename << endl;
ifstream inFileStream(filename.c_str());
if(!inFileStream){
cout << "File cannot be found" << endl;
exit(-1);
}
int rows, cols;
inFileStream >> rows;
inFileStream >> cols;
matData.create(rows,cols,CV_32F);
cout << "numRows: " << rows << "\t numCols: " << cols << endl;
matData.setTo(0); // init all values to 0
float *dptr;
for(int ridx=0; ridx < matData.rows; ++ridx){
dptr = matData.ptr<float>(ridx);
for(int cidx=0; cidx < matData.cols; ++cidx, ++dptr){
inFileStream >> *dptr;
}
}
inFileStream.close();
}
Driver code to use above mentioned function in cpp program:
Mat myMatrix;
readMatAsciiWithHeader("path/to/data-file.txt", myMatrix);
For completeness, some code to save the data using C++:
int saveMatAsciiWithHeader( const string& filename, Mat& matData)
{
if (matData.empty()){
cout << "File could not be saved. MatData is empty" << endl;
return 0;
}
ofstream oStream(filename.c_str());
// Create header
oStream << matData.rows << " " << matData.cols << endl;
// Write data
for(int ridx=0; ridx < matData.rows; ridx++)
{
for(int cidx=0; cidx < matData.cols; cidx++)
{
oStream << setprecision(9) << matData.at<float>(ridx,cidx) << " ";
}
oStream << endl;
}
oStream.close();
cout << "Saved " << filename.c_str() << endl;
return 1;
}
Future work:
solution for 3D matrices
conversion to Eigen::Matrix
I am reading in ADC values, assuming I am reading things right I'm still new to this, from a [NAU7802](14www.nuvoton.com/resource-files/NAU7802 Data Sheet V1.7.pdf0) and I am getting values outputted as an 8 bit integer (i.e. 0-255) as three bytes. How do I merge the three bytes together to get the output as a 24bit value (0-16777215)?
Here is the code I am using, if I am assuming I did this right, I am still new to I2C communication.
from smbus2 import SMBus
import time
bus = SMBus(1)
address = 0x2a
bus.write_byte_data(0x2a, 0x00, 6)
data = bus.read_i2c_block_data(0x2a,0x12,3)
print bus.read_i2c_block_data(0x2a,0x12,3)
adc1 = bin(data[2])
adc2 = bin(data[1])
adc3 = bin(data[0])
print adc1
print adc2
print adc3
When I convert the binary manually I get and output that corresponds to what I am inputting to the adc.
Ouput:
[128, 136, 136]
0b10001001
0b10001000
0b10000000
try this:
data=[128, 136, 136]
data[0] + (data[1] << 8) + (data[2] << 16)
# 8947840
or
((data[2] << 24) | (data[1] << 16) | (data[0] << 8)) >> 8
# 8947840
(8947840 & 0xFF0000) >> 16
#136
(8947840 & 0x00FF00) >> 8
#136
(8947840 & 0x0000FF)
#128
Here's an example on unpacking 3 different numbers:
data=[118, 123, 41]
c = data[0] + (data[1] << 8) + (data[2] << 16)
#2718582
(c & 0xFF0000) >> 16
#41
(c & 0x00FF00) >> 8
#123
(c & 0x0000FF)
#118
Looking for the help with an algorithm for local machine or cluster (Python, R, JavaScript, any languages).
I have a list of locations with coordinates.
# R script
n <- 10
set.seed(1)
index <- paste0("id_",c(1:n))
lat <- runif(n, 32.0, 41)
lon <- runif(n, 84, 112)*(-1)
values <- as.integer(runif(n, 50, 100))
df <- data.frame(index, lat, lon, values, stringsAsFactors = FALSE)
names(df) <- c('loc_id','lat','lon', 'value')
loc_id lat lon value
1 id_1 34.38958 -89.76729 96
2 id_2 35.34912 -88.94359 60
3 id_3 37.15568 -103.23664 82
4 id_4 40.17387 -94.75490 56
5 id_5 33.81514 -105.55556 63
6 id_6 40.08551 -97.93558 69
7 id_7 40.50208 -104.09332 50
8 id_8 37.94718 -111.77337 69
9 id_9 37.66203 -94.64099 93
10 id_10 32.55608 -105.76847 67
I need to find 3 closets locations for each location in the table.
This is my code in R:
# R script
require(dplyr)
require(geosphere)
start.time <- Sys.time()
d1 <- df
sample <- 999999999999
distances <- list("init1" = sample, "init2" = sample, "init3" = sample)
d1$distances <- apply(d1, 1, function(x){distances})
n_rows = nrow(d1)
for (i in 1:(n_rows-1)) {
# current location
dot1 <- c(d1$lon[i], d1$lat[i])
for (k in (i+1):n_rows) {
# next location
dot2 <- c(d1$lon[k], d1$lat[k])
# distance between locations
meters_between <- as.integer(distm(dot1, dot2, fun = distHaversine))
# updating current location distances
distances <- d1$distances[[i]]
distances[d1$loc_id[k]] <- meters_between
d1$distances[[i]] <- distances[order(unlist(distances), decreasing=FALSE)][1:3]
# updating next location distances
distances <- d1$distances[[k]]
distances[d1$loc_id[i]] <- meters_between
d1$distances[[k]] <- distances[order(unlist(distances), decreasing=FALSE)][1:3]
}
}
But it takes too much time:
# [1] "For 10 rows and 45 iterations takes 0.124729156494141 sec. Average sec 0.00277175903320313 per row."
# [1] "For 100 rows and 4950 iterations takes 2.54944682121277 sec. Average sec 0.000515039761861165 per row."
# [1] "For 200 rows and 19900 iterations takes 10.1178169250488 sec. Average sec 0.000508433011308986 per row."
# [1] "For 500 rows and 124750 iterations takes 73.7151870727539 sec. Average sec 0.000590903303188408 per row."
I did the same in Python:
# Python script
import pandas as pd
import numpy as np
n = 10
np.random.seed(1)
data_m = np.random.uniform(0, 5, 5)
data = {'loc_id':range(1, n+1),
'lat':np.random.uniform(32, 41, n),
'lon':np.random.uniform(84, 112, n)*(-1),
'values':np.random.randint(50, 100, n)}
df = pd.DataFrame(data)[['loc_id', 'lat', 'lon', 'values']]
df['loc_id'] = df['loc_id'].apply(lambda x: 'id_{0}'.format(x))
df = df.reset_index().drop('index', axis = 1).set_index('loc_id')
from geopy.distance import distance
from datetime import datetime
start_time = datetime.now()
sample = 999999999999
df['distances'] = np.nan
df['distances'] = df['distances'].apply(lambda x: [{'init1': sample}, {'init2': sample}, {'init3': sample}])
n_rows = len(df)
rows_done = 0
for i, row_i in df.head(n_rows-1).iterrows():
dot1 = (row_i['lat'], row_i['lon'])
rows_done = rows_done + 1
for k, row_k in df.tail(n_rows-rows_done).iterrows():
dot2 = (row_k['lat'], row_k['lon'])
meters_between = int(distance(dot1,dot2).meters)
distances = df.at[i, 'distances']
distances.append({k: meters_between})
distances_sorted = sorted(distances, key=lambda x: x[next(iter(x))])[:3]
df.at[i, 'distances'] = distances_sorted
distances = df.at[k, 'distances']
distances.append({i: meters_between})
distances_sorted = sorted(distances, key=lambda x: x[next(iter(x))])[:3]
df.at[k, 'distances'] = distances_sorted
print df
Almost the same performance.
Anybody knows if there is a better approach? In my task it has to be done for 90000 locations. Even thought about Hadoop/MpRc/Spark, but have no idea how to do in distributed mode.
I am glad to hear any ideas or suggestions.
If Euclidean distance is ok then nn2 uses kd-trees and C code so it should be fast:
library(RANN)
nn2(df[2:3], k = 4)
This took a total of 0.06 to 0.11 seconds on my not particularly fast laptop to process n = 10,000 rows and a total of 1.00 to 1.25 seconds for 90,000 rows.
I can offer a python solution with scipy
from scipy.spatial import distance
from geopy.distance import vincenty
v=distance.cdist(df[['lat','lon']].values,df[['lat','lon']].values,lambda u, v: vincenty(u, v).kilometers)
np.sort(v,axis=1)[:,1:4]
Out[1033]:
array([[384.09948155, 468.15944729, 545.41393271],
[270.07677993, 397.21974571, 659.96238603],
[384.09948155, 397.21974571, 619.616239 ],
[203.07302273, 483.54687912, 741.21396029],
[203.07302273, 444.49156394, 659.96238603],
[437.31308598, 468.15944729, 494.91879983],
[494.91879983, 695.91437812, 697.27399161],
[270.07677993, 444.49156394, 483.54687912],
[530.54946479, 626.29467739, 695.91437812],
[437.31308598, 545.41393271, 697.27399161]])
Here's how to solve this problem with C++ and my library
GeographicLib (version 1.47 or later). This uses true ellipsoidal geodesic
distances and a
vantage point tree
to optimize the search for nearest neighbors.
#include <exception>
#include <vector>
#include <fstream>
#include <string>
#include <GeographicLib/NearestNeighbor.hpp>
#include <GeographicLib/Geodesic.hpp>
using namespace std;
using namespace GeographicLib;
// A structure to hold a geographic coordinate.
struct pos {
string id;
double lat, lon;
pos(const string& _id = "", double _lat = 0, double _lon = 0) :
id(_id), lat(_lat), lon(_lon) {}
};
// A class to compute the distance between 2 positions.
class DistanceCalculator {
private:
Geodesic _geod;
public:
explicit DistanceCalculator(const Geodesic& geod) : _geod(geod) {}
double operator() (const pos& a, const pos& b) const {
double d;
_geod.Inverse(a.lat, a.lon, b.lat, b.lon, d);
if ( !(d >= 0) )
// Catch illegal positions which result in d = NaN
throw GeographicErr("distance doesn't satisfy d >= 0");
return d;
}
};
int main() {
try {
// Read in pts
vector<pos> pts;
string id;
double lat, lon;
{
ifstream is("pts.txt"); // lines of "id lat lon"
if (!is.good())
throw GeographicErr("pts.txt not readable");
while (is >> id >> lon >> lat)
pts.push_back(pos(id, lat, lon));
if (pts.size() == 0)
throw GeographicErr("need at least one location");
}
// Define a distance function object
DistanceCalculator distance(Geodesic::WGS84());
// Create NearestNeighbor object
NearestNeighbor<double, pos, DistanceCalculator>
ptsset(pts, distance);
vector<int> ind;
int n = 3; // Find 3 nearest neighbors
for (unsigned i = 0; i < pts.size(); ++i) {
ptsset.Search(pts, distance, pts[i], ind,
n, numeric_limits<double>::max(),
// exclude the point itself
0.0);
if (ind.size() != n)
throw GeographicErr("unexpected number of results");
cout << pts[i].id;
for (unsigned j = 0; j < ind.size(); ++j)
cout << " " << pts[ind[j]].id;
cout << "\n";
}
int setupcost, numsearches, searchcost, mincost, maxcost;
double mean, sd;
ptsset.Statistics(setupcost, numsearches, searchcost,
mincost, maxcost, mean, sd);
long long
totcost = setupcost + searchcost,
exhaustivecost = ((pts.size() - 1) * pts.size())/2;
cerr
<< "Number of distance calculations = " << totcost << "\n"
<< "With an exhaustive search = " << exhaustivecost << "\n"
<< "Ratio = " << double(totcost) / exhaustivecost << "\n"
<< "Efficiency improvement = "
<< 100 * (1 - double(totcost) / exhaustivecost) << "%\n";
}
catch (const exception& e) {
cerr << "Caught exception: " << e.what() << "\n";
return 1;
}
}
This reads in a set of points (in the form "id lat lon") for pts.txt,
puts them in a VP tree. Then for each point it looks up the 3 nearest
neighbors and prints the id and the id's of the neighbors (ranked by
distance).
Compile this with, e.g.,
g++ -O3 -o nearest nearest.cpp -lGeographic
If pts.txt contains 90000 points, then the computation completes in
about 6 secs (or 70 μs per point) on my home computer after doing about 3380000 distance
calculations. This
is about 1200 times more efficient than a brute force calculaion
(doing all N(N − 1)/2 distance calculations).
You could speed this up (by a factor of a "few") by using a crude
approximation to the distance (e.g., spherical or euclidean); just
modify the DistanceCalculator class appropriately. For example, this
version of DistanceCalculator returns the spherical distance in
degrees:
// A class to compute the spherical distance between 2 positions.
class DistanceCalculator {
public:
explicit DistanceCalculator(const Geodesic& /*geod*/) {}
double operator() (const pos& a, const pos& b) const {
double sphia, cphia, sphib, cphib, somgab, comgab;
Math::sincosd(a.lat, sphia, cphia);
Math::sincosd(b.lat, sphib, cphib);
Math::sincosd(Math::AngDiff(a.lon, b.lon), somgab, comgab);
return Math::atan2d(Math::hypot(cphia * sphib - sphia * cphib * comgab,
cphib * somgab),
sphia * sphib + cphia * cphib * comgab);
}
};
But now you have the added burden of ensuring that the approximation
is good enough. I recommend just using the correct geodesic distance
in the first place.
Details of the implementation of VP trees in GeographicLib are given
here.