I am benchmarking matrix multiplication for different libraries as I am thinking of rewriting some cython code to native c++. However the simple tests seem to imply that numpy is faster than BLAS or eigen for simple matrix multiplications.
I have written the following files:
#!test_blas.cpp
#include <random>
#include <cstdio>
#include <stdlib.h>
#include <iostream>
#include <cblas.h>
int main ( int argc, char* argv[] ) {
// Random numbers
std::mt19937_64 rnd;
std::uniform_real_distribution<double> doubleDist(0, 1);
// Create arrays that represent the matrices A,B,C
const int n = 2000;
double* A = new double[n*n];
double* B = new double[n*n];
double* C = new double[n*n];
// Fill A and B with random numbers
for(uint i =0; i <n; i++){
for(uint j=0; j<n; j++){
A[i*n+j] = doubleDist(rnd);
B[i*n+j] = doubleDist(rnd);
}
}
// Calculate A*B=C
clock_t start = clock();
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, A, n, B, n, 0.0, C, n);
clock_t end = clock();
double time = double(end - start)/ CLOCKS_PER_SEC;
std::cout<< "Time taken : " << time << std::endl;
// Clean up
delete[] A;
delete[] B;
delete[] C;
return 0;
}
#!test_eigen.cpp
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{
int n_a_rows = 2000;
int n_a_cols = 2000;
int n_b_rows = n_a_cols;
int n_b_cols = 2000;
MatrixXd a(n_a_rows, n_a_cols);
for (int i = 0; i < n_a_rows; ++ i)
for (int j = 0; j < n_a_cols; ++ j)
a (i, j) = n_a_cols * i + j;
MatrixXd b (n_b_rows, n_b_cols);
for (int i = 0; i < n_b_rows; ++ i)
for (int j = 0; j < n_b_cols; ++ j)
b (i, j) = n_b_cols * i + j;
MatrixXd d (n_a_rows, n_b_cols);
clock_t begin = clock();
d = a * b;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Time taken : " << elapsed_secs << std::endl;
}
#!test_numpy.py
import numpy as np
import time
N = 2000
a = np.random.rand(N, N)
b = np.random.rand(N, N)
start = time.time()
c = a.dot(b)
print(f"Time taken : {time.time() - start}")
Finally I created the following test file
#!test_matrix.sh
c++ -O2 -march=native -std=c++11 -I /usr/include/eigen3 test_eigen.cpp -o eigen
c++ -O2 -march=native -std=c++11 test_blas.cpp -o blas -lcblas
echo "testing BLAS"
./blas
echo "testing Eigen"
./eigen
echo "testing numpy"
python test_numpy.py
which yields the output
testing BLAS
Time taken : 1.63807
testing Eigen
Time taken : 0.795115
testing numpy
Time taken : 0.28397703170776367
Now my question is, how come numpy is the fastest of these tests? Am I missing something with regards to optimizations?
One thing could be that numpy uses threading to compute the matrix product. Adding the compiler flag -fopenmp however yields worse performance for eigen and BLAS.
I am using g++ version 9.0.3-1. Numpy is version 1.18.1 using python 3.8.2. Thanks in advance.
Related
This is probably going to be a long question, I apologize in advance.
I'm working on a project with the goal of researching different solutions for the closest string problem.
Let s_1, ... s_n be strings of length m. Find a string s of length m such that it minimizes max{d(s, s_i) | i = 1, ..., n}, where d is the hamming distance.
One solution that has been tried is one using ant colony optimization, as decribed here.
The paper itself does not go into implementation details, so I've done my best on efficiency. However, efficiency is not the only unusual behaviour.
I'm not sure whether it's common pratice to do so, but I will present my code through pastebin since I believe it would overwhelm the thread if I should put it directly here. If that turns out to be a problem, I won't mind editing the thread to put it here directly. As all the previous algorithms I've experimented with, I've written this one in python initially. Here's the code:
def solve_(self, problem: CSProblem) -> CSSolution:
m, n, alphabet, strings = problem.m, problem.n, problem.alphabet, problem.strings
A = len(alphabet)
rho = self.config['RHO']
colony_size = self.config['COLONY_SIZE']
global_best_ant = None
global_best_metric = m
ants = np.full((colony_size, m), '')
world_trails = np.full((m, A), 1 / A)
for iteration in range(self.config['MAX_ITERS']):
local_best_ant = None
local_best_metric = m
for ant_idx in range(colony_size):
for next_character_index in range(m):
ants[ant_idx][next_character_index] = random.choices(alphabet, weights=world_trails[next_character_index], k=1)[0]
ant_metric = utils.problem_metric(ants[ant_idx], strings)
if ant_metric < local_best_metric:
local_best_metric = ant_metric
local_best_ant = ants[ant_idx]
# First we perform pheromone evaporation
for i in range(m):
for j in range(A):
world_trails[i][j] = world_trails[i][j] * (1 - rho)
# Now, using the elitist strategy, only the best ant is allowed to update his pheromone trails
best_ant_ys = (alphabet.index(a) for a in local_best_ant)
best_ant_xs = range(m)
for x, y in zip(best_ant_xs, best_ant_ys):
world_trails[x][y] = world_trails[x][y] + (1 - local_best_metric / m)
if local_best_metric < global_best_metric:
global_best_metric = local_best_metric
global_best_ant = local_best_ant
return CSSolution(''.join(global_best_ant), global_best_metric)
The utils.problem_metric function looks like this:
def hamming_distance(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
def problem_metric(string, references):
return max(hamming_distance(string, r) for r in references)
I've seen that there are a lot more tweaks and other parameters you can add to ACO, but I've kept it simple for now. The configuration I'm using is is 250 iterations, colony size od 10 ants and rho=0.1. The problem that I'm testing it on is from here: http://tcs.informatik.uos.de/research/csp_cssp , the one called 2-10-250-1-0.csp (the first one). The alphabet consists only of '0' and '1', the strings are of length 250, and there are 10 strings in total.
For the ACO configuration that I've mentioned, this problem, using the python solver, gets solved on average in 5 seconds, and the average target function value is 108.55 (simulated 20 times). The correct target function value is 96. Ironically, the 5-second average is good compared to what it used to be in my first attempt of implementing this solution. However, it's still surprisingly slow.
After doing all kinds of optimizations, I've decided to try and implement the exact same solution in C++ so see whether there will be a significant difference between the running times. Here's the C++ solution:
#include <iostream>
#include <vector>
#include <algorithm>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <random>
#include <chrono>
#include <map>
class CSPProblem{
public:
int m;
int n;
std::vector<char> alphabet;
std::vector<std::string> strings;
CSPProblem(int m, int n, std::vector<char> alphabet, std::vector<std::string> strings)
: m(m), n(n), alphabet(alphabet), strings(strings)
{
}
static CSPProblem from_csp(std::string filepath){
std::ifstream file(filepath);
std::string line;
std::vector<std::string> input_lines;
while (std::getline(file, line)){
input_lines.push_back(line);
}
int alphabet_size = std::stoi(input_lines[0]);
int n = std::stoi(input_lines[1]);
int m = std::stoi(input_lines[2]);
std::vector<char> alphabet;
for (int i = 3; i < 3 + alphabet_size; i++){
alphabet.push_back(input_lines[i][0]);
}
std::vector<std::string> strings;
for (int i = 3 + alphabet_size; i < input_lines.size(); i++){
strings.push_back(input_lines[i]);
}
return CSPProblem(m, n, alphabet, strings);
}
int hamm(const std::string& s1, const std::string& s2) const{
int h = 0;
for (int i = 0; i < s1.size(); i++){
if (s1[i] != s2[i])
h++;
}
return h;
}
int measure(const std::string& sol) const{
int mm = 0;
for (const auto& s: strings){
int h = hamm(sol, s);
if (h > mm){
mm = h;
}
}
return mm;
}
friend std::ostream& operator<<(std::ostream& out, CSPProblem problem){
out << "m: " << problem.m << std::endl;
out << "n: " << problem.n << std::endl;
out << "alphabet_size: " << problem.alphabet.size() << std::endl;
out << "alphabet: ";
for (const auto& a: problem.alphabet){
out << a << " ";
}
out << std::endl;
out << "strings:" << std::endl;
for (const auto& s: problem.strings){
out << "\t" << s << std::endl;
}
return out;
}
};
std::random_device rd;
std::mt19937 gen(rd());
int get_from_distrib(const std::vector<float>& weights){
std::discrete_distribution<> d(std::begin(weights), std::end(weights));
return d(gen);
}
int max_iter = 250;
float rho = 0.1f;
int colony_size = 10;
int ant_colony_solver(const CSPProblem& problem){
srand(time(NULL));
int m = problem.m;
int n = problem.n;
auto alphabet = problem.alphabet;
auto strings = problem.strings;
int A = alphabet.size();
float init_pher = 1.0 / A;
std::string global_best_ant;
int global_best_matric = m;
std::vector<std::vector<float>> world_trails(m, std::vector<float>(A, 0.0f));
for (int i = 0; i < m; i++){
for (int j = 0; j < A; j++){
world_trails[i][j] = init_pher;
}
}
std::vector<std::string> ants(colony_size, std::string(m, ' '));
for (int iteration = 0; iteration < max_iter; iteration++){
std::string local_best_ant;
int local_best_metric = m;
for (int ant_idx = 0; ant_idx < colony_size; ant_idx++){
for (int next_character_idx = 0; next_character_idx < m; next_character_idx++){
char next_char = alphabet[get_from_distrib(world_trails[next_character_idx])];
ants[ant_idx][next_character_idx] = next_char;
}
int ant_metric = problem.measure(ants[ant_idx]);
if (ant_metric < local_best_metric){
local_best_metric = ant_metric;
local_best_ant = ants[ant_idx];
}
}
// Evaporation
for (int i = 0; i < m; i++){
for (int j = 0; j < A; j++){
world_trails[i][j] = world_trails[i][j] + (1.0 - rho);
}
}
std::vector<int> best_ant_xs;
for (int i = 0; i < m; i++){
best_ant_xs.push_back(i);
}
std::vector<int> best_ant_ys;
for (const auto& c: local_best_ant){
auto loc = std::find(std::begin(alphabet), std::end(alphabet), c);
int idx = loc- std::begin(alphabet);
best_ant_ys.push_back(idx);
}
for (int i = 0; i < m; i++){
int x = best_ant_xs[i];
int y = best_ant_ys[i];
world_trails[x][y] = world_trails[x][y] + (1.0 - static_cast<float>(local_best_metric) / m);
}
if (local_best_metric < global_best_matric){
global_best_matric = local_best_metric;
global_best_ant = local_best_ant;
}
}
return global_best_matric;
}
int main(){
auto problem = CSPProblem::from_csp("in.csp");
int TRIES = 20;
std::vector<int> times;
std::vector<int> measures;
for (int i = 0; i < TRIES; i++){
auto start = std::chrono::high_resolution_clock::now();
int m = ant_colony_solver(problem);
auto stop = std::chrono::high_resolution_clock::now();
int duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
times.push_back(duration);
measures.push_back(m);
}
float average_time = static_cast<float>(std::accumulate(std::begin(times), std::end(times), 0)) / TRIES;
float average_measure = static_cast<float>(std::accumulate(std::begin(measures), std::end(measures), 0)) / TRIES;
std::cout << "Average running time: " << average_time << std::endl;
std::cout << "Average solution: " << average_measure << std::endl;
std::cout << "all solutions: ";
for (const auto& m: measures) std::cout << m << " ";
std::cout << std::endl;
return 0;
}
The average running time now is only 530.4 miliseconds. However, the average target function value is 122.75, which is significantly higher than that of the python solution.
If the average function values were the same, and the times were as they are, I would simply write this off as 'C++ is faster than python' (even though the difference in speed is also very suspiscious). But, since C++ yields worse solutions, it leads me to believe that I've done something wrong in C++. What I'm suspiscious of is the way I'm generating an alphabet index using weights. In python I've done it using random.choices as follows:
ants[ant_idx][next_character_index] = random.choices(alphabet, weights=world_trails[next_character_index], k=1)[0]
As for C++, I haven't done it in a while so I'm a bit rusty on reading cppreference (which is a skill of its own), and the std::discrete_distribution solution is something I've plain copied from the reference:
std::random_device rd;
std::mt19937 gen(rd());
int get_from_distrib(const std::vector<float>& weights){
std::discrete_distribution<> d(std::begin(weights), std::end(weights));
return d(gen);
}
The suspiscious thing here is the fact that I'm declaring the std::random_device and std::mt19937 objects globally and using the same ones every time. I have not been able to find an answer to whether this is the way they're meant to be used. However, if I put them in the function:
int get_from_distrib(const std::vector<float>& weights){
std::random_device rd;
std::mt19937 gen(rd());
std::discrete_distribution<> d(std::begin(weights), std::end(weights));
return d(gen);
}
the average running time gets significantly worse, clocking in at 8.84 seconds. However, even more surprisingly, the average function value gets worse as well, at 130.
Again, if only one of the two things changed (say if only the time went up) I would have been able to draw some conclusions. This way it only gets more confusing.
So, does anybody have an idea of why this is happening?
Thanks in advance.
MAJOR EDIT: I feel embarrased having asked such a huge question when in fact the problem lies in a simple typo. Namely in the evaporation step in the C++ version I put a + instead of a *.
Now the algorithms behave identically in terms of average solution quality.
However, I could still use some tips on how to optimize the python version.
Apart form the dumb mistake I've mentioned in the question edit, it seems I've finally found a way to optimize the python solution decently. First of all, keeping world_trails and ants as numpy arrays instead of lists of lists actually slowed things down. Furthermore, I actually stopped keeping a list of ants altogether since I only ever need the best one per iteration.
Lastly, running cProfile indicated that a lot of the time was spent on random.choices, therefore I've decided to implement my own version of it suited specifically for this case. I've done this by pre-computing total weight sum per character for each next iteration (in the trail_row_wise_sums array), and using the following function:
def fast_pick(arr, weights, ws):
r = random.random()*ws
for i in range(len(arr)):
if r < weights[i]:
return arr[i]
r -= weights[i]
return 0
The new version now looks like this:
def solve_(self, problem: CSProblem) -> CSSolution:
m, n, alphabet, strings = problem.m, problem.n, problem.alphabet, problem.strings
A = len(alphabet)
rho = self.config['RHO']
colony_size = self.config['COLONY_SIZE']
miters = self.config['MAX_ITERS']
global_best_ant = None
global_best_metric = m
init_pher = 1.0 / A
world_trails = [[init_pher for _ in range(A)] for _ in range(m)]
trail_row_wise_sums = [1.0 for _ in range(m)]
for iteration in tqdm(range(miters)):
local_best_ant = None
local_best_metric = m
for _ in range(colony_size):
ant = ''.join(fast_pick(alphabet, world_trails[next_character_index], trail_row_wise_sums[next_character_index]) for next_character_index in range(m))
ant_metric = utils.problem_metric(ant, strings)
if ant_metric <= local_best_metric:
local_best_metric = ant_metric
local_best_ant = ant
# First we perform pheromone evaporation
for i in range(m):
for j in range(A):
world_trails[i][j] = world_trails[i][j] * (1 - rho)
# Now, using the elitist strategy, only the best ant is allowed to update his pheromone trails
best_ant_ys = (alphabet.index(a) for a in local_best_ant)
best_ant_xs = range(m)
for x, y in zip(best_ant_xs, best_ant_ys):
world_trails[x][y] = world_trails[x][y] + (1 - 1.0*local_best_metric / m)
if local_best_metric < global_best_metric:
global_best_metric = local_best_metric
global_best_ant = local_best_ant
trail_row_wise_sums = [sum(world_trails[i]) for i in range(m)]
return CSSolution(global_best_ant, global_best_metric)
The average running time is now down to 800 miliseconds (compared to 5 seconds that it was before). Granted, applying the same fast_pick optimization to the C++ solution did also speed up the C++ version (around 150 ms) but I guess now I can write it off as C++ being faster than python.
Profiler also showed that a lot of the time was spent on calculating Hamming distances, but that's to be expected, and apart from that I see no other way of computing the Hamming distance between arbitrary strings more efficiently.
I want to Cythonize portion of a pyx script which involves work with numpy arrays with complex numbers. The relevant portion of the python script looks like this:
M = np.dot(N , Q)
In my work, N, Q and M are numpy arrays with complex number entries.
Specifically, I want to transfer the matrices N and Q to a C++ code and do the matrix multiplication in C++.
While I know the method to transfer real valued numpy arrays using pointers to C++ script, followed by use of cython, I am a bit confused about how I should approach things for numpy arrays with complex values.
This is how I am trying to transfer the array from pyx to C++ presently.
import numpy as np
cimport numpy as np
cdef extern from "./matmult.h" nogil:
void mult(double* M, double* N, double* Q)
def sim():
cdef:
np.ndarray[np.complex128_t,ndim=2] N = np.zeros(( 2 , 2 ), dtype=np.float64)
np.ndarray[np.complex128_t,ndim=2] Q = np.zeros(( 2 , 2 ), dtype=np.float64)
np.ndarray[np.complex128_t,ndim=2] M = np.zeros(( 2 , 2 ), dtype=np.float64)
N = np.array([[1.1 + 2j,2.2],[3.3,4.4]])
Q = np.array([[3.3,4.4+5j],[5.5,6.6]])
mult(&M[0,0], &N[0,0], &Q[0,0])
print M
This is my C++ code:
#include "matmult.h"
using namespace std;
int main(){}
void mult(double *M, double *N, double *Q)
{
double P[2][2], A[2][2], B[2][2];
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
A[i][j] = *( N + ((2*i) + j) );
B[i][j] = *( Q + ((2*i) + j) );
P[i][j] = 0;
}
}
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
for (int k=0; k<2; k++)
{
P[i][j] += A[i][k]*B[k][i];
}
}
}
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
*( M + ((2*i) + j) ) = P[i][j];
}
}
}
When I compile this using cython, I get the following error
mat.pyx:17:27: Cannot assign type 'double complex *' to 'double *'
I will be grateful to have some help here.
This error message is telling you what's wrong:
mat.pyx:17:27: Cannot assign type 'double complex *' to 'double *'
That is, you have a double complex pointer from numpy (pointer to complex128 numpy dtype) and you're trying to pass that into the C++ function using double pointers. C++ needs to be able to deal with the complex numbers, so if you change your double* -> std::complex this should fix your problem
void mult(double *M, double *N, double *Q)
becomes
#include <complex>
void mult(std::complex<double> *M, std::complex<double> *N, std::complex<double> *Q)
Does numpy matrix multiply not suffice for your use case? Cython might be overkill.
Edit: Ok I finally got something, there's something a bit weird dealing with C++ std::complex and C double _Complex types.
cppmul.pyx:
import numpy as np
cimport numpy as np
cdef extern from "./matmult.h" nogil:
void mult(np.complex128_t* M, np.complex128_t* N, np.complex128_t* Q)
def sim():
cdef:
np.ndarray[np.complex128_t,ndim=2] N = np.zeros(( 2 , 2 ), dtype=np.complex128)
np.ndarray[np.complex128_t,ndim=2] Q = np.zeros(( 2 , 2 ), dtype=np.complex128)
np.ndarray[np.complex128_t,ndim=2] M = np.zeros(( 2 , 2 ), dtype=np.complex128)
N = np.array([[1.1 + 2j,2.2],[3.3,4.4]])
Q = np.array([[3.3,4.4+5j],[5.5,6.6]])
mult(&M[0,0], &N[0,0], &Q[0,0])
print M
matmul.c:
#include "matmult.h"
void mult(complex_t *M, complex_t *N, complex_t *Q)
{
complex_t P[2][2], A[2][2], B[2][2];
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
A[i][j] = *( N + ((2*i) + j) );
B[i][j] = *( Q + ((2*i) + j) );
P[i][j] = 0;
}
}
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
for (int k=0; k<2; k++)
{
P[i][j] += A[i][k]*B[k][i];
}
}
}
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
*( M + ((2*i) + j) ) = P[i][j];
}
}
}
matmult.h:
#include <complex.h>
typedef double _Complex complex_t;
void mult(complex_t *M, complex_t *N, complex_t *Q);
setup.py:
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
import numpy as np
sourcefiles = ['cppmul.pyx', 'matmult.c']
extensions = [Extension("cppmul",
sourcefiles,
include_dirs=[np.get_include()],
extra_compile_args=['-O3']
)]
setup(
ext_modules = cythonize(extensions)
)
after running python setup.py build_ext --inplace it imports and runs as expected
import cppmul
cppmul.sim()
result:
[[15.73 +6.6j 15.73 +6.6j]
[43.56+16.5j 43.56+16.5j]]
try this
#include "matmult.h"
using namespace std;
int main(){}
void mult(double *M, double *N, double *Q)
{
double P[2][2], A[2][2], B[2][2];
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
A[i][j] = *( N + ((2*i) + j) );
B[i][j] = *( Q + ((2*i) + j) );
P[i][j] = 0;
}
}
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
for (int k=0; k<2; k++)
{
P[i][j] += A[i][k]*B[k][i];
}
}
}
for (int i=0; i<2; i++)
{
for (int j=0; j<2; j++)
{
*( ((2*i) + j) )+ M = P[i][j];
}
}
}
i am using Dev-c++ version 5.11 and running a c++ code.
#include <iostream>
#include <time.h>
using namespace std ;
int main()
{
int start = clock();
for(int x = 0; x <=100000; x++)
{
cout << x << endl;
}
int stop = clock();
cout << endl;
cout << (stop-start)/double(CLOCKS_PER_SEC);
return 0;
}
It is taking 80sec - 75sec on windows 8.1 pavilion g6 2.5Ghz but in python it executes in under 3 seconds how to decrease execution time in c++ code.
Python Code -- (Pycharm python ver.3)
import time as t
num = 100000
start = t.time()
for z in range(0, num):
print(z)
print(t.time() - start)
execution time - 2.95 seconds
I know very little on Python, but I'm quite experienced in C++.
I was looking for an algorithm that would loop through the points in a hexagon pattern and found one written in Python that seems to be exactly what I need. The problem is that I have no idea how to interpret it.
Here is the Python code:
for x in [(n-abs(x-int(n/2))) for x in range(n)]:
for y in range(n-x):
print ' ',
for y in range(x):
print ' * ',
print
I'd show you my attempts but there are like 30 different ones that all failed (which I'm sure are just my bad interpretation).
Hope this helps you.Tested both python and c++ code .Got same results.
#include <iostream>
#include <cmath>
using namespace std;
int main() {
// your code goes here
int n = 10;
int a[n];
for(int j = 0; j < n; j++) {
a[j] = n - abs(j - (n / 2));
}
for(int x = 0; x < n; x++) {
for(int y = 0; y < (n - a[x]); y++) {
std::cout << " " << std::endl;
}
}
for(int k = 0; k < a[n-1]; k++) {
std::cout << " * " << std::endl;
}
std::cout << "i" << std::endl;
return 0;
}
auto out = ostream_iterator<const char*>(cout," ");
for(int i = 0; i != n; ++i) {
auto width = n - abs(i - n/2);
fill_n(out, n-width, " ");
fill_n(out, width, " * ");
cout << "\n";
}
Live demo.
Python live demo for reference.
Some time ago (I can't remember where) I find this python snippet which implents a spigot algorithm for calculating digits of Pi:
def pi_digits():
"""generator for digits of pi"""
q,r,t,k,n,l = 1,0,1,1,3,3
while True:
if 4*q+r-t < n*t:
yield n
q,r,t,k,n,l = (10*q,10*(r-n*t),t,k,(10*(3*q+r))/t-10*n,l)
else:
q,r,t,k,n,l = (q*k,(2*q+r)*l,t*l,k+1,(q*(7*k+2)+r*l)/(t*l),l+2)
digits = pi_digits()
for i in range(30): print digits.next()
Now I wanna implent this in C++. My try was:
#include <cmath>
#include <cstdlib>
#include <iostream>
typedef long long ll;
void help() {
std::cout << "Usage: pi2 <digits>" << std::endl;
exit(1);
}
void pi(const long long digits) {
ll q, r, t, k, n, l;
q=1;
r=0;
t=1;
k=1;
n=3;
l=3;
for(ll i=0; i<digits; ++i) {
if(4*q+r-t < n*t) {
std::cout << n;
q=10*q;
r=10*(r-n*t);
n = ( 10 * ( 3 * q + r) / t ) - 10 * n; //Thanks to maverik
} else {
q=q*k;
r=(2*q+r)*l;
t=t*l;
k=k+1;
n=(q*(7*k+2)+r*l)/(t*l);
l=l+2;
}
}
}
int main(int argc, char** argv) {
if(argc<2) help();
ll digits = 0;
if(digits=atoll(argv[1])<1) help();
pi(digits);
return 0;
}
But it never calls std::cout::operator<<, while the python version works.
Can you help me?
Thanks.
The reason is your code not performing the equivalent calculations in the two languages.
There are(as far as I see) two reasons for this:
In this python code, all the calculations are done at the same time:
q,r,t,k,n,l = (q*k,(2*q+r)*l,t*l,k+1,(q*(7*k+2)+r*l)/(t*l),l+2)
In the C code, the calculations are performed one at a time, so every one uses the result of the previous ones instead of using the old values(like the python code does).
You're using ints in python, and long longs in C.
The division in C code will produce long longs, while those in python(assuming python 2), will produce rounded-down ints.
This could also create miscalculations which can cause your condition to never be true.
P.S.
Implementing this in C from scratch is probably a better idea than porting a python algorithm.
Looks like there should be (according to the python code):
n = ( 10 * ( 3 * q + r) / t ) - 10 * n;
And there:
if ( 4 * q + r - t < n * t) ...
Or I've missed something?