Eigen Matrix vs Numpy Array multiplication performance

Eigen Matrix vs Numpy Array multiplication performance - python

I read in this question that eigen has very good performance. However, I tried to compare eigen MatrixXi multiplication speed vs numpy array multiplication. And numpy performs better (~26 seconds vs. ~29). Is there a more efficient way to do this eigen?
Here is my code:
Numpy:
import numpy as np
import time
n_a_rows = 4000
n_a_cols = 3000
n_b_rows = n_a_cols
n_b_cols = 200
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)
start = time.time()
d = np.dot(a, b)
end = time.time()
print "time taken : {}".format(end - start)
Result:
time taken : 25.9291000366
Eigen:
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{
int n_a_rows = 4000;
int n_a_cols = 3000;
int n_b_rows = n_a_cols;
int n_b_cols = 200;
MatrixXi a(n_a_rows, n_a_cols);
for (int i = 0; i < n_a_rows; ++ i)
for (int j = 0; j < n_a_cols; ++ j)
a (i, j) = n_a_cols * i + j;
MatrixXi b (n_b_rows, n_b_cols);
for (int i = 0; i < n_b_rows; ++ i)
for (int j = 0; j < n_b_cols; ++ j)
b (i, j) = n_b_cols * i + j;
MatrixXi d (n_a_rows, n_b_cols);
clock_t begin = clock();
d = a * b;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Time taken : " << elapsed_secs << std::endl;
}
Result:
Time taken : 29.05
I am using numpy 1.8.1 and eigen 3.2.0-4.

My question has been answered by #Jitse Niesen and #ggael in the comments.
I need to add a flag to turn on the optimizations when compiling: -O2 -DNDEBUG (O is capital o, not zero).
After including this flag, eigen code runs in 0.6 seconds as opposed to ~29 seconds without it.

Change:
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)
into:
a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)*1.0
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)*1.0
This gives factor 100 boost at least at my laptop:
time taken : 11.1231250763
vs:
time taken : 0.124922037125
Unless you really want to multiply integers. In Eigen it is also quicker to multiply double precision numbers (amounts to replacing MatrixXi with MatrixXd three times), but there I see just 1.5 factor: Time taken : 0.555005 vs 0.846788.

Is there a more efficient way to do this eigen?
Whenever you have a matrix multiplication where the matrix on the left side of the = does not also appear on the right side, you can safely tell the compiler that there is no aliasing taking place. This will safe you one unnecessary temporary variable and assignment operation, which for big matrices can make an important difference in performance. This is done with the .noalias() function as follows.
d.noalias() = a * b;
This way a*b is directly evaluated and stored in d. Otherwise, to avoid aliasing problems, the compiler will first store the product into a temporary variable and then assign the this variable to your target matrix d.
So, in your code, the line:
d = a * b;
is actually compiled as follows:
temp = a*b;
d = temp;

Related

How to do computation with a c function on a numpy array of a non square matrix of size `n` times `m` (n≠m)

Edit
I created a similar question which I found more understandable and practical there: How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?
Original question
I want to use C in python to perform a computation an all entry of a big non square matrix of size n times m. I copied the code from the excellent tutorial there: https://medium.com/spikelab/calling-c-functions-from-python-104e609f2804. The code there is for a square matrix
I first compiled the c_sum.c script
$ cc -fPIC -shared -o c_sum.so c_sum.c
and then ran the python script:
$ python main.py
and that ran well. However if I set the values of n and m in the main.py to different values, I get a segmentation fault. I guess one has to allocate memory separately for n and m but my knowledge of C is to rudimentary to know how to do it. What would be a code that would work with, let's say, m=3000 and n=2000?
Here are the script c_sum.c:
#include <stdlib.h>
double * c_sum(const double * matrix, int n, int m){
double * results = (double *)malloc(sizeof(double) * n);
int index = 0;
for(int i=0; i< n*m; i+=n){
results[index] = 0;
for(int j=0; j<m; j++){
results[index] += matrix[i+j];
}
index += 1;
}
return results;
}
Here is the main.c script:
# https://medium.com/spikelab/calling-c-functions-from-python-104e609f2804
from ctypes import c_void_p, c_double, c_int, cdll
from numpy.ctypeslib import ndpointer
import numpy as np
import time
import pdb
def py_sum(matrix: np.array, n: int, m: int) -> np.array:
result = np.zeros(n)
for i in range(0, n):
for j in range(0, m):
result[i] += matrix[i][j]
return result
n = 3000
m = 3000
matrix = np.random.randn(n, m)
time1 = time.time()
py_result = py_sum(matrix, n, m)
time2 = time.time() - time1
print("py running time in seconds:", time2)
py_time = time2
lib = cdll.LoadLibrary("c_sum.so")
c_sum = lib.c_sum
c_sum.restype = ndpointer(dtype=c_double,
shape=(n,))
time1 = time.time()
result = c_sum(c_void_p(matrix.ctypes.data),
c_int(n),
c_int(m))
time2 = time.time() - time1
print("c running time in seconds:", time2)
c_time = time2
print("speedup:", py_time/c_time)

I assume you want to compute sum along last axis for a (n,m) matrix. Segmentation fault occurs when you access memory which you have no access. The issue lies in the the erroneous outer loop. You need to iterate over both dimensions but you iterate over same dimension twice.
double * results = (double *)malloc(sizeof(double) * n); /* you allocate n doubles.
Do you free this Outside function? If not, you are having a memory leak.
An alternative way is to pass the output array to function, so that you can avoid creating memory in the function*/
for(int i=0; i< n*m; i+=n){ /* i+=n => you are iterating for m times. also you are iterating over last dimension */
results[index] = 0; /* when index > n ; you are accessing data which
you don't have access leading to segmentation fault */
for(int j=0; j<m; j++) /* you are iterating again over last axis*/
{
results[index] += matrix[i+j];
}
index += 1; /* this leads to index > n as you iterate for m times and m>n in this case.
For a square matrix, m=n, so you don't have any issue */
}
TLDR: To fix the segmentation fault, you need to replace for(int i=0; i< n*m; i+=n) with for(int i=0; i< n*m; i+=m) so that you only iterate for n times and over both dimensions.

Truncation erros in python

Try to calculate:
by storing 1/N and X as float variable. Which result do you get for N=10000, 100000 and 1000000?
Now try to use double variables. Does it change outcome?
In order to do this I wrote this code:
#TRUNCATION ERRORS
import numpy as np #library for numerical calculations
import matplotlib.pyplot as plt #library for plotting purposes
x = 0
n = 10**6
X = []
N = []
for i in range(1,n+1):
x = x + 1/n
item = float(x)
item2 = float(n)
X.append(item)
N.append(item2)
plt.figure() #block for plot purpoes
plt.plot(N,X,marker=".")
plt.xlabel('N')
plt.ylabel('X')
plt.grid()
plt.show()
The output is:
This is wrong because the output should be like that (showed in the lecture):

First, you want to plot N on the x-axis, but you're actually plotting 1/N.
Second, you aren't calculating the expression you think you're calculating. It looks like you're calculating sum_{i=1..N}(1/i).
You need to calculate sum_{i=1..N}(1/N) which is 1/N + 1/N + ... + 1/N repeated N times. In other words, you want to calculate N * 1 / N, which should be equal to 1. Your exercise is showing you that it won't be when you use floating-point math, because reasons.
To do this correctly, let's first define a list of values for N
Nvals = [1, 10, 100, 1000, 10000, 100000, 1000000]
Let's define a function that will calculate our summation for a single value of N:
def calc_sum(N):
total = 0
for i in range(N):
total += 1 / N
return total
Next, let's create an empty list of Xvals and fill it up with the calculated sum for each N
Xvals = []
for N in Nvals:
Xvals.append(calc_sum(N))
or, as a listcomprehension:
Xvals = [calc_sum(N) for N in Nvals]
Now we get this value of Xvals:
[1.0,
0.9999999999999999,
1.0000000000000007,
1.0000000000000007,
0.9999999999999062,
0.9999999999980838,
1.000000000007918]
Clearly, they are not all equal to 1.
You can increase the number of values in Nvals to get a denser plot, but the idea is the same.
Now pay attention to what #khelwood said in their comment:
"float variables" and "double variables" are not a thing in Python. Variables don't have types. And floats are 64 bit
Python floats are all 64-bit floating-point numbers, so you can't do your exercise in python. If you used a language like C or C++ that actually has 32-bit float and 64-bit double types, you'd get something like this:
Try it online
#include <iostream>
float calc_sum_f(int N) {
float total = 0.0;
for (int i = 0; i < N; i++)
total += ((float)1 / N);
return total;
}
double calc_sum_d(int N) {
double total = 0.0;
for (int i = 0; i < N; i++)
total += ((double)1 / N);
return total;
}
int main()
{
int Nvals[7] = { 1, 10, 100, 1000, 10000, 100000, 1000000 };
std::cout << "N\tdouble\tfloat" << std::endl;
for (int ni = 0; ni < 7; ni++) {
int N = Nvals[ni];
double x_d = calc_sum_d(N);
float x_f = calc_sum_f(N);
std::cout << N << "\t" << x_d << "\t" << x_f << std::endl;
}
}
Output:
N double float
1 1 1
10 1 1
100 1 0.999999
1000 1 0.999991
10000 1 1.00005
100000 1 1.00099
1000000 1 1.00904
Here you can see that 32-bit floats don't have enough precision beyond a certain value of N to accurately calculate N * 1 / N. There's no reason the plot should look like your hand-drawn plot, because there's no reason it will decrease consistently as we can evidently see here.
Using numpy Thanks for the suggestion #Kelly to get 32-bit and 64-bit floating point types in python, we can similarly define two functions:
def calc_sum_64(N):
c = np.float64(0)
one_over_n = np.float64(1) / np.float64(N)
for i in range(N):
c += one_over_n
return c
def calc_sum_32(N):
c = np.float32(0)
one_over_n = np.float32(1) / np.float32(N)
for i in range(N):
c += one_over_n
return c
Then, we find Xvals_64 and Xvals_32
Nvals = [10**i for i in range(7)]
Xvals_32 = [calc_sum_32(N) for N in Nvals]
Xvals_64 = [calc_sum_64(N) for N in Nvals]
And we get:
Xvals_32 = [1.0, 1.0000001, 0.99999934, 0.9999907, 1.0000535, 1.0009902, 1.0090389]
Xvals_64 = [1.0,
0.9999999999999999,
1.0000000000000007,
1.0000000000000007,
0.9999999999999062,
0.9999999999980838,
1.000000000007918]
I haven't vectorized my numpy code to make it easier for you to understand what's going on, but Kelly shows a great way to vectorize it to speed up the calculation:
sum(1/N) from i = 1 to N is (1 / N) + (1 / N) + (1 / N) + ... {N times} , which is an array of N ones, divided by N and then summed. You could write the calc_sum_32 and calc_sum_64 functions like so:
def calc_sum_32(N):
return (np.ones((N,), dtype=np.float32) / np.float32(N)).sum()
def calc_sum_64(N):
return (np.ones((N,), dtype=np.float64) / np.float64(N)).sum()
You can then call these functions for every value of N you care about, and get a plot that looks like so, which shows the result oscillating about 1 for float32, but barely any oscillation for float64:

Unusual behaviour of Ant Colony Optimization for Closest String Problem in Python and C++

This is probably going to be a long question, I apologize in advance.
I'm working on a project with the goal of researching different solutions for the closest string problem.
Let s_1, ... s_n be strings of length m. Find a string s of length m such that it minimizes max{d(s, s_i) | i = 1, ..., n}, where d is the hamming distance.
One solution that has been tried is one using ant colony optimization, as decribed here.
The paper itself does not go into implementation details, so I've done my best on efficiency. However, efficiency is not the only unusual behaviour.
I'm not sure whether it's common pratice to do so, but I will present my code through pastebin since I believe it would overwhelm the thread if I should put it directly here. If that turns out to be a problem, I won't mind editing the thread to put it here directly. As all the previous algorithms I've experimented with, I've written this one in python initially. Here's the code:
def solve_(self, problem: CSProblem) -> CSSolution:
m, n, alphabet, strings = problem.m, problem.n, problem.alphabet, problem.strings
A = len(alphabet)
rho = self.config['RHO']
colony_size = self.config['COLONY_SIZE']
global_best_ant = None
global_best_metric = m
ants = np.full((colony_size, m), '')
world_trails = np.full((m, A), 1 / A)
for iteration in range(self.config['MAX_ITERS']):
local_best_ant = None
local_best_metric = m
for ant_idx in range(colony_size):
for next_character_index in range(m):
ants[ant_idx][next_character_index] = random.choices(alphabet, weights=world_trails[next_character_index], k=1)[0]
ant_metric = utils.problem_metric(ants[ant_idx], strings)
if ant_metric < local_best_metric:
local_best_metric = ant_metric
local_best_ant = ants[ant_idx]
# First we perform pheromone evaporation
for i in range(m):
for j in range(A):
world_trails[i][j] = world_trails[i][j] * (1 - rho)
# Now, using the elitist strategy, only the best ant is allowed to update his pheromone trails
best_ant_ys = (alphabet.index(a) for a in local_best_ant)
best_ant_xs = range(m)
for x, y in zip(best_ant_xs, best_ant_ys):
world_trails[x][y] = world_trails[x][y] + (1 - local_best_metric / m)
if local_best_metric < global_best_metric:
global_best_metric = local_best_metric
global_best_ant = local_best_ant
return CSSolution(''.join(global_best_ant), global_best_metric)
The utils.problem_metric function looks like this:
def hamming_distance(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
def problem_metric(string, references):
return max(hamming_distance(string, r) for r in references)
I've seen that there are a lot more tweaks and other parameters you can add to ACO, but I've kept it simple for now. The configuration I'm using is is 250 iterations, colony size od 10 ants and rho=0.1. The problem that I'm testing it on is from here: http://tcs.informatik.uos.de/research/csp_cssp , the one called 2-10-250-1-0.csp (the first one). The alphabet consists only of '0' and '1', the strings are of length 250, and there are 10 strings in total.
For the ACO configuration that I've mentioned, this problem, using the python solver, gets solved on average in 5 seconds, and the average target function value is 108.55 (simulated 20 times). The correct target function value is 96. Ironically, the 5-second average is good compared to what it used to be in my first attempt of implementing this solution. However, it's still surprisingly slow.
After doing all kinds of optimizations, I've decided to try and implement the exact same solution in C++ so see whether there will be a significant difference between the running times. Here's the C++ solution:
#include <iostream>
#include <vector>
#include <algorithm>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <random>
#include <chrono>
#include <map>
class CSPProblem{
public:
int m;
int n;
std::vector<char> alphabet;
std::vector<std::string> strings;
CSPProblem(int m, int n, std::vector<char> alphabet, std::vector<std::string> strings)
: m(m), n(n), alphabet(alphabet), strings(strings)
{
}
static CSPProblem from_csp(std::string filepath){
std::ifstream file(filepath);
std::string line;
std::vector<std::string> input_lines;
while (std::getline(file, line)){
input_lines.push_back(line);
}
int alphabet_size = std::stoi(input_lines[0]);
int n = std::stoi(input_lines[1]);
int m = std::stoi(input_lines[2]);
std::vector<char> alphabet;
for (int i = 3; i < 3 + alphabet_size; i++){
alphabet.push_back(input_lines[i][0]);
}
std::vector<std::string> strings;
for (int i = 3 + alphabet_size; i < input_lines.size(); i++){
strings.push_back(input_lines[i]);
}
return CSPProblem(m, n, alphabet, strings);
}
int hamm(const std::string& s1, const std::string& s2) const{
int h = 0;
for (int i = 0; i < s1.size(); i++){
if (s1[i] != s2[i])
h++;
}
return h;
}
int measure(const std::string& sol) const{
int mm = 0;
for (const auto& s: strings){
int h = hamm(sol, s);
if (h > mm){
mm = h;
}
}
return mm;
}
friend std::ostream& operator<<(std::ostream& out, CSPProblem problem){
out << "m: " << problem.m << std::endl;
out << "n: " << problem.n << std::endl;
out << "alphabet_size: " << problem.alphabet.size() << std::endl;
out << "alphabet: ";
for (const auto& a: problem.alphabet){
out << a << " ";
}
out << std::endl;
out << "strings:" << std::endl;
for (const auto& s: problem.strings){
out << "\t" << s << std::endl;
}
return out;
}
};
std::random_device rd;
std::mt19937 gen(rd());
int get_from_distrib(const std::vector<float>& weights){
std::discrete_distribution<> d(std::begin(weights), std::end(weights));
return d(gen);
}
int max_iter = 250;
float rho = 0.1f;
int colony_size = 10;
int ant_colony_solver(const CSPProblem& problem){
srand(time(NULL));
int m = problem.m;
int n = problem.n;
auto alphabet = problem.alphabet;
auto strings = problem.strings;
int A = alphabet.size();
float init_pher = 1.0 / A;
std::string global_best_ant;
int global_best_matric = m;
std::vector<std::vector<float>> world_trails(m, std::vector<float>(A, 0.0f));
for (int i = 0; i < m; i++){
for (int j = 0; j < A; j++){
world_trails[i][j] = init_pher;
}
}
std::vector<std::string> ants(colony_size, std::string(m, ' '));
for (int iteration = 0; iteration < max_iter; iteration++){
std::string local_best_ant;
int local_best_metric = m;
for (int ant_idx = 0; ant_idx < colony_size; ant_idx++){
for (int next_character_idx = 0; next_character_idx < m; next_character_idx++){
char next_char = alphabet[get_from_distrib(world_trails[next_character_idx])];
ants[ant_idx][next_character_idx] = next_char;
}
int ant_metric = problem.measure(ants[ant_idx]);
if (ant_metric < local_best_metric){
local_best_metric = ant_metric;
local_best_ant = ants[ant_idx];
}
}
// Evaporation
for (int i = 0; i < m; i++){
for (int j = 0; j < A; j++){
world_trails[i][j] = world_trails[i][j] + (1.0 - rho);
}
}
std::vector<int> best_ant_xs;
for (int i = 0; i < m; i++){
best_ant_xs.push_back(i);
}
std::vector<int> best_ant_ys;
for (const auto& c: local_best_ant){
auto loc = std::find(std::begin(alphabet), std::end(alphabet), c);
int idx = loc- std::begin(alphabet);
best_ant_ys.push_back(idx);
}
for (int i = 0; i < m; i++){
int x = best_ant_xs[i];
int y = best_ant_ys[i];
world_trails[x][y] = world_trails[x][y] + (1.0 - static_cast<float>(local_best_metric) / m);
}
if (local_best_metric < global_best_matric){
global_best_matric = local_best_metric;
global_best_ant = local_best_ant;
}
}
return global_best_matric;
}
int main(){
auto problem = CSPProblem::from_csp("in.csp");
int TRIES = 20;
std::vector<int> times;
std::vector<int> measures;
for (int i = 0; i < TRIES; i++){
auto start = std::chrono::high_resolution_clock::now();
int m = ant_colony_solver(problem);
auto stop = std::chrono::high_resolution_clock::now();
int duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
times.push_back(duration);
measures.push_back(m);
}
float average_time = static_cast<float>(std::accumulate(std::begin(times), std::end(times), 0)) / TRIES;
float average_measure = static_cast<float>(std::accumulate(std::begin(measures), std::end(measures), 0)) / TRIES;
std::cout << "Average running time: " << average_time << std::endl;
std::cout << "Average solution: " << average_measure << std::endl;
std::cout << "all solutions: ";
for (const auto& m: measures) std::cout << m << " ";
std::cout << std::endl;
return 0;
}
The average running time now is only 530.4 miliseconds. However, the average target function value is 122.75, which is significantly higher than that of the python solution.
If the average function values were the same, and the times were as they are, I would simply write this off as 'C++ is faster than python' (even though the difference in speed is also very suspiscious). But, since C++ yields worse solutions, it leads me to believe that I've done something wrong in C++. What I'm suspiscious of is the way I'm generating an alphabet index using weights. In python I've done it using random.choices as follows:
ants[ant_idx][next_character_index] = random.choices(alphabet, weights=world_trails[next_character_index], k=1)[0]
As for C++, I haven't done it in a while so I'm a bit rusty on reading cppreference (which is a skill of its own), and the std::discrete_distribution solution is something I've plain copied from the reference:
std::random_device rd;
std::mt19937 gen(rd());
int get_from_distrib(const std::vector<float>& weights){
std::discrete_distribution<> d(std::begin(weights), std::end(weights));
return d(gen);
}
The suspiscious thing here is the fact that I'm declaring the std::random_device and std::mt19937 objects globally and using the same ones every time. I have not been able to find an answer to whether this is the way they're meant to be used. However, if I put them in the function:
int get_from_distrib(const std::vector<float>& weights){
std::random_device rd;
std::mt19937 gen(rd());
std::discrete_distribution<> d(std::begin(weights), std::end(weights));
return d(gen);
}
the average running time gets significantly worse, clocking in at 8.84 seconds. However, even more surprisingly, the average function value gets worse as well, at 130.
Again, if only one of the two things changed (say if only the time went up) I would have been able to draw some conclusions. This way it only gets more confusing.
So, does anybody have an idea of why this is happening?
Thanks in advance.
MAJOR EDIT: I feel embarrased having asked such a huge question when in fact the problem lies in a simple typo. Namely in the evaporation step in the C++ version I put a + instead of a *.
Now the algorithms behave identically in terms of average solution quality.
However, I could still use some tips on how to optimize the python version.

Apart form the dumb mistake I've mentioned in the question edit, it seems I've finally found a way to optimize the python solution decently. First of all, keeping world_trails and ants as numpy arrays instead of lists of lists actually slowed things down. Furthermore, I actually stopped keeping a list of ants altogether since I only ever need the best one per iteration.
Lastly, running cProfile indicated that a lot of the time was spent on random.choices, therefore I've decided to implement my own version of it suited specifically for this case. I've done this by pre-computing total weight sum per character for each next iteration (in the trail_row_wise_sums array), and using the following function:
def fast_pick(arr, weights, ws):
r = random.random()*ws
for i in range(len(arr)):
if r < weights[i]:
return arr[i]
r -= weights[i]
return 0
The new version now looks like this:
def solve_(self, problem: CSProblem) -> CSSolution:
m, n, alphabet, strings = problem.m, problem.n, problem.alphabet, problem.strings
A = len(alphabet)
rho = self.config['RHO']
colony_size = self.config['COLONY_SIZE']
miters = self.config['MAX_ITERS']
global_best_ant = None
global_best_metric = m
init_pher = 1.0 / A
world_trails = [[init_pher for _ in range(A)] for _ in range(m)]
trail_row_wise_sums = [1.0 for _ in range(m)]
for iteration in tqdm(range(miters)):
local_best_ant = None
local_best_metric = m
for _ in range(colony_size):
ant = ''.join(fast_pick(alphabet, world_trails[next_character_index], trail_row_wise_sums[next_character_index]) for next_character_index in range(m))
ant_metric = utils.problem_metric(ant, strings)
if ant_metric <= local_best_metric:
local_best_metric = ant_metric
local_best_ant = ant
# First we perform pheromone evaporation
for i in range(m):
for j in range(A):
world_trails[i][j] = world_trails[i][j] * (1 - rho)
# Now, using the elitist strategy, only the best ant is allowed to update his pheromone trails
best_ant_ys = (alphabet.index(a) for a in local_best_ant)
best_ant_xs = range(m)
for x, y in zip(best_ant_xs, best_ant_ys):
world_trails[x][y] = world_trails[x][y] + (1 - 1.0*local_best_metric / m)
if local_best_metric < global_best_metric:
global_best_metric = local_best_metric
global_best_ant = local_best_ant
trail_row_wise_sums = [sum(world_trails[i]) for i in range(m)]
return CSSolution(global_best_ant, global_best_metric)
The average running time is now down to 800 miliseconds (compared to 5 seconds that it was before). Granted, applying the same fast_pick optimization to the C++ solution did also speed up the C++ version (around 150 ms) but I guess now I can write it off as C++ being faster than python.
Profiler also showed that a lot of the time was spent on calculating Hamming distances, but that's to be expected, and apart from that I see no other way of computing the Hamming distance between arbitrary strings more efficiently.

Parallelize code which is doing bit wise operation

I have this code to make a 100k*200k matrix occupy less space by packing 8 elements per byte of this AU matrix into A to reduce the memory consumption. This code takes forever to run as you can expect and i am planning on increasing the number of rows to 200k as well. I am running the code on a pretty powerful instance (CPU and GPU)and can scale it so can anyone help parallelize this code so that it is quicker.
import numpy as np
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()
A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
for j in range(A.shape[1]):
A[i,j] = 0
for k in range(8):
if AU[i,(j*8)+k] == 1:
A[i,j] = A[i,j] | (1<<(7-k))

Warning: You try to allocate a huge amount of memory: about 2 TB of memory that you probably do not have.
Assuming you have enough memory or you can reduce the size of the dataset, you can write a much much faster implementation using the Numba JIT. Moreover, you can parallelize the code and replace the slow conditional with a branchless implementation to significantly speed up the computation since AU is filled with binary values. Finally, you can unroll the inner loop working on k to make the code even faster. Here is the resulting implementation:
import numpy as np
import numba as nb
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)
#nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
offset = j * 8
res = AU[i,offset] << 7
res |= AU[i,offset+1] << 6
res |= AU[i,offset+2] << 5
res |= AU[i,offset+3] << 4
res |= AU[i,offset+4] << 3
res |= AU[i,offset+5] << 2
res |= AU[i,offset+6] << 1
res |= AU[i,offset+7]
A[i,j] = res
compute(A, AU)
On my machine, this code is 37851 times faster than the original implementation on a smaller dataset (with colm=int(20000/8) and rows=10000). The original implementation took 6min3s while the optimized one took 9.6ms.
This code is memory bound on my machine. With the current inputs, this code is close to be optimal as it spends most of its time reading the AU input matrix. A good additional optimization would be to "compress" the AU matrix to a smaller one (if possible).

Python solution:
import numpy as np
import time
def compute(A, AU):
A[:,:] = 0
# Put every 8 columns in AU into A
for i in range(A.shape[1]):
A[:, i//8] = np.bitwise_or(A[:, i//8], np.left_shift(AU[:, i], i % 8))
colm = int(20000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()
A = np.empty((rows,colm), dtype=np.uint8)
start_time = time.time()
compute(A, AU)
end_time = time.time()
print(end_time - start_time)
Packs bits in 1/2 second
Same code in C:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc, char* argv[]) {
int colm = 200000/8;
int rows = 10000;
int cols = colm*8;
unsigned char *A = (unsigned char *)malloc(rows * colm * sizeof(unsigned char));
unsigned char *AU = (unsigned char *)malloc(rows * cols * sizeof(unsigned char));
int i, j;
clock_t begin;
clock_t end;
double time_spent;
begin = clock();
// Create AU
for (i = 0; i < rows; i++)
for (j = 0; j < cols; j++)
*(AU + i*cols + j) = (unsigned char) (rand() & 0x01);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%lf seconds to create AU\n", time_spent);
begin = clock();
// Create a zeroed out A
for (i = 0; i < rows; i++)
for (j = 0; j < colm; j++)
*(A + i*colm + j) = (unsigned char) 0;
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%lf seconds to create A\n", time_spent);
begin = clock();
// Pack into bits
for (i = 0; i < rows; i++)
for (j = 0; j < colm; j++) {
int au_idx = i*cols + j*8;
for (int k=0; k<8; k++)
*(A + i*colm + j) |= *(AU + au_idx + k) << k;
}
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("%lf seconds to pack\n", time_spent);
free(A);
free(AU);
return 0;
}
Tested with colm=200,000. Bit packing takes 0.27 seconds against 0.64 seconds for the optimized Python version provided by Jérôme Richard. Calls to rand() are expensive and greatly increase overall runtime. In terms of memory, the C version peaks at 2GB against Python's 4.2GB. Further code optimization and parallelization would certainly reduce runtime.
Julia version:
using Random
colm = 200000÷8
rows = 30000
cols = colm*8
AU = zeros(UInt8, (rows, cols))
rand!(AU)
AU .&= 0x01
A = zeros(UInt8, (rows, colm))
function compute(A, AU)
for i in 1:size(A)[2]
start_col = (i-1) << 3
#views A[:, i] .= AU[:, start_col + 1] .|
(AU[:, start_col + 2] .<< 1) .|
(AU[:, start_col + 3] .<< 2) .|
(AU[:, start_col + 4] .<< 3) .|
(AU[:, start_col + 5] .<< 4) .|
(AU[:, start_col + 6] .<< 5) .|
(AU[:, start_col + 7] .<< 6) .|
(AU[:, start_col + 8] .<< 7)
end
end
#time compute(A, AU)
Julia scales well in terms of performance. Results with colm=25,000 and rows=30,000:
Language Total Run Time (secs) Bit Packing Time (secs) Peak Memory (GB)
Python 22.1 3.0 6
Julia 11.7 1.2 6

After reading your post initially yesterday I actually planned to write a routine myself with the just-in-time compiler numba but Jérôme was faster than me and delivered you an excellent solution. But I have an alternative to offer: Why reinvent the wheel when there already exists a numpy function which does exactly the same: numpy.packbits.
import numpy as np
A = np.packbits(AU,axis=-1)
does the job. From my tests it seems to be quite a bit slower than Jérôme's version but anyways way faster than your initial version.

C running slower than PyPy

I'm running these two codes. They both perform the same mathematical procedure (calculate series value up to large terms), and also, as expected, produce the same output.
But for some reason, the PyPy code is running significantly faster than the C code.
I cannot figure out why this is happening, as I expected the C code to run faster.
I'd be thankful if anyone could help me by clarifying that (maybe there is a better way to write the C code?)
C code:
#include <stdio.h>
#include <math.h>
int main()
{
double Sum = 0.0;
long n;
for(n = 2; n < 1000000000; n = n + 1) {
double Sign;
Sign = pow(-1.0, n % 2);
double N;
N = (double) n;
double Sqrt;
Sqrt = sqrt(N);
double InvSqrt;
InvSqrt = 1.0 / Sqrt;
double Ln;
Ln = log(N);
double LnSq;
LnSq = pow(Ln, 2.0);
double Term;
Term = Sign * InvSqrt * LnSq;
Sum = Sum + Term;
}
double Coeff;
Coeff = Sum / 2.0;
printf("%0.14f \n", Coeff);
return 0;
}
PyPy code (faster implementation of Python):
from math import log, sqrt
Sum = 0
for n in range(2, 1000000000):
Sum += ((-1)**(n % 2) * (log(n))**2) / sqrt(n)
print(Sum / 2)

This is far from surprising, PyPy does a number of run-time optimizations by default, where as C compilers by default do not perform any optimization. Dave Beazley's 2012 PyCon Keynote covers this pretty explicitly and provides an deep explanation of why this happens.
Per the referenced talk, C should surpass PyPy when compiled with optimization level 2 or 3 (you can watch the full section on the performance of fibonacci generation in cpython, pypy and C starting here).

Additionally to compiler's optimisation level, you can improve your code as well:
int main()
{
double Sum = 0.0;
long n;
for(n = 2; n < 1000000000; ++n)
{
double N = n; // cast is implicit, only for code readability, no effect on runtime!
double Sqrt = sqrt(N);
//double InvSqrt; // spare that:
//InvSqrt = 1.0/Sqrt; // you spare this division with!
double Ln = log(N);
double LnSq;
//LnSq = pow(Ln,2.0);
LnSq = Ln*Ln; // more efficient
double Term;
//Term = Sign * InvSqrt * LnSq;
Term = LnSq / Sqrt;
if(n % 2)
Term = -Term; // just negating, no multiplication
// (IEEE provided: just one bit inverted)
Sum = Sum + Term;
}
// ...
Now we can simplify the code a little more:
int main()
{
double Sum = 0.0;
for(long n = 2; n < 1000000000; ++n)
// ^^^^ possible since C99, better scope, no runtime effect
{
double N = n;
double Ln = log(N);
double Term = Ln * Ln / sqrt(N);
if(n % 2)
Sum -= Term;
else
Sum += Term;
}
// ...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Eigen Matrix vs Numpy Array multiplication performance - python

My question has been answered by #Jitse Niesen and #ggael in the comments. I need to add a flag to turn on the optimizations when compiling: -O2 -DNDEBUG (O is capital o, not zero). After including this flag, eigen code runs in 0.6 seconds as opposed to ~29 seconds without it.

Related

How to do computation with a c function on a numpy array of a non square matrix of size `n` times `m` (n≠m)

Truncation erros in python

Unusual behaviour of Ant Colony Optimization for Closest String Problem in Python and C++

Parallelize code which is doing bit wise operation

C running slower than PyPy

Categories

Resources