Eigen + MKL or OpenBLAS slower than Numpy/Scipy + OpenBLAS

Eigen + MKL or OpenBLAS slower than Numpy/Scipy + OpenBLAS - python

I'm starting with c++ atm and want to work with matrices and speed up things in general. Worked with Python+Numpy+OpenBLAS before.
Thought c++ + Eigen + MKL might be faster or at least not slower.
My c++ code:
#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/LU>
#include <chrono>
using namespace std;
using namespace Eigen;
int main()
{
int n = Eigen::nbThreads( );
cout << "#Threads: " << n << endl;
uint16_t size = 4000;
MatrixXd a = MatrixXd::Random(size,size);
clock_t start = clock ();
PartialPivLU<MatrixXd> lu = PartialPivLU<MatrixXd>(a);
float timeElapsed = double( clock() - start ) / CLOCKS_PER_SEC;
cout << "Elasped time is " << timeElapsed << " seconds." << endl ;
}
My Python code:
import numpy as np
from time import time
from scipy import linalg as la
size = 4000
A = np.random.random((size, size))
t = time()
LU, piv = la.lu_factor(A)
print(time()-t)
My timings:
C++ 2.4s
Python 1.2s
Why is c++ slower than Python?
I am compiling c++ using:
g++ main.cpp -o main -lopenblas -O3 -fopenmp -DMKL_LP64 -I/usr/local/include/mkl/include
MKL is definiely working: If I disable it the running time is around 13s.
I also tried C++ + OpenBLAS which gives me around 2.4s as well.
Any ideas why C++ and Eigen are slower than numpy/scipy?

The timing is just wrong. That's a typical symptom of wall clock time vs. CPU time. When I use the system_clock from the <chrono> header it “magically” becomes faster.
#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/LU>
#include <chrono>
int main()
{
int const n = Eigen::nbThreads( );
std::cout << "#Threads: " << n << std::endl;
int const size = 4000;
Eigen::MatrixXd a = Eigen::MatrixXd::Random(size,size);
auto start = std::chrono::system_clock::now();
Eigen::PartialPivLU<Eigen::MatrixXd> lu(a);
auto stop = std::chrono::system_clock::now();
std::cout << "Elasped time is "
<< std::chrono::duration<double>{stop - start}.count()
<< " seconds." << std::endl;
}
I compile with
icc -O3 -mkl -std=c++11 -DNDEBUG -I/usr/include/eigen3/ test.cpp
and get the output
#Threads: 1
Elasped time is 0.295782 seconds.
Your Python version reports 0.399146080017 on my machine.
Alternatively, to obtain comparable timing you could use time.clock() (CPU time) in Python instead of time.time() (wall clock time).

This is not a fair comparison. The python routine is operating on float precision while the c++ code needs to crunch doubles. This exactly doubles the computation time.
>>> type(np.random.random_sample())
<type 'float'>
You should compare with MatrixXf instead of MatrixXd and your MKL code should be equally fast.

Related

Why is writing to I2C too slow in Python comparing to C?

I am trying to find out why the same code in Python works 25 times slower than C even if I use CDLL, when I try to write into I2C. Below I will describe all the details what I am doing step by step.
The version of Raspberry PI: Raspberry PI 3 Model B
OS: Raspbian Buster Lite Version:July 2019
GCC version: gcc (Raspbian 8.3.0-6+rpi1) 8.3.0
Python version: Python 3.7.3
The device I am working with though I2C is MCP23017. All I do is writing 0 and 1 to the pin B0. Here is my code written in C:
// test1.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/i2c-dev.h>
#include <time.h>
int init() {
int fd = open("/dev/i2c-1", O_RDWR);
ioctl(fd, I2C_SLAVE, 0x20);
return fd;
}
void deinit(int fd) {
close(fd);
}
void makewrite(int fd, int v) {
char buffer[2] = { 0x13, 0x00 };
buffer[1] = v;
write(fd, buffer, 2);
}
void mytest() {
clock_t tb, te;
int n = 1000;
int fd = init();
tb = clock();
int v = 1;
for (int i = 0; i < n; i++) {
makewrite(fd, v);
v = 1 - v;
}
te = clock();
printf("Time: %.3lf ms\n", (double)(te - tb) / n / CLOCKS_PER_SEC * 1e3);
deinit(fd);
}
int main() {
mytest();
return 0;
}
I compile and run it with the command:
gcc test1.c -o test1 && ./test1
It gives me the result:
pi#raspberrypi:~/dev/i2c_example $ gcc test1.c -o test1 && ./test1
Time: 0.020 ms
I may conclude that writing to the pin takes 0.02 milliseconds.
After that I create SO-file to be able to access the written functions from my Python script:
gcc -c -fPIC test1.c -o test1.o && gcc test1.o -shared -o test1.so
And my Python script to test:
# test1.py
import ctypes
from time import time
test1so = ctypes.CDLL("/home/pi/dev/i2c_example/test1.so")
test1so.mytest()
n = 1000
fd = test1so.init()
tb = time()
v = 1
for _ in range(n):
test1so.makewrite(fd, v)
v = 1 - v
te = time()
print("Time: {:.3f} ms".format((te - tb) / n * 1e3))
test1so.deinit(fd)
This provides me the result:
pi#raspberrypi:~/dev/i2c_example $ python test1.py
Time: 0.021 ms
Time: 0.516 ms
I cannot understand why the call of makewrite is 25 times slower in Python though actually I call the same C-code. I also researched that if I comment write(fd, buffer, 2); in test1.c or change fd to 1, the times given by the Python script are compatible, there is no such huge difference.
// in test1.c
write(fd, buffer, 2); -> write(1, buffer, 2);
Running C-program:
pi#raspberrypi:~/dev/i2c_example $ gcc test1.c -o test1 && ./test1
...Time: 0.012 ms
Running Python-program:
pi#raspberrypi:~/dev/i2c_example $ python3 test1.py
...Time: 0.009 ms
...Time: 0.021 ms
It confused me a lot. Can anybody tell me why does it happen and how can I improve my performance in Python regarding the access via I2C using C-DLL?
Summary:
Descriptor: 1 (stdout)
Execution time of makewrite in C purely: 0.009 ms
Execution time of makewrite in C called as C-DLL function from Python: 0.021 ms
The result is expectable. This difference is not so high. It can be explained that Python loop and its statements are not as efficient as in C, thus it increases the execution time.
Descriptor: I2C
Execution time of makewrite in C purely: 0.021 ms
Execution time of makewrite in C called as DLL function from Python: 0.516 ms
After switching the file descriptor to I2C the execution time in pure C increased in around 0.012 ms, so I would expect the execution time for calling from Python: 0.021 ms + 0.012 ms = 0.033 ms, because all changes I've done are inside of makewrite, so Python is supposed not to know this internal thing (because it's packed in so-file). But I have 0.516 ms instead of 0.033 ms that confuses me.

How to divide a binary file to 6-byte blocks in C++ or Python with fast speed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I’m reading a file in C++ and Python as a binary file. I need to divide the binary into blocks, each 6 bytes. For example, if my file is 600 bytes, the result should be 100 blocks, each 6 bytes.
I have tried struct (in C++ and Python) and array (Python). None of them divide the binary into blocks of 6 bytes. They can only divide the binary into blocks each power of two (1, 2, 4, 8, 16, etc.).
The array algorithm was very fast, reading 1 GB of binary data in less than a second as blocks of 4 bytes. In contrast, I used some other methods, but all of them are extremely slow, taking tens of minutes to do it for a few megabytes.
How can I read the binary as blocks of 6 bytes as fast as possible? Any help in either C++ or Python will be great. Thank you.
EDIT - The Code:
struct Block
{
char data[6];
};
class BinaryData
{
private:
char data[6];
public:
BinaryData() {};
~BinaryData() {};
void readBinaryFile(string strFile)
{
Block block;
ifstream binaryFile;
int size = 0;
binaryFile.open(strFile, ios::out | ios::binary);
binaryFile.seekg(0, ios::end);
size = (int)binaryFile.tellg();
binaryFile.seekg(0, ios::beg);
cout << size << endl;
while ( (int)binaryFile.tellg() < size )
{
cout << binaryFile.tellg() << " , " << size << " , " <<
size - (int)binaryFile.tellg() << endl;
binaryFile.read((char*)block.data,sizeof(block.data));
cout << block.data << endl;
//cin >> block.data;
if (size - (int)binaryFile.tellg() > size)
{
break;
}
}
binaryFile.close();
}
};
Notes :
in the file the numbers are in big endian ( remark )
the goal is to as fast as possible read them then sort them in ascending order ( remark )

Let's start simple, then optimize.
Simple Loop
uint8_t array1[6];
while (my_file.read((char *) &array1[0], 6))
{
Process_Block(&array1[0]);
}
The above code reads in a file, 6 bytes at a time and sends the block to a function.
Meets the requirements, not very optimal.
Reading Larger Blocks
Files are streaming devices. They have an overhead to start streaming, but are very efficient to keep streaming. In other words, we want to read as much data per transaction to reduce the overhead.
static const unsigned int CAPACITY = 6 * 1024;
uint8_t block1[CAPACITY];
while (my_file.read((char *) &block1[0], CAPACITY))
{
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
}
The above code reads up to 1024 blocks in one transaction. After reading, each block is sent to a function for processing.
This version is more efficient than the Simple Loop, as it reads more data per transaction. Adjust the CAPACITY to find the optimal size on your platform.
Loop Unrolling
The previous code reduces the first bottleneck of input transfer speed (although there is still room for optimization). Another technique is to reduce the overhead of the processing loop by performing more data processing inside the loop. This is called loop unrolling.
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while ((blocks_read / 4) != 0)
{
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
blocks_read -= 4;
}
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
You can adjust the quantity of operations in the loop, to see how it affects your program's speed.
Multi-Threading & Multiple Buffers
Another two techniques for speeding up the reading of the data, are to use multiple threads and multiple buffers.
One thread, an input thread, reads the file into a buffer. After reading into the first buffer, the thread sets a semaphore indicating there is data to process. The input thread reads into the next buffer. This repeats until the data is all read. (For a challenge, figure out how to reuse the buffers and notify the other thread of which buffers are available).
The second thread is the processing thread. This processing thread is started first and waits for the first buffer to be completely read. After the buffer has the data, the processing thread starts processing the data. After the first buffer has been processed, the processing thread starts on the next buffer. This repeats until all the buffers have been processed.
The goal here is to use as many buffers as necessary to keep the processing thread running and not waiting.
Edit 1: Other techniques
Memory Mapped Files
Some operating systems support memory mapped files. The OS reads a portion of the file into memory. When a location outside the memory is accessed, the OS loads another portion into memory. Whether this technique improves performance needs to be measured (profiled).
Parallel Processing & Threading
Adding multiple threads may show negligible performance gain. Computers have a data bus (data highway) connecting many hardware devices, including memory, file I/O and the processor. Devices will be paused to let other devices use the data highway. With multiple cores or processors, one processor may have to wait while the other processor is using the data highway. This waiting may cause negligible performance gain when using multiple threads or parallel processing. Also, the operating system has overhead when constructing and maintaining threads.

Try that, the input file is received in argument of the program, as you said I suppose the the 6 bytes values in the file are written in the big endian order, but I do not make assumption for the program reading the file then sorting and it can be executed on both little and big endian (I check the case at the execution)
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <limits.h> // CHAR_BIT
using namespace std;
#if CHAR_BIT != 8
# error that code supposes a char has 8 bits
#endif
int main(int argc, char ** argv)
{
if (argc != 2)
cerr << "Usage: " << argv[1] << " <file>" << endl;
else {
ifstream in(argv[1], ios::binary);
if (!in.is_open())
cerr << "Cannot open " << argv[1] << endl;
else {
in.seekg(0, ios::end);
size_t n = (size_t) in.tellg() / 6;
vector<uint64_t> values(n);
uint64_t * p = values.data(); // for performance
uint64_t * psup = p + n;
in.seekg(0, ios::beg);
int i = 1;
if (*((char *) &i)) {
// little endian
unsigned char s[6];
uint64_t v = 0;
while (p != psup) {
if (!in.read((char *) s, 6))
return -1;
((char *) &v)[0] = s[5];
((char *) &v)[1] = s[4];
((char *) &v)[2] = s[3];
((char *) &v)[3] = s[2];
((char *) &v)[4] = s[1];
((char *) &v)[5] = s[0];
*p++ = v;
}
}
else {
// big endian
uint64_t v = 0;
while (p != psup) {
if (!in.read(((char *) &v) + 2, 6))
return -1;
*p++ = v;
}
}
cout << "file successfully read" << endl;
sort(values.begin(), values.end());
cout << "values sort" << endl;
// DEBUG, DO ON A SMALL FILE ;-)
for (auto v : values)
cout << v << endl;
}
}
}

MPI Bcast or Scatter to specific ranks

I have some array of data. What I was trying to do is like this:
Use rank 0 to bcast data to 50 nodes. Each node has 1 mpi process on it with 16 cores available to that process. Then, each mpi process will call python multiprocessing. Some calculations are done, then the mpi process saves the data that was calculated with multiprocessing. The mpi process then changes some variable, and runs multiprocessing again. Etc.
So the nodes do not need to communicate with each other besides the initial startup in which they all receive some data.
The multiprocessing is not working out so well. So now I want to use all MPI.
How can I (or is it not possible) use an array of integers that refers to MPI ranks for bcast or scatter. For example, ranks 1-1000, the node has 12 cores. So every 12th rank I want to bcast the data. Then on every 12th rank, i want it to scatter data to 12th+1 to 12+12 ranks.
This requires the first bcast to communicate with totalrank/12, then each rank will be responsible for sending data to ranks on the same node, then gathering the results, saving it, then sending more data to ranks on the same node.

I don't know enough of mpi4py to be able to give you a code sample with it, but here is what could be a solution in C++. I'm sure you can infer a Python code out of it easily.
#include <mpi.h>
#include <iostream>
#include <cstdlib> /// for abs
#include <zlib.h> /// for crc32
using namespace std;
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
// get size and rank
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
// get the compute node name
char name[MPI_MAX_PROCESSOR_NAME];
int len;
MPI_Get_processor_name( name, &len );
// get an unique positive int from each node names
// using crc32 from zlib (just a possible solution)
uLong crc = crc32( 0L, Z_NULL, 0 );
int color = crc32( crc, ( const unsigned char* )name, len );
color = abs( color );
// split the communicator into processes of the same node
MPI_Comm nodeComm;
MPI_Comm_split( MPI_COMM_WORLD, color, rank, &nodeComm );
// get the rank on the node
int nodeRank;
MPI_Comm_rank( nodeComm, &nodeRank );
// create comms of processes of the same local ranks
MPI_Comm peersComm;
MPI_Comm_split( MPI_COMM_WORLD, nodeRank, rank, &peersComm );
// now, masters are all the processes of nodeRank 0
// they can communicate among them with the peersComm
// and with their local slaves with the nodeComm
int worktoDo = 0;
if ( rank == 0 ) worktoDo = 1000;
cout << "Initially [" << rank << "] on node "
<< name << " has " << worktoDo << endl;
MPI_Bcast( &worktoDo, 1, MPI_INT, 0, peersComm );
cout << "After first Bcast [" << rank << "] on node "
<< name << " has " << worktoDo << endl;
if ( nodeRank == 0 ) worktoDo += rank;
MPI_Bcast( &worktoDo, 1, MPI_INT, 0, nodeComm );
cout << "After second Bcast [" << rank << "] on node "
<< name << " has " << worktoDo << endl;
// cleaning up
MPI_Comm_free( &peersComm );
MPI_Comm_free( &nodeComm );
MPI_Finalize();
return 0;
}
As you can see, you first create communicators with processes on the same node. Then you create peer communicators with all processes of the same local rank on each nodes.
From than, your master process of global rank 0 will send data to the local masters. And they will distribute the work on the node they are responsible of.

Detect the sequence of blinking lights

I'm looking for an example or just a starting point to achieve the following:
Using Python openCV I want to detect the sequence of blinking lights. i.e. on off on off = match
Is this possible and could someone start by showing me a simple example. I'm hoping from this I can learn. I learn better by examples and cannot find any to achieve sort this functionality.

If the lightsource is very prominent in your image you can use the mean intensity of your image for detecting changes.
Here is a very simple example. I use this video for testing.
You probably need to adjust the thresholds for your video.
If your video is not as simple as the one I used for testing you might need to make some adjustments. For example you could try to segment the light source first if there is to much distraction in the other parts of the image. Or if the intensity changes between consecutive images are not big enough, you might need to look at the changes over several images.
Edit:
I just saw the question was tagged with python, but my source code is C++. But I leave it for now. Maybe it helps you to get the general idea so you can port it to python yourself.
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>
using namespace cv;
int main(int argc, char** argv)
{
VideoCapture capture(argv[1]);
Mat frame;
if( !capture.isOpened() )
throw "Error when reading video";
double lastNorm = 0.0;
int lastCounter = 0;
int counter = 0;
int currentState = 0;
namedWindow( "w", 1);
for( ; ; )
{
capture >> frame;
imshow("w", frame);
double currentNorm = norm(frame);
double diffNorm = currentNorm - lastNorm;
if (diffNorm > 20000 && currentState == 0)
{
currentState = 1;
std::cout << "on - was off for " << counter - lastCounter << " frames" << std::endl;
lastCounter = counter;
}
if (diffNorm < -20000 && currentState == 1)
{
currentState = 0;
std::cout << "off - was on for " << counter - lastCounter << " frames" << std::endl;
lastCounter = counter;
}
waitKey(20); // waits to display frame
lastNorm = currentNorm;
counter++;
}
waitKey(0); // key press to close window
// releases and window destroy are automatic in C++ interface
}

md5 a string multiple times get different result on different platform

t.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <openssl/md5.h>
static char* unsigned_to_signed_char(const unsigned char* in , int len) {
char* res = (char*)malloc(len * 2 + 1);
int i = 0;
memset(res , 0 , len * 2 + 1);
while(i < len) {
sprintf(res + i * 2 , "%02x" , in[i]);
i ++;
};
return res;
}
static unsigned char * md5(const unsigned char * in) {
MD5_CTX ctx;
unsigned char * result1 = (unsigned char *)malloc(MD5_DIGEST_LENGTH);
MD5_Init(&ctx);
printf("len: %lu \n", strlen(in));
MD5_Update(&ctx, in, strlen(in));
MD5_Final(result1, &ctx);
return result1;
}
int main(int argc, char *argv[])
{
const char * i = "abcdef";
unsigned char * data = (unsigned char *)malloc(strlen(i) + 1);
strncpy(data, i, strlen(i));
unsigned char * result1 = md5(data);
free(data);
printf("%s\n", unsigned_to_signed_char(result1, MD5_DIGEST_LENGTH));
unsigned char * result2 = md5(result1);
free(result1);
printf("%s\n", unsigned_to_signed_char(result2, MD5_DIGEST_LENGTH));
unsigned char * result3 = md5(result2);
free(result2);
printf("%s\n", unsigned_to_signed_char(result3, MD5_DIGEST_LENGTH));
return 0;
}
makeflle
all:
cc t.c -Wall -L/usr/local/lib -lcrypto
and t.py
#!/usr/bin/env python
import hashlib
import binascii
src = 'abcdef'
a = hashlib.md5(src).digest()
b = hashlib.md5(a).digest()
c = hashlib.md5(b).hexdigest().upper()
print binascii.b2a_hex(a)
print binascii.b2a_hex(b)
print c
The results of python script on Debian6 x86 and MacOS 10.6 are the same:
e80b5017098950fc58aad83c8c14978e
b91282813df47352f7fe2c0c1fe9e5bd
85E4FBD1BD400329009162A8023E1E4B
the c version on MacOS is:
len: 6
e80b5017098950fc58aad83c8c14978e
len: 48
eac9eaa9a4e5673c5d3773d7a3108c18
len: 64
73f83fa79e53e9415446c66802a0383f
Why it is different from Debian6 ?
Debian environment:
gcc (Debian 4.4.5-8) 4.4.5
Python 2.6.6
Linux shuge-lab 2.6.26-2-686 #1 SMP Thu Nov 25 01:53:57 UTC 2010 i686 GNU/Linux
OpenSSL was installed from testing repository.
MacOS environment:
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
Python 2.7.1
Darwin Lees-Box.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386
OpenSSL was installed from MacPort.
openssl #1.0.0d (devel, security)
OpenSSL SSL/TLS cryptography library

I think you are allocating bytes exactly for MD5 result, without ending \0. Then you are calculating MD5 of block of memory that starts with result from previous MD5 calculating but with some random bytes after it. You should allocate one byte more for result and set it to \0.
My proposal:
...
unsigned char * result1 = (unsigned char *)malloc(MD5_DIGEST_LENGTH + 1);
result1[MD5_DIGEST_LENGTH] = 0;
...

The answers so far don't seem to me to have stated the issue clearly enough. Specifically the problem is the line:
MD5_Update(&ctx, in, strlen(in));
The data block you pass in is not '\0' terminated, so the call to update may try to process further bytes beyond the end of the MD5_DIGEST_LENGTH buffer. In short, stop using strlen() to work out the length of an arbitrary buffer of bytes: you know how long the buffers are supposed to be so pass the length around.

You don't '\0' terminate the string you're passing to md5 (which I
suppose takes a '\0' terminated string, since you don't pass it the
length). The code
memset( data, 0, sizeof( strlen( i ) ) );
memcpy( data, i, strlen( i ) );
is completely broken: sizeof( strlen( i ) ) is the same as
sizeof( size_t ), 4 or 8 on typical machines. But you don't want the
memset anyway. Try replacing these with:
strcpy( data, i );
Or better yet:
std::string i( "abcdef" );
, then pass i.c_str() to md5 (and declare md5 to take a char
const*. (I'd use an std::vector<unsigned char> in md5() as well,
and have it return it. And unsigned_to_signed_char would take the
std::vector<unsigned char> and return std::string.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Eigen + MKL or OpenBLAS slower than Numpy/Scipy + OpenBLAS - python

Related

Why is writing to I2C too slow in Python comparing to C?

How to divide a binary file to 6-byte blocks in C++ or Python with fast speed? [closed]

MPI Bcast or Scatter to specific ranks

Detect the sequence of blinking lights

md5 a string multiple times get different result on different platform

Categories

Resources