MPI Bcast or Scatter to specific ranks - python

I have some array of data. What I was trying to do is like this:
Use rank 0 to bcast data to 50 nodes. Each node has 1 mpi process on it with 16 cores available to that process. Then, each mpi process will call python multiprocessing. Some calculations are done, then the mpi process saves the data that was calculated with multiprocessing. The mpi process then changes some variable, and runs multiprocessing again. Etc.
So the nodes do not need to communicate with each other besides the initial startup in which they all receive some data.
The multiprocessing is not working out so well. So now I want to use all MPI.
How can I (or is it not possible) use an array of integers that refers to MPI ranks for bcast or scatter. For example, ranks 1-1000, the node has 12 cores. So every 12th rank I want to bcast the data. Then on every 12th rank, i want it to scatter data to 12th+1 to 12+12 ranks.
This requires the first bcast to communicate with totalrank/12, then each rank will be responsible for sending data to ranks on the same node, then gathering the results, saving it, then sending more data to ranks on the same node.

I don't know enough of mpi4py to be able to give you a code sample with it, but here is what could be a solution in C++. I'm sure you can infer a Python code out of it easily.
#include <mpi.h>
#include <iostream>
#include <cstdlib> /// for abs
#include <zlib.h> /// for crc32
using namespace std;
int main( int argc, char *argv[] ) {
MPI_Init( &argc, &argv );
// get size and rank
int rank, size;
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
// get the compute node name
char name[MPI_MAX_PROCESSOR_NAME];
int len;
MPI_Get_processor_name( name, &len );
// get an unique positive int from each node names
// using crc32 from zlib (just a possible solution)
uLong crc = crc32( 0L, Z_NULL, 0 );
int color = crc32( crc, ( const unsigned char* )name, len );
color = abs( color );
// split the communicator into processes of the same node
MPI_Comm nodeComm;
MPI_Comm_split( MPI_COMM_WORLD, color, rank, &nodeComm );
// get the rank on the node
int nodeRank;
MPI_Comm_rank( nodeComm, &nodeRank );
// create comms of processes of the same local ranks
MPI_Comm peersComm;
MPI_Comm_split( MPI_COMM_WORLD, nodeRank, rank, &peersComm );
// now, masters are all the processes of nodeRank 0
// they can communicate among them with the peersComm
// and with their local slaves with the nodeComm
int worktoDo = 0;
if ( rank == 0 ) worktoDo = 1000;
cout << "Initially [" << rank << "] on node "
<< name << " has " << worktoDo << endl;
MPI_Bcast( &worktoDo, 1, MPI_INT, 0, peersComm );
cout << "After first Bcast [" << rank << "] on node "
<< name << " has " << worktoDo << endl;
if ( nodeRank == 0 ) worktoDo += rank;
MPI_Bcast( &worktoDo, 1, MPI_INT, 0, nodeComm );
cout << "After second Bcast [" << rank << "] on node "
<< name << " has " << worktoDo << endl;
// cleaning up
MPI_Comm_free( &peersComm );
MPI_Comm_free( &nodeComm );
MPI_Finalize();
return 0;
}
As you can see, you first create communicators with processes on the same node. Then you create peer communicators with all processes of the same local rank on each nodes.
From than, your master process of global rank 0 will send data to the local masters. And they will distribute the work on the node they are responsible of.

Related

Convert reinterpret_cast from C++ to Python

I have this piece of code in C++ that I want to convert to Python:
int main() {
char buf[21] = "211000026850KBAALHAA";
std::cout << reinterpret_cast<const uint32_t*>(buf+12) << "\n";
const uint32_t *num = reinterpret_cast<const uint32_t*>(buf+12);
std::cout << num[0] << " " << num[1] << "\n";
}
which prints 1094795851 1094797388. I want to write a similar function to have input of KBAALHAA and output of 1094795851 1094797388. I cannot seem to find the equivalent of reinterpret_cast function in Python.
The unpack_from function from the struct module is perfect for this.
https://docs.python.org/3.9/library/struct.html#struct.unpack_from
Unpack from buffer starting at position offset, according to the format string format. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes, starting at position offset, must be at least the size required by the format, as reflected by calcsize().
The format string used here is II. I being an 4-byte unsigned int.
import struct
string = b"211000026850KBAALHAA"
num = struct.unpack_from("II", string, 12)
print(num[0], num[1])
Output:
1094795851 1094797388

String manipulation Mutiply letters like in python

In pyhton you are able to do things such as
word = "e" * 5
print(word)
To get
"eeeee"
But when i attempt the same thing in C++ i get issues where the output doesnt contain any text heres the code im attempting
playerInfo.name + ('_' * (20 - sizeof(playerInfo.name)))
Im tyring to balence the length of the string so everything on the player list is inline with each other
Thanks in advance for any help
If your actual problem is that you want to display names to a certain width, then don't modify the underlying data. Instead, take advantage of ostream's formatting capabilities to set alignment and fill width. Underlying data should not be modified to cater to display. The displaying function should be able to take the underlying data and format it as required.
This is taken from https://en.cppreference.com/w/cpp/io/manip/left which describes specifically the std::left function, but shows examples of std::setw and std::fill, which should get you what you want. You will need to #include <iomanip> to to use these functions.
#include <iostream>
#include <iomanip>
int main(int argc, char** argv)
{
const char name1[] = "Yogesh";
const char name2[] = "John";
std::cout << "|" << std::setfill(' ') << std::setw(10) << std::left << name1 << "|\n";
std::cout << "|" << std::setfill('*') << std::setw(10) << std::right << name2 << "|\n";
}
Outputs
|Yogesh |
|******John|
A note on the persistence of std::cout and ostreams
Note that std::cout is a std::ostream object, and by default lives for the lifetime of your program (or for enough of your program that it's close enough to the lifetime). As an object, it has member variables. When we call std::setfill('*') we're setting one of those member variables (the fill character) and overwriting the default fill character. When we call std::setw(10) we're setting the underlying width of the stream until another function clears it.
std::setfill, std::left, std::right will persist until you explicitly set them to something else (they don't return to defaults automatically). std::setw will persist until one of (from https://en.cppreference.com/w/cpp/io/manip/setw)
operator<<(basic_ostream&, char) and operator<<(basic_ostream&, char*)
operator<<(basic_ostream&, basic_string&)
std::put_money (inside money_put::put())
std::quoted (when used with an output stream)
So std::setw will persist until basically the next std::string or const char * output.
In case of repeating a single character, you can use std::string(size_type count, CharT ch) like this:
std::string str(5, 'e');
std::cout << str << std::endl; // eeeee
There is no overloaded * operator for std::string. You have to write a custom function
std::string multiply_str(std::string in, size_t count)
{
std::string ret;
rest.reserve(in.size() * count);
for(int i = 0; i < count; i++)
rest += in;
return ret;
}
Or, if it is only on character:
std::string(your char here, how much)

How to divide a binary file to 6-byte blocks in C++ or Python with fast speed? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I’m reading a file in C++ and Python as a binary file. I need to divide the binary into blocks, each 6 bytes. For example, if my file is 600 bytes, the result should be 100 blocks, each 6 bytes.
I have tried struct (in C++ and Python) and array (Python). None of them divide the binary into blocks of 6 bytes. They can only divide the binary into blocks each power of two (1, 2, 4, 8, 16, etc.).
The array algorithm was very fast, reading 1 GB of binary data in less than a second as blocks of 4 bytes. In contrast, I used some other methods, but all of them are extremely slow, taking tens of minutes to do it for a few megabytes.
How can I read the binary as blocks of 6 bytes as fast as possible? Any help in either C++ or Python will be great. Thank you.
EDIT - The Code:
struct Block
{
char data[6];
};
class BinaryData
{
private:
char data[6];
public:
BinaryData() {};
~BinaryData() {};
void readBinaryFile(string strFile)
{
Block block;
ifstream binaryFile;
int size = 0;
binaryFile.open(strFile, ios::out | ios::binary);
binaryFile.seekg(0, ios::end);
size = (int)binaryFile.tellg();
binaryFile.seekg(0, ios::beg);
cout << size << endl;
while ( (int)binaryFile.tellg() < size )
{
cout << binaryFile.tellg() << " , " << size << " , " <<
size - (int)binaryFile.tellg() << endl;
binaryFile.read((char*)block.data,sizeof(block.data));
cout << block.data << endl;
//cin >> block.data;
if (size - (int)binaryFile.tellg() > size)
{
break;
}
}
binaryFile.close();
}
};
Notes :
in the file the numbers are in big endian ( remark )
the goal is to as fast as possible read them then sort them in ascending order ( remark )
Let's start simple, then optimize.
Simple Loop
uint8_t array1[6];
while (my_file.read((char *) &array1[0], 6))
{
Process_Block(&array1[0]);
}
The above code reads in a file, 6 bytes at a time and sends the block to a function.
Meets the requirements, not very optimal.
Reading Larger Blocks
Files are streaming devices. They have an overhead to start streaming, but are very efficient to keep streaming. In other words, we want to read as much data per transaction to reduce the overhead.
static const unsigned int CAPACITY = 6 * 1024;
uint8_t block1[CAPACITY];
while (my_file.read((char *) &block1[0], CAPACITY))
{
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
}
The above code reads up to 1024 blocks in one transaction. After reading, each block is sent to a function for processing.
This version is more efficient than the Simple Loop, as it reads more data per transaction. Adjust the CAPACITY to find the optimal size on your platform.
Loop Unrolling
The previous code reduces the first bottleneck of input transfer speed (although there is still room for optimization). Another technique is to reduce the overhead of the processing loop by performing more data processing inside the loop. This is called loop unrolling.
const size_t bytes_read = my_file.gcount();
const size_t blocks_read = bytes_read / 6;
uint8_t const * block_pointer = &block1[0];
while ((blocks_read / 4) != 0)
{
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
Process_Block(block_pointer);
block_pointer += 6;
blocks_read -= 4;
}
while (blocks_read > 0)
{
Process_Block(block_pointer);
block_pointer += 6;
--blocks_read;
}
You can adjust the quantity of operations in the loop, to see how it affects your program's speed.
Multi-Threading & Multiple Buffers
Another two techniques for speeding up the reading of the data, are to use multiple threads and multiple buffers.
One thread, an input thread, reads the file into a buffer. After reading into the first buffer, the thread sets a semaphore indicating there is data to process. The input thread reads into the next buffer. This repeats until the data is all read. (For a challenge, figure out how to reuse the buffers and notify the other thread of which buffers are available).
The second thread is the processing thread. This processing thread is started first and waits for the first buffer to be completely read. After the buffer has the data, the processing thread starts processing the data. After the first buffer has been processed, the processing thread starts on the next buffer. This repeats until all the buffers have been processed.
The goal here is to use as many buffers as necessary to keep the processing thread running and not waiting.
Edit 1: Other techniques
Memory Mapped Files
Some operating systems support memory mapped files. The OS reads a portion of the file into memory. When a location outside the memory is accessed, the OS loads another portion into memory. Whether this technique improves performance needs to be measured (profiled).
Parallel Processing & Threading
Adding multiple threads may show negligible performance gain. Computers have a data bus (data highway) connecting many hardware devices, including memory, file I/O and the processor. Devices will be paused to let other devices use the data highway. With multiple cores or processors, one processor may have to wait while the other processor is using the data highway. This waiting may cause negligible performance gain when using multiple threads or parallel processing. Also, the operating system has overhead when constructing and maintaining threads.
Try that, the input file is received in argument of the program, as you said I suppose the the 6 bytes values in the file are written in the big endian order, but I do not make assumption for the program reading the file then sorting and it can be executed on both little and big endian (I check the case at the execution)
#include <iostream>
#include <fstream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <limits.h> // CHAR_BIT
using namespace std;
#if CHAR_BIT != 8
# error that code supposes a char has 8 bits
#endif
int main(int argc, char ** argv)
{
if (argc != 2)
cerr << "Usage: " << argv[1] << " <file>" << endl;
else {
ifstream in(argv[1], ios::binary);
if (!in.is_open())
cerr << "Cannot open " << argv[1] << endl;
else {
in.seekg(0, ios::end);
size_t n = (size_t) in.tellg() / 6;
vector<uint64_t> values(n);
uint64_t * p = values.data(); // for performance
uint64_t * psup = p + n;
in.seekg(0, ios::beg);
int i = 1;
if (*((char *) &i)) {
// little endian
unsigned char s[6];
uint64_t v = 0;
while (p != psup) {
if (!in.read((char *) s, 6))
return -1;
((char *) &v)[0] = s[5];
((char *) &v)[1] = s[4];
((char *) &v)[2] = s[3];
((char *) &v)[3] = s[2];
((char *) &v)[4] = s[1];
((char *) &v)[5] = s[0];
*p++ = v;
}
}
else {
// big endian
uint64_t v = 0;
while (p != psup) {
if (!in.read(((char *) &v) + 2, 6))
return -1;
*p++ = v;
}
}
cout << "file successfully read" << endl;
sort(values.begin(), values.end());
cout << "values sort" << endl;
// DEBUG, DO ON A SMALL FILE ;-)
for (auto v : values)
cout << v << endl;
}
}
}

Read pipe (C/C++), no error, but not all data

In a C++ program, I want to fetch some data that a python program can easily provide. The C++ program invokes popen(), reads the data (a serialized protobuf) and continues on. This worked fine but has recently begun to fail with a shorter string received than sent.
I am trying to understand why I am not reading what I have written (despite no error reported) and how to generate further hypotheses. Fwiw, this is on linux (64 bit) and both processes are local. Python is 2.7.
(It's true the data size has gotten large (now 17MB where once 500 KB), but this should not lead to failure, although it's a sure signal I need to make some changes for the sake of efficiency.)
On the python side, I compute a dict of group_id mapping to group (a RegistrationProgress, cf. below):
payload = RegistrationProgressArray()
for group_id, group in groups.items():
payload.group.add().CopyFrom(group)
payload.num_entries = len(groups)
print('{a}, {p}'.format(a=len(groups), p=len(payload.group)),
file=sys.stderr)
print(payload.SerializeToString())
print('size={s}'.format(s=len(payload.SerializeToString())),
file=sys.stderr)
Note that a and p match (correctly!) on the python side. The size will be about 17MB. On the C++ side,
string FetchProtoFromXXXXX<string>(const string& command_name) {
ostringstream fetch_command;
fetch_command << /* ... */ ;
if (GetMode(kVerbose)) {
cout << "FetchProtoFromXXXXX()" << endl;
cout << endl << fetch_command.str() << endl << endl;
}
FILE* fp = popen(fetch_command.str().c_str(), "r");
if (!fp) {
perror(command_name.c_str());
return "";
}
// There is, sadly, no even remotely portable way to create an
// ifstream from a FILE* or a file descriptor. So we do this the
// C way, which is of course just fine.
const int kBufferSize = 1 << 16;
char c_buffer[kBufferSize];
ostringstream buffer;
while (!feof(fp) && !ferror(fp)) {
size_t bytes_read = fread(c_buffer, 1, kBufferSize, fp);
if (bytes_read < kBufferSize && ferror(fp)) {
perror("FetchProtoFromXXXXX() failed");
// Can we even continue? Let's try, but expect that it
// may set us up for future sadness when the protobuf
// isn't readable.
}
buffer << c_buffer;
}
if (feof(fp) && GetMode(kVerbose)) {
cout << "Read EOF from pipe" << endl;
}
int ret = pclose(fp);
const string out_buffer(buffer.str());
if (ret || GetMode(kVerbose)) {
cout << "Pipe closed with exit status " << ret << endl;
cout << "Read " << out_buffer.size() << " bytes." << endl;
}
return out_buffer;
}
)
The size will be about 144KB.
The protobuf I'm sending looks like this. The num_entries was a bit of paranoia, since it should be the same as group_size() which is the same as group().size().
message RegistrationProgress { ... }
message RegistrationProgressArray {
required int32 num_entries = 1;
repeated RegistrationProgress group = 2;
}
Then what I run is
array = FetchProtoFromXXXXX("my_command.py");
cout << "size=" << array.num_entries() << endl;
if (array.num_entries() != array.group_size()) {
cout << "Something is wrong: array.num_entries() == "
<< array.num_entries()
<< " != array.group_size() == " << array.group_size()
<< " " << array.group().size()
<< endl;
throw MyExceptionType();
}
and the output of running it is
122, 122
size=17106774
Read EOF from pipe
Pipe closed with exit status 0
Read 144831 bytes.
size=122
Something is wrong: array.num_entries() == 122 != array.focus_group_size() == 1 1
Inspecting the deserialized protobuf, it appears that group is an array of length one containing only the first element of the array I expected.
This...
buffer << c_buffer;
...requires that c_buffer contain ASCIIZ content, but in your case you're not NUL-terminating it.
Instead, make sure the exact number of bytes read are captured (even if there are embedded NULs):
buffer.write(c_buffer, bytes_read);
You catenate each chunk to the output buffer with this:
buffer << c_buffer;
As Tony D explains in his answer, you do not null terminate c_buffer before doing, so you invoke undefined behavior if c_buffer does not contain embedded null characters.
Conversely, if c_buffer does contain embedded null characters, portions of the stream are stripped and ignored.
Are you sure the streaming protocol does not contain embedded '\0' bytes?
You should also read Why is “while ( !feof (file) )” always wrong? although in your case, I don't think this is causing your problem.

Detect the sequence of blinking lights

I'm looking for an example or just a starting point to achieve the following:
Using Python openCV I want to detect the sequence of blinking lights. i.e. on off on off = match
Is this possible and could someone start by showing me a simple example. I'm hoping from this I can learn. I learn better by examples and cannot find any to achieve sort this functionality.
If the lightsource is very prominent in your image you can use the mean intensity of your image for detecting changes.
Here is a very simple example. I use this video for testing.
You probably need to adjust the thresholds for your video.
If your video is not as simple as the one I used for testing you might need to make some adjustments. For example you could try to segment the light source first if there is to much distraction in the other parts of the image. Or if the intensity changes between consecutive images are not big enough, you might need to look at the changes over several images.
Edit:
I just saw the question was tagged with python, but my source code is C++. But I leave it for now. Maybe it helps you to get the general idea so you can port it to python yourself.
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>
using namespace cv;
int main(int argc, char** argv)
{
VideoCapture capture(argv[1]);
Mat frame;
if( !capture.isOpened() )
throw "Error when reading video";
double lastNorm = 0.0;
int lastCounter = 0;
int counter = 0;
int currentState = 0;
namedWindow( "w", 1);
for( ; ; )
{
capture >> frame;
imshow("w", frame);
double currentNorm = norm(frame);
double diffNorm = currentNorm - lastNorm;
if (diffNorm > 20000 && currentState == 0)
{
currentState = 1;
std::cout << "on - was off for " << counter - lastCounter << " frames" << std::endl;
lastCounter = counter;
}
if (diffNorm < -20000 && currentState == 1)
{
currentState = 0;
std::cout << "off - was on for " << counter - lastCounter << " frames" << std::endl;
lastCounter = counter;
}
waitKey(20); // waits to display frame
lastNorm = currentNorm;
counter++;
}
waitKey(0); // key press to close window
// releases and window destroy are automatic in C++ interface
}

Categories