[LINUX] Coordination of each process in MPI and buffering of standard output

Introduction

background

Look at Mr. Robota's this tweet regarding MPI, and how do each process work together (mainly regarding the compilation of standard output)? I was curious about the buffer control and the point, so I did a rough search.

Caution

This time, I haven't written in detail about the environment etc., so please be sure to take the back and use it by yourself without taking it **. In addition, please note that the terms used are quite attached to the texto.

In addition, I tried Linux environment, and there are three types of MPI: Intel, SGI-MPT, and OpenMPI. Also, Fortran is not mentioned at all. For C / C ++.

What is standard output buffering?

What is buffering?

First of all, about "buffering".

When outputting data from standard output, it is done through stdout for C and std :: cout for C ++, but even if you call printf or std :: ostream :: operator <<, the output will be reflected immediately. It is not always done. This is because calling the OS APIs (write and send) that are actually in charge of output in small pieces is generally disadvantageous in terms of performance, and in the standard library, it is stored in a buffer to some extent and lumped. This is because I try to spit it out all at once. This behavior is called ** buffering **.

Buffering in the standard library

There are three modes of buffering:

For C stdio, control is done with setbuf / setvbuf. The default control depends on the file / device to which the standard output is connected, line buffered for TTY / PTY, and fully buffered otherwise. The flush operation is performed with the fflush function.

For C ++ iostream, control is done by whether to set the std :: ios_base :: unitbuf flag to std :: cout (std :: cout.setf or unsetf). If it is set, it is unbuffered, if it is not set, it is fully buffered, and there is no row buffered. The default is full buffered. The flush operation is performed with the I / O manipulator std :: flush or std :: endl.

… And, like this, the control is different for C and C ++, but in the case of mixed C / C ++ code, it is usually a problem if the output order is disturbed, so by default both are synchronized. I will. In other words, even if it is fully buffered on the C ++ side, if the C side is not, it will be pulled to the C side. However, this behavior can be overridden by std :: ios_base :: sync_with_stdio (false).

How MPI works

Mechanism overview

Roughly speaking, what MPI is is that if you specify the nodes that can be mobilized and the number of processes to be executed, multiple nodes can start the same (or even heterogeneous) program and perform cooperative calculations. It is a library and a group of tools.

Therefore, there are various high-speed networks such as InfiniBand as cooperation between the started programs, but this time I think about the "standard output flow". Therefore, it is enough to consider the three factors shown in the following figure.

image.png

** This is an arbitrary term **,

It will be classified as. In the above figure, the front and other nodes are drawn as if they were running on different nodes, but they may be the same.

Differences by MPI

Intel MPI First is Intel MPI. It looks like the following figure.

image.png

The roles and responses are as follows.

After starting, the manager connects to the front via TCP / IP, aggregates the output piped from the worker, and sends it to the front.

SGI MPT Next is SGI MPT.

image.png

The roles and responses are as follows.

The division of roles is similar to Intel MPI, but starting the manager from the front requires a daemon called arrayd (even if it is a local node) that comes with SGI MPT.

OpenMPI Finally, OpenMPI.

image.png

The roles and responses are as follows.

The big difference from the above two MPIs, as well as the handling of local nodes, is that the interaction between managers and workers becomes PTY.

Buffering standard output in MPI

Reorganization of output path

If you output with an MPI program, we will look at what happens when it is finally aggregated and output to the front.

As organized above, three types of programs, front manager and worker, work together when MPI is executed. And ** worker output is aggregated to the front via the manager **. Therefore, it is necessary to organize buffering for each route. That is,

There are three places.

Differences in buffering for each MPI

Intel MPI In the case of Intel MPI, the output of each worker is mixed even in the middle of the line. So it looks like buffering is disabled.

This is because ** the MPI library internally calls setbuf / setvbuf during MPI_Init to put the worker in an unbuffered state **. In other words, the part of worker-> manager is buffering disabled, and the manager-> front, front-> final output destination is spilled as it is without any particular control, so it looks like buffering is disabled as a whole.

Therefore, after MPI_Init, you can enable buffering by calling setbuf / setvbuf and reconfiguring the buffering. In addition, it seems that the flag of std :: cout is not changed in either MPI_Init or MPI: Init, so if it is a pure C ++ application, buffering will be enabled by disabling C, C ++ synchronization. ..

-ordered-output Use this option to avoid intermingling of data output from the MPI processes. This option affects both the standard output and the standard error streams. NOTE When using this option, end the last output line of each process with the end-of-line '\n' character. Otherwise the application may stop responding.

SGI MPT In the case of SGI MPT, the output is organized line by line, which is equivalent to line buffered behavior.

The mechanism behind this is a bit complicated.

In other words, the buffering is done by the efforts of the front desk. Conversely, it may be that you do not want the MPI application (worker) to control the buffer without permission.

OpenMPI Like SGI MPT, OpenMPI behaves like row buffered.

This mechanism is very simple, because the output between worker and manager is PTY, and stdout control for it is row buffered by default. Others Manager-> Front, Front-> Final output destination does not seem to control anything special. In other words, OpenMPI itself does not specifically deal with buffering control, it is left to the standard library.

Summary

So, I have seen the difference in control in each MPI.

Although there are differences in each MPI, if you want to make sure that buffering works, I think it is better to call setbuf / setvbuf immediately after MPI_Init.

reference

Below, the source and operation log when trying the behavior with Intel MPI are listed for reference.

Operation log


$ cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core) 
$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)
Copyright (C) 2003-2017, Intel Corporation. All rights reserved.
$ icpc --version
icpc (ICC) 18.0.1 20171018
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

$ mpiicpc -std=gnu++11 -o test test.cpp
$ mpirun -np 2 ./test
abababababababababbababababababababababa

abababababababababababababababababababab
a
bbabaababababababababababababababababab
a
bababbaababababbaababababababababababab

babaabababababababababababababababababab

$ mpirun -np 2 ./test --nosync
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
$ mpirun -np 2 ./test --setvbuf
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
$ mpirun -np 2 ./test --nosync --unitbuf
abababbaabababababababababababababababab
a
bababababababababababababababababababab

babababababababababababababababababababa

ababababbabababababababababababababababa

abababababababababababababababababababab

$ mpiicpc -std=gnu++11 -o test2 test2.cpp
$ mpirun -np 2 ./test2
abababababbaababababababbaababababababab

babaabababbaabababababababababababababab

babababababababababababababababababababa

ababababbaabababababbabaabababababababab
a
bababababababababababababababababababab

$ mpirun -np 2 ./test2 -f
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbb
$ mpirun -np 2 ./test2 -l
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbb
$ 

test.cpp


#include <mpi.h>
#include <iostream>
#include <thread>
#include <string>
#include <cstdio>

static char stdoutbuf[8192];

int main(int argc, char **argv) {
  MPI::Init(argc,argv);
  MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
  int rank = MPI::COMM_WORLD.Get_rank();

  for ( int i=1; i<argc; i++ ) {
    std::string opt(argv[i]);
    if ( opt == "--nosync" ) {
      // detach C++-iostream from C-stdio
      std::ios_base::sync_with_stdio(false);
    }
    else if ( opt == "--setvbuf" ) {
      // re-setvbuf for C-stdio
      std::setvbuf(stdout,stdoutbuf,_IOFBF,sizeof(stdoutbuf));
    }
    else if ( opt == "--unitbuf" ) {
      // disable buffering on C++-iostream
      std::cout.setf(std::ios_base::unitbuf);
    }
    else if ( rank == 0 ) {
      std::cerr << "invalid option: " << opt << std::endl;
      std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
  }

  char c='a'+rank;
  for ( int i=0; i<5; i++ ) {
    MPI::COMM_WORLD.Barrier();
    for ( int j=0; j<20; j++ ) {
      std::cout << c;
      std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
    std::cout << std::endl;
  }
  MPI::Finalize();
}

test2.cpp


#include <mpi.h>
#include <iostream>
#include <thread>
#include <string>
#include <cstdio>

static char stdoutbuf[8192];

int main(int argc, char **argv) {
  MPI::Init(argc,argv);
  MPI::COMM_WORLD.Set_errhandler(MPI::ERRORS_THROW_EXCEPTIONS);
  int rank = MPI::COMM_WORLD.Get_rank();

  if ( argc > 1 ) {
    std::string opt(argv[1]);
    if ( opt == "-f" ) {
      // full buffered
      std::setvbuf(stdout,stdoutbuf,_IOFBF,sizeof(stdoutbuf));
    }
    else if ( opt == "-l" ) {
      // line buffered
      std::setvbuf(stdout,stdoutbuf,_IOLBF,sizeof(stdoutbuf));
    }
  }

  char c='a'+rank;
  for ( int i=0; i<5; i++ ) {
    MPI::COMM_WORLD.Barrier();
    for ( int j=0; j<20; j++ ) {
      std::cout << c;
      std::this_thread::sleep_for(std::chrono::milliseconds(10));
    }
    std::cout << '\n';
  }
  std::cout << std::flush;
  MPI::Finalize();
}

Recommended Posts

Coordination of each process in MPI and buffering of standard output
Calculation of standard deviation and correlation coefficient in Python
Receives and outputs standard output of Python 2 and Python 3> C implementations
Check the processing time and the number of calls for each process in python (cProfile)
Explanation of CSV and implementation example in each programming language
Read the image of the puzzle game and output the sequence of each block
Read the standard output of a subprocess line by line in Python
Input / output method of values from standard input in competitive programming, etc.
Make standard output non-blocking in Python
Export and output files in Python
Even in the process of converting from CSV to space delimiter, seriously try to separate input / output and rules
Implement part of the process in C ++
Notes on standard input / output of Go
UnicodeEncodeError struggle with standard output of python3
Screenshots of Megalodon in selenium and Chrome.
Separation of design and data in matplotlib
Difference in output of even-length window function
Summary of modules and classes in Python-TensorFlow2-
Status of each Python processing system in 2020
Project Euler # 1 "Multiples of 3 and 5" in Python
How to count the number of elements in Django and output to a template