Skip to content

Experimental support for fault tolerance & asymmetric PGAS

Compare
Choose a tag to compare
@zbeekman zbeekman released this 27 May 21:05
1.9.0
fdc783d

Github Releases (by Asset) Build Status license Twitter URL

Experimental failed-image detection

This feature is experimental and requires an MPI implementation with certain experimental, proposed MPIX functions and constants. These are present in MPICH 3.2 which is now the default, officially-supported MPI back end. Some/most/all of these features are available in OpenMPI through the ULFM project. If the build systems detects the required features are present it will default to enabling failed images support.

See the src/tests/unit/fail_images subdirectory for demonstrations of the new support for Fortran 2015 features related to fault-tolerance, including the following:

  • The iso_fortran_envintrinsic module now contains a new stat_failed_image value that the compiler and runtime library assign to the stat argument of parallel synchronization and communication statements to signal that an image has ceased responding, a scenario considered increasingly likely as computing platforms approach exaflop scalability.
  • A new failed_images() function returns an array containing the image numbers failed images.

Richer support for fault-tolerant execution necessitates the Fortran 2015 team feature. However, this release enables users to start experimenting with fault-tolerance in advance of anticipated team support.

Additional experimental support for derived-type coarrays with allocatable components:

This adds on an incomplete implementation in the 1.8.0 release for supporting derived type coarrays with allocatable components. Fortran requires that array coarrays have the same shape and bounds on each image. For intrinsic coarrays, this implies memory allocations that are invariant under image-number transformations. With coarrays of derived type, however, one can allocate data that are of varying size and shape across images:

type foo
   real, allocatable :: bar(:)
end type
type(foo) :: foobar[*]

which is a powerful enabler when used judiciously in problems that require such flexibility of distributed, non-uniform memory allocations. This feature requires GCC/GFortran 7.1 since compiler side interface changes were required to support this feature. This features is still considered experimental and is not yet fully implemented in all regards, so use we do not yet recommend using it in production.

Bug fixes

  • #309 stop statements with numeric and string arguments were not handled correctly and are now fixed.
  • #342 A maintainer flag was added to turn on tests intended only for OpenCoarrays developers. This can be turned on using OPENCOARRAYS_DEVELOPER=TRUE as an environment variable or by turning on the CAF_RUN_DEVELOPER_TESTS advanced CMake option.
  • #354 sync (all|images) without stat= was not erroring out under certain error conditions. This is now resolved.
  • #376 The CI build matrix was expanded for more complete test coverage using GCC 6 and 7 for compiling the library.
  • #383 cafrun had a typo (missing space) with the -v flag. Thanks to @LaHaine for pointing this out.
  • #384 install.sh does not work on HPC Linux. A new script was added to install OpenCoarrays on HPC Linux.
  • #385 install.sh was not correctly reporting the path to the newly installed CMake under certain circumstances. This is now fixed.
  • #388 Better build system robustness and diagnostics
  • Excessive debug output has been reduced when building the Debug configuration
  • Tests' oversubscription is now reduced

Installation

Please see the installation instructions for more details on how to build and install this version of OpenCoarrays


GitHub forks GitHub stars GitHub watchers Twitter URL