To manipulate entire structures in one step, it is useful to have a collection of operations that perform such manipulations. These global operations may be very problem-dependent, but certain ones have been found to be widely useful. For example, the average filtering example above made use of shift operations to move an array around. Broadcast is another common global operation, used to send data from one processor to all other processors. Extensions of the broadcast operation include simultaneously performing a broadcast within every (predetermined and distinct) subset of processors. For example, suppose matrix has been partitioned into submatrices allocated to different processors, and one needs to broadcast the first row of
so that if a processor contains any elements of column
, then it obtains the value of
. In this situation, the more general form of a subset-based broadcast can be used.
Besides operating within subsets of processors, many global operations are defined in terms of a commutative, associative, semigroup operator , and
. Examples of such semigroup operators include
,
, and
. For example, suppose there is a set of values
, and the goal is to obtain the maximum of these values. Then
would represent maximum, and the operation of applying
to all
values is called reduction. If the value of the reduction is broadcast to all processors, then it is sometimes known as report. A more general form of the reduction operation involves labeled data items, i.e., each data item is embedded in a record that also contains a label, where at the end of the reduction operation the result of applying
to all values with the same label will be recorded in the record.
Global operations provide a useful way to describe major actions in parallel programs. Further, since several of these operations are widely useful, they are often made available in highly optimized implementations. The language APL provided a model for several of these operations, and some parallel versions of APL have appeared. Languages such as provide for some forms of global operations, as do message-passing systems such as MPI. Reduction operations are so important that most parallelizing compilers detect them automatically, even if they have no explicit support for other global operations.
Besides broadcast, reduction, and shift, other important global operations include the following.
Sort: Let
be an ordered set such that
, for all
. (That is,
is a subset of a linearly ordered data type.) Given that the
elements of
are arbitrarily distributed among a set of
processors, the sort operation will (re)arrange the members of
so that they are ordered with respect to the processors. That is, after sorting, elements
will be in the first processor, elements
will be in the second processor, and so forth. Note that this assumes an ordering on the processors, as well as on the elements.
Merge: Suppose that sets and
are subsets of some linearly ordered data type, and
and
are each distributed in an ordered fashion among disjoint sets of processors
and
, respectively. Then the merge operation combines
and
to yield a single sorted set stored in ordered fashion in the entire set of processors
.
Associative read/write: These operations start with a set of master records indexed by unique keys. In the associative read, each processor specifies a key and ends up with the data in the master record indexed by that key, if such a record exists, or else a flag indicating that there is no such record. In the associative write, each processor specifies a key and a value, and each master record is updated by applying to all values sent to it. (Master records are generated for all keys written.)
These operations are extensions of the CRCW PRAM operations. They model a PRAM with associative memory and a powerful combining operation for concurrent writes. On most distributed memory machines the time to perform these more powerful operations is within a multiplicative constant of the time needed to simulate the usual concurrent read and concurrent write, and the use of the more powerful operations can result in significant algorithmic simplifications and speedups.