Global Operations

To manipulate entire structures in one step, it is useful to have a collection of operations that perform such manipulations. These global operations may be very problem-dependent, but certain ones have been found to be widely useful. For example, the average filtering example above made use of shift operations to move an array around. Broadcast is another common global operation, used to send data from one processor to all other processors. Extensions of the broadcast operation include simultaneously performing a broadcast within every (predetermined and distinct) subset of processors. For example, suppose matrix $A$ has been partitioned into submatrices allocated to different processors, and one needs to broadcast the first row of $A$ so that if a processor contains any elements of column $i$, then it obtains the value of $A(1, i)$. In this situation, the more general form of a subset-based broadcast can be used.

Besides operating within subsets of processors, many global operations are defined in terms of a commutative, associative, semigroup operator $\otimes$, and $\oplus$. Examples of such semigroup operators include $\mathtt{minimum, maximum}$, $\mathtt{or, and, sum}$, and $\mathtt{product}$. For example, suppose there is a set of values $V(i), 1 \leq i \leq n$, and the goal is to obtain the maximum of these values. Then $\otimes$ would represent maximum, and the operation of applying $\otimes$ to all $n$ values is called reduction. If the value of the reduction is broadcast to all processors, then it is sometimes known as report. A more general form of the reduction operation involves labeled data items, i.e., each data item is embedded in a record that also contains a label, where at the end of the reduction operation the result of applying $\otimes$ to all values with the same label will be recorded in the record.

Global operations provide a useful way to describe major actions in parallel programs. Further, since several of these operations are widely useful, they are often made available in highly optimized implementations. The language APL provided a model for several of these operations, and some parallel versions of APL have appeared. Languages such as $C*$ provide for some forms of global operations, as do message-passing systems such as MPI. Reduction operations are so important that most parallelizing compilers detect them automatically, even if they have no explicit support for other global operations.

Besides broadcast, reduction, and shift, other important global operations include the following.

Sort: Let $X = \{x_0 , x_1 , \dots , x_{n-1}\}$ be an ordered set such that $x_i < x_{i+1}$, for all $0 \leq i < n - 1$. (That is, $X$ is a subset of a linearly ordered data type.) Given that the $n$ elements of $X$ are arbitrarily distributed among a set of $p$ processors, the sort operation will (re)arrange the members of $X$ so that they are ordered with respect to the processors. That is, after sorting, elements $x_0 , \dots , x_{\lfloor n/p \rfloor}$ will be in the first processor, elements $x_{\lfloor n/p \rfloor+1}, \dots, x_{\lfloor 2n/p\rfloor}$ will be in the second processor, and so forth. Note that this assumes an ordering on the processors, as well as on the elements.

Merge: Suppose that sets $D_1$ and $D_2$ are subsets of some linearly ordered data type, and $D_1$ and $D_2$ are each distributed in an ordered fashion among disjoint sets of processors $P_1$ and $P_2$, respectively. Then the merge operation combines $D_1$ and $D_2$ to yield a single sorted set stored in ordered fashion in the entire set of processors $P = P_1 \cup P_2$.

Associative read/write: These operations start with a set of master records indexed by unique keys. In the associative read, each processor specifies a key and ends up with the data in the master record indexed by that key, if such a record exists, or else a flag indicating that there is no such record. In the associative write, each processor specifies a key and a value, and each master record is updated by applying $\oplus$ to all values sent to it. (Master records are generated for all keys written.)

These operations are extensions of the CRCW PRAM operations. They model a PRAM with associative memory and a powerful combining operation for concurrent writes. On most distributed memory machines the time to perform these more powerful operations is within a multiplicative constant of the time needed to simulate the usual concurrent read and concurrent write, and the use of the more powerful operations can result in significant algorithmic simplifications and speedups.