Parallel Binary Search

Suppose now that we change the model by having $p$ processors with a shared memory. That is, we use a parallel RAM (PRAM) model. Can we speed up binary search? First, we have to define how the memory is accessed in a concurrent way. The most used model is concurrent read but exclusive write (CREW) (otherwise it is difficult to know the final value of a memory cell after a writing operation).

In a CREW PRAM, we can use the following simple parallel binary search. We divide the sorted set into $p + 1$ segments (then, there are $p$ internal segment boundaries). Processor $i$ compares the key to the element stored in the $i$th boundary and writes in a variable $c_i$ a 0, if it is greater or a $1$ if it is smaller (in case of equality, the search ends). All the processors do this in parallel. After this step, there is an index $j$ such that $c_j = 0$ and $c_{j+1} = 1$ (we assume that $c_0 = 0$ and $c_{p+1} = 1)$, which indicates in which segment the key should be. Then, processor $i$ compares $c_i$ and $c_{i+1}$ and if they are different writes the new boundaries where the search continues recursively (see figure [*]). This step is also done in parallel (processor $1$ and $p$ take care of the extreme cases). When the segment is of size $p$ or less, each processor compares one element and the search ends. Then, the worst-case number of parallel key comparisons is given by

$\displaystyle U_n = 1 + U_\frac{n}{p+1}
$

for $U_i = (i \leq p)$, which gives $U_n = \log_{p+1} n + O(p)$. That is, $U_n = O(\log n/\log (p+1))$. Note that for $p = 1$, we obtain the binary search result, as expected. It is possible to prove that it is not possible to do it better. In the PRAM model, the optimal speedup is when the work done by p processors is $p$ times the work of the optimal sequential algorithm. In this case, the total work is $p \log n/ \log p$, which is larger than $\log n$. In other words, searching in a sorted set cannot be solved with optimal speedup. If we restrict the PRAM model also to exclusive reads (EREW), then $U_n = O(\log n - \log p)$, which is even worse. This is because, at every recursive step, if all the processors cannot read the new segment concurrently, we slow down all the processes.

Figure: Parallel binary search.
\includegraphics[scale=0.3]{bin-ser}