Woodstock Blog

a tech blog for general algorithmic interview questions

[Design] Median of Array in Distributed Computers

Question

link

What is the distributed algorithm to determine the median of arrays of 1 billion integers located on different computers?

Solution

  1. Suppose you have a master node (or are able to use a consensus protocol to elect a master from among your servers).

  2. The master queries all servers for the size of their sets of data, so that it knows to look for the k = n/2 largest element.

  3. The master then selects a random server and queries a random element.

  4. The master broadcasts this element to each server.

  5. Each server partitions its elements into those larger than or equal to the broadcasted element and those smaller than the broadcasted element.

  6. Each server returns to the master the size of the larger-than partition, call this m.

    1. If the sum of these sizes is greater than k, the master indicates to each server to disregard the ‘less-than’ set.

    2. If it is less than k, the master indicates to disregard the larger-than sets and updates k = k - m.

    3. If it is exactly k, the algorithm terminates and the value returned is the pivot selected at the beginning of the iteration.

  7. Recurse beginning with selecting a new random pivot from the remaining elements.

ref