[Design] MapReduce Common Friends Example

Aug 5th, 2014 12:00 am

Question

Facebook’s “You and Joe have 230 friends in common” feature. Sure you could use a decent caching strategy, but for now, use MapReduce.

Analysis

This list doesn’t change frequently so it’d be wasteful to recalculate it every time. We’re going to use mapreduce to calculate everyone’s common friends once a day and store those results.

Note that friends are bi-directional. If I’m your friend, you’re mine. (follow-up: how to do it for follower system?)

Input

Assume the friends are stored as Person->[List of Friends], our friends list is then:

A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D

Map

The mapper will output a key-value pair. The key will be a friend along with the person. The value will be the list of friends. The key will be sorted so that the friends are in order, causing all pairs of friends to go to the same reducer.

For map(A -> B C D) :

(A B) -> B C D
(A C) -> B C D
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E

For map(C -> A B D E) :

(A C) -> A B D E
(B C) -> A B D E
(C D) -> A B D E
(C E) -> A B D E

For map(D -> A B C E) :

(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E

And finally for map(E -> B C D):

(B E) -> B C D
(C E) -> B C D
(D E) -> B C D

Before we send these key-value pairs to the reducers, we group them by their keys and get:

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)

Each line will be passed as an argument to a reducer.

Reduce

The reduce function will simply intersect the lists of values and output the same key with the result of the intersection. After reduction:

(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)

Result

Now when D visits B’s profile, we can quickly look up (B D) and see that they have three friends in common, (A C E).

[Design] Amazon Web Services

Aug 5th, 2014 12:00 am

Overview

Amazon Web Services (AWS) is a collection of remote computing services that together make up a cloud computing platform. AWS provides a large computing capacity much faster and cheaper than building a physical server farm.

The most well-known is Amazon EC2 and Amazon S3.

Amazon is an example of an IAAS (Infrastructure as a Service) provider.

Terminologies

AWS

Amazon Web Services, a division of Amazon focusing on hosting our applications and data

SimpleDB

AWS’s always-available replacement for RDBMSs. Specifically SimpleDB is their hosted, replicated key-value store that is always available and accessible as a web service

S3

(a.k.a Simple Storage Service) AWS’s always-available file storage solution accessible as a web service

SQS

(a.k.a Simple Queue Service) AWS’s always-available queueing service accessible as a web service

ELB

(a.k.a Elastic Load Balancer) AWS’s always-available load balancing service accessible as a web service

EC2

(a.k.a Elastic Compute Cloud) AWS’s on-demand server offering accessible as a web service

CloudFront

AWS’s CDN (a.k.a Content Delivery Network) offering accessible as a web service

Benefits

Basically, by moving these services out of your data center and into the cloud, you:

No longer need to run and maintain a data center. You no longer need the associated staff
Get a more scalable and available infrastructure without trying to build it yourself

[Google] Traveller Path Problem

Aug 4th, 2014 12:00 am

Question

link

Traveler wants to travel from city “A” to city “D”. There is a path from city “A” to city “D”. Path consists of steps, i.e. travel from city “A” to city “B”. Path is encoded as sequence of steps.

Sequence is in incorrect order. Your task is to restore order of steps in the path.

Input (unordered sequence):

C -> D 
A -> B 
B -> C

Output (Correctly ordered list which represents path):

A, B, C, D

Implement following API:

class Step {
    String start;
    String finish;
};

class Node {
    String value;
    Node next;
}

List<String> findPath(List<Step> steps) {
}

Solution

This question is not stated clear enough. I found one tentative algorithm:

First, initialize a result - sortedPath, and build 2 maps (String to String) - startsToFinishes and finishesToStarts (O(N))

Then, find the one key on startsToFinishes, that is not a key in finishesToStarts - this is the city from which the path begins. (O(N))

Then, iteratively, city by city, build the path.

Follow up on Sep 2nd, 2014:

This post talks about this question. The answer would be similar to what’s writtne above.

[Design] Overview of Big Data Technology

Aug 4th, 2014 12:00 am

link

Traditional RDBMS

Data is organized in a highly-structured manner, following the relational model.

The need for the data to be well-structured actually becomes a substantial burden at extremely large volumes.

NoSQL

A completely different framework of databases that allows for high-performance, agile processing of information at massive scale.

NoSQL centers around the concept of distributed databases.

It’s horizontally scalable; as data continues to explode, just add more hardware to keep up.

Hadoop

Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing.

Hadoop is an open source implementation of the MapReduce programming model. Hadoop relies not on Google File System (GFS), but on its own Hadoop Distributed File System (HDFS).

MapReduce

An example of the Hadoop ecosystem is MapReduce.

It’s a computational model that basically takes intensive data processes and spreads the computation across a potentially endless number of servers.

[Design] Model–view–controller (MVC)

Aug 4th, 2014 12:00 am

Overview

Model–view–controller (MVC) is a software architectural pattern for implementing user interfaces.

From MIT handouts 3: MVC uses separate programming entities to store the data(model), display the data(view), and modify the data(controller).

Components

The central component of MVC, the model, captures the application’s behavior in terms of its problem domain, independent of the user interface. The model directly manages the application’s data, logic and rules.

A view can be any output representation of information, such as a chart or a diagram; multiple views of the same information are possible, such as a bar chart for management and a tabular view for accountants.

The third part, the controller, accepts input and converts it to commands for the model or view.

Early web MVC frameworks took a thin client approach that placed almost the entire MVC on server. As client technologies have matured, MVC components can be executed partly on the client(AJAX).

Other good explanation

MVC is a user interface design pattern.

Controller – Represents interactions, typically with the mouse or keyboard, or in the case of web applications, in the form of HTTP requests.
View – Renders the graphical output of the application
Model – Everything else. In particular this includes the data and business logic.

[Design] Hadoop Cluster

Aug 4th, 2014 12:00 am

link

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

Such clusters run Hadoop’s open source distributed processing software on low-cost commodity computers.

Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker; these are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker; these are the slaves.

Hadoop clusters are known for boosting the speed of data analysis applications. They also are highly scalable.

As of early 2013, Facebook was recognized as having the largest Hadoop cluster in the world.

[Google] Find Occurance Greater Than Index

Aug 4th, 2014 12:00 am

Question

link

Given an unsorted array of integers, you need to return maximum possible n such that the array consists at least n values greater than or equals to n. Array can contain duplicate values.

Sample input : [1, 2, 3, 4] – output : 2

Sample input : [900, 2, 901, 3, 1000] – output: 3

Solution

The idea of ‘couting sort’, solves in O(n) time.

Lets say the array has M numbers. So, we can count the number of existing values between 1 and M.

Then, process the values backwards (M to 1) to find the answer, adding the counts of the values processed so far.

This is not an easy question.

Code

not written by me

int Solve(const vector<int>& values) {
    int n = values.size();
    vector<int> count(n+1, 0);
    for (auto val: values)
        if (val >= n)
            count[n]++;
        else if (val > 0) // ignore negative values
            count[val]++;
    int am = 0;
    for (int i = n; i > 0; i--) {
        am += count[i];  // amount of numbers >= i
        if (am >= i)
            return i;
    }
    return 0;
}

[Google] Find Nearest Point in a 2D Space

Aug 4th, 2014 12:00 am

Question

link

You are given information about hotels in a country/city. X and Y coordinates of each hotel are known. You need to suggest the list of nearest hotels to a user who is querying from a particular point (X and Y coordinates of the user are given). Distance is calculated as the straight line distance between the user and the hotel coordinates.

Solution

Build a 2-D Tree (by using Binary Tree).

First find the root, then divide the rest of nodes by left-side and right-side. Then, divide by up-side and down-side.

There’s a very good example video here, which talked about how to construct a 2-D tree.

After this preprocessing, the search for nearest hotels in about O(lgn). However, if we were to return a list of nearest nodes, I’ve got no idea.

[CC150v4] 17.1 Type a URL in Browser and Hit Enter

Aug 3rd, 2014 12:00 am

Question

17.1 Explain what happens, step by step, after you type a URL into a browser Use as much detail as possible.

Short Answer

Browser contacts the DNS server to find the IP address of URL
DNS returns back the IP address of the site
Browser opens TCP connection to the web server at port 80
The browser sends a GET request to the server, asking for the file
Browser fetches the html code
Browser renders the HTML in the display window
Browser terminates the connection when window is closed

ref

More Details

Phase 1:

browser checks cache; if requested object is in cache and is fresh, skip to Phase 3 #4
browser asks OS for server’s IP address
OS makes a DNS lookup and replies the IP address to the browser
browser opens a TCP connection to server (this step is much more complex with HTTPS)
browser sends the HTTP request through TCP connection

ref

Phase 2:

That computer receives the HTTP request from the TCP/IP connection and passes it to the web server program.
Web server reads the hostname and path and finds or generates the data that you’ve asked for.
Web server generates an HTTP response containing that data.
Web server sends that HTTP response back down the TCP/IP connection to your machine.

ref

Phase 3:

browser receives HTTP response and may close the TCP connection, or reuse it for another request
browser checks if the response is a redirect (3xx result status codes), authorization request (401), error (4xx and 5xx), etc.; these are handled differently from normal responses (2xx)
if cacheable, response is stored in cache
browser decodes response (e.g. if it’s gzipped)
browser determines what to do with response (e.g. is it a HTML page, is it an image, is it a sound clip?)
browser renders response, or offers a download dialog for unrecognized types

Miscellaneous:

Each domain name server is divided into zones A single server may only be responsible for knowing the host names and IP addresses for a small subset of a zone, but DNS servers can work together to map all domain names to their IP addresses That means if one domain name server is unable to find the IP addresses of a requested domain then it requests the information from other domain name servers

The firewall will control connections to & from your computer. For the most part it will just be controlling who can connect to your computer and on what ports. For web browsing your firewall generally won’t be doing a whole lot.

Your router essentially guides your request through the network, helping the packets get from computer to computer and potentially doing some NAT (Network Address Tranlator) to translate IP addresses along the way (so your internat LAN request can be transitioned onto the wider internet and back).

Even more

An even more detailed article can be found here. And a __great step-by-step example using facebook.com can be found here.

[Google] Three Keys Data Structure

Aug 3rd, 2014 12:00 am

Question

link

Give efficient implementation of the following problem.

An item consist of different keys say k1, k2, k3. User can insert any number of items in database, search for item using any (one) key, delete it using any key and iterate through all the items in sorted order using any key. Give the most efficient way such that it supports insertion, search based on a key, iteration and deletion.

Solution

There’re 3 keys, so we need 3 maps to store search map for 3 types of keys. For example, the DS is like this:

(date, name, country) -> ItemObject

Then we would have:

date -> a list of ItemObject

name -> a list of ItemObject

country -> a list of ItemObject

Since we need to iterate in order, we choose TreeMap over HashMap.

For scalability purpose, we use another HashMap<KeyType, TreeMap> and put 3 items in.

Final Data Structure

3 DS needed. source

ArrayList list;
TreeMap<KeyValue, List> index;
HashMap<KeyType, TreeMap<KeyValue, List>> mapOfIndexes;

Method add(item): void

add item in ArrayList.
For each key, get TreeMap from HashMap and add into TreeMap.

Method search(key): List

Get TreeMap from HashMap for provided key.
Search the TreeMap
Return List Of Items

Method delete(key): List

Using search method get List Of item
Delete items from ArrayList and also delete its mapping from all (3) TreeMap

Method orderedIterator(keytype): Iterator

Get TreeMap from HashMap for provided key
Get EntrySet from TreeMap
Return EntrySet.iterator();

← Older Blog Archives Newer →