Woodstock Blog

a tech blog for general algorithmic interview questions

[Design] HBase and HDFS

Overview

Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS, MapReduce, and YARN form the core of Apache™ Hadoop®.

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

Hadoop

Hadoop is basically 2 things, a FS (Hadoop Distributed File System) and a computation framework (MapReduce).

HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a distributed and parallel manner.

But MapReduce is not limited to just HDFS.

HDFS and HBase

Being a FS, HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data.

Now, the Comparison

HDFS is a distributed file system. ref

  1. It is optimized for streaming access of large files. You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
  2. HDFS files are write once files. You can append to files in some of the recent versions but that is not a feature that is very commonly used. Consider HDFS files as write-once and read-many files. There is no concept of random writes.
  3. HDFS doesn’t do random reads very well.

HBase is a database that stores it’s data in a distributed filesystem.

The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. It doesn’t mean that HBase can’t work on any other filesystem.

  1. Low latency access to small amounts of data from within a large data set. You can access single rows quickly from a billion row table.
  2. Flexible data model to work with and data is indexed by the row key.
  3. Fast scans across tables.
  4. Scale in terms of writes as well as total volume of data.

[Design] Speed Up Webpage for Slow Connection (4)

What hijacks your load time

1. Too Many HTTP Requests

This is the single biggest contributor to performance problems in most web pages.

  1. Concatenate scripts and stylesheets

  2. Combine images with sprites (put common images into a single large image file, then use CSS to position and selectively display the appropriate portion of the sprite image)

  3. Use fewer images, more CSS.

2. Minimal Client-side Processing

  1. Validation on client. (eg. form input)

  2. Use web standards and MVC separation, making a maintainable, accessible, future-proof and max-performance website.

    Think of the HTML as the model, the CSS as the view, and the JavaScript as the controller. This separation tends to make code more efficient and maintainable, and makes many optimization techniques much more practical to apply.

  3. Push presentation code into the client tier (eg. Charts and graphs — push raw data to the client in JSON format, and use JavaScript and CSS to create pretty graphs.)

  4. Leverage Ajax techniques (only requiring small parts of the page to change in response to user actions)

3. Low Number of Parallel Requests

Fetch a script, parse and execute it, then fetch another one… this will use up all the available connections. There are things you can do to your HTML to allow virtually any browser to make many requests at once, which has a huge impact on latency.

  1. Use browser-appropriate domain sharding

  2. Use intelligent script loading

  3. Leverage Keep-Alive (reuse the same TCP connection for multiple HTTP request/response cycles)

4. Failure to leverage browser cache / local storage

  1. HTTP cache overview

  2. Leverage local storage

5. Third-party widgets

  1. Avoid third-party widgets!
  2. Try to use widgets that provide asynchronous implementations, so their inevitably terrible performance impacts their widget without dragging down your entire UX with it.

7. Failure to Use a Global Network

Amazon S3.

Ref: http://www.sitepoint.com/seven-mistakes-that-make-websites-slow/

[Design] Speed Up Webpage for Slow Connection (3)

Website KPI

There are 3 interesting phases of a web site from an end-user performance perspective.

  1. First Impression
  2. OnLoad
  3. Fully Loaded Time.

Loading Time

Question: what percentage of the time a user spends waiting for your page to load is spent after the HTML comes back to their browser?

It is typically over 90%.

Most of the time users spend waiting on your website is spent after the HTML page has been retrieved by their browser.

Fetching the HTML is just the beginning

In a nutshell, browsers parse your page’s HTML, sequentially discovering its assets (such as scripts, stylesheets, and images), requesting and then either parsing and executing them or displaying them as appropriate.

But these assets are not simply fetched all at once. Instead, the browser opens a limited number of connections to the server(s) referenced by the page. There is overhead involved in establishing TCP and HTTP connections, and some unavoidable latency in pushing the request and response bytes back and forth across the network.

So, in general, round trips between the browser and server are expensive. The structure of the HTML markup, the number and the ordering of its assets, are absolutely critical factors in its performance.

[Design] Speed Up Webpage for Slow Connection (2)

(… continued from last post)

9. Use a Content Delivery Network (CDN)

Your site’s speed is greatly affected by where the user’s location is, relative to your web server. The farther away they are, the more distance the data being transmitted has to travel. Having your content cached across multiple, strategically placed geographical locations helps take care of this problem.

A CDN will often make your operating cost a little higher, but you definitely gain a speed bonus. Check out MaxCDN or Amazon Simple Storage Service (Amazon S3).

10. Optimize Web Caching

Along with using caching systems, you should create websites that utilize web caching as much as possible. Web caching is when files are cached by the web browser for later use. Things that browsers can cache include CSS files, JavaScript files, and images.

Aside from the basics, such as putting CSS and JavaScript code that are used in multiple pages in external files, there are many ways to make sure that you are caching your files in the most efficient way possible.

For example, you can set HTTP response headers such as Expires and Last-Modified to reduce the need of re-downloading certain files when the user comes back to your site. To learn more, read about caching in HTTP and leveraging browser caching.

To set up HTTP Expires headers in Apache, read this tutorial on adding future expires headers.

Other mentions

  1. Avoid Resizing Images in HTML (using HTML’s width and height attributes), for the sake of smaller file size.

  2. Optimize Image Sizes by Using the Correct File Format. Eg. JPG format often displays photographic images at smaller file sizes than PNG.

  3. Optimize the Way You Write Code. For example, instead of using (h1)(em)Your heading(em)(h1), you can easily use CSS to make your headings italics.

    Writing code efficiently not only reduces file sizes of your HTML and CSS documents, but also makes it easier to maintain.

Ref: http://sixrevisions.com/web-development/site-speed-performance/

[Design] Speed Up Webpage for Slow Connection (1)

ref

Question

Suppose you are handling Amazon website and you have 10 MB size home page. Optimize the homepage for a customer who has 100 kbps internet connection.

Further he asked for the customer who has 100 mbps internet connection.

Reading

This is a very nice article about website speedup. Below is the full quoted text.

1. Defer Loading Content When Possible

Ajax allows us to build web pages that can be asynchronously updated at any time. This means that instead of reloading an entire page when a user performs an action, we can simply update parts of that page.

We can use an image gallery as an example. Image files are big and heavy; they can slow down page-loading speeds of web pages. Instead of loading all of the images when a user first visits the web page, we can just display thumbnails of the images and then when the user clicks on them, we can asynchronously request the full-size images from the server and update the page. This way, if a user only wants to see a few pictures, they don’t have to suffer waiting for all of the pictures to download. This development pattern is called lazy loading.

Ajax/web development libraries like jQuery, Prototype, and MooTools can make deferred content-loading easier to implement.

2. Use External JS and CSS Files

When the user first loads your web page, the browser will cache external resources like CSS and JavaScript files. Thus, instead of inline JavaScript and CSS files, it’s best to place them in external files.

Using inline CSS also increases the rendering time of a web page; having everything defined in your main CSS file lets the browser do less work when rendering the page, since it already knows all the style rules that it needs to apply.

As a bonus, using external JavaScript and CSS files makes site maintenance easier because you only need to maintain global files instead of code scattered in multiple web pages.

3. Use Caching Systems

If you find that your site is connecting to your database in order to create the same content, it’s time to start using a caching system. By having a caching system in place, your site will only have to create the content once instead of creating the content every time the page is visited by your users. Don’t worry, caching systems periodically refresh their caches depending on how you set it up — so even constantly-changing web pages (like a blog post with comments) can be cached.

Popular content management systems like WordPress and Drupal will have static caching features that convert dynamically generated pages to static HTML files to reduce unnecessary server processing. For WordPress, check out WP Super Cache (one of the six critical WordPress plugins which, claimed by the author, enjoyed a improvement by 259.1% for content-heavy pages). Drupal has a page-caching feature in the core.

There are also database caching and server-side scripts caching systems that you can install on your web server (if you have the ability to do so). For example, PHP has extensions called PHP accelerators that optimize performance through caching and various other methods; one example of a PHP accelerator is APC. Database caching improves performance and scalability of your web applications by reducing the work associated with database read/write/access processes; memcached, for example, caches frequently used database queries.

8. Load JavaScript at the End of Your Document

It’s best if you have your scripts loading at the end of the page rather than at the beginning. It allows for the browser to render everything before getting started with the JavaScript. This makes your web pages feel more responsive because the way JavaScript works is that it blocks anything below it from rendering until it has finished downloading. If possible, reference JavaScript right before the closing (body) tag of your HTML documents. To learn more, read about deferring the loading of JavaScript.

(to be continued…)

[LinkedIn] Sort Part to Make Entire Array Sorted

Question

link

Given an unsorted array arr[0..n-1] of size n, find the minimum length subarray arr[s..e] such that sorting this subarray makes the whole array sorted.

Solution

The solution from gfg is definitely good, but I find this graphic explanation a really great resource.

The idea is, find min and max in the unsorted proportion, and trim the original array.

Code

not written.

[Google] Minimum Adjustments

Question

link

Given an integer array, adjust each integers so that the difference of every adjacent integers are not greater than a given number target.

If the array before adjustment is A, the array after adjustment is B, you should minimize the sum of |A[i]-B[i]|

Given [1,4,2,3] and target=1, one of the solutions is [2,3,2,3], the minimal adjustment cost is 2.

Solution

This is a difficult DP. Now we define 2D array like this:

dp[i][j]: the minimal adjust cost on changing A[i] to j

And the equation:

dp[i][j] = min{dp[i-1][k] + |j-A[i]|} s.t. |k-j| <= target

Refer to the post by simon.zhangsm

Code

Not written.

Available at OJ

[LinkedIn] Unique Combination of Factors (因式分解)

Question

link

A question by dongfeiwww:

打印一个数的所有乘数组合,从大到小,不要有重复

Print all unique combinations of factors of a positive integer. For example given 24:

24*1
12*2
8*3
6*4
6*2*2
4*3*2
3*2*2*2

Solution

Simple DFS.

Code

public List<List<Integer>> factorCombinations(int n) {
    List<List<Integer>> ans = new ArrayList<List<Integer>>();
    helper(ans, n, n / 2, new ArrayList<Integer>());
    return ans;
}

private void helper(List<List<Integer>> ans, int num, int largestFactor,
        List<Integer> path) {
    if (num == 1) {
        ans.add(new ArrayList<Integer>(path));
        return;
    }
    for (int i = largestFactor; i > 1; i--) {
        if (num % i == 0) {
            path.add(i);
            helper(ans, num / i, i, path);
            path.remove(path.size() - 1);
        }
    }
}

[LinkedIn] Sum of Integer Weighted by Depth

Question

link

/** 
* Given a nested list of integers, returns the sum of all integers in the list weighted by their depth 
* For example, given the list {(1,1),2,(1,10)} the function should return 10 (four 1's at depth 2, one 2 at depth 1) 
* Given the list {1,{4,{6}}} the function should return 27 (one 1 at depth 1, one 4 at depth 2, and one 6 at depth 3) 
*/ 
public int depthSum (List<NestedInteger> input) 
{ // ur implementation here} 


** 
* This is the interface that represents nested lists. 
* You should not implement it, or speculate about its implementation. 
*/ 
public interface NestedInteger 
{ 
/** @return true if this NestedInteger holds a single integer, rather than a nested list */ 
boolean isInteger(); 

/** @return the single integer that this NestedInteger holds, if it holds a single integer 
* Return null if this NestedInteger holds a nested list */ 
Integer getInteger(); 

/** @return the nested list that this NestedInteger holds, if it holds a nested list 
* Return null if this NestedInteger holds a single integer */ 
List<NestedInteger> getList(); 
}

Solution

DFS recurse.

Code

The algo:

public int depthSum(List<NestedInteger> input, int weight) {
    // ur implementation here
    int sum = 0;
    for (NestedInteger ni : input) {
        if (ni.isInteger()) {
            sum += ni.getInteger() * weight;
        } else {
            sum += depthSum(ni.getList(), weight + 1);
        }
    }
    return sum;
}

Interface and Impl:

/*
 * This is the interface that represents nested lists. You should not implement
 * it, or speculate about its implementation.
 */
interface NestedInteger {
    /**
     * @return true if this NestedInteger holds a single integer, rather than a
     *         nested list
     */
    boolean isInteger();

    /**
     * @return the single integer that this NestedInteger holds, if it holds a
     *         single integer Return null if this NestedInteger holds a nested
     *         list
     */
    Integer getInteger();

    /**
     * @return the nested list that this NestedInteger holds, if it holds a
     *         nested list Return null if this NestedInteger holds a single
     *         integer
     */
    List<NestedInteger> getList();
}

class NestedIntegerImpl implements NestedInteger {

    int num;
    List<NestedInteger> list = new ArrayList<NestedInteger>();

    public NestedIntegerImpl(int number) {
        num = number;
        list = null;
    }

    public NestedIntegerImpl(List<NestedInteger> inputList) {
        num = -1;
        list = inputList;
    }

    @Override
    public boolean isInteger() {
        return list == null;
    }

    @Override
    public Integer getInteger() {
        if (isInteger()) {
            return num;
        }
        return -1;
    }

    @Override
    public List<NestedInteger> getList() {
        return list;
    }
}

[LinkedIn] Executive's Schedule

Question

link

Develop an algorithm to schedule an executive’s day, given a list of people the executive has to meet with, the amount of time they request to see the executive, and the priority of the meeting.

Solution

Refer to [Question] 0-1 Knapsack Problem.