[Question] Trie Wildcard String Matching

May 22nd, 2017 12:00 am

Question

link

Given a trie data struture, implement add() and search() method, which support wildcard matching.

eg. If trie contains [“cool”, “kid”, “kindle”, “kind”]

These would match: “col”, “kind”, “**”

These won’t: “*”, “kid”, “coo”

Solution

First, know how to build a trie, then know how to search in Trie. This could be tricky. Do Read On!

Trie (root) start with an empty node.
When do searching, ALWAYS assumes that current TrieNode is matched. I.e. do not check value of current TrieNode. Instead, check current TrieNode’s children, and find a match.
The only case you return true, is when matching reaches current TrieNode, and current TrieNode is an terminating node for a word.
Be careful and do not confuse yourself for which node you gonna match, and when you return true.

Read the code below. If you still have a hard time, read #2 and #3 above, again.

Code

public boolean solve(TrieRoot trie, String word) {
    if (word == null || word.length() == 0) {
        return false;
    }
    return matchFromChildren(trie.getRoot(), word, 0);
}

private boolean matchFromChildren(TrieNode node, String word, int index) {
    // [important note] 
    // regardless of the value of node.letter, match word[index...] from node.children

    if (index > word.length()) {
        // impossible to reach here
        return false;
    } else if (index == word.length()) {
        // word is now fully matched, check if isTerminalNode(node)
        return node.isEndOfWord();
    }

    char curLetter = word.charAt(index);
    if (curLetter == '*') {
        for (TrieNode child: node.getAllChildren()) {
            if (matchFromChildren(child, word, index + 1)) {
                return true;
            }
        }
        return false;
    } else {
        TrieNode nextNode = node.getChild(curLetter);
        if (nextNode == null) {
            return false;
        } else {
            return matchFromChildren(nextNode, word, index + 1);
        }
    }
}

[Question] Dutch National Flag Problem

May 13th, 2017 12:00 am

Question

link

Given an array A[] consisting 0s, 1s and 2s, write a function that sorts A[]. The functions should put all 0s first, then all 1s and all 2s in last.

Example
Input = {0, 1, 1, 0, 1, 2, 1, 2, 0, 0, 0, 1};
Output = {0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2}

Solution

3-way sort: http://www.geeksforgeeks.org/3-way-quicksort-dutch-national-flag/

Code

public int[] threePointerSort(int[] input) {
    if (input == null || input.length <= 1) {
        return input;
    }

    // left is at the leftmost non-0 position
    // right is at rightmost non-2
    // mid is currently being evaluated (before mid, it's sorted)
    int left = 0, right = input.length - 1;
    int mid = left;

    while (mid <= right) {
        if (input[mid] > 1) {
            Common.swap(input, mid, right);
            right--;
        } else if (input[mid] < 1) {
            Common.swap(input, left, mid);
            left++;
            mid++;
        } else {
            mid++;
        }
    }

    return input;
}

[Design] How to Design Logging

Aug 1st, 2016 12:00 am

Logging

Conecting between information and knowledge.

Information: it’s just bits
Knowledge: drives product direction

Logging is all about Event

Q1. shall we log it live? what should we log? how to do it (on different platforms)?

Not live. Do it asynchronously.

Demand and product design drives logging.

Logs represent event, so all your logging should be based around the events.

eg. who click the button, the user ID, but do not log the age, because you can find it somewhere else!

Think about “what” to log and “why”

“How” is just coding. Happy Logging!

[Design] MVC, MVP and MVVM

Jul 31st, 2016 12:00 am

MVC Pattern

Model-View-Controller.

The model and controller logic are decoupled from user interface (view).

Model

business model + data access operations (i.e. data model)
View

only for displaying data (received from the controller, transforms the model into UI)
Controller (IMPORTANT)

process incoming requests. It receives input from users via the View, then process the user’s data with the help of Model and passing the results back to the View. Typically, it acts as the coordinator between the View and the Model.

example

Ruby on Rails, Spring Framework, Apple iOS Development and ASP.NET MVC.

Softwares (not web apps).

MVP pattern

separate the presentation layer from the logic

The Presenter is responsible for handling all UI events on behalf of the view.

Unlike view and controller, view and presenter are completely decoupled from each other’s and communicate to each other’s by an interface.

Also, presenter does not manage the incoming request traffic as controller.

Key Points about MVP

User interacts with the View.
There is one-to-one relationship between View and Presenter means one View is mapped to only one Presenter.
View has a reference to Presenter but View has not reference to Model.
Provides two way communication between View and Presenter.

example

Android, ASP.NET Web Forms applications

MVVM pattern

This pattern supports two-way data binding between view and View model.

This enables automatic propagation of changes, within the state of view model to the View.

Typically, the view model uses the observer pattern to notify changes in the view model to model.

details

The View Model is responsible for exposing methods, commands, and other properties that helps to maintain the state of the view, manipulate the model as the result of actions on the view, and trigger events in the view itself.

Key Points about MVVM

User interacts with the View.
There is many-to-one relationship between View and ViewModel means many View can be mapped to one ViewModel.
View has a reference to ViewModel but View Model has no information about the View.
Supports two-way data binding between View and ViewModel.

example

Ember.js, WPF, Silverlight

Ref

Main: http://www.dotnet-tricks.com/Tutorial/designpatterns/2FMM060314-Understanding-MVC,-MVP-and-MVVM-Design-Patterns.html

TLDR: http://www.beyondjava.net/blog/model-view-whatever/

[Design] Design Twitter

Jul 11th, 2016 12:00 am

System design evaluation form

work solution
special cases
analysis
trade off
knowledge base

Design guideline: 4S

Scenario

ask, features, qps, DAU, interfaces
Service

split, application, module
Storage

schema, data, sql, NoSql, file system
Scale

sharding, optimize, special case

Scenario

DAU?

Whta’s the DAU/MAU rate?

Chatting apps like wechat/whatapp has a rate of around 75%, but facebook/twitter is lower at 60%.

Enumerate the functions

registration
user profile display/edit
upload image/video
search
post a tweet
share a tweet
timeline
newsfeed
follow/unfollow

QPS

concurrent user

150M user * 60 query/user / 606024s = average QPS = 100K

peak QPS = 3 * average QPS = 300K

fast growing product = 2 * peak QPS = 600K
read qps: 300K QPS
write qps: 5K QPS

On average, a web server support around 1000 QPS, thus in this case, we need 300 servers to support the system.

Service

4 services for Twitter

user service
1. register
2. login
tweet service
1. post tweet
2. news feed
3. timeline
media service
1. upload image
2. upload video
friendship service
1. follow
2. upfollow

Storage

SQL

Good for accurate, small amount of data, more read than write. user table
NoSQL

Good for large amount of read/write, high scalability. tweets social graph (follower)
File System

Good for media files photo, video

Select the right DB

Question: can we use file system for tweets?

No, it’s hard to query. Eg. query all tweets of my friends.

Design data schema

(optional) 3 tables needed:

user table
tweet table
friendship table: this is not as straight forward, as it shall contain double directions info

Important: News Feed

pull model

Read top 50 feeds from top 100 friends, then merge sort by date. (note that user is getting sync-blocked here).

Post tweet is simple 1 DB write.

This design is bad, because file-system/DB read is slow. If you have N friends, you query O(N) DB queries. It’s too slow (and user is getting sync-blocked, too). We should have, ideally, <= 7 DB queries per web page.

problem

synchronously block user from getting news feed
too many DB reads

push model

Each person have a list storing new feeds. When friend post tweet, fanout to my feed list.

When I read, I simply read top 100 from the feed list. So read is 1 DB read.

Post tweet is N DB writes, which is slow. However this is done async, so it does not matter.

One example of async implementation: RabbitMQ

Scale

optimize pull model

Although it looks like push is faster than pull, facebook and twitter both use pull model.

add cache for DB, reduce # of DB read
also cache each user’s news feed

your yesterday’s feeds are all cached, thus don’t need to read everytime.

optimize push model

disk waste a lot, although disk is cheap
inactive user!

rank follower by weight, and don’t write to inactive user (eg. last login time)
if follower is toooo much, like Lady Gaga, user pull for Lady Gaga.

Tradeoff: Push + Pull model.

optimize ‘Like’ info

In tweet table, if we need to count(user) who liked, it’s gonna take forever.

We must denormalize this data!

Denormalize: it’s duplicate info but we store in table, because of performance improvement.

Shortcoming: inconsistency!

unless using SQL transaction, async failure can result in wrong counting number
race condition

Solution: 1. use atomic operation 2. every day, schedule to update this number.

optimize thundering herd problem

When cache fails, all DB query will go to DB. This results in DB crash.

Hot spot (thundering herd)

Solution: hold all incoming queries (who fails cache), and only send 1 DB query. When result is returned, return to every query.

[Octopress] Add Aside Content to Octopress

Jun 11th, 2016 12:00 am

Aside

Example of this is the ‘categories’ list on the right hand side of the blog page. It’s always pinned on RHS, regardless of page content.

Instruction

Add to source/_includes/asides/category_list.html with the following content. (remember to delete [delete-this-tag])

 <section class="well">
   <h1>Categories</h1>
   <ul id="categories" class="nav nav-list">
     {[delete-this-tag]% category_list %[delete-this-tag]}
   </ul>
 </section>

Go to _config.yml and modify default_asides.

 default_asides: [asides/category_list.html, asides/recent_posts.html, asides/github.html, asides/delicious.html, asides/pinboard.html, asides/googleplus.html, asides/advertise.html]

Since the ‘category_list’ tag is not natively supported. Add plugins/category_list_tag.rb with the followin content.

 module Jekyll
   class CategoryListTag < Liquid::Tag
     def render(context)
       html = ""
       categories = context.registers[:site].categories.keys
       categories.sort.each do |category|
         posts_in_category = context.registers[:site].categories[category].size
         category_dir = context.registers[:site].config['category_dir']
         category_url = File.join(category_dir, category.gsub(/_|\P{Word}/, '-').gsub(/-{2,}/, '-').downcase)
         html << "<li class='category'><a href='/#{category_url}/'>#{category} (#{posts_in_category})</a></li>\n"
       end
       html
     end
   end
 end

 Liquid::Template.register_tag('category_list', Jekyll::CategoryListTag)

rake generate && rake preview

[Design] User Registry Table Design

May 8th, 2016 12:00 am

First word

Designing a system like twitter, facebook or airbnb, first step is often User Registry.

The tables, we must use RDBMS, as it’s more reliable.

Table design

Friendship table is important:

[Design] Designing a Simple Web Crawler

Nov 22nd, 2015 12:00 am

1. Choose a framework

Assuming we use Python to do this.

plain python?

We can write a simple Python crawler with the code below:

import re, urllib

textfile = file('depth_1.txt','wt')
print "Enter the URL you wish to crawl.."
print 'Usage  - "http://phocks.org/stumble/creepy/" <-- With the double quotes'
myurl = input("@> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
    print i  
    for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
        print ee
        textfile.write(ee+'\n')
textfile.close()

Scrapy?

You only define the rules, Scrapy do the rest
easily plugin extensions
portable + python runtime.

Why Scrapy

scrapy has the tools to manage every stage of a web crawl, just to name a few:

Requests manager - in charge of downloading pages all concurrently behind the scenes! You won’t need to invest a lot of time in concurrent architecture.

Selectors - parse the html document (eg. XPath)

Pipelines - after you retrieve the data, there’s a bunch of functions to modify the data.

Following the spirit of other don’t repeat yourself frameworks, such as Django:

it makes it easier to build and scale large crawling projects by allowing developers to re-use their code.

For more, read Scrapy Architecture .

Scrapy Engine

control data flow
Scheduler

receives requests from the engine and enqueues them for feeding them later
Downloader
Spiders
Item Pipeline
Downloader middlewares

specific hooks that sit between the Engine and the Downloader and process requests
Spider middlewares

specific hooks that sit between the Engine and the Spiders and are able to process spider input (responses) and output (items and requests).

2. Schedule a Scrapy job

APScheduler? (todo)

add/remove jobs

3. Choose a DB

I chose NoSQL/MongoDB. But why?

there’s only a few tables with few columns
no overly complex associations between nodes
huge amount of time-based data
scaling requirements: MongoDB better horizontal scaling
different field names: dynamical storage

4. Technical Difficulty?

4.1 differrent way to crawl.

We need to check AJAX response sometime and study each website’s API.

Some site would close certain APIs if they found out too many queries requests.

4.2 Difficulty navigating pages

Study their URL structure.

eg.

www.abc.com/index.html?page=milk&start_index=0

Just play with the url params!

4.3 What is key?

I defined extra column only to store keys (combine a few key columns, and convert to lower-case).

We can search using regex though, but:

Mongo (current version 2.0.0) doesn’t allow case-insensitive searches against indexed fields. For non-indexed fields, the regex search should be fine.

How to go about it:

searching with regex’s case insensitive means that mongodb cannot search by index, so queries against large datasets can take a long time.

Even with small datasets, it’s not very efficient… which could become an issue if you are trying to achieve scale.

As an alternative, you can store an uppercase copy and search against that…

If your field is large, such as a message body, duplicating data is probably not a good option. I believe using an extraneous indexer like Apache Lucene is the best option in that case.

4.4 A lot bad data

write a sophisticated pipeline()
try not let bad data reach pipeline() - better

Make your spider better!

4.5 NLP: brand names

how? (todo)

[Question] Swizzle Sort

Nov 21st, 2015 12:00 am

Question

link

输入一个数组，要求输出满足：a[0]<=a[1]>=a[2]<=a[3]>=…

Solution

O(n)，一边扫描即可。发现不符合条件的只要跟前面一个数对调就可以，

Code

public int[] solve(int[] input) {
    boolean incr = true;
    int len = input.length;
    int p = 1;
    while (p < len) {
        if (incr ^ (input[p - 1] < input[p])) {
            Common.swap(input, p - 1, p);
        }
        p++;
        incr = !incr;
    }
    return input;
}

[Design] How to Generate Maze

Nov 21st, 2015 12:00 am

Question

link

Design a 2D dungeon crawling game. It must allow for various items in the maze - walls, objects, and computer-controlled characters.

Part 1: API design

Serialize:

if a certain cell has a wall to the North and West but not to the South or East, it would be represented as 1001, or 9… (e.g., “9,6,11,12\n3,10,10,4\n13,9,12,5\n3,6,1,6” in a 4x4 maze)

Design API~

Part 2: Algorithm

Depth-first search

This is most common and one of the simplest ways to generate a maze using a computer. It’s commonly implemented using Recursive backtrack.

from a random cell, select a random neighbour that hasn’t been visited.
removes the ‘wall’ and adds the new cell to a stack.
a cell with no unvisited neighbours is considered dead-end.
When at a dead-end it backtracks through the path until it reaches a cell with unvisited neighbours, continuing from there.
until every cell has been visited, the computer would backtrack all the way to the beginning cell.
Entire maze space is guaranted a complete visit.

side note

To add difficulty and a fun factor to the DFS, you can influence the likelihood of which neighbor you should visit, instead of completely random.

By making it more likely to visit neighbors to your sides, you can have a more horizontal maze generation.

← Older Blog Archives