[Design] TCP 3-Way Handshake

Aug 11th, 2014 12:00 am

Handshaking

Handshaking is an automated process of negotiation that dynamically sets parameters of a communications channel established between two entities before normal communication over the channel begins.

It is usually a process that takes place when a computer is about to communicate with a foreign device to establish rules for communication.

TCP three-way handshake

TCP three-way handshake is the method used by TCP set up a TCP/IP connection over an Internet Protocol based network.

It’s commonly referred to as “SYN-SYN-ACK”.

Process

Host A sends a TCP SYNchronize packet to Host B
Host B receives A’s SYN
Host B sends a SYNchronize-ACKnowledgement
Host A receives B’s SYN-ACK
Host A sends ACKnowledge
Host B receives ACK.
TCP socket connection is ESTABLISHED.

Alternatively, there’s a good illustration on wiki:

Establishing a normal TCP connection requires three separate steps:

The first host (Alice) sends the second host (Bob) a “synchronize” (SYN) message with its own sequence number x, which Bob receives.

Bob replies with a synchronize-acknowledgment (SYN-ACK) message with its own sequence number y and acknowledgement number x + 1, which Alice receives.

Alice replies with an acknowledgment message with acknowledgement number y + 1, which Bob receives and to which he doesn’t need to reply.

Two more thing

Note that FTP, Telnet, HTTP, HTTPS, SMTP, POP3, IMAP, SSH and any other protocol that rides over TCP also has a three way handshake performed as connection is opened.

TCP ‘rides’ on top of Internet Protocol (IP) in the protocol stack. IP handles IP addressing and routing and gets the packets from one place to another, but TCP manages the actual communication sockets between endpoints.

[Google] Count Complete Binary Tree

Aug 11th, 2014 12:00 am

Question

link

给定一棵完全二叉树（complete binary tree）的根结点，统计该树的结点总数。

提示：使用O(n)的递归算法统计二叉树的结点数是一种显而易见的方法，此题中请利用完全二叉树的性质，想出效率更高的算法。

Solution

Similar to binary search, O(lgn) complexity.

The idea is, sum up 1 branch of nodes at a time. Do it recursively. The following code is refactored and written by me.

Code

read it from here

//使用TreeNodeUtil.getLeftChildNode(TreeNode)获得左儿子结点
//使用TreeNodeUtil.getRightChildNode(TreeNode)获得右儿子结点
//使用TreeNodeUtil.isNullNode(TreeNode)判断结点是否为空
public class CountCompleteBinayTreeNodes {
    public int countNodes(TreeNode root) {
        if (TreeNodeUtil.isNullNode(root)) {
            return 0;
        }
        int hl = height(TreeNodeUtil.getLeftChildNode(root));
        int hr = height(TreeNodeUtil.getRightChildNode(root));
        if (hl == hr) {
            return (int) Math.pow(2, hl) + countNodes(TreeNodeUtil.getRightChildNode(root));
        } else {
            return (int) Math.pow(2, hr) + countNodes(TreeNodeUtil.getLeftChildNode(root));
        }
    }

    private int height(TreeNode root) {
        int count = 0;
        while (!TreeNodeUtil.isNullNode(root)) {
            root = TreeNodeUtil.getLeftChildNode(root);
            count++;
        }
        return count;
    }
}

[Design] Big Data - Fuzzy Search Url (Bloom Filter)

Aug 10th, 2014 12:00 am

Question

link

给你A,B两个文件，各存放50亿条URL，每条URL占用64字节，内存限制是4G，让你找出A,B文件共同的URL。如果是三个乃至n个文件呢？

Bloom Filter

自从Burton Bloom在70年代提出Bloom Filter之后，Bloom Filter就被广泛用于拼写检查，数据库系统中。。。可以实现数据字典，进行数据的判重，或者集合求交集

基本原理及要点

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

很明显这个过程并不保证查找的结果是100%正确的。同时也不支持删除一个已经插入的关键字，因为该关键字对应的位会牵动到其他的关键字。

所以一个简单的改进就是 counting Bloom filter，用一个counter数组代替位数组，就可以支持删除了。

Error rate

m: length of BF array (in bits)
n: number of input elements
k: number of hash functions

A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element (means m = 9.6 x n).

Usage

Bloom Filter可以用来实现数据字典，进行数据的判重，或者集合求交集.

Solution

Of course we can always use 【分治+trie树/hash+小顶堆】 standard solution, but for Fuzzy search, BF is the best.

4G = 2³² 大概是40亿 x 8大概是340亿bit，n = 50亿，如果按出错率0.01算需要的大概是480亿个bit。现在可用的是340亿，相差并不多，这样可能会使出错率上升些。

[Design] Big Data - Find Median Numbers

Aug 10th, 2014 12:00 am

Question

link

5亿个int找它们的中位数.

Analysis

Categorized under 双层桶划分.

Idea: 首先我们将int划分为2¹⁶个区域，然后读取数据统计落到各个区域里的数的个数，之后我们根据统计结果就可以判断中位数落到那个区域，同时知道这个区域中的第几大数刚好是中位数。然后第二次扫描我们只统计落在这个区域中的那些数就可以了。

Details

开一个大小为65536的Int数组，第一遍读取，统计Int32的高16位的情况(就相当于用该数除以65536)。每读取一个数，数组中对应的计数+1。第一遍统计之后，遍历数组，逐个累加统计，看中位数处于哪个区间。

第二遍统计同上面的方法类似，但这次只统计处于区间k的情况，这次计数之后，再统计一下，看中位数所处的区间，最后将高位和低位组合一下就是结果了。

And this can be done more than 2 times, depending on the input size (eg. if input is int64, then do 24/20/20 bits).

[Design] Big Data - Find Existence of a Number

Aug 10th, 2014 12:00 am

Question

link

给40亿个不重复的unsigned int的整数，没排过序的，然后再给一个数，如何快速判断这个数是否在那40亿个数当中？

Analysis

Categorized under BitMap.

There’re 4.3 billion unsigned integers in Java. This is a perfect question to use BitMap.

申请512M的内存，一个bit位代表一个unsigned int值。读入40亿个数，设置相应的bit位，读入要查询的数，查看相应bit位是否为1，为1表示存在，为0表示不存在。

[Design] Big Data - Top K Frequency (Case Analysis)

Aug 10th, 2014 12:00 am

Question

link

在海量数据中找出出现频率最高的前K个数，或者从海量数据中找出最大的前K个数，这类问题通常称为“top K”问题，

top K value

top K frequency

Analysis

Standard solution is 【分治+trie树/hash+小顶堆/quickselect】, which I covered in another post Big Data - Top k Frequency. Briefly it is 3 steps:

先将数据集按照hash方法分解成多个小数据集，
使用trie树或者hash统计每个小数据集中的query词频，
用小顶堆/quickselect求出每个数据集中出频率最高的前K个数

But, there’re other senarios where different solutions may apply. Consider:

Single core vs. multiple core
Single PC vs. multiple PC
Large RAM vs. limited RAM
Distributed system

1. 单机+单核+足够大内存

设每个查询词平均占8Byte，则10亿个查询词所需的内存大约是10⁹*8=8G内存。如果你有这么大的内存，直接在内存中对查询词进行排序，顺序遍历找出10个出现频率最大的10个即可。这种方法简单快速，更加实用。当然，也可以先用HashMap求出每个词出现的频率，然后求出出现频率最大的10个词。

2. 单机+单核+受限内存

这种情况下，需要将原数据文件切割成一个一个小文件，如，采用hash(x)%M，将原文件中的数据切割成M小文件，如果小文件仍大于内存大小，继续采用hash的方法对数据文件进行切割，直到每个小文件小于内存大小，这样，每个文件可放到内存中处理。采用3.1节的方法依次处理每个小文件。

3. 单机+多核+足够大内存

这时可以直接在内存中实用hash方法将数据划分成n个partition，每个partition交给一个线程处理，线程的处理逻辑是同[1]节类似，最后一个线程将结果归并。

该方法存在一个瓶颈会明显影响效率，即数据倾斜，每个线程的处理速度可能不同，快的线程需要等待慢的线程，最终的处理速度取决于慢的线程。解决方法是，将数据划分成 (c x n)个partition（c>1），每个线程处理完当前partition后主动取下一个partition继续处理，直到所有数据处理完毕，最后由一个线程进行归并。

4. 多机+受限内存

这种情况下，为了合理利用多台机器的资源，可将数据分发到多台机器上，每台机器采用[3]节中的策略解决本地的数据。可采用hash + socket方法进行数据分发。

5. Distributed

Top k问题很适合采用MapReduce框架解决，用户只需编写一个map函数和两个reduce 函数，然后提交到Hadoop（采用mapchain和reducechain）上即可解决该问题。

A map function. 对于map函数，采用hash算法，将hash值相同的数据交给同一个reduce task.

2 reduce functions. 对于第一个reduce函数，采用HashMap统计出每个词出现的频率，对于第二个reduce函数，统计所有reduce task输出数据中的top k即可。

6. Other

公司一般不会自己写个程序进行计算，而是提交到自己核心的数据处理平台上计算，该平台的计算效率可能不如直接写程序高，但它具有良好的扩展性和容错性，而这才是企业最看重的。

[Design] Big Data - Remove Duplicate Numbers

Aug 10th, 2014 12:00 am

Question

link

2.5亿个整数中找出不重复的整数的个数，内存空间不足以容纳这2.5亿个整数。

Analysis

Categorized under 双层桶划分 or BitMap.

整数个数为2³²,也就是，我们可以将这2³²个数，划分为2⁸个区域(比如用单个文件代表一个区域)，然后将数据分离到不同的区域，然后不同的区域在利用bitmap就可以直接解决了。

But how exactly to use BitMap? 将bit-map扩展一下，用2bit表示一个数即可，0表示未出现，1表示出现一次，2表示出现2次及以上。

[Testing] Test Command Line Copy

Aug 9th, 2014 12:00 am

Question

link

testing the copy command in windows environment

Solution

The most important point is to come up with different domains of inputs and scenarios.

Copying between:

network share
A really slow network share across the Internet
partitions
disks
disks of different types (flash, usb, external sata, SSD, etc…)
directories
within the same directory

Naming

Normal file name
Filename that exceeds 8.3 (verylongfilename.whatever)
Copying a very long file name, but referencing it by it’s 8.3 name (copy verylo~1.wha d:)
A full directory path that exeeds MAX_PATH (260) characters (e.g. c:\a\very\long\directory\name\that\goes\on\forever\in\length……foo.txt)
By absolute addressing (\?\c:\foo\foo.txt)
wildcards (e.g. . *.txt foo?.txt )
A filename with unicode characters
A filename with illegal characters in it (there are creative ways to get these files on disk)

Attributes

Testing with different file attributes (read-only, hidden, system, archive, etc…)
Validate timestamp is preserved across copies
Validate timestamp is preserved across network file share copies when the destination machine is in another timezone
NTFS ACLs are preserved

Addressing types

reference by absolute path (e.g. copy c:\some\directory\foo.txt c:\other\place\foo.txt)
reference by relative path (e.g. copy ....\documents\whatever\foo.txt subdirectory/foo.txt)
By absolute drive letter into current working directoroy of destination (with no path (e.g. copy foo.txt d:)
Network share mounted to a drive letter

Failure cases, edge cases, and hack attacks

Try to copy a file onto itself (e.g copy c:\foo.txt c:\foo.txt)
Copy when the network share is down.
Unplug the network cable in the middle of a network file copy
copy to a read only directory
copy when the source file is locked.
copy the when destination file exists but the destination file exists and is read only
Detach the external disk right before the file copy starts
disk is near full (But would be full before the entire copy finishes)
disk is full
Unplug the power cable in the middle of the copy!
During a very long copy, start another copy with the same source file, but to another destination
During a very long copy, start another copy with a different source file, but the the same destination
During a very long copy, start another copy with the same source and destination files!

File types

ascii file
unicode file
binary file

Environments

RAID configurations
FAT and NTFS
Windows XP, Vista, 7, Server 2003, etc… (you can quantify this by asking the requirement of “which OS” up front)
Virtual Machine (VMWare, virtual PC, hypervisor, etc…)
Intel and AMD

[Design] Process vs. Thread

Aug 9th, 2014 12:00 am

First difference: inclusion

A process can contain multiple threads. When you open Microsoft Word, you start a process that runs Word. The OS creates a process and begins executing the primary thread of that process.

Second difference: address space

Threads within the same process share the same address space, whereas different processes do not.

This facilitates communication between threads, allows threads to read from and write to the same data structures and variables.

Communication between processes – also known as IPC, or inter-process communication – is quite difficult and resource-intensive.

Interprocess Communication Mechanisms

Processes communicate with each other and with the kernel to coordinate their activities. Linux supports a number of Inter-Process Communication (IPC) mechanisms. Signals and pipes are two of them but Linux also supports the System V IPC mechanisms named after the Unix TM release in which they first appeared.

Signals

Signals are one of the oldest inter-process communication methods used by Unix TM systems. They are used to signal asynchronous events to one or more processes. A signal could be generated by a keyboard interrupt or an error condition such as the process attempting to access a non-existent location in its virtual memory. Signals are also used by the shells to signal job control commands to their child processes. There are a set of defined signals that the kernel can generate or that can be generated by other processes in the system, provided that they have the correct privileges. You can list a system's set of signals using the kill command (kill -l).

Pipes

The common Linux shells all allow redirection. For example

$ ls | pr | lpr

pipes the output from the ls command listing the directory's files into the standard input of the pr command which paginates them. Finally the standard output from the pr command is piped into the standard input of the lpr command which prints the results on the default printer. Pipes then are unidirectional byte streams which connect the standard output from one process into the standard input of another process. Neither process is aware of this redirection and behaves just as it would normally. It is the shell which sets up these temporary pipes between the processes.

enter image description here

In Linux, a pipe is implemented using two file data structures which both point at the same temporary VFS inode which itself points at a physical page within memory. Figure shows that each file data structure contains pointers to different file operation routine vectors; one for writing to the pipe, the other for reading from the pipe.

Sockets

Message Queues: Message queues allow one or more processes to write messages, which will be read by one or more reading processes. Linux maintains a list of message queues, the msgque vector; each element of which points to a msqid_ds data structure that fully describes the message queue. When message queues are created a new msqid_ds data structure is allocated from system memory and inserted into the vector.
System V IPC Mechanisms: Linux supports three types of interprocess communication mechanisms that first appeared in Unix TM System V (1983). These are message queues, semaphores and shared memory. These System V IPC mechanisms all share common authentication methods. Processes may access these resources only by passing a unique reference identifier to the kernel via system calls. Access to these System V IPC objects is checked using access permissions, much like accesses to files are checked. The access rights to the System V IPC object is set by the creator of the object via system calls. The object's reference identifier is used by each mechanism as an index into a table of resources. It is not a straight forward index but requires some manipulation to generate the index.
Semaphores: In its simplest form a semaphore is a location in memory whose value can be tested and set by more than one process. The test and set operation is, so far as each process is concerned, uninterruptible or atomic; once started nothing can stop it. The result of the test and set operation is the addition of the current value of the semaphore and the set value, which can be positive or negative. Depending on the result of the test and set operation one process may have to sleep until the semphore's value is changed by another process. Semaphores can be used to implement critical regions, areas of critical code that only one process at a time should be executing.

Say you had many cooperating processes reading records from and writing records to a single data file. You would want that file access to be strictly coordinated. You could use a semaphore with an initial value of 1 and, around the file operating code, put two semaphore operations, the first to test and decrement the semaphore's value and the second to test and increment it. The first process to access the file would try to decrement the semaphore's value and it would succeed, the semaphore's value now being 0. This process can now go ahead and use the data file but if another process wishing to use it now tries to decrement the semaphore's value it would fail as the result would be -1. That process will be suspended until the first process has finished with the data file. When the first process has finished with the data file it will increment the semaphore's value, making it 1 again. Now the waiting process can be woken and this time its attempt to increment the semaphore will succeed. enter image description here

Shared Memory: Shared memory allows one or more processes to communicate via memory that appears in all of their virtual address spaces. The pages of the virtual memory is referenced by page table entries in each of the sharing processes' page tables. It does not have to be at the same address in all of the processes' virtual memory. As with all System V IPC objects, access to shared memory areas is controlled via keys and access rights checking. Once the memory is being shared, there are no checks on how the processes are using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access to the memory.

Inspired by tldp.org.

Ref: http://stackoverflow.com/a/17503877

behind the scene: System stack and heap

Each thread gets its own Stack, while sharing the same Heap. Stack size is fixed. Stack is reclaimed when thread ends.

Heap allocation is customized by heap allocator. There’s no rules. Heap size can be expanded, if approved by the OS. As Heap holds global resources, it must be thread-safe, thus making it very complex to manage.

For more, refer to [Design] Stack and Heap.

[Design] Median of Array in Distributed Computers

Aug 9th, 2014 12:00 am

Question

link

What is the distributed algorithm to determine the median of arrays of 1 billion integers located on different computers?

Solution

Suppose you have a master node (or are able to use a consensus protocol to elect a master from among your servers).
The master queries all servers for the size of their sets of data, so that it knows to look for the k = n/2 largest element.
The master then selects a random server and queries a random element.
The master broadcasts this element to each server.
Each server partitions its elements into those larger than or equal to the broadcasted element and those smaller than the broadcasted element.
Each server returns to the master the size of the larger-than partition, call this m.
1. If the sum of these sizes is greater than k, the master indicates to each server to disregard the ‘less-than’ set.
2. If it is less than k, the master indicates to disregard the larger-than sets and updates k = k - m.
3. If it is exactly k, the algorithm terminates and the value returned is the pivot selected at the beginning of the iteration.
Recurse beginning with selecting a new random pivot from the remaining elements.

ref

← Older Blog Archives Newer →