Woodstock Blog

a tech blog for general algorithmic interview questions

[Design] Designing a Simple Web Crawler

1. Choose a framework

Assuming we use Python to do this.

plain python?

We can write a simple Python crawler with the code below:

import re, urllib

textfile = file('depth_1.txt','wt')
print "Enter the URL you wish to crawl.."
print 'Usage  - "http://phocks.org/stumble/creepy/" <-- With the double quotes'
myurl = input("@> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
    print i  
    for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
        print ee
        textfile.write(ee+'\n')
textfile.close()

Scrapy?

  1. You only define the rules, Scrapy do the rest
  2. easily plugin extensions
  3. portable + python runtime.

Why Scrapy

scrapy has the tools to manage every stage of a web crawl, just to name a few:

  1. Requests manager - in charge of downloading pages all concurrently behind the scenes! You won’t need to invest a lot of time in concurrent architecture.

  2. Selectors - parse the html document (eg. XPath)

  3. Pipelines - after you retrieve the data, there’s a bunch of functions to modify the data.

Following the spirit of other don’t repeat yourself frameworks, such as Django:

it makes it easier to build and scale large crawling projects by allowing developers to re-use their code.

For more, read Scrapy Architecture .

  1. Scrapy Engine

    control data flow

  2. Scheduler

    receives requests from the engine and enqueues them for feeding them later

  3. Downloader

  4. Spiders

  5. Item Pipeline

  6. Downloader middlewares

    specific hooks that sit between the Engine and the Downloader and process requests

  7. Spider middlewares

    specific hooks that sit between the Engine and the Spiders and are able to process spider input (responses) and output (items and requests).

2. Schedule a Scrapy job

APScheduler? (todo)

add/remove jobs

3. Choose a DB

I chose NoSQL/MongoDB. But why?

  1. there’s only a few tables with few columns

  2. no overly complex associations between nodes

  3. huge amount of time-based data

  4. scaling requirements: MongoDB better horizontal scaling

  5. different field names: dynamical storage

4. Technical Difficulty?

4.1 differrent way to crawl.

We need to check AJAX response sometime and study each website’s API.

Some site would close certain APIs if they found out too many queries requests.

4.2 Difficulty navigating pages

Study their URL structure.

eg.

www.abc.com/index.html?page=milk&start_index=0

Just play with the url params!

4.3 What is key?

I defined extra column only to store keys (combine a few key columns, and convert to lower-case).

We can search using regex though, but:

Mongo (current version 2.0.0) doesn’t allow case-insensitive searches against indexed fields. For non-indexed fields, the regex search should be fine.

How to go about it:

searching with regex’s case insensitive means that mongodb cannot search by index, so queries against large datasets can take a long time.

Even with small datasets, it’s not very efficient… which could become an issue if you are trying to achieve scale.

As an alternative, you can store an uppercase copy and search against that…

If your field is large, such as a message body, duplicating data is probably not a good option. I believe using an extraneous indexer like Apache Lucene is the best option in that case.

4.4 A lot bad data

  1. write a sophisticated pipeline()

  2. try not let bad data reach pipeline() - better

Make your spider better!

4.5 NLP: brand names

how? (todo)