How to detect duplicate documents in billions of urls

Problem You have a billion urls, where each is a huge page. How do you detect the duplicate documents? Solution Characterstic of page Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this. A hash can depend on links inside page, title, creation Billions of urls exist so we don’t want to compare every page with every other page that would be O(n^2). [Read More]

Design a database with terabytes of data that supports efficient range queries

Problem You have to design a database that can store terabytes of data. It should support efficient range queries. How would you do it? Solution This rings couple of points in the head. I believe the interviewer was expecting a data structure like hashtable ,B-tree or B+ trees. If we switch to distributed computing based solution, esp when “100 billion records” are involved, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing) [Read More]

How to avoid getting into infinite loops when designing a web crawler

Problem If you were designing a web crawler, how would you avoid getting into infinite loops? Solution First, how does the crawler get into a loop? The answer is very simple: when we re-parse an already parsed page. As answered by Lirik%20%20Per%20Fr0zenFyr%27s%20comment:%20if%20one%20uses%20the%20AOPIC%20algorithm%20for%20selecting%20pages,%20then%20it%27s%20fairly%20easy%20to%20avoid%20bot-traps%20of%20the%20infinite%20loop%20kind.%20Here%20is%20a%20summary%20of%20how%20AOPIC%20works:%20%20%20%20%20%20Get%20a%20set%20of%20N%20seed%20pages.%20%20%20%20%20Allocate%20X%20amount%20of%20credit%20to%20each%20page,%20such%20that%20each%20page%20has%20X/N%20credit%20(i.e.%20equal%20amount%20of%20credit)%20before%20crawling%20has%20started.%20%20%20%20%20Select%20a%20page%20P,%20where%20the%20P%20has%20the%20highest%20amount%20of%20credit%20(or%20if%20all%20pages%20have%20the%20same%20amount%20of%20credit,%20then%20crawl%20a%20random%20page).%20%20%20%20%20Crawl%20page%20P%20(let%27s%20say%20that%20P%20had%20100%20credits%20when%20it%20was%20crawled).%20%20%20%20%20Extract%20all%20the%20links%20from%20page%20P%20(let%27s%20say%20there%20are%2010%20of%20them).%20%20%20%20%20Set%20the%20credits%20of%20P%20to%200.%20%20%20%20%20Take%20a%2010%%20%22tax%22%20and%20allocate%20it%20to%20a%20Lambda%20page.%20%20%20%20%20Allocate%20an%20equal%20amount%20of%20credits%20each%20link%20found%20on%20page%20P%20from%20P%27s%20original%20credit%20-%20the%20tax:%20so%20(100%20(P%20credits)%20-%2010%20(10%%20tax))/10%20(links)%20=%209%20credits%20per%20each%20link.%20%20%20%20%20Repeat%20from%20step%203.%20%20Since%20the%20Lambda%20page%20continuously%20collects%20tax,%20eventually%20it%20will%20be%20the%20page%20with%20the%20largest%20amount%20of%20credit%20and%20we%27ll%20have%20to%20%22crawl%22%20it.%20I%20say%20%22crawl%22%20in%20quotes,%20because%20we%20don%27t%20actually%20make%20an%20HTTP%20request%20for%20the%20Lambda%20page,%20we%20just%20take%20its%20credits%20and%20distribute%20them%20equally%20to%20all%20of%20the%20pages%20in%20our%20database.%20%20Since%20bot%20traps%20only%20give%20internal%20links%20credits%20and%20they%20rarely%20get%20credit%20from%20the%20outside,%20they%20will%20continually%20leak%20credits%20(from%20taxation)%20to%20the%20Lambda%20page.%20The%20Lambda%20page%20will%20distribute%20that%20credits%20out%20to%20all%20of%20the%20pages%20in%20the%20database%20evenly%20and%20upon%20each%20cycle%20the%20bot%20trap%20page%20will%20lose%20more%20and%20more%20credits,%20until%20it%20has%20so%20little%20credits%20that%20it%20almost%20never%20gets%20crawled%20again.%20This%20will%20not%20happen%20with%20good%20pages,%20because%20they%20often%20get%20credits%20from%20back-links%20found%20on%20other%20pages.%20This%20also%20results%20in%20a%20dynamic%20page%20rank%20and%20what%20you%20will%20notice%20is%20that%20any%20time%20you%20take%20a%20snapshot%20of%20your%20database,%20order%20the%20pages%20by%20the%20amount%20of%20credits%20they%20have,%20then%20they%20will%20most%20likely%20be%20ordered%20roughly%20according%20to%20their%20true%20page%20rank.%20%20This%20only%20avoid%20bot%20traps%20of%20the%20infinite-loop%20kind,%20but%20there%20are%20many%20other%20bot%20traps%20which%20you%20should%20watch%20out%20for%20and%20there%20are%20ways%20to%20get%20around%20them%20too.) here : If you want to get a detailed answer take a look at section 3.8 this paper, which describes the URL-seen test of a modern scraper: [Read More]

Find duplicate elements in the array with 4KB memory only

Problem You have an array with all the numbers from 1 to N, where N is at most 32,000. The array may have duplicate entries and you do not know what N is. With only 4KB of memory available, how would you print all duplicate elements in the array? Solution 4KB = 32768 bits. We can use a byte array to represent if we have seen number i. If so, we make it 1. [Read More]

Generate an integer which is not in a file with 4 billion integers, limited memory

Problem Given an input file with four billion integers, provide an algorithm to generate an integer which is not contained in the file. Assume you have 1 GB of memory. FOLLOW UP What if you have only 10 MB of memory? Solution Case 1 - If very large integers are allowed then one can generate a number that is statistically likely to be unique in O(1) time. For example, a psuedo random 128 bit integer like a GUID will only collide with one of the existing four billion integers in the set in less than one out of every 64 billion billion billion cases. [Read More]