It has been a year and a couple of months since GoogleMIRC was shown at RSNA. GoogleMIRC was a radiology vertical search engine that served as a research project. Incidentally it was my last research project at the Baltimore VA. This post will be in place of an article that I have written and rewritten but never thought was really any good. I originally intended to publish an article in Radiographics. This is obviously never going to happen now. Fortunately I can write much more informally and tell you the story of GoogleMIRC.
Before we go any farther I want to acknowledge the other participants in the project who made it possible. None of this would have been possible without Khan Siddiqui. We came up with the idea together in his office while discussing some of the limitations of RSNA’s MIRC project. He worked with me to make it all possible. I want to thank Paul Wheeler currently at Positronic who helped out with a couple of crucial fixes including speeding up the search algorithm and balancing the urls that were sent to the crawler. Also Eliot Siegel whose expectations we constantly tried to exceed. Also thanks to the rest of the group including Woojin Kim, Nabile Safdar, Bill Boonn and Krishna Juluru. Additionally thanks must be offered to everyone whose web server I abused for this project, particularly the University of Alabama teaching file.
Originally GoogleMIRC was conceived as an idea to simply replace the search functionality in MIRC. Khan and I came up with the idea during one of our late afternoon discussions. Every afternoon we had an ice cream break, usually around 4:30 or 5 and discussed interesting things. We discussed adding simply a summary to the search results like google has for each result. MIRC simply showed the title of the case. Also at the time (I don’t know if it is still true) MIRC provided little to no relevance ranking for results. The results were partitioned by which server they came from which is really not what the user is looking for. So with that I set out study search technology. It was a good thing that none of us had any idea what we were getting into. This occurred at the end of January 2006.
The project quickly expanded into covering as many teaching files as possible. We wanted to provide radiologists with a tool that they could use in clinical practice that added value. We judged that radiologists would want to be able to quickly access content that was radiology specific. After all the radiologist wants his information immediately and in a form that allows him to better perform his job. An article about the disease in nature is not particularly useful at the time of diagnosis, not matter how interesting it might be.
I spent the next two months reading and researching search technology. There are a plethora of books, articles and other resources on the topic. My interest in technology which had been waning was definitely recharged. After beginning to understand some of the problems involved (which are immense) I built the first test crawler. It was quite limited being non distributed. It was very impolite since it ignored robots.txt files and tended to hammer servers since it did not throttle requests. I learned a great many important lessons though about how a web crawler works and how to process HTML data.
The processing of HTML data is very nontrivial. First thanks to browsers being very forgiving of web designers HTML that is downloaded is often broken. Missing tags. Unclosed tags. Things that start and stop suddenly. There were many hours spent in the debugger and adding a module to clean the incoming HTML and prepare it for processing. The decision that Netscape made back in the mid 90′s still haunts us today with poorly written HTML. Commercial search engines such as Google and Yahoo do much more with the HTML data including determining word importance by its location in the document and how large the font is relative to other words in the document.
So the first crawler was built in April and by early May I had decided to completely do away with it. I had never really intended it as the final version and it had become a huge mess as I had added features. The new crawler was a distributed crawler with a central controller and services running on different computers that downloaded the pages. It throttled its requests to specific hosts, contacting a remote computer no more than once every 30 seconds. How did the crawl work? Basically I used Radiology Education to seed the crawler with about 400 URLs. Big sites that were not really relevant such as google, microsoft, and flickr were removed by hand. From there the crawl went out and crawled all sites that it found.
By June we crawler was fully functional with a plethora of features such as Whitelist/Blacklist, throttling, a new URL extractor, and code to recrawl a page a couple of times in the event of an error. The crawler at this point was very much improved over what it had been and existed in basically this form for the duration of the project. I also implemented a special component of the crawler for retrieving data from sites running RSNA MIRC software. Since there was a cap on the results that were returned to the user I implemented a paging system that allowed the crawler to retrieve all the results.
In June I started seriously working on indexing. I built an inverted index to allow the text to be searched. I computed PageRank for the currently known graph of urls. The PageRank computation was handled as was described in Larry and Sergey’s original paper, using a single machine and the computation took several hours to run each iteration. I was able to get convergence at around 10 iterations which is consistent with the literature. This was actually a bit more work than these words do justice to. I also began to work on document classification with a Bayesian classifier. The classifier used teaching files from a commercial DVD as training documents. Common words were removed. This classifier did allow us to determine if a page was related to radiology or not by its content. I will note here that this is a very primitive attempt. Using the data we had I could have incorporated a variety of other information into the algorithm such as content on pages that linked to it or that it linked to.
July and August were spent working on various analysis projects as well as building a search algorithm. I used the Vector Space Model because of its simplicity even though it tends to be biased toward shorter documents. In July I had a completely working version although it was still far short of where I wanted it to be. I built a stemmer using porter stemming and built in support for both go and stop words. Stemming reduces words to their root so that radiologist and radiologists would both appear in a search for radiologist. Go words are never stemmed and stop words are words that are not indexed. Stop words are common words such as a, an, of, etc…
At the end of August I decided to leave the VA for the purposes of commercializing a vertical search engine on the web for radiologists. When I left at the end of September we were in fairly good shape for RSNA. There was still a scramble to polish it for RSNA. It never really reached the point that I wanted it to.
There were many interesting things we found. One was how bad misspelling is on the Internet and even on commercial teaching files. Several that were utilized for various things were definitely not run through spell checker. The crawler was the best working part of the whole system. It was able to sustain about 2 Mbps of traffic and download millions of pages. Further work would be need to make it scale which would include partitioning the URL database and allowing multiple crawl managers to work on different lists of URLs. The crawler was powerful enough to crawl through the radiology portion of the web. One of the reasons that this does not really make a good scientific article is the lack of measurable data. We did not collect data on radiologist satisfaction with GoogleMIRC. We did not measure recall and precision, two traditional measures of search engine quality.
The project had a number of limitations. First was my own choice of technology. I am a heavy .net user and I implemented GoogleMIRC in .net. That was not a bad decision. However I decided to use SQL Server 2005 as the data store. This was a very poor decision that I did not understand the ramifications of at the time. It did have a lot of developer time which I judged to be more valuable for the purposes of the project since I was the only person programming on it. I wish I have known about Lucene at the time and used the .net port of it. That would have saved a tremendous amount of time on building the index and search algorithm and probably led to better results. There definitely would have been more features, like thumbnails. Further more I which I had known about Nutch and Hadoop. When I found them about a year ago I kicked myself. Nutch is an open source search engine built in Java. Hadoop is a distributed computing platform that replicates Google’s infrastructure. Building in Java may have been wiser due to the amount of mathematical open source libraries to perform tasks such as singular value decomposition, a crucial piece of a technique called latent semantic indexing.
Most limitations really centered around the fact that there was only one developer on the project. It is crazy to try to build a search engine yourself. There are a lot of moving moving pieces. It is actually on challenging if you really want to make it scale up since many techniques that work on one machine will not work across multiple machines.
I personally got a tremendous amount out of the project. For instance since I used SQL Server and built my own index and search algorithms I gained a solid understanding of the issues there. I know how to build a crawler that scales reasonably well. Working on a project like this you gain a knew found understanding of the scale of the web. I tried lots of things that did not work out at the time such as singular value decomposition for finding common concepts in documents that I have since gotten to work.
What comes after? Yottalook builds on many of the idea and leverages Google’s custom search technology. I have not stopped working on search and hope to publicly show what I have been working on this year.