Category Archives: RSNA

GoogleMIRC

It has been a year and a couple of months since GoogleMIRC was shown at RSNA. GoogleMIRC was a radiology vertical search engine that served as a research project. Incidentally it was my last research project at the Baltimore VA. This post will be in place of an article that I have written and rewritten but never thought was really any good. I originally intended to publish an article in Radiographics. This is obviously never going to happen now. Fortunately I can write much more informally and tell you the story of GoogleMIRC.

Before we go any farther I want to acknowledge the other participants in the project who made it possible. None of this would have been possible without Khan Siddiqui. We came up with the idea together in his office while discussing some of the limitations of RSNA’s MIRC project. He worked with me to make it all possible. I want to thank Paul Wheeler currently at Positronic who helped out with a couple of crucial fixes including speeding up the search algorithm and balancing the urls that were sent to the crawler. Also Eliot Siegel whose expectations we constantly tried to exceed. Also thanks to the rest of the group including Woojin Kim, Nabile Safdar, Bill Boonn and Krishna Juluru. Additionally thanks must be offered to everyone whose web server I abused for this project, particularly the University of Alabama teaching file.

Originally GoogleMIRC was conceived as an idea to simply replace the search functionality in MIRC. Khan and I came up with the idea during one of our late afternoon discussions. Every afternoon we had an ice cream break, usually around 4:30 or 5 and discussed interesting things. We discussed adding simply a summary to the search results like google has for each result. MIRC simply showed the title of the case. Also at the time (I don’t know if it is still true) MIRC provided little to no relevance ranking for results. The results were partitioned by which server they came from which is really not what the user is looking for. So with that I set out study search technology. It was a good thing that none of us had any idea what we were getting into. This occurred at the end of January 2006.

The project quickly expanded into covering as many teaching files as possible. We wanted to provide radiologists with a tool that they could use in clinical practice that added value. We judged that radiologists would want to be able to quickly access content that was radiology specific. After all the radiologist wants his information immediately and in a form that allows him to better perform his job. An article about the disease in nature is not particularly useful at the time of diagnosis, not matter how interesting it might be.

I spent the next two months reading and researching search technology. There are a plethora of books, articles and other resources on the topic. My interest in technology which had been waning was definitely recharged. After beginning to understand some of the problems involved (which are immense) I built the first test crawler. It was quite limited being non distributed. It was very impolite since it ignored robots.txt files and tended to hammer servers since it did not throttle requests. I learned a great many important lessons though about how a web crawler works and how to process HTML data.

The processing of HTML data is very nontrivial. First thanks to browsers being very forgiving of web designers HTML that is downloaded is often broken. Missing tags. Unclosed tags. Things that start and stop suddenly. There were many hours spent in the debugger and adding a module to clean the incoming HTML and prepare it for processing. The decision that Netscape made back in the mid 90’s still haunts us today with poorly written HTML. Commercial search engines such as Google and Yahoo do much more with the HTML data including determining word importance by its location in the document and how large the font is relative to other words in the document.

So the first crawler was built in April and by early May I had decided to completely do away with it. I had never really intended it as the final version and it had become a huge mess as I had added features. The new crawler was a distributed crawler with a central controller and services running on different computers that downloaded the pages. It throttled its requests to specific hosts, contacting a remote computer no more than once every 30 seconds. How did the crawl work? Basically I used Radiology Education to seed the crawler with about 400 URLs. Big sites that were not really relevant such as google, microsoft, and flickr were removed by hand. From there the crawl went out and crawled all sites that it found.

By June we crawler was fully functional with a plethora of features such as Whitelist/Blacklist, throttling, a new URL extractor, and code to recrawl a page a couple of times in the event of an error. The crawler at this point was very much improved over what it had been and existed in basically this form for the duration of the project. I also implemented a special component of the crawler for retrieving data from sites running RSNA MIRC software. Since there was a cap on the results that were returned to the user I implemented a paging system that allowed the crawler to retrieve all the results.

In June I started seriously working on indexing. I built an inverted index to allow the text to be searched. I computed PageRank for the currently known graph of urls. The PageRank computation was handled as was described in Larry and Sergey’s original paper, using a single machine and the computation took several hours to run each iteration. I was able to get convergence at around 10 iterations which is consistent with the literature. This was actually a bit more work than these words do justice to. I also began to work on document classification with a Bayesian classifier. The classifier used teaching files from a commercial DVD as training documents. Common words were removed. This classifier did allow us to determine if a page was related to radiology or not by its content. I will note here that this is a very primitive attempt. Using the data we had I could have incorporated a variety of other information into the algorithm such as content on pages that linked to it or that it linked to.

July and August were spent working on various analysis projects as well as building a search algorithm. I used the Vector Space Model because of its simplicity even though it tends to be biased toward shorter documents. In July I had a completely working version although it was still far short of where I wanted it to be. I built a stemmer using porter stemming and built in support for both go and stop words. Stemming reduces words to their root so that radiologist and radiologists would both appear in a search for radiologist. Go words are never stemmed and stop words are words that are not  indexed. Stop words are common words such as a, an, of, etc…

At the end of August I decided to leave the VA for the purposes of commercializing a vertical search engine on the web for radiologists. When I left at the end of September we were in fairly good shape for RSNA. There was still a scramble to polish it for RSNA. It never really reached the point that I wanted it to.

There were many interesting things we found. One was how bad misspelling is on the Internet and even on commercial teaching files. Several that were utilized for various things were definitely not run through spell checker. The crawler was the best working part of the whole system. It was able to sustain about 2 Mbps of traffic and download millions of pages. Further work would be need to make it scale which would include partitioning the URL database and allowing multiple crawl managers to work on different lists of URLs. The crawler was powerful enough to crawl through the radiology portion of the web. One of the reasons that this does not really make a good scientific article is the lack of measurable data. We did not collect data on radiologist satisfaction with GoogleMIRC. We did not measure recall and precision, two traditional measures of search engine quality.

The project had a number of limitations. First was my own choice of technology. I am a heavy .net user and I implemented GoogleMIRC in .net. That was not a bad decision. However I decided to use SQL Server 2005 as the data store. This was a very poor decision that I did not understand the ramifications of at the time. It did have a lot of developer time which I judged to be more valuable for the purposes of the project since I was the only person programming on it. I wish I have known about Lucene at the time and used the .net port of it. That would have saved a tremendous amount of time on building the index and search algorithm and probably led to better results. There definitely would have been more features, like thumbnails. Further more I which I had known about Nutch and Hadoop. When I found them about a year ago I kicked myself. Nutch is an open source search engine built in Java. Hadoop is a distributed computing platform that replicates Google’s infrastructure. Building in Java may have been wiser due to the amount of mathematical open source libraries to perform tasks such as singular value decomposition, a crucial piece of a technique called latent semantic indexing.

Most limitations really centered around the fact that there was only one developer on the project. It is crazy to try to build a search engine yourself. There are a lot of moving moving pieces. It is actually on challenging if you really want to make it scale up since many techniques that work on one machine will not work across multiple machines.

I personally got a tremendous amount out of the project. For instance since I used SQL Server and built my own index and search algorithms I gained a solid understanding of the issues there. I know how to build a crawler that scales reasonably well. Working on a project like this you gain a knew found understanding of the scale of the web.  I tried lots of things that did not work out at the time such as singular value decomposition for finding common concepts in documents that I have since gotten to work.

What comes after? Yottalook builds on many of the idea and leverages Google’s custom search technology. I have not stopped working on search and hope to publicly show what I have been working on this year.

Technorati Tags: ,,,

What’s an EMR?

Neil Versel has a conversation with a sales rep about what an EMR is at RSNA. Really quite scary.

Tags: , ,

More GoogleMIRC Links

http://www.health-itworld.com/newsitems/2006/dec/12-01-06-rsna

http://www.diagnosticimaging.com/showArticle.jhtml?articleID=196600932

Tags: , ,

What was cool at RSNA

There was only one thing that I saw at RSNA this year that was cool and innovative. That was the next generation software from TeraRecon. It is called iNtuition. TeraRecon has always had a great product with their AquariusNET software. It provides 3D visualization of medical images using server side rendering. I will talk in detail about this some other time.

The new features were great. Automatic anatomy labeling. Woah. The software can now pick out anatomy. This has huge implications for the future since this will open up many new applications that use it. Also iNtuition provides good workflow which previous software did not. The beginnings of what could be the first 3D native PACS may be emerging.

All in all this is a huge step forward. It was also the only really cool and new thing that was shown which was disappointing.

Tags: , , ,

RSNA Writeup

This years RSNA was awesome for me. I had a great time and met lots of great people. I was gone for a week and had a blast in Chicago. It is such a great city to visit. Also from Saturday until Wednesday we had absolutely great weather. It was not windy or cold. I was fine at night just in my suit jacket.

I flew in at 10:30 AM on Saturday, the day before the convention opened, and was at the convention center around 12:30. I was setting my my exhibits in InfoRad, now called Informatics. I had 4 exhibits to set up. After dealing with a server which at first would not boot and a portable hard drive that died we were mostly up and running. I want to note here that I had to convert all the hard drives on the computers that were rented for us from FAT32 to NTFS which I found absolutely outrageous. Who in the year 2006 does not run NTFS on Windows computers?

Saturday night David Channin who leads the informatics group at Northwestern hosted an event at his house for people who are interested in imaging related topics. Alot of great people came including Kurt Langlotz, Dan Rubin, and Ben Johnson. From our group we brought Khan Siddiqui, Woojin Kim, Bill Boonn and Nabile Safdar. It was a great group of people and we had lots of great conversation.

On Sunday the meeting opened. I spent the first 2 hours setting up the last exhibit. The exhibits that I showed were:

  • GoogleMIRC – A search engine for web content that only indexes radiology content
  • Performance Dexa – A report creator for bone densitometry
  • Lightweight MIRC Authoring Tools – Tools for the RSNA’s MIRC project that are simple to use
  • Web2Mirc – A tool for converting web based teaching files into MIRC documents

GoogleMIRC and Lightweight MIRC both were invited to Radiographics and Lightweight MIRC won a certificate of merit.

Sunday was a busy day. I went out to dinner with TeraRecon that night at Nick’s Fish House.

Monday was more of the same. Up early. I had my talk at 11AM. It was really well received. Paul Chang and Dave Avrin chaired my session and I was privileged to be presenting with other notables including Dan Rubin, Chuck Khan, and Reuben Mezrich. My talk was being covered by the press which was exciting. The photographer was there to take pictures for the article that would appear in the daily newspaper for the conference the following morning. I had been interviewed the previous week for the article. After my talk I had a great time talking to Paul Chang who I had never had a chance to sit down with before.

Monday night I went to the GE party at House of Blues. I did stop into the Emageon event at ESPN zone but there was not alot going on there so I went back to the GE party. I spent some time with people I know there including Mark Morita and Jeff Whipple. Then I went down to the John Hancock building to hangout with some people from TeraRecon and closed out the night at a bar with a live band.

Tuesday I got up early to get my copy of the paper to see myself in it. I was there and it was great and really exciting. Tuesday I spent most of the day at my search engine exhibit. Lots of people stopped by to see it. The most common question was when will it be available. Availability will be announced here first. I met alot of fantastic people. Tuesday was supposed to be meet the bloggers but NetWorks bar was packed and I only found Dalai but I really had a good time talking to him. I hung out that night at the W Hotel which has a great bar ontop of it.

Wednesday I spent alot of time in the vendor hall. There were alot less people there than on the previous days, so it was easier to talk to vendors. I had dinner that night with Ed Parker and Karen from Evolve Technologies. They are great people to work with also.

Thursday was the last main day of the conference and I had a full day with lots of meetings. That night we went out with GE including Mark Morita and Prakash Mahesh (one of the companies software architects) first for pizza and then bowling. The guys at GE are old friends and I love hanging out with them.

Friday I packed up my exhibits and managed to miss my flight thanks to the snow in Chicago. I spent the night there and flew home on Saturday.

I will write in more detail about specific topics such as what was hot and what was not in future posts.

Tags: ,

RSNA

RSNA has been crazy. I am in the RSNA Daily Bulletin today. My work on search in radiology has been really well recieved. I will write alot more when/if things slow down. Looking forward to the blogger event tonight.

MyPACS.Net and GE

Aunt Minnie has a brief article about MyPACS.Net (a teaching file) being integrated into GE Centricity. GE is also part of the IHE TCE demo at RSNA showing how any system can be used to author a teaching file. I recommend that you stop by and check it out.I hope you will come by and see it. I was involved in the project so this is also shameless self promotion.

Tags: , , ,

Bloggers at RSNA 2006

Tim Gee has the details of a blogger meetup at RSNA. Hope to see more people come.

Tags: ,