Category Archives: Technology

Talk on High Frequency Trading

James Thomas from Headlands Technologies did an interview on a podcast earlier this year. If you involved in or follow the HFT industry it will be of limited interest but if you are tech savvy and want to listen to an industry insider talk about it then this is interesting. Ironically Headlands has an office located just a few blocks from our offices in San Francisco.

Paper on using OCaml for Trading

If you have not read the excellent paper by Yaron Minsky and Stephen Weeks titled Caml trading – experiences with functional programming on Wall Street I suggest you read it. It covers why Janes Street picked OCaml as the primary language they use. Many reasons related to safety of code.

Many of the same reasons drove Alpha Heavy Industries to pick Haskell as the primary development platform. I will publish in the future what I see as some of the short comings of the platform.

On to eBay

So as has been well documented on the web Positronic has been acquired by eBay. I spent last week on campus at eBay learning about every aspect of the company, everything from how the business is run and balancing buyer and seller needs to technology. I will be relocating to the bay area next month along with the rest of Positronic. I am very excited about this opportunity to work on search technology for an extremely large scale service. For eBayers I will be running an internal blog. I will post the address when it is up. It will be updated far more frequently than this blog ever will.

The last 9 months has seen my abilities grow tremendously. Possibly the most major influence on me has been Nathan Howell formerly of the spam team at Microsoft. I have been fully embracing asynchronous programming and functional programming (F#). Nathan has also taught me a lot about building extremely scalable systems. I wish I could dive into the details of some of our projects but alas I cannot.

On my own time I have been working on scratching one of my itches. I hate it when blogs talk about the same things, particularly tech blogs. Who really needs to read about Cuil from 10 different sources? I am a huge fan of figuring out how to digest the largest amount of information possible. Part of that is eliminating duplicate information to maximize time utilization. To that end I am working on a project that uses clustering techniques to eliminate duplicate news stories. The early results are encouraging however I am still really working getting the base tech working reliably. I am not tackling the problem of find relevant information, just eliminating duplicates. I will post more here over the next few weeks/months.

.Net 4.0 is shaping up to be awesome. I love LINQ in 3.5. It has really changed the way that I write a lot of code as well as think about problems. There are a whole class of components and technologies that can probably be represented as LINQ expression trees. I have been following a couple of projects that are LINQ providers for CUDA and probably eventually OpenCL. I am extremely interested in doing some GPU programming and I am going to procuring a CUDA card in the near future as soon as a crack my case and check my power supply wattage.

On the topic of power I have been reading James Hamilton’s blog and have found to be very interesting. James breaks everything in a data center down into numbers and that makes for very interesting reading. I really like how when he makes a statement about why something is better or worse he has numbers to go with it. If you have not read his posts on solid state drives they are an absolute must. If you consider yourself to have an influence on hardware architecture at your organization James’s blog is a must read.

My goal for the next few months is to become a much better powershell user. I have started using it and I want to be much faster and be able to use it for many more things than I do right now.


Everyone who writes systems at any reasonable scale needs a high performance logging solution. A number of these exist with varying degrees of functionality and complexity. I have been using NTrace which is a managed wrapper around the windows Event Tracing for Windows (ETW) libraries. ETW is a high performance trace system implemented as a device driver that was first introduced in Windows 2000. There is not a lot of info out there on ETW although it is beginning to gain some traction. I know that it is used internally at Microsoft as a tracing and instrumentation solution in large systems. For a background on ETW see this MSDN article.

NTrace is an elegant wrapper with an easy to use syntax.

EtwTrace.Trace("Item {0} Was not added to the database. Message: {1}", itemId, ex.Message);

The messages are compiled into TMF files which combine the text with the message Ids and data from the ETW system when viewing a trace. Some other awesome things are adjusting the trace level on the fly…a must for high performance applications. ETW supports circular log files which is highly useful. The performance of ETW is ample for even the most demanding applications. It is possible to instrument an application at a very detailed level and with very little performance cost.

Microsoft allows WCF and WPF both to log to ETW. I am moving all my WCF logging into ETW. The Service Trace Viewer can still open the resulting files.

The only downside is the precompiler does not provide a lot of information as to the nature of any errors that you made.

See Andy Hooper’s blog.

Google my health, or not

Well the open secret is now not so secret with today’s announcement. I see this as pretty much an non-event. Like Neil Versel said today, show me  PHR with more than a handful of active users. It is true that Google could become a dominant player in this market, however there are a lot of people who have a vested interest in that not happening. I can’t comment on the specifics of Google’s implementation since I don’t have access but PHR’s don’t seem to have come up with their compelling reason for existence. The if we build it they will come attitude has not worked here either.

In a larger sense EMR’s have not been shown to increase anything except cost. They have not saved money or lives on any large scale. Maybe they will in the future. In fact I am sure they will. But not yet. We can’t even exchange data yet.

So Google making a PHR doesn’t solve anything. I am sure they will throw a nice party at HIMSS though.

Technorati Tags: ,,,

FDA Regulation

Some emails this morning prompted me to write this. I have many thoughts on FDA regulations which are mainly formulated by my contact with companies that comply with FDA regulations. Note that I have no experience getting something FDA certified. If you do and want to comment please do so.

First I note that FDA regulation does not produce quality software. Whatever goes on does not include rigorous testing, How do I know? The number of bugs that appear in FDA regulated software. I have seen all sorts of bugs from showing the wrong patient data to crashing on legal DICOM to processing HL7 messages incorrectly. That is to say nothing of crashes, system hangs or other phenomenon. One of my colleagues could crash a leading vendors 3D workstation on demand. It was always funny to do it in their booth and watch their people squirm.

Second some vendors have used and still use it as an excuse to sell hardware at hugely inflated prices. Some are worse than others and at the slightest modification to a system throw up their hands and talk about how they cannot guarantee that their system will work. Once a vendor told us that hooking a trackball to a PACS workstation instead of a mouse could cause the software to click on random things. After all the mouse had not been validated. Fortunately our administrator told them just where they could stick that line. For anyone familiar with software you know that mouse movement is controlled through an abstraction layer and the software has no idea what it is talking to.

So just what is this regulation getting the customer? Not a lot other than expense. Quality software is built by having good developers and employing good software development techniques. People interested in this should check out the book ‘Dreaming in Code’ by Scott Rosenberg.

Now obviously this is anecdotal and you may reader may cry foul and say that this is all not really true. I know of no company that compiles data on software quality in healthcare. Healthcare software companies would throw up all kinds of roadblocks if this was attempted since I think most of them know what would be found. So all we have today in anecdote.

In summary I view the FDA regulation of software as a waste of time. It is good that they try to make sure that companies are not cooking people with radiation and that at some level they keep the pharma companies in line although that is a completely separate and very complicated issue.


It has been a year and a couple of months since GoogleMIRC was shown at RSNA. GoogleMIRC was a radiology vertical search engine that served as a research project. Incidentally it was my last research project at the Baltimore VA. This post will be in place of an article that I have written and rewritten but never thought was really any good. I originally intended to publish an article in Radiographics. This is obviously never going to happen now. Fortunately I can write much more informally and tell you the story of GoogleMIRC.

Before we go any farther I want to acknowledge the other participants in the project who made it possible. None of this would have been possible without Khan Siddiqui. We came up with the idea together in his office while discussing some of the limitations of RSNA’s MIRC project. He worked with me to make it all possible. I want to thank Paul Wheeler currently at Positronic who helped out with a couple of crucial fixes including speeding up the search algorithm and balancing the urls that were sent to the crawler. Also Eliot Siegel whose expectations we constantly tried to exceed. Also thanks to the rest of the group including Woojin Kim, Nabile Safdar, Bill Boonn and Krishna Juluru. Additionally thanks must be offered to everyone whose web server I abused for this project, particularly the University of Alabama teaching file.

Originally GoogleMIRC was conceived as an idea to simply replace the search functionality in MIRC. Khan and I came up with the idea during one of our late afternoon discussions. Every afternoon we had an ice cream break, usually around 4:30 or 5 and discussed interesting things. We discussed adding simply a summary to the search results like google has for each result. MIRC simply showed the title of the case. Also at the time (I don’t know if it is still true) MIRC provided little to no relevance ranking for results. The results were partitioned by which server they came from which is really not what the user is looking for. So with that I set out study search technology. It was a good thing that none of us had any idea what we were getting into. This occurred at the end of January 2006.

The project quickly expanded into covering as many teaching files as possible. We wanted to provide radiologists with a tool that they could use in clinical practice that added value. We judged that radiologists would want to be able to quickly access content that was radiology specific. After all the radiologist wants his information immediately and in a form that allows him to better perform his job. An article about the disease in nature is not particularly useful at the time of diagnosis, not matter how interesting it might be.

I spent the next two months reading and researching search technology. There are a plethora of books, articles and other resources on the topic. My interest in technology which had been waning was definitely recharged. After beginning to understand some of the problems involved (which are immense) I built the first test crawler. It was quite limited being non distributed. It was very impolite since it ignored robots.txt files and tended to hammer servers since it did not throttle requests. I learned a great many important lessons though about how a web crawler works and how to process HTML data.

The processing of HTML data is very nontrivial. First thanks to browsers being very forgiving of web designers HTML that is downloaded is often broken. Missing tags. Unclosed tags. Things that start and stop suddenly. There were many hours spent in the debugger and adding a module to clean the incoming HTML and prepare it for processing. The decision that Netscape made back in the mid 90’s still haunts us today with poorly written HTML. Commercial search engines such as Google and Yahoo do much more with the HTML data including determining word importance by its location in the document and how large the font is relative to other words in the document.

So the first crawler was built in April and by early May I had decided to completely do away with it. I had never really intended it as the final version and it had become a huge mess as I had added features. The new crawler was a distributed crawler with a central controller and services running on different computers that downloaded the pages. It throttled its requests to specific hosts, contacting a remote computer no more than once every 30 seconds. How did the crawl work? Basically I used Radiology Education to seed the crawler with about 400 URLs. Big sites that were not really relevant such as google, microsoft, and flickr were removed by hand. From there the crawl went out and crawled all sites that it found.

By June we crawler was fully functional with a plethora of features such as Whitelist/Blacklist, throttling, a new URL extractor, and code to recrawl a page a couple of times in the event of an error. The crawler at this point was very much improved over what it had been and existed in basically this form for the duration of the project. I also implemented a special component of the crawler for retrieving data from sites running RSNA MIRC software. Since there was a cap on the results that were returned to the user I implemented a paging system that allowed the crawler to retrieve all the results.

In June I started seriously working on indexing. I built an inverted index to allow the text to be searched. I computed PageRank for the currently known graph of urls. The PageRank computation was handled as was described in Larry and Sergey’s original paper, using a single machine and the computation took several hours to run each iteration. I was able to get convergence at around 10 iterations which is consistent with the literature. This was actually a bit more work than these words do justice to. I also began to work on document classification with a Bayesian classifier. The classifier used teaching files from a commercial DVD as training documents. Common words were removed. This classifier did allow us to determine if a page was related to radiology or not by its content. I will note here that this is a very primitive attempt. Using the data we had I could have incorporated a variety of other information into the algorithm such as content on pages that linked to it or that it linked to.

July and August were spent working on various analysis projects as well as building a search algorithm. I used the Vector Space Model because of its simplicity even though it tends to be biased toward shorter documents. In July I had a completely working version although it was still far short of where I wanted it to be. I built a stemmer using porter stemming and built in support for both go and stop words. Stemming reduces words to their root so that radiologist and radiologists would both appear in a search for radiologist. Go words are never stemmed and stop words are words that are not  indexed. Stop words are common words such as a, an, of, etc…

At the end of August I decided to leave the VA for the purposes of commercializing a vertical search engine on the web for radiologists. When I left at the end of September we were in fairly good shape for RSNA. There was still a scramble to polish it for RSNA. It never really reached the point that I wanted it to.

There were many interesting things we found. One was how bad misspelling is on the Internet and even on commercial teaching files. Several that were utilized for various things were definitely not run through spell checker. The crawler was the best working part of the whole system. It was able to sustain about 2 Mbps of traffic and download millions of pages. Further work would be need to make it scale which would include partitioning the URL database and allowing multiple crawl managers to work on different lists of URLs. The crawler was powerful enough to crawl through the radiology portion of the web. One of the reasons that this does not really make a good scientific article is the lack of measurable data. We did not collect data on radiologist satisfaction with GoogleMIRC. We did not measure recall and precision, two traditional measures of search engine quality.

The project had a number of limitations. First was my own choice of technology. I am a heavy .net user and I implemented GoogleMIRC in .net. That was not a bad decision. However I decided to use SQL Server 2005 as the data store. This was a very poor decision that I did not understand the ramifications of at the time. It did have a lot of developer time which I judged to be more valuable for the purposes of the project since I was the only person programming on it. I wish I have known about Lucene at the time and used the .net port of it. That would have saved a tremendous amount of time on building the index and search algorithm and probably led to better results. There definitely would have been more features, like thumbnails. Further more I which I had known about Nutch and Hadoop. When I found them about a year ago I kicked myself. Nutch is an open source search engine built in Java. Hadoop is a distributed computing platform that replicates Google’s infrastructure. Building in Java may have been wiser due to the amount of mathematical open source libraries to perform tasks such as singular value decomposition, a crucial piece of a technique called latent semantic indexing.

Most limitations really centered around the fact that there was only one developer on the project. It is crazy to try to build a search engine yourself. There are a lot of moving moving pieces. It is actually on challenging if you really want to make it scale up since many techniques that work on one machine will not work across multiple machines.

I personally got a tremendous amount out of the project. For instance since I used SQL Server and built my own index and search algorithms I gained a solid understanding of the issues there. I know how to build a crawler that scales reasonably well. Working on a project like this you gain a knew found understanding of the scale of the web.  I tried lots of things that did not work out at the time such as singular value decomposition for finding common concepts in documents that I have since gotten to work.

What comes after? Yottalook builds on many of the idea and leverages Google’s custom search technology. I have not stopped working on search and hope to publicly show what I have been working on this year.

Technorati Tags: ,,,

Why scalability should concern you when you are small

Scalability is a topic often ignored by small companies. We will get to it later. We don’t have enough traffic. We don’t have the time or resources. Well that’s all well and good until something happens. Maybe that something is you becoming successful. Now your service is in demand. Your database is being pounded. Adding web servers isn’t a problem but getting them data is. Your database server thrashes in agony. And then you get upset. How could this happen? We bought big beefy machines from Dell. We paid a lot of money for them.

Engineering scalability does matter. You don’t need to have all the hardware you need to stay up when TechCrunch links to you. I would even say just having your developers model what they would do is good thing. Then at least you have a plan that has been tested. You know what you will do as traffic increases. People who know me have heard me talk about playbooks. I don’t always advocate doing exactly what your playbook says but I strongly recommend having one since it gives you a place to start. Think of it like a business plan. You won’t follow your business plan exactly but it gives you a place to start.

Technorati Tags:

Microsoft is serious about healthcare

So after the HealthVault launch everyone could be comfortable that Microsoft was serious about healthcare. In case you still were not convinced they just made another acquisition. Buying software that focuses on the developing world is extremely smart. Some countries have exploding middle classes that are going to want to consume healthcare. Don’t be surprised to see Microsoft entering the EMR market in a big way over the next few years.

Technorati tags: ,

Funny iPhone Airplane Story

Check out this story from an iPhone user on a plane with his phone in “Airplane Mode”.

Technorati tags: ,