Category Archives: Software

AHGTH: Comments on Snap/Yesod

I spent some time last night looking at Snap, which is a web development framework for haskell. I have seen repeated comments that Snap is superior to Yesod so I finally poked around myself. We use Yesod at Alpha Heavy Industries for internal status pages and tools. I have found it to be a bit kludgy to use and Snap was advertising these Snaplets which I figured held some promise. My frame of reference for comparison is ASP.Net around version 3.5 and 4.0 which I also used to build a lot of tools.

First off I am not really interested in the technical merits of either platform. Both advertise very high request throughput and have made design choices to support it. I have no need of such a solution. Our webserver is lucky if it has two people making requests at the same time. I am mostly concerned with ease of development. In ASP.Net I could build a quick webapp to display data or do data entry (my most common tasks) in a couple of hours. In Yesod it often turns into a day long affair. I have realized how much ASP.Net had insulated me from the mechanics of HTML, building forms, etc… Yesod is a much lower level affair, and in some ways higher.

At the risk of inciting your wrath I will admit that I have repeatedly compared Yesod to many half baked Microsoft technologies like WCF and WWF. Both were very easy to get up and running with. Both demoed fantastically. Everything was great, until a scenario that was not considered comes up. I feel the same in Yesod. Its easy to get going, especially now, but then you need to do something a bit complicated and you have to dive into the details of the framework. Admittedly as my haskell wizardry has increased I have been able to devise simpler solutions to complex problems in Yesod.

Snap doesn’t seem to be much different. I will leave it to others to argue about what the right templating syntax is. It seems to have most of the same features although quite a bit less documentation. I was disappointed that most of the snaplets were built to interface with databases. I was really hoping for some powerful controls. I may be missing something and if I am I invite anyone to fill me in on what that is.

ASP.Nets controls are one thing I miss from the old world. It was super easy to build a datagrid or a series of charts bound to arbitrary data sources. Now many will say that to build highly scalable websites you can’t use these things and they are right. But most of the websites out there are not serving thousands of pages per second. What I really think either framework could benefit from would be a layer on top of what exists now. A set of full featured controls for charts, data grids, calendars and things I have never thought of.

In Yesod I am still not sure what this would look like. Is a chart a Widget? My chart control today is a Widget. In fact most of my widgets are controls and they are hacky and nasty since they are javascript powered. I have done more javascript code generation than I care to mention. Wait aren’t I supposed to be writing haskell?

All that being said both seem to have solid technical foundations. More work just needs to be spent on the higher levels. I look forward to seeing both evolve.

P.S. A reader may ask why use Yesod if I don’t like it? I don’t want to have to rewrite blocks of functionality in another language. It requires less time overall to just suck it up and make the best of it.

P. P. S. Is anyone knows how reusable controls in Yesod should be developed so they can just be linked in I will develop any future controls in such a manner and release them.

Haskell And My Conversion To The IO Monad

I first started learning haskell in early 2010. Coming from a .net (C#/F#) background I thought that haskell would be very similar to F#. I was very, very wrong. When I first started with haskell I was like many people extremely frustrated with the IO monad which in haskell is where all IO happens. A chief complaint I had was that types get infected with an IO prefix. For instance say a string is read from a file. It’s type is not just String it is instead IO String.

Adding IO to the type it turns out is very useful. IO is used to denote unsafe operations which happen outside the runtime such as reading from a file or a socket. The IO annotation simply tells the runtime that something unsafe and unpredictable can happen here. By contrast the rest of the code is what is called pure which means that the compiler can prove that it is deterministic. The proof has great properties for optimization since the optimizer in the compiler can rewrite the code any way it wants.

There is a reason that this IO is so important. I spent the last year writing scala code on hadoop. While hadoop is frustratingly limited using scala with it made writing both map reduce jobs and hive UDFs much less verbose than writing in java. I started to like scala aside from the inherit limitations of the JVM. I then started building my first large scale stand alone app in scala. That turned out to be a painful disaster. One of the most frustrating things I found was varying semantics of Java streams. I also found writing highly asynchronous code in Scala to be very painful. I began to realize that the reason I liked scala was because all my code was pure. Hadoop was handling the IO (whether it was doing a good job of it was completely a different question).

Using the IO monad to reason about IO turns out to be much easier than the alternatives. When writing large scale high performance apps in languages such as C++ typically tools and libraries are developed to compensate for the lack of the IO type annotation. However in Haskell the type checker in the compiler can alert us when we have done something incorrectly as the types will not line up. All this worrying about IO becomes very important when writing highly reliable software. Having your language do as much work as possible toward eliminating entire classes of bugs is very important in driving down the costs of developing top quality software.

I am very pleased with the fact that the software I write now often works once it compiles as the type system is able to provide assurances that languages without as rich a type system such as Java or C# cannot provide. Consider me a convert.

Scala 2.9 ArrayStack

So I moved my current codebase to be compiled against Scala 2.9.0.1. I had been using 2.8 and I want access to the parallel collection operators which I was missing dearly from .net. Much of my code used ArrayStack as a mutable collection. I have increasingly preferred immutable approaches however when simply tying together existing Java components this can be impractical due to the way some API’s work. With the upgrade to 2.9 I noticed 2 specific problems.

1. If you call ArrayStack.reverse and iterate through the results the first item returned is actually the second item added to the collection.

2. When iterating through larger collections the iterator starts with null elements from the beginning. This is probably returning unused elements from the buffer that are allocated as part of the growth strategy. I have not spent enough time to know that this is conclusively the case.

With these problems with basic collections I have removed ArrayStack from all of my code. I have been instead using java.util.Vector[A]. This seems to have cleared up much bizarre behavior.

NTrace

Everyone who writes systems at any reasonable scale needs a high performance logging solution. A number of these exist with varying degrees of functionality and complexity. I have been using NTrace which is a managed wrapper around the windows Event Tracing for Windows (ETW) libraries. ETW is a high performance trace system implemented as a device driver that was first introduced in Windows 2000. There is not a lot of info out there on ETW although it is beginning to gain some traction. I know that it is used internally at Microsoft as a tracing and instrumentation solution in large systems. For a background on ETW see this MSDN article.

NTrace is an elegant wrapper with an easy to use syntax.

EtwTrace.Trace("Item {0} Was not added to the database. Message: {1}", itemId, ex.Message);

The messages are compiled into TMF files which combine the text with the message Ids and data from the ETW system when viewing a trace. Some other awesome things are adjusting the trace level on the fly…a must for high performance applications. ETW supports circular log files which is highly useful. The performance of ETW is ample for even the most demanding applications. It is possible to instrument an application at a very detailed level and with very little performance cost.

Microsoft allows WCF and WPF both to log to ETW. I am moving all my WCF logging into ETW. The Service Trace Viewer can still open the resulting files.

The only downside is the precompiler does not provide a lot of information as to the nature of any errors that you made.

See Andy Hooper’s blog.

GoogleMIRC

It has been a year and a couple of months since GoogleMIRC was shown at RSNA. GoogleMIRC was a radiology vertical search engine that served as a research project. Incidentally it was my last research project at the Baltimore VA. This post will be in place of an article that I have written and rewritten but never thought was really any good. I originally intended to publish an article in Radiographics. This is obviously never going to happen now. Fortunately I can write much more informally and tell you the story of GoogleMIRC.

Before we go any farther I want to acknowledge the other participants in the project who made it possible. None of this would have been possible without Khan Siddiqui. We came up with the idea together in his office while discussing some of the limitations of RSNA’s MIRC project. He worked with me to make it all possible. I want to thank Paul Wheeler currently at Positronic who helped out with a couple of crucial fixes including speeding up the search algorithm and balancing the urls that were sent to the crawler. Also Eliot Siegel whose expectations we constantly tried to exceed. Also thanks to the rest of the group including Woojin Kim, Nabile Safdar, Bill Boonn and Krishna Juluru. Additionally thanks must be offered to everyone whose web server I abused for this project, particularly the University of Alabama teaching file.

Originally GoogleMIRC was conceived as an idea to simply replace the search functionality in MIRC. Khan and I came up with the idea during one of our late afternoon discussions. Every afternoon we had an ice cream break, usually around 4:30 or 5 and discussed interesting things. We discussed adding simply a summary to the search results like google has for each result. MIRC simply showed the title of the case. Also at the time (I don’t know if it is still true) MIRC provided little to no relevance ranking for results. The results were partitioned by which server they came from which is really not what the user is looking for. So with that I set out study search technology. It was a good thing that none of us had any idea what we were getting into. This occurred at the end of January 2006.

The project quickly expanded into covering as many teaching files as possible. We wanted to provide radiologists with a tool that they could use in clinical practice that added value. We judged that radiologists would want to be able to quickly access content that was radiology specific. After all the radiologist wants his information immediately and in a form that allows him to better perform his job. An article about the disease in nature is not particularly useful at the time of diagnosis, not matter how interesting it might be.

I spent the next two months reading and researching search technology. There are a plethora of books, articles and other resources on the topic. My interest in technology which had been waning was definitely recharged. After beginning to understand some of the problems involved (which are immense) I built the first test crawler. It was quite limited being non distributed. It was very impolite since it ignored robots.txt files and tended to hammer servers since it did not throttle requests. I learned a great many important lessons though about how a web crawler works and how to process HTML data.

The processing of HTML data is very nontrivial. First thanks to browsers being very forgiving of web designers HTML that is downloaded is often broken. Missing tags. Unclosed tags. Things that start and stop suddenly. There were many hours spent in the debugger and adding a module to clean the incoming HTML and prepare it for processing. The decision that Netscape made back in the mid 90’s still haunts us today with poorly written HTML. Commercial search engines such as Google and Yahoo do much more with the HTML data including determining word importance by its location in the document and how large the font is relative to other words in the document.

So the first crawler was built in April and by early May I had decided to completely do away with it. I had never really intended it as the final version and it had become a huge mess as I had added features. The new crawler was a distributed crawler with a central controller and services running on different computers that downloaded the pages. It throttled its requests to specific hosts, contacting a remote computer no more than once every 30 seconds. How did the crawl work? Basically I used Radiology Education to seed the crawler with about 400 URLs. Big sites that were not really relevant such as google, microsoft, and flickr were removed by hand. From there the crawl went out and crawled all sites that it found.

By June we crawler was fully functional with a plethora of features such as Whitelist/Blacklist, throttling, a new URL extractor, and code to recrawl a page a couple of times in the event of an error. The crawler at this point was very much improved over what it had been and existed in basically this form for the duration of the project. I also implemented a special component of the crawler for retrieving data from sites running RSNA MIRC software. Since there was a cap on the results that were returned to the user I implemented a paging system that allowed the crawler to retrieve all the results.

In June I started seriously working on indexing. I built an inverted index to allow the text to be searched. I computed PageRank for the currently known graph of urls. The PageRank computation was handled as was described in Larry and Sergey’s original paper, using a single machine and the computation took several hours to run each iteration. I was able to get convergence at around 10 iterations which is consistent with the literature. This was actually a bit more work than these words do justice to. I also began to work on document classification with a Bayesian classifier. The classifier used teaching files from a commercial DVD as training documents. Common words were removed. This classifier did allow us to determine if a page was related to radiology or not by its content. I will note here that this is a very primitive attempt. Using the data we had I could have incorporated a variety of other information into the algorithm such as content on pages that linked to it or that it linked to.

July and August were spent working on various analysis projects as well as building a search algorithm. I used the Vector Space Model because of its simplicity even though it tends to be biased toward shorter documents. In July I had a completely working version although it was still far short of where I wanted it to be. I built a stemmer using porter stemming and built in support for both go and stop words. Stemming reduces words to their root so that radiologist and radiologists would both appear in a search for radiologist. Go words are never stemmed and stop words are words that are not  indexed. Stop words are common words such as a, an, of, etc…

At the end of August I decided to leave the VA for the purposes of commercializing a vertical search engine on the web for radiologists. When I left at the end of September we were in fairly good shape for RSNA. There was still a scramble to polish it for RSNA. It never really reached the point that I wanted it to.

There were many interesting things we found. One was how bad misspelling is on the Internet and even on commercial teaching files. Several that were utilized for various things were definitely not run through spell checker. The crawler was the best working part of the whole system. It was able to sustain about 2 Mbps of traffic and download millions of pages. Further work would be need to make it scale which would include partitioning the URL database and allowing multiple crawl managers to work on different lists of URLs. The crawler was powerful enough to crawl through the radiology portion of the web. One of the reasons that this does not really make a good scientific article is the lack of measurable data. We did not collect data on radiologist satisfaction with GoogleMIRC. We did not measure recall and precision, two traditional measures of search engine quality.

The project had a number of limitations. First was my own choice of technology. I am a heavy .net user and I implemented GoogleMIRC in .net. That was not a bad decision. However I decided to use SQL Server 2005 as the data store. This was a very poor decision that I did not understand the ramifications of at the time. It did have a lot of developer time which I judged to be more valuable for the purposes of the project since I was the only person programming on it. I wish I have known about Lucene at the time and used the .net port of it. That would have saved a tremendous amount of time on building the index and search algorithm and probably led to better results. There definitely would have been more features, like thumbnails. Further more I which I had known about Nutch and Hadoop. When I found them about a year ago I kicked myself. Nutch is an open source search engine built in Java. Hadoop is a distributed computing platform that replicates Google’s infrastructure. Building in Java may have been wiser due to the amount of mathematical open source libraries to perform tasks such as singular value decomposition, a crucial piece of a technique called latent semantic indexing.

Most limitations really centered around the fact that there was only one developer on the project. It is crazy to try to build a search engine yourself. There are a lot of moving moving pieces. It is actually on challenging if you really want to make it scale up since many techniques that work on one machine will not work across multiple machines.

I personally got a tremendous amount out of the project. For instance since I used SQL Server and built my own index and search algorithms I gained a solid understanding of the issues there. I know how to build a crawler that scales reasonably well. Working on a project like this you gain a knew found understanding of the scale of the web.  I tried lots of things that did not work out at the time such as singular value decomposition for finding common concepts in documents that I have since gotten to work.

What comes after? Yottalook builds on many of the idea and leverages Google’s custom search technology. I have not stopped working on search and hope to publicly show what I have been working on this year.

Technorati Tags: ,,,

Xobni

Check out Xobni…a social network for your inbox.

Xobni outlook add-in for your inbox

Why scalability should concern you when you are small

Scalability is a topic often ignored by small companies. We will get to it later. We don’t have enough traffic. We don’t have the time or resources. Well that’s all well and good until something happens. Maybe that something is you becoming successful. Now your service is in demand. Your database is being pounded. Adding web servers isn’t a problem but getting them data is. Your database server thrashes in agony. And then you get upset. How could this happen? We bought big beefy machines from Dell. We paid a lot of money for them.

Engineering scalability does matter. You don’t need to have all the hardware you need to stay up when TechCrunch links to you. I would even say just having your developers model what they would do is good thing. Then at least you have a plan that has been tested. You know what you will do as traffic increases. People who know me have heard me talk about playbooks. I don’t always advocate doing exactly what your playbook says but I strongly recommend having one since it gives you a place to start. Think of it like a business plan. You won’t follow your business plan exactly but it gives you a place to start.

Technorati Tags:

Microsoft is serious about healthcare

So after the HealthVault launch everyone could be comfortable that Microsoft was serious about healthcare. In case you still were not convinced they just made another acquisition. Buying software that focuses on the developing world is extremely smart. Some countries have exploding middle classes that are going to want to consume healthcare. Don’t be surprised to see Microsoft entering the EMR market in a big way over the next few years.

Technorati tags: ,

The future platform

I want for a moment to address a weakness that I think exists in Microsofts tool offerings. Microsoft does not offer tools for building ultra high performance, extremely fault tolerant and widely distributed applications. While some may consider this to be a niche market I do not. And further more the applications, platforms and tools that require such technology will be integral to a much wider array of applications through the services that they offer. A more concrete example would be a search engine. Search engines have massive computing needs and need to be infinitely scalable. Consider an application like an Electronic Medical Record being run as an ASP. In order to handle the plethora of data from thousands of sources and to enable the application to always be available new computing approaches need to be developed.

Forget everything that you have learned about traditional applications. N-Tier applications have difficulty scaling up to the size that is required by a large scale web application. Traditional components like databases will continue to exist and be important but the features that are important will change. For instance startups and large companies building webscale applications have increasingly turned to MySQL. MySQL enables horizontal scaling so that to increase capacity you simply add servers. This is very different from something like Microsoft SQL Server where you need to scale up onto larger machines. Don’t think that MySQL is ready for prime time? Yahoo, Amazon, Nokia, and Google disagree.

However that is not a significant departure from traditional applications. Enter Hadoop. Hadoop is a distributed computing platform that has many features that are similar to Google’s base technology. It implements Map Reduce. It has a distributed file system. Now this is an advanced filesystem that is fault tolerant. It can copy and replicate data on demand for frequently accessed objects. Hadoop provides a simple infrastructure for developers to use to build applications that can deal with enourmous volumes of data. Why would companies that are not search engines be interested? Simple, Hadoop can be used to solve very demanding problems very cheaply. Want to build an EMR with advanced data mining functionality? Hadoop would enable the data to be analyzed in a fast and inexpensive manner.

Hadoop is built in Java and runs natively on Linux. For Microsoft that is a problem. Already many web applications, especially Web 2.0 run on LAMP. Microsoft has competitors to these though that are good, even if more expensive. The argument that Microsoft tools have developer time may make up for more expensive software but that is a seperate issue for another time. However when deploying applications that run on a large number of machines can lead to huge licensing costs. Each computer needs its own copy of Windows even though the only thing that these machines will be used for is computing and storage. Few Windows services will ever be used. So Linux makes sense in this situation.

Microsoft should do three things. Microsoft should sponsor a port of Hadoop to .Net. That will give Microsoft developers the same tools as their open source associates. Microsoft will need to hire a bunch of FTEs to make this happen. Second SQL Server needs to easily scale horizontally across many machines cheaply. Third their needs to be a version of Windows to support this. Windows Compute Cluster is not what we are looking for. Windows Compute Cluster does not support .Net natively (although it can use PInvoke) and is targeted at people I think with legacy applications. Information from the website is short on details. If someone wants to correct this please do.

Microsoft needs to do something to address this niche but rapidly growing market segment.

Technorati tags: , , , , , ,

O’Reilly on Multi-core

Tim O’Reilly posted pieces of a conversation about the wall that we have almost hit with processor speed. I just wanted to add that not only do most applications not take advantage of multi-core processors, the complexity of writing multi-threaded applications will mean that most applications will never see the advantages of multi-core processors. What is need are tools that enable the underlying operations to be parallelized without the programmer having to immerse himself in those details. There are already examples of this, web applications for instance. A webserver handles this problem beautifully. The programmer creates the site and it easily services multiple users and many can scale to the millions of users. Of course you get into other problems but that is not the point. Finally I want to point out that many problems will never be able to take advantage of multi-core processors since they are not operations that are easily broken into multiple paths of execution.

Technorati: , ,