I have been writing in Haskell almost exclusively since the early summer. Before that I was writing in Scala. I have a significant amount of experience in both C# and F#. One of the most frustrating things about learning Haskell is there is not a good map that takes the way things are done in the OO world of Java and .Net to the way things should be done in Haskell. Even for a good developer it can be difficult to make the transition especially for writing high performance, highly concurrent applications.
I am starting this series to document things that I find have worked in Haskell. I don’t claim to be a Haskell expert, but I do build code that ships and works. I tend to use less of the fancy language features than other people and I prefer to write very readable code even if it is slightly more verbose than it might otherwise be. I hope that others can learn from my experience and move along the learning curve faster.
For our inaugural topic I will cover our development environment. We operate in a mixed language environment using Haskell, C++ and a smattering of both Python and R. We develop primarily on vim (I use mvim) on OSX. We deploy code onto custom Gentoo linux running on both our own boxes as well as EC2 machines.
As our codebase grew we quickly started to become frustrated with cabal as there is no notion of recompile all dependencies, so when a change is made in a common package everyone had to manually recompile their packages in the correct order. We evaluated cabal-dev but ultimately decided that it did not meet our needs, primarily due to the fact that we needed good C++ support and wanted more flexibility to add custom steps which did not seem like it was going to be particularly easy using cabal-dev. Ultimately we built our own build system, cabal-waf, using waf. You can find Nathan’s blog post with more details here.
For debugging we have increasingly moved to using a custom build of the RTS which has our own debugging extensions specifically around heap analysis. Most debugging is done through GDB which in and of itself is a terrible experience compared to using Windbg. In the future we will be adding more functionality for debugging as we will need it to debug production issues. I have not found GHCI to be useful for anything other than trivial issues as it runs far to slow. All core applications and libraries are compiled with -Wall and -Werror to catch any potential bugs that they can pickup.
We use our own forked version of the LLVM bindings for Haskell extensively to do code generation. There was functionality we required the depended on type unsafe code that the maintainers did not feel that they wanted, so we maintain our own fork.
It turns out that we have developed our own set of core libraries as many companies do. We will probably release a couple of these in the future. We have our own time library that takes most of its design from Joda Time. I was actually very shocked by how undeveloped this functionality was on Haskell, and we had to do a far amount of work to get complex time manipulation working initially. Apparently we are not the only ones with problems as I have found another implementation that takes its inspiration from .Net’s DateTime.
Information on low latency trading (HFT) is hard to come by. Recently Low Latency posted the audio from the London Low Latency Summit. This set of interviews and discussions contains some of the best publicly available informations on topics such as FGPA Feed handlers, which inter-continental fiber links are used, some of the costs, typically deployments, metropolitan fiber networks and strategy deployment. If you are interested in the HFT space check it out.
If you have not read the excellent paper by Yaron Minsky and Stephen Weeks titled Caml trading – experiences with functional programming on Wall Street I suggest you read it. It covers why Janes Street picked OCaml as the primary language they use. Many reasons related to safety of code.
Many of the same reasons drove Alpha Heavy Industries to pick Haskell as the primary development platform. I will publish in the future what I see as some of the short comings of the platform.
At Alpha Heavy Industries we are using Bloomberg Open Symbology(BSYM) in all our internal tools. As BSYM is in the public domain we are making a versioned history of BSYM available in a git repo. These files will be updated daily. Please feel free to use them in your own projects.
I first started learning haskell in early 2010. Coming from a .net (C#/F#) background I thought that haskell would be very similar to F#. I was very, very wrong. When I first started with haskell I was like many people extremely frustrated with the IO monad which in haskell is where all IO happens. A chief complaint I had was that types get infected with an IO prefix. For instance say a string is read from a file. It’s type is not just String it is instead IO String.
Adding IO to the type it turns out is very useful. IO is used to denote unsafe operations which happen outside the runtime such as reading from a file or a socket. The IO annotation simply tells the runtime that something unsafe and unpredictable can happen here. By contrast the rest of the code is what is called pure which means that the compiler can prove that it is deterministic. The proof has great properties for optimization since the optimizer in the compiler can rewrite the code any way it wants.
There is a reason that this IO is so important. I spent the last year writing scala code on hadoop. While hadoop is frustratingly limited using scala with it made writing both map reduce jobs and hive UDFs much less verbose than writing in java. I started to like scala aside from the inherit limitations of the JVM. I then started building my first large scale stand alone app in scala. That turned out to be a painful disaster. One of the most frustrating things I found was varying semantics of Java streams. I also found writing highly asynchronous code in Scala to be very painful. I began to realize that the reason I liked scala was because all my code was pure. Hadoop was handling the IO (whether it was doing a good job of it was completely a different question).
Using the IO monad to reason about IO turns out to be much easier than the alternatives. When writing large scale high performance apps in languages such as C++ typically tools and libraries are developed to compensate for the lack of the IO type annotation. However in Haskell the type checker in the compiler can alert us when we have done something incorrectly as the types will not line up. All this worrying about IO becomes very important when writing highly reliable software. Having your language do as much work as possible toward eliminating entire classes of bugs is very important in driving down the costs of developing top quality software.
I am very pleased with the fact that the software I write now often works once it compiles as the type system is able to provide assurances that languages without as rich a type system such as Java or C# cannot provide. Consider me a convert.
Cloud Computing has got to be the most over used term today. As an abstract term it serves a useful purpose for marketers who understand that many cloud computing will mean to the audience whatever the audience wants it to mean. For me cloud computing means platforms like Amazon Web Services (AWS) and Azure which at their core offer compute and storage services in a remote data center. If you don’t have EC2 and S3 like services I don’t consider your offering to be a cloud. Note I don’t particularly care that iCloud does not fit this definition as it is a consumer service. I will be discussing Cloud from the perspective of a developer and from the business who prefers not to own physical infrastructure.
Last week saw NYSE announce its Cloud. The release as with many “Enterprise” software press releases is short on details. It will probably mean that NYSE will host more back office applications in its data centers. It seems to be the rage today to call SaaS a Cloud in finance. See this about a back office application being hosted remotely and called a Cloud.
What would be cool would be infrastructure for frontend strategy development. That should look something like Amazon’s EC2 and S3 with NYSE providing readily available data sets such as SuperFeed historical and realtime data, low latency connections to the exchange data center, a large compute farm with every machine having a Tesla card and API’s for accessing other pieces of NYSE’s infrastructure.
I am going to continue to use AWS. At least one company is building a finance focused platform on AWS although the approach I am taking relies on capabilities that are unlikely to be built into a commercially available platform. AWS has GPU compute instances which give it a decisive advantage over competitors such as Azure for building financial/trading applications.
So I moved my current codebase to be compiled against Scala 188.8.131.52. I had been using 2.8 and I want access to the parallel collection operators which I was missing dearly from .net. Much of my code used ArrayStack as a mutable collection. I have increasingly preferred immutable approaches however when simply tying together existing Java components this can be impractical due to the way some API’s work. With the upgrade to 2.9 I noticed 2 specific problems.
1. If you call ArrayStack.reverse and iterate through the results the first item returned is actually the second item added to the collection.
2. When iterating through larger collections the iterator starts with null elements from the beginning. This is probably returning unused elements from the buffer that are allocated as part of the growth strategy. I have not spent enough time to know that this is conclusively the case.
With these problems with basic collections I have removed ArrayStack from all of my code. I have been instead using java.util.Vector[A]. This seems to have cleared up much bizarre behavior.
So Carlos Slim has taken a stake in The New York Times. I have been wondering who might step up to the plate since the clock is ticking. Either Carlos knows something we don’t or he is going to learn some of the same lessons as Sam Zell who bought Tribune Co. Tribune is current in bankruptcy. The game has changed and there are going to be winners and losers. I would bet that New York Times is going to be a loser. Even people I know who read it online everyday would not pay for it. News is now expected to be free. There are a few sites that may survive that I follow WSJ, FT and other sites that are important to high income people. However the idea that a news paper is going to be at the for front of modern media is patently ludicrous. The flip side is that Digg is not the answer either. Digg has never made money and I don’t expect it to anytime in the near future if ever.
So where does that leave us? NYT is going to hobble along as it slips into irrelevance. There will be consolidation in the media market. I expect that Reuters and AP will be the ultimate winners, providing global content to media outlets. Newspapers do not have the scale or the value add to be an effective product in today’s market.
If you consider yourself a financial neophyte or are confused by what is going on today check out The Assent of Money. It is a good overview even if it glosses over some of the more technical details of modern finance. I don’t completely agree with his assessment of quantitative finance. There are many firms that are not quants that are down considerably. RenTech still seems to be raking in the money.
Considering his example of LTCM it seems that an unhedged (or partially hedged) bet on volatility across the world is asking for trouble. All strategies that revolve around that then can become correlated. If I have learned anything it is that in a crises all (or almost all) correlation goes to 1. Still black box systems can and do work and program trading will continue to grow.
So as has been well documented on the web Positronic has been acquired by eBay. I spent last week on campus at eBay learning about every aspect of the company, everything from how the business is run and balancing buyer and seller needs to technology. I will be relocating to the bay area next month along with the rest of Positronic. I am very excited about this opportunity to work on search technology for an extremely large scale service. For eBayers I will be running an internal blog. I will post the address when it is up. It will be updated far more frequently than this blog ever will.
The last 9 months has seen my abilities grow tremendously. Possibly the most major influence on me has been Nathan Howell formerly of the spam team at Microsoft. I have been fully embracing asynchronous programming and functional programming (F#). Nathan has also taught me a lot about building extremely scalable systems. I wish I could dive into the details of some of our projects but alas I cannot.
On my own time I have been working on scratching one of my itches. I hate it when blogs talk about the same things, particularly tech blogs. Who really needs to read about Cuil from 10 different sources? I am a huge fan of figuring out how to digest the largest amount of information possible. Part of that is eliminating duplicate information to maximize time utilization. To that end I am working on a project that uses clustering techniques to eliminate duplicate news stories. The early results are encouraging however I am still really working getting the base tech working reliably. I am not tackling the problem of find relevant information, just eliminating duplicates. I will post more here over the next few weeks/months.
.Net 4.0 is shaping up to be awesome. I love LINQ in 3.5. It has really changed the way that I write a lot of code as well as think about problems. There are a whole class of components and technologies that can probably be represented as LINQ expression trees. I have been following a couple of projects that are LINQ providers for CUDA and probably eventually OpenCL. I am extremely interested in doing some GPU programming and I am going to procuring a CUDA card in the near future as soon as a crack my case and check my power supply wattage.
On the topic of power I have been reading James Hamilton’s blog and have found to be very interesting. James breaks everything in a data center down into numbers and that makes for very interesting reading. I really like how when he makes a statement about why something is better or worse he has numbers to go with it. If you have not read his posts on solid state drives they are an absolute must. If you consider yourself to have an influence on hardware architecture at your organization James’s blog is a must read.
My goal for the next few months is to become a much better powershell user. I have started using it and I want to be much faster and be able to use it for many more things than I do right now.