Latest Blog Posts
- Scala Dynamo - a Scala (and Java!) API for Amazon’s Dynamo
- On Complexity, Scala & Development Practices
- The Dark Side of Big Data: Pseudo-Science & Fooled By Randomness
- TextMinr Progress Update
- Making Text Mining Accessible to Any Developer & Non-Expert
- The Big Picture: True Machine Intelligence & Predictive Power
- Developers Must Feel the Pain of Operations
- Testing IO Client Code Easily with Functional Programming
- Continuous Deployment = Continuous Business Improvement
- The Problem with the Scala Community
Blogroll
25 Jan 12 Scala Dynamo - a Scala (and Java!) API for Amazon’s Dynamo
Last week saw the announcement of Amazon Web Services NoSQL database as a service, Dynamo. Dynamo has a number of very interesting features, not least that it runs as a managed service with certain guarantees with regards to scalability, resilience and performance. From the pricing structure, I would say that Dynamo is probably more expensive than using something like MongoDB for larger datasets. However, I think Dynamo has an interesting role to play for startups and companies with smaller storage needs: Dynamo is perfect as an “early NoSQL database” for startups - you outsource the expensive tasks of managing and scaling the database to Amazon until the point where your business concept has either been proved or disproved. If your concept is dead in the water, your data storage requirements may never really get to a threshold where other solutions are appreciably cheaper, and if your concept is a roaring success, well then you have every reason in the world to put in the effort to migrate from Dynamo to Mongo, Cassandra or something else.
A simple Dynamo API for Scala and Java
With that short intro to where I think Dynamo fits in, I thought I’d unveil Scala Dynamo, an open source project I quickly put together to make working with Dynamo a breeze in Scala or Java. The standard AWS Java API’s are verbose to say the least, but with Scala Dynamo you will be able to save and load Scala case classes or Java beans that fit into the Dynamo way of storing data in a single line. The main thrust of Scala Dynamo so far has been in serializing and deserializing case classes and java bean classes into and from Dynamo. I might add more features in the future as I familiarize myself further with Dynamo, but in the meantime, I think the library should already be quite useful to anyone using a JVM language wanting to work with Dynamo.
For examples, source code etc, check out Scala Dynamo’s GitHub repository!
12 Jan 12 On Complexity, Scala & Development Practices
Over the last few days there has been a lot of discussion about the perceived complexity of Scala, following Scala creator Martin Odersky’s response to a critical blog post that compiler flags may be introduced to allow people to turn on and off certain features in the language which may be perceived as “complex”.
The Nature of Complexity
There are actually two big reasons that worry me about Odersky’s response: firstly, it is an implicit admission that maybe Scala is actually too complex. Secondly, if that is the case, the response is the wrong one, as it will inevitable result in even more complexity.
Let’s start with my second point - complexity, especially in software is often the result of one of several common pitfalls:
- Ill-considered design decisions where functionality is too course-grained to be easily composable for the actual use-case.
- Too many ways of doing the same thing, leading to confusion over what the best practices are.
- Unecessary features/”options” being added without there being a clearcut need for them.
- “Flags” being used instead of clean, composable abstractions.
By any measure, adding options will only serve to add complexity - if you want to remove complexity and make things simpler, you should be looking to what can be removed or reduced, not to what could be added. In my opinion, either things should remain as they are, or if the features that are being questioned truly are too complex (which I’m not sure about), they should slowly be phased out of the language with the admission that they might have gone too far. Either it belongs in the language, or it doesn’t, pretty simple.
It bares repeating: complexity is almost always the result of trying to do too many things, please too many people and giving people too many alternatives. Adding yet another alternative will only increase complexity even if the purpose is the opposite. Good design is based on doing the bare minimum, the bare essentials and nothing more, YAGNI and all of that..
Dealing With A Learning Curve and PERCEIVED complexity
I’m not sure Scala is actually as complex as some claim it is, but I do not in the least contend that anything that has a learning curve will initially be perceived as complex. Teams adopting Scala obviously need to manage that learning curve by using solid and agreed development practices. But this is not a problem that is unique to Scala: teams developing in any language will typically put in place a set of commonly agreed principles and practices. These practices will usually range from granularity of testing (what is a unit-test, what is an integration test etc?), how to guard for common bugs (running something like FindBugs), how build processes work etc to how code is formatted. There are many excellent build-plugins for most languages and build-tools that let teams enforce these practices.
My point here is that most teams have to deal with these issues anyway, so for a team adopting Scala, it’s all about managing it in the traditional way. In this sense it would make a LOT more sense if guarding for potentially confusing Scala features was done as part of the teams regular work, and perhaps enforced by some useful build tool (sbt) plugin. Trying to fix it at the language level or on the compiler level seems to me to be the wrong place to fix an issue that is mostly about managing a collective learning curve more than anything else. Yes, a compiler flag might be the quick, expedient way of addressing it, but I’m afraid it is the wrong place, and also risks the usual pitfalls that come with “quick, expedient fixes”.
Final Thoughts
In my opinion, Martin Odersky and his team need to work out for themselves whether Scala truly is too complex and some rough edges need to be entirely phased out of the language, or if it is simply an issue of managing learning curves (which is my opinion). If it is the latter, then I believe it should be dealt with by optional sbt plugins and IDE features that helps teams manage their learning curve and not anywhere else.
20 Dec 11 The Dark Side of Big Data: Pseudo-Science & Fooled By Randomness
Over the last couple of months I have read up on volumes of Technical Analysis (“TA”) information, I have back tested probably hundreds of automated trading strategies against massive amounts of data, both exchange intraday- and tick data, as well as other sources. Some of these strategies have been massively profitable in back testing, others not so much.
Some of the TA patterns, I’ve discarded before they even left the book, because they did not stand up to any sort of scientific scrutiny because they lacked a clear predictive thesis, where riddled with forward-looking bias (“Head and Shoulders patterns”), and in some cases where just plain bulls**t (“Elliott Wave Principle” comes to mind).
The outcomes of my testing has made me think about the implications of large scale data analysis in general: it is very easy to get fooled by randomness. In many cases in my testing results have been amazing, but I cannot come up with a plausible causal explanation as to why, and when I gently nudge the parameters just ever so slightly, outcomes can look entirely different.
Taking a step back from the data, looking at it in a larger perspective, I’m inclined to conclude that if data across multiple parameter variations looks like a random walk and lacks a plausible causal explanation, then it is a random walk.
If I cannot say “X is caused by A and B”, I’m inclined to believe that the actual reason is “X is the result because A and B fit the historical data D, but may not do so in the future”.
And herein lies the crux of the matter: how many data scientists are inclined to take a step back, rather than just assume that there is a pattern there? How many are prepared to do so if their livelihood is largely based on them finding patterns, rather than discarding them because they do not hold up to deeper scrutiny? I’d say very few.
My conclusion to this is that the age of Big Data will see a radical increase of pseudo-scientific “discoveries”, driven out of an interest in announcing new great “patterns”. This pseudo-science will pervade both academia, public sector and private sector, God knows I’ve seen a fair number of academic research papers already that simply do not hold if you investigate their thesis in a deeper manner.
I suspect we will arrive at a point much like with any new technology whereby people will tire of the claims made by “Big Data Scientists”, because at least half of what they say will have been proven to be hokey and pseudo-science in the pursuit of being able to make even more outlandish claims in a game of one-upping the competition. Some of this will be driven by malice and self-interest, but I suspect in equal parts it will be driven by ignorance and perverted incentives putting blinders on people in the business.
29 Nov 11 TextMinr Progress Update
Since I announced TextMinr, the Text mining as a service platform last week - the interest has been overwhelming: it got to the top 3 of Hackernews, it got retweeted by Tim O’Reilly of O’Reilly media and many other influential people in the tech scene.
Progress over the last week has been good, the public API is coming along and it’s quite likely we will start to dripfeed invitations to the beta towards the end of next week. I say “dripfeed”, because there are so many people who have declared interest that we’ll have to do it slowly to scale the service in a managed way as the load grows. Everyone who has signed up will get an invite (and people can still sign up), it just might take a week or two longer before everyone gets in.
Further, the first beta drop will likely be primarily a developer oriented release: we will provide API’s and documentation for:
- web scraping
- name/entity recognition
- subject identification
- context searching/extraction (extracting important words and sentences for a search term)
- document similarity (comparing two documents/urls for similarity of content)
- RSS listening that hooks into all of the above
The other functionality that has been mentioned, such as sentiment analysis, classification, Twitter and Facebook monitoring and analytics will be rolled out in the weeks thereafter (it’s a Beta after all - functionality will be continuously improved and added).
The first release may be mostly developer oriented, but in coming releases we’ll hopefully improve the polish of the web experience as well, so that non-developers will eventually also be able to create their own analysis pipelines and analytics reports.
Stay tuned!
21 Nov 11 Making Text Mining Accessible to Any Developer & Non-Expert
On the back of what I wrote the other week about machine intelligence, I think another important step is democratizing use of machine learning & intelligence software: making it accessible to people and companies that don’t have a PhD or deep pockets to hire one. This has thus far been the domain of experts and laborious manual work. I think this has to change.
In that spirit, I’m launching TextMinr - Text Mining as a Service, and we’re accepting applications for Beta users as of now.
The plan is to expose large parts of the underlying technology that drives GreedAndFearIndex in the shape of REST API’s and a web console/dashboard, so that others can innovate on top of it and make use of state-of-the-art text mining and natural language processing technology without having to spend years learning how it all works.
Pricing is still to be decided, but it will definitely be accessible to anyone with an idea: our current thinking is a simple pay-as-you-go pricing, where anyone will be able to dip their feet and test our technology out without having to pay and arm and a leg. I think it’s the fair way to go: if you barely use it, then you barely pay for it, if you process and analyze half the internet on a daily basis, well, then you’ll probably pay a little bit more.
You can sign up for the Beta that will be available soonish right now, all we want in return is your feedback. So if your interest, please do sign up!
4 Nov 11 The Big Picture: True Machine Intelligence & Predictive Power
At the beginning of last week, I launched GreedAndFearIndex - a SaaS platform that automatically reads thousands of financial news articles daily to deduce what companies are in the news and whether financial sentiment is positive or negative.
It’s an app built largely on Scala, with MongoDB and Akka playing prominent roles to be able to deal with the massive amounts of data on a relatively small and cheap amount of hardware.
The app itself took about 4-5 weeks to build, although the underlying technology in terms of web crawling, data cleansing/normalization, text mining, sentiment analysis, name recognition, language grammar comprehension such as subject-action-object resolution and the underlying “God”-algorithm that underpins it all took considerably longer to get right.
Doing it all was not only lots of late nights of coding, but also reading more academic papers than I ever did at university, not only on machine learning but also on neuroscience and research on the human neocortex.
What I am getting at is that financial news and sentiment analysis might be a good showcase and the beginning, but it is only part of a bigger picture and problem to solve.
Unlocking True Machine Intelligence & Predictive Power
The human brain is an amazing pattern matching & prediction machine - in terms of being able to pull together, associate, correlate and understand causation between disparate, seemingly unrelated strands of information it is unsurpassed in nature and also makes much of what has passed for “Artificial Intelligence” look like a joke.
However, the human brain is also severely limited: it is slow, it’s immediate memory is small, we can famously only keep track of 7 (+/-2) things at any one time unless we put considerable effort into it. We are awash in amounts of data, information and noise that our brain is evolutionary not yet adapted to deal with.
So the bigger picture of what I’m working on is not a SaaS sentiment analysis tool, it is the first step of a bigger picture (which admittedly, I may not solve, or not solve in my lifetime):
What if we could make machines match our own ability to find patterns based on seemingly unrelated data, but far quicker and with far more than 5-9 pieces of information at a time?
What if we could accurately predict the movements of financial markets, the best price point for a product, the likelihood of natural disasters, the spreading patterns of infectious diseases or even unlock the secrets of solving disease and aging themselves?
The Enablers
I see a number of enablers that are making this future a real possibility within my lifetime:
- Advances in neuroscience: our understanding of the human brain is getting better year by year, the fact that we can now look inside the brain on a very small scale and that we are starting to build a basic understanding of the neocortex will be the key to the future of machine learning. Computer Science and Neuroscience must intermingle to a higher degree to further both fields.
- Cloud Computing, parallelism & increased computing power: Computing power is cheaper than ever with the cloud, the software to take advantage of multi-core computers is finally starting to arrive and Moore’s law is still advancing at ever (the latest generation of MacBook Pro’s have roughly 2.5 times the performance of my barely 2 year old MBP).
- “Big Data”: we have the data needed to both train and apply the next generation of machine learning algorithms on abundantly available to us. It is no longer locked away in the silos of corporations or the pages of paper archives, it’s available and accessible to anyone online.
- Crowdsourcing: There are two things that are very time intensive when working with machine learning - training the algorithms, and once in production, providing them with feedback (“on the job training”) to continually improve and correct. The internet and crowdsourcing lowers the barriers immensely. Digg, Reddit, Tweetmeme, DZone are all early examples of simplistic crowdsourcing with little learning, but where participants have a personal interest in participating in the crowdsourcing. Combine that with machine learning and you have a very powerful tool at your disposal.
Babysteps & The Perfect Storms
All things considered, I think we are getting closer to the perfect storm of taking machine intelligence out of the dark ages where they have lingered far too long and quite literally into a brave new world where one day we may struggle to distinguish machine from man and artificial intelligence from biological intelligence.
It will be a road fraught with setbacks, trial and error where the errors will seem insurmountable, but we’ll eventually get there one babystep at a time.
I’m betting on it and the first natural step is predictive analytics & adaptive systems able to automatically detect and solve problems within well-defined domains.
30 Sep 11 Developers Must Feel the Pain of Operations
I firmly believe that software developers not being responsible for their software in production is as damaging, bad and stupid as bankers not being responsible for their losses. To further the analogy by paraphrasing a commonly used derogatory term about banking, developers not being responsible for the daily running of their software encourages “casino software development”.
Developers become enticed to take shortcuts, since they know they will probably be on a different project altogether once the software actually goes into production - it won’t be their problem anymore.
The disconnect between “software developer”, “tester”, “support engineer” and “systems administrator” that is so common today is one of the most destructive practices we have in software engineering. In many contemporary organizations, software developers rarely have to live with the pain of the shortcomings of their software, except for what turns up in testing. What is forgotten in that equation is that testing is often limited, it does not deal with the pain of evolving, improving and maintaining software in an ongoing operation.
There is one simple rule to human behaviour as it pertains to business and the workplace: most of us are not too concerned with pain/issues caused by our actions if it doesn’t fall on ourselves and it is highly unlikely that anyone will be able to pin it on you. It is an unfortunate order of things, but you only have to see the maintenance and operations issues in just about any software product where the developers move on after getting “sign off”.
In simple terms, I think the traditional way of doing software development is wrong. We should not be having “software developers”, “testers”, “system administrators” and “support engineers” as separate roles. They should all be a single roll rolled into one. Yes, we may have people with slightly different expertise, spending slightly different proportions of their time on the different concerns, but on the whole, the pain of both creating and running software should be one, shared by the whole of the team.
If everyone knows they have to live with the pain of any shortcuts they take today, they are much less likely to take them in the first place, and if they do take them, much more likely to do so as a carefully weighed conscious decision and much more likely to address them at the first opportunity when they encounter the pain.
UPDATE
I omitted the role of “analyst” from this post originally, mostly because I think it more than anything shouldn’t exist. Everyone should be an analyst, ready to challenge and firm up requirements based on what the ultimate goal is.
13 Sep 11 Testing IO Client Code Easily with Functional Programming
Testing code that uses Input/Output is hard, right? I certainly used to think so before my forays into Functional Programming.
The code below is quite a common style of programming that you’ll find in a lot of Java applications that write to- or communicate with some remote network location where you have to transform some input data into the target output:
There are a couple of problems with this code: Firstly, even during testing of the actual code, you will hit the remote file location. Secondly, the method does not return anything, thus the only way to assert that you got the expected result is to actually check the remote location for the result, once again creating an environment and network dependency for simple testing.
Of course, there are two obvious ways to solve this problem: mocking, or stubbing out the remoteFS interface with a stub implementation that is injected. However I don’t like either of these solutions - apart from the overhead, I think they have a couple of issues: firstly, creating an interface and stubbing it out creates an unneeded interface where you likely only have one implementation, in other words, your introducing the Java disease of “MyInterface” and “MyInterfaceImpl” classes litered unnecessarily all over the place. As for mocking, mocking might introduce the above, in addition to the general issue I find with mocks: you are not blackbox testing behavior, but instead you find yourself testing that the internal calls of a method are done in an expected order - you’re breaking up the black box.
Scala and Functional Programming to the Rescue!
As I write most of my code in Scala these days, there is a very simple solution given that we can use functions as first class members, consider the following code:
So what have we done here? Well, the main difference is that we are passing in a function that takes a value of type “SomeOutput” as an argument into the function that does most of the work, with a default argument already provided for production runtime. Most actual client code will never change this default argument, hence will not need to bother with it, while during testing, we actually have a useful hook in to decouple us from the remote environment and protocol in a simple way, while at the same time effectively getting a “return value” by virtue of being able to pass in a test function that gives us a handle to the result instead of writing it anywhere.
On the surface, this solution may look a lot like stubbing an interface out, but the benefit is that we do not afflict our system with the “1 interface, 1 implementation” disease, we don’t have to create a lot of implementations of any testing forced interfaces/traits and we gain what amounts to an “almost return value” for the function if we wish to.
This is a relatively simple example, but it demonstrates on how to make IO client code more testable and get rid of an all too common code smell in IO client code in a very non-intrusive way if your using Scala.
I would also add that in Functional Programming, at least I consider functions that do not return anything to be a code smell. However, in the real world, you may genuinely have a few cases (though fewer than you’d think) where things go in or out, but nothing comes back (writing to a log or file would be one such case) - the above simple pattern is one way of partially getting rid of that code smell.
9 Sep 11 Continuous Deployment = Continuous Business Improvement
One of the benefits of working on a startup project of which I am the owner is that I can do things the way I want to.
One of the things I was keen to try out was continuous deployment, and for the last week or so I have done it, although there are about a hundred rough edges to the process to shave off before it is truly “continuous” and fully automated.
Surprisingly, the greatest benefits have not been technical, although those have been abundantly clear as well. From a technical point of view, it’s quite clear to me that smaller incremental releases are de-risking change a lot, as long as you have thorough testing in place, a fallback plan and backup datastores judiciously should something go horribly wrong.
But aside from the technical benefits, the business- and product development benefits stand out to me as the greatest wins: being able to quickly see- and use the system, not only in a demo mode, but in production means that my understanding of the product I am building and its uses is evolving much quicker than if I tried to do “big bang” releases with lots of functionality crammed in. The cost of change is low, as is the cost of “getting something wrong”. But most of all, continuous deployments are allowing me to develop the business proposition of the product iteratively and rapidly, as assumptions and theories can quickly be validated or discarded.
To me, it is becoming abundantly clear that continuous deployment is not primarily about technology, even though technology gains. It’s about continuous improvement of your business, both in terms of market fit and day-to-day operations.
21 Aug 11 The Problem with the Scala Community
I think Functional Programming in general and Scala in particular suffers from a serious dysfunction that risks impeding the general acceptance of FP (and FP inspired) languages such as Scala.
No, it’s not the “complexity of Scala”, because Scala really is no more complex than Java or any other language, regardless of the FUD that people too lazy to learn and too vested in the old ways perpetuate to excuse themselves from learning anything new.
Most of the Scala community is made up by great, smart and very helpful people, the sort of people that are a pleasure interacting with and learning from. But then there’s a small subset of the community which is made up of fragile, insecure egos who seem to get off on intellectual intimidation by wilfully confusing others to make themselves feel smart and superior at the expense of others.
This is not a new phenomena, in fact, it’s probably as old as the IT industry itself: I’ve seen techies spout off dense technobabble for as long as I’ve been in the industry, almost always in an attempt to intellectually intimidate and confuse “the other party”, hoping that they never ask the obvious question or point out that the emperor has no clothes. Whenever the early adopters find new toys to get excited about, it inevitably happens: 10 years ago it was the clusterf*** that was J2EE and J2EE “design patterns”, today it is all about actors, monads, iteratee’s and applicative functors. Different names, same overexcitement, overindulgence and insecure egos coming out to assert their “intellectual superiority”.
Functional programming has it’s roots in mathematics, which means that there is some dense language and notation to be memorized at times. But, as often is the case with mathematics, the language and notation is harder than the actual concepts and problem solving: most of them can be quite easily explained in “plain english”. There is practically no good reason why FP concepts, algorithms and data structures could not be explained in plain English assisted by code examples, and quite a lot of the time such explanations would actually turn out to be much briefer.
I love the Scala community and the eco-system around the language, but quite frankly, to foster the continued growth of the community, we need to call out the small minority of bullies on their bullshit whenever it occurs: being able to confuse and intellectually intimidate newbies to the language and community is not proving your supposed “intellectual superiority”, it is merely a reflection of an insecure ego and social incompetence.
UPDATE
Please read Dhananjay Nene’s comment in the comments section, it’s an insightful comment on the subject worthy of equal billing as this blog post.
- update 2: by calling out “bullies”, I mean calling out the behaviour and correcting it, rather than pointing out people and scapegoating. Anything else is as unhelpful as the original act.
