Latest Blog Posts
- This blog has moved
- Using Emacs for Scala Development - A Setup Tutorial
- Software is the Central Nervous System of Modern Business
- Most of the code I wrote >6 months ago sucks
- Killing the “U” in CRUD: making databases immutable
- Scala Dynamo - a Scala (and Java!) API for Amazon’s Dynamo
- On Complexity, Scala & Development Practices
- The Dark Side of Big Data: Pseudo-Science & Fooled By Randomness
- TextMinr Progress Update
- Making Text Mining Accessible to Any Developer & Non-Expert
8 Nov 12 This blog has moved
This blog has moved. Old posts will remain here, but new posts will appear at www.recursivity.com/blog
Recently I’ve become disillusioned with the state of the JVM IDE: I haven’t touched Eclipse in anger in 3 years, and have no intention to do so anytime soon. IntelliJ on the other hand has served me decently, but suffers from too many odd bugs when dealing with Scala code to not be a minor nuisance in my workflow. Not only that, “modern” IDE’s are massive workspace hogs: try working with IntelliJ without at least a 1920x1200 resolution screen.
For these reasons, I’ve decided to try using Emacs for development for a month or so to see how it compares, and so far, so good.
Setting Up Emacs for Scala Development
Setting up Emacs for Scala development is fairly easy, just do the following:
Create a folder called ~/.emacs.d/scala-mode (in your home folder). You might already have an .emacs.d folder there, if not, create it.
Copy all contents from $SCALA_HOME/misc/scala-tool-support/emacs into ~/.emacs.d/scala-mode. I found that this folder is missing in the latest Scala 2.9.2 distribution, but you can get it from the 2.9.1 distribution. Also I found the current equivalent content in GitHub has some issues, so until a new version is released, prefer the 2.9.1 distribution for Emacs tools support.
Create a folder called ~/.emacs.d/ensime.
Download ensime, and add the contents of the zip files folder into ~/.emacs.d/ensime.
Add the following contents into a file called ~/.emacs.d/init.el: https://gist.github.com/2499183
You should now have ensime and Scala-mode installed in Emacs. Now it’s time to set up SBT!
Setting up SBT
To setup sbt, all you need to do is create a file called ~/.sbt/plugins/build.sbt with the following content (or add it if you already have it): https://gist.github.com/2499204
This sets up the sbt ensime plugin as a global sbt plugin for you that is available for all projects.
Using Emacs with SBT and Scala
This should conclude the setup of your environment. For existing SBT projects, it may be a good idea to clean out any target and project/target folders, as I found this interferes with the ensime sbt plugin sometimes (not making it available in sbt).
To generate a .ensime file that Emacs will use to use the same classpath as your SBT project, simply start your SBT project and enter “ensime generate” into the SBT console.
Once you have done this, start Emacs from the folder of your project, run M-x ensime to start the ensime project and off you go!
When running Ensime and Scala-mode, Emacs will support code completion, compilation, organising imports, refactorings like renaming, and much more, for complete details, please refer to the Ensime documentation.
If you want a little more, like class templates and the like, you may want to check out Yasnippet as well (I have not yet had time to do so).
Software is the Central Nervous System of modern business, this is something that Bill Gates asserted way back in 1999 in his book “Business @ the Speed of Thought”, and it is even more true in 2012 than it was in 1999.
Software drives modern business, it is everywhere: it drives the ad campaigns that attract customers, it drives the sales whether they come in via the web, a call centre or via a till in a shop. It drives the stock management system that checks whether goods are available and if they need resupplying. It drives the supply chain, it drives the pace with which goods or services are delivered, it drives the financial transactions that support it all. In short terms, the claim that software is the Central Nervous System of modern business is a relatively uncontroversial one.
Organisational Parkinsons or Alzheimer
Yet senior management in many organisations treat software with a large amount of contempt, as if it was not important. Imagine suffering from a cancerous tumour on your brain, would any right thinking person scour the earths every corner for not the best, but the cheapest surgeon to remove the tumour?
I wouldn’t think so, yet this is exactly how senior management in many organisations deal with the surgeons of their organisational Central Nervous System: never mind ability, who can make the loftiest promises at the cheapest quoted price? It doesn’t take a genius to see that this is madness, and in many cases akin to suicide by accepting a slow degenerative disease to your organisations central nervous system.
You have probably seen this in your travels if you have been in software long enough: siloed systems and data, left hand unaware of what the right hand is doing, endless death marches failing to deliver and many other things eventually resulting in the organisation slowly but surely being unable to react to changing market conditions, unable to seize or even recognise opportunities in front of it.
Though you may not realise it, this is exactly what the organisational version of Parkinsons and Alzheimers disease looks like.
Senior Managers and Executives everywhere are treating the Central Nervous Systems of their businesses with dangerous disdain by means of neglect, prioritising cost over value and seeing software as separate from the rest of their business operations. It can only end in one way: the slow onset of a degenerative disease that at first will seem like a minor nuisance, but that will eventually ensure that what may once have been a glorious, successful organisation slowly fades away and becomes a pale shadow of what it was, before it ultimately meets its end.
The thing about learning a new programming language is that it is a humbling experience. You may have been an expert in another language after years of honing intricate knowledge of its behaviors and foibles. Yet when learning a new language, it counts for very little - at best you’ll have an experience to help you accelerate your learning of the new language, at worst it will be a chain and ball of bad habits holding you back in old, bad habits.
After 2-3 years of using Scala, I’d like to think I am approaching the level of becoming a competent Scala developer (though Dunning-Kruger is always a risk), but I know that there is undoubtedly a lot of depth that I have yet even scratched the surface of. I find myself taking spurts if improvement once about every 2-3 months, and find myself cringing at some of the code I wrote further back than that. Code that is older than 6 months I positively want to disown and deny any knowledge of (though it might be hard, as much of it is open source).
Going from an imperative style to more functional style, I’ve noticed the following changes in my own code over the last 6-9 months:
- Code I thought was fairly “functional” 6 months ago in fact has a lot of very obvious imperative “smells” seen with my eyes today.
- Though I knew mutability was bad, 9 months ago I still “cheated” occasionally as my mind had not yet shifted fully to solve all problems immutably.
- I define fewer vals (and no vars) these days, preferring more concise composition of functions.
- These days I find functions that return “Unit” smelly: returning Unit to me implies that a function is probably side-effecting.
- A lot of code now boils down to simply manipulating Lists and Options with filter, folds, maps and flatMaps (see point 3. And maybe LISP got it right in the naming of the language).
- 6 months ago I really didn’t get implicits. At all. Now I feel I’m starting to have a decent appreciations of where they fit and where they don’t.
These are a few of the things I have noticed in the evolution of my Scala code. I’m sure looking back in another 6 months, I will find a lot of bad smells in code I write today: I would be slightly concerned if that wasn’t the case, as it would imply I haven’t been learning enough.
To ensure I think my Scala code from today “stinks” in six months time, I will try to integrate more perspectives into my toolbox: though I’m unlikely to use any of them for production code, I will try to improve my Clojure skills, and perhaps learn me some Haskell, you know, for great good. Even if you only use a few of your tools, awareness of a wide variety of tools in your toolbox can only improve you as a programmer.
As I’ve gotten deeper into Scala and functional programming, a natural consequence is getting rid of mutable state and favouring immutability. Immutable data structures on their own get rid of a whole raft of defects and anti-patterns that traditional Java applications usually suffer from - immutability gives you a whole new level of confidence in the integrity of your code. If you come from an imperative background like me, it takes a while to shift, but once you see the benefits, there’s no going back.
A natural extension of shifting from mutability to immutability is to change your thinking about state from being a set of variables that change over time to thinking of state as a sequence of (immutable) values over time, in formal terms, Functional Reactive Programming. If you’re not already doing it, give it a try, once your mind makes the shift your application code will be much more concise, free of defects and flexible.
However, this way of programming may seem at odds with the traditional way we think about databases, be it SQL databases or NoSQL databases: most of us are used to mutating state in data stores, creating, reading, deleting and most pertinently updating data.
Enter Event Sourcing
When it comes to updating data in databases, it is worth asking one fundamental question, do we really need to, or is there another way of dealing with “updates”? If you think about how most data in most applications work, it tend to fall into two core categories (though not always):
- Data that has one natural entry (user settings, preferences etc)
- Data where each event leading to the end-state is important, where end-state can be deduced by the sequence of events (account entries in a book keeping application)
In the first instance, not updating may be a little contrived, but it’s still workable: version the data and keep a few version back, retrieving data by the latest version.
The second example on the other hand falls naturally into the category of data that really shouldn’t be updated. In many traditional applications, the end-state may be kept in a database column, but in actual fact the end-state is likely more reliably arrived at by summing up all the events that lead up to the end-state. If there was a mistake in an entry, it can either be deleted in it’s entirety, or cancelled out by an event that resets the total state to its previous state.
Apparently I’m not talking out of my nether, rear region, because there is a name for this: Event Sourcing.
There are a lot of reasons to go down the route of Event Sourcing instead of doing traditional database updates on a traditional data model:
- Remove the impedance mismatch between an immutable application and a mutable datastore - application and datastore become better aligned. Your database effectively becomes a more-reliable-than-RAM offloading point for your FRP style application.
- Keep the history of events that led to a state: understanding the events that lead up to something, rather than just having the end result can lead to deeper insights (see the current buzz around “Big Data”).
- Be able to assert data consistency in NoSQL data stores that do not explicitly support transactionality through versioning.
- Storage, memory and processing is cheap: in the olden days, these where at a premium, so there was an incentive to keep storage needs at a minimum. This no longer holds true, as the value of data has crossed over the cost of storage and processing.
I realise there may be a bunch of things I have not yet considered, as this blog post summarises/is a braindump of a lot that has slowly gestated in my mind lately and been driven forward by both offline and online conversations, so any and all feedback is more than welcome!
Last week saw the announcement of Amazon Web Services NoSQL database as a service, Dynamo. Dynamo has a number of very interesting features, not least that it runs as a managed service with certain guarantees with regards to scalability, resilience and performance. From the pricing structure, I would say that Dynamo is probably more expensive than using something like MongoDB for larger datasets. However, I think Dynamo has an interesting role to play for startups and companies with smaller storage needs: Dynamo is perfect as an “early NoSQL database” for startups - you outsource the expensive tasks of managing and scaling the database to Amazon until the point where your business concept has either been proved or disproved. If your concept is dead in the water, your data storage requirements may never really get to a threshold where other solutions are appreciably cheaper, and if your concept is a roaring success, well then you have every reason in the world to put in the effort to migrate from Dynamo to Mongo, Cassandra or something else.
A simple Dynamo API for Scala and Java
With that short intro to where I think Dynamo fits in, I thought I’d unveil Scala Dynamo, an open source project I quickly put together to make working with Dynamo a breeze in Scala or Java. The standard AWS Java API’s are verbose to say the least, but with Scala Dynamo you will be able to save and load Scala case classes or Java beans that fit into the Dynamo way of storing data in a single line. The main thrust of Scala Dynamo so far has been in serializing and deserializing case classes and java bean classes into and from Dynamo. I might add more features in the future as I familiarize myself further with Dynamo, but in the meantime, I think the library should already be quite useful to anyone using a JVM language wanting to work with Dynamo.
For examples, source code etc, check out Scala Dynamo’s GitHub repository!
Over the last few days there has been a lot of discussion about the perceived complexity of Scala, following Scala creator Martin Odersky’s response to a critical blog post that compiler flags may be introduced to allow people to turn on and off certain features in the language which may be perceived as “complex”.
The Nature of Complexity
There are actually two big reasons that worry me about Odersky’s response: firstly, it is an implicit admission that maybe Scala is actually too complex. Secondly, if that is the case, the response is the wrong one, as it will inevitable result in even more complexity.
Let’s start with my second point - complexity, especially in software is often the result of one of several common pitfalls:
- Ill-considered design decisions where functionality is too course-grained to be easily composable for the actual use-case.
- Too many ways of doing the same thing, leading to confusion over what the best practices are.
- Unecessary features/”options” being added without there being a clearcut need for them.
- “Flags” being used instead of clean, composable abstractions.
By any measure, adding options will only serve to add complexity - if you want to remove complexity and make things simpler, you should be looking to what can be removed or reduced, not to what could be added. In my opinion, either things should remain as they are, or if the features that are being questioned truly are too complex (which I’m not sure about), they should slowly be phased out of the language with the admission that they might have gone too far. Either it belongs in the language, or it doesn’t, pretty simple.
It bares repeating: complexity is almost always the result of trying to do too many things, please too many people and giving people too many alternatives. Adding yet another alternative will only increase complexity even if the purpose is the opposite. Good design is based on doing the bare minimum, the bare essentials and nothing more, YAGNI and all of that..
Dealing With A Learning Curve and PERCEIVED complexity
I’m not sure Scala is actually as complex as some claim it is, but I do not in the least contend that anything that has a learning curve will initially be perceived as complex. Teams adopting Scala obviously need to manage that learning curve by using solid and agreed development practices. But this is not a problem that is unique to Scala: teams developing in any language will typically put in place a set of commonly agreed principles and practices. These practices will usually range from granularity of testing (what is a unit-test, what is an integration test etc?), how to guard for common bugs (running something like FindBugs), how build processes work etc to how code is formatted. There are many excellent build-plugins for most languages and build-tools that let teams enforce these practices.
My point here is that most teams have to deal with these issues anyway, so for a team adopting Scala, it’s all about managing it in the traditional way. In this sense it would make a LOT more sense if guarding for potentially confusing Scala features was done as part of the teams regular work, and perhaps enforced by some useful build tool (sbt) plugin. Trying to fix it at the language level or on the compiler level seems to me to be the wrong place to fix an issue that is mostly about managing a collective learning curve more than anything else. Yes, a compiler flag might be the quick, expedient way of addressing it, but I’m afraid it is the wrong place, and also risks the usual pitfalls that come with “quick, expedient fixes”.
In my opinion, Martin Odersky and his team need to work out for themselves whether Scala truly is too complex and some rough edges need to be entirely phased out of the language, or if it is simply an issue of managing learning curves (which is my opinion). If it is the latter, then I believe it should be dealt with by optional sbt plugins and IDE features that helps teams manage their learning curve and not anywhere else.
Over the last couple of months I have read up on volumes of Technical Analysis (“TA”) information, I have back tested probably hundreds of automated trading strategies against massive amounts of data, both exchange intraday- and tick data, as well as other sources. Some of these strategies have been massively profitable in back testing, others not so much.
Some of the TA patterns, I’ve discarded before they even left the book, because they did not stand up to any sort of scientific scrutiny because they lacked a clear predictive thesis, where riddled with forward-looking bias (“Head and Shoulders patterns”), and in some cases where just plain bulls**t (“Elliott Wave Principle” comes to mind).
The outcomes of my testing has made me think about the implications of large scale data analysis in general: it is very easy to get fooled by randomness. In many cases in my testing results have been amazing, but I cannot come up with a plausible causal explanation as to why, and when I gently nudge the parameters just ever so slightly, outcomes can look entirely different.
Taking a step back from the data, looking at it in a larger perspective, I’m inclined to conclude that if data across multiple parameter variations looks like a random walk and lacks a plausible causal explanation, then it is a random walk.
If I cannot say “X is caused by A and B”, I’m inclined to believe that the actual reason is “X is the result because A and B fit the historical data D, but may not do so in the future”.
And herein lies the crux of the matter: how many data scientists are inclined to take a step back, rather than just assume that there is a pattern there? How many are prepared to do so if their livelihood is largely based on them finding patterns, rather than discarding them because they do not hold up to deeper scrutiny? I’d say very few.
My conclusion to this is that the age of Big Data will see a radical increase of pseudo-scientific “discoveries”, driven out of an interest in announcing new great “patterns”. This pseudo-science will pervade both academia, public sector and private sector, God knows I’ve seen a fair number of academic research papers already that simply do not hold if you investigate their thesis in a deeper manner.
I suspect we will arrive at a point much like with any new technology whereby people will tire of the claims made by “Big Data Scientists”, because at least half of what they say will have been proven to be hokey and pseudo-science in the pursuit of being able to make even more outlandish claims in a game of one-upping the competition. Some of this will be driven by malice and self-interest, but I suspect in equal parts it will be driven by ignorance and perverted incentives putting blinders on people in the business.
29 Nov 11 TextMinr Progress Update
Since I announced TextMinr, the Text mining as a service platform last week - the interest has been overwhelming: it got to the top 3 of Hackernews, it got retweeted by Tim O’Reilly of O’Reilly media and many other influential people in the tech scene.
Progress over the last week has been good, the public API is coming along and it’s quite likely we will start to dripfeed invitations to the beta towards the end of next week. I say “dripfeed”, because there are so many people who have declared interest that we’ll have to do it slowly to scale the service in a managed way as the load grows. Everyone who has signed up will get an invite (and people can still sign up), it just might take a week or two longer before everyone gets in.
Further, the first beta drop will likely be primarily a developer oriented release: we will provide API’s and documentation for:
- web scraping
- name/entity recognition
- subject identification
- context searching/extraction (extracting important words and sentences for a search term)
- document similarity (comparing two documents/urls for similarity of content)
- RSS listening that hooks into all of the above
The other functionality that has been mentioned, such as sentiment analysis, classification, Twitter and Facebook monitoring and analytics will be rolled out in the weeks thereafter (it’s a Beta after all - functionality will be continuously improved and added).
The first release may be mostly developer oriented, but in coming releases we’ll hopefully improve the polish of the web experience as well, so that non-developers will eventually also be able to create their own analysis pipelines and analytics reports.
On the back of what I wrote the other week about machine intelligence, I think another important step is democratizing use of machine learning & intelligence software: making it accessible to people and companies that don’t have a PhD or deep pockets to hire one. This has thus far been the domain of experts and laborious manual work. I think this has to change.
In that spirit, I’m launching TextMinr - Text Mining as a Service, and we’re accepting applications for Beta users as of now.
The plan is to expose large parts of the underlying technology that drives GreedAndFearIndex in the shape of REST API’s and a web console/dashboard, so that others can innovate on top of it and make use of state-of-the-art text mining and natural language processing technology without having to spend years learning how it all works.
Pricing is still to be decided, but it will definitely be accessible to anyone with an idea: our current thinking is a simple pay-as-you-go pricing, where anyone will be able to dip their feet and test our technology out without having to pay and arm and a leg. I think it’s the fair way to go: if you barely use it, then you barely pay for it, if you process and analyze half the internet on a daily basis, well, then you’ll probably pay a little bit more.
You can sign up for the Beta that will be available soonish right now, all we want in return is your feedback. So if your interest, please do sign up!