******Please read this post at my new blog site, AInsightful. Thanks!******
R has some very useful statistics libraries. It’s an excellent langauge for manipulating and graphing data. What would take multiple lines of Java (or even Scala) code can elegantly be written in R without code looking obscure. One downside is that all that good stuff is hidden from the software industry which mostly uses the mainstream languages such as Java, C# and C++.
That’s is why I recently put together dockerr, a dockerized R server with sample java/scala clients. All the ingredients were available. All I did was put things together and document them.
Any engineer would naturally want to avoid complications and look for a solution that doesn’t involve calling yet another service. A new service is an additional worry. You need a host, monitoring, upgrades. However the only mainstream language that has excellent coverage of statistical and machine learning methods is Python through Scikit and supporting data analysis libraries Pandas, Numpy and Matplotlib. Although Weka is written in Java, it doesn’t have a strong community like R and Scikit do. Being written in Java has its downsides too. Import statements, class declarations, OO paradigm and generally its verbosity make it less favourable for data analysis tasks. Here is for example code fragments to work out the mean of three numbers in R, Scala and Java.
mean(c(1, 2, 3))
val x = List(1, 2, 3) x.sum/x.length
List<Double> integers = Arrays.asList(1.0, 2.0, 3.0);
double x = integers.stream().mapToDouble(Double::doubleValue).sum()/integers.size();
One doesn’t suddenly make their code compile against a data analysis library and start using it in production. There is a bit of work to do before that happy moment – data cleaning and several iterations involving various data plumbing work, data visualisation, training, tweaking parameters, testing and graphing results. So it’s wise to pick a language that makes these tasks relatively straightforward. And like what I discovered, it’s wise not to spend too much time searching for good data analysis libraries written in Java.
There are also MLlib and H2O, powerful machine learning frameworks written in Java. They put big data at the core of their architecture. The array of algorithms supported and their community size is not a match to what R and python have to offer. So support is limited. If the method you are after isn’t implemented then you may have to do it yourself or wait until someone does it. Their suitability for big data comes at a cost too, particularly for MLlib since setting up the framework and following the Spark programming model can be an overkill for small problems. H2O’s sdk for R and Python make it more attractive here. Running an H2O server is easy as well. You just download and run the jar.
Since I talked about Rserve, I should also mention the Java-R interface JRI. It provides a Java API to locally installed R. It sounds nice in the beginning, and even though I haven’t investigated it in anger, I wouldn’t choose it as a solution because it leads to a monolithic application that runs inside a single JVM. The micro service architecture works better here.
I’m now convinced that calling Rserve from java isn’t a bad idea after all. Dockerr is just a start. More can be done to an Rserve container to improve it. It can be set up to load functions only once. You can also put a number of containers behind a load balancer for better performance. One big flaw is that containers won’t be able to work together. If you have huge data and an algorithm that needs multiple machines to crunch it then H2O and MLlib are worth considering.
There will be other solutions implemented in other languages/frameworks too. There is no one single technology that solves all problems. Data manipulation, graphing, availability of algorithm implementation and performance are some of the factors behind choice of technology for the data scientist. Anyone who is serious about data analysis will need to be flexible about technologies. Remember The Law of Instrument