Big Data Tools Turn Days Into Minutes
Apache Hadoop, the distributed computing platform , is getting a lot of attention recently and rightly so as one of the very few viable utilities to process big data by breaking up work for processing on clusters up to thousands of nodes.
SAP AG, the enterprise software company, has been working on its own SAP HANA implementation for some time based on the principal of keeping large data sets, both structured and unstructured, in memory rather than continually transferring data to and from disk. SAP recently announced further big data capabilities through the integration of Hadoop environments thereby allowing SAP HANA to utilise the Hadoop Distributed File System and Hive for reading and writing data. According to Dan Woods, in a recent Forbes article, “The HIVE project is an attempt to make data in the HDFS store accessible through an SQL like interface.”.
One particular client SAP highlight in their announcement are Mitsui Knowledge Industry who utilise SAP and Hadoop for cancer research and have, according to SAP, “found a way to shorten the genome analysis time from several days down to only 20 minutes.”. Genome sequencing researchers have been considering methods to manage the massive amount of data involved in genome sequencing for some time. In 2010 researchers considered the process of “sharding”, splitting genome data into smaller more manageable chunks for cluster based processing, utilising the Hadoop framework and Genome Analysis Toolkit (GATK) used for projects such as The Cancer Genome Atlas.