Big Data Tools Turn Days Into Minutes

Apache Hadoop, the distributed computing platform , is getting a lot of attention recently and rightly so as one of the very few viable utilities to process big data by breaking up work for processing on clusters up to thousands of nodes.

SAP AG, the enterprise software company, has been working on its own SAP HANA implementation for some time based on the principal of keeping large data sets, both structured and unstructured, in memory rather than continually transferring data to and from disk. SAP recently announced further big data capabilities through the integration of Hadoop environments thereby allowing SAP HANA to utilise the Hadoop Distributed File System and Hive for reading and writing data. According to Dan Woods, in a recent Forbes article,  “The HIVE project is an attempt to make data in the HDFS store accessible through an SQL like interface.”.

One particular client SAP highlight in their announcement are Mitsui Knowledge Industry who utilise SAP and Hadoop for cancer research and have, according to SAP, “found a way to shorten the genome analysis time from several days down to only 20 minutes.”. Genome sequencing researchers have been considering methods to manage the massive amount of data involved in genome sequencing for some time. In 2010 researchers considered the process of “sharding”, splitting genome data into smaller more manageable chunks for cluster based processing, utilising the Hadoop framework and Genome Analysis Toolkit (GATK) used for projects such as The Cancer Genome Atlas.

Tip: Dealing With Off Screen Textboxes In Word 2010

Off Page Textbox In Microsoft Word 2010Running a spell check may be the first indication of a rogue textbox in Microsoft Word. A rogue textbox, in this instance, is a textbox that is positioned beyond the margins of the page and will, most likely, be inaccessible from many of the views including the standard Print Layout view.

Off Page Textbox In Microsoft Word 2010A spell check will highlight any spelling or grammar issues in the textbox, and show any offending text, but will not display the textbox on screen which can be extremely frustrating, cause a lot of head scratching, and generally be somewhat time consuming to track down.

Most advice regarding textboxes in Word usually require the user to click on the border of the textbox but if the textbox is off page then this is not possible.

Off Page Textbox In Microsoft Word 2010If a rogue textbox is suspected there are a couple of ways of dealing with the problem. The first step will be to switch to the Drawing Tools panel and enable the Selection Pane, on the Ribbon Bar, which should list all text boxes under the “Selection and Visibility” panel.

Off Page Textbox In Microsoft Word 2010Next, try switching to the Web Layout view, under the View menu, which will usually show the current document regardless of the limitations of the margins. If this is the case then grabbing the textbox and dragging it back onscreen, or removing it altogether, is a relatively simple fix.

If Web Layout view does not work another option would be to take any word found by the spell checker and perform a Find, then from the Format menu, under Drawing Tools, click the Position icon in the Ribbon bar and select one of the options such as “In Line With Text”, or “With Text Wrapping” which should bring the textbox back within the margins of the current page.

Big Data Tools Are On Their Way

Big data and big data analytics are moving on from being the latest “buzz words” as products begin to emerge that start to address the problems of handling vast amounts of unstructured data.

Yahoo! is the latest company to make an announcement with Yahoo! Genome building upon the years of experience the internet services company has in handling huge data sets from its user base which the company say now exceeds half a billion people. According to the platform website Genome, to be available around July 2012, can combine information from its own database and existing customer relationship management applications to provide more detailed information regarding customers and their activities.

Compuware have also announced an update to its dynaTrace Enterprise product, as part of its APM Platform Spring Release 2012, which will support big data environments emphasising cloud performance management and automatic baselining of systems. As enterprise begins to take advantage of cloud technology and its ability to dynamically scale its architecture web based applications that handle big data require radically new tools.

Data integration specialist Informatica recently announced version 9.5 of its data integration platform with more emphasis on maximising return on big data, that is “the value of data divided by its cost” – a view also recently held by IBM at its RoI: Return On Information event and when it comes to big data analytics the big blue already has an array of tools on offer including InfoSphere Streams for analysing continuous streams of data and InfoSphere BigInsights, a Apache Hadoop based tool, for analysing both structured and unstructured information.

Review: Apple iPad Smart Cover

Apple iPad Smart CoverCovers and cases for the Apple iPad seem to crop up pretty regularly as alternatives to the official products but it is always interesting to see an official product, as the designers intended, to see how well it complements the primary product. In this review we will be taking a look at the official Apple iPad Smart Cover, Leather Tan, designed for the iPad 2 (model MD302ZM/A) but also works with the 3rd generation of iPad. Whilst this review will focus on the leather version there are two main types of Smart Cover available: the Polyurethane design and the premium aniline-dyed Italian leather design.

Apple iPad Smart CoverThe Smart Cover is interesting in that, as well as protecting the screen, it can also:

  • Wake up the iPad when the cover is opened
  • Put the iPad to sleep when the cover is closed
  • Folds for use as a stand (keyboard stand and film stand).

For starters the cover is very slim. The product specification talks about the cover being slim but it really is quite slim compared to many other covers and cases seen. The box the product arrives in is slim too and includes very little else other than the cover and a protective sheet (for shipping).

The leather tan cover contrasts well with the iPad (the black version of the iPad was used for the review) and the warm feel of the cover contrasts nicely with the cool aluminium base of the iPad. Folding back the first segment of the cover wakes the iPad which has a real sense of purpose.

Apple iPad Smart CoverThe cover utilises two aluminium hinges which align magnetically to the iPad whilst magnets embedded in the opposite side of the cover should keep the cover in place. In early operation the cover takes a few minutes getting used to for example in the initial connection to the iPad and figuring out how best to fold the cover (as there are no instructions as how to fold – presumably it should be easy) but after a short practice things start to work out.

Apple iPad Smart CoverA really neat feature is that the smart cover also includes a microfibre lining to help keep the screen of the iPad clear of those dreaded finger smudges. There has clearly been a lot of thought given to the overall design, and function, of the smart cover and it is highly recommended.

The Apple iPad Smart Cover, Leather Tan, is an excellent addition to the iPad and complements the design and craftsmanship very well. The cover is currently available from the Apple Store and other vendors including John Lewis, who provided this unit for review, for £59 (price checked 11th May 2012).

For more information about this cover and others head over to the John Lewis website: iPad accessories.

Big Data To Be Worth $50 Billion By 2017

The big data market looks set to be worth over 50 billion dollars by 2017, up from just 5 billion today, according to research from technology research and advisory community Wikibon. In 2011, the market leaders in the field of big data, by revenue, were familiar names including IBM, Intel, HP and Oracle, with Microsoft down at 23rd place. The split of big data spending is between services (44%), hardware (31%), and software (25%).

Big data adds levels of complexity that require new skills. Typical organised data, such as that from an ERP or CRM, may fall into the gigabytes of size but big data really starts to climb when meta data is added. This meta data can be from a wide array of areas including social networking services, data feeds, and service logs but also from other analytical sources such as sentiment analysis, sensors, RFID tags, and click streams (for example clicks through a website).

Two Thirds Of Organisations See Big Data As An Opportunity

According to a recent study, by data integration company Informatica, more than 67% of respondents viewed “big data” as a business opportunity rather than a risk. The survey asked almost 600 business and IT professionals their views on a range of big data issues and over a third were either testing/piloting, or had in production already, big data related projects with the aims of achieving greater operational efficiencies and increasing business agilities.

One of the big challenges identified by the respondents were the lack of tools, or the lack of maturity of tools, to handle the task of big data analysis. Reflecting on this one problem with the existing tools for big data is that they are often quite complicated to operate and with data now falling outside the realms of the traditional information technology department and directly into the hands of the user base these tools need to become simpler and more intuitive.

Business intelligence tools are already playing catch up but dealing with big data requires a different perspective as often big data can sit outside of databases and can include additional layers of meta-data. Hadoop is one such open tool that utilises the power of distributed processing architectures for handling big data sets.

At a recent conference IBM outlined how they are tackling big data and in doing so it looks as if the big blue is set to migrate from being a builder of things to an information services company. Microsoft are aware of the challenge of big data and are already incorporating tools such as Apache Hadoop and Windows Azure into its “Big Data Solutions”. Big data plays a key role in the latest revisions of SQL Server 2012 which also ties into the various business intelligence tools to allow analysis, as Microsoft say, “on all data, including those in Hadoop.

One company that is well versed in crunching large amounts of information, and doing it quickly, is Google who are working on an online analytical processing system dubbed BigQuery that allows users to run secure SQL-like queries against massive sets of data (terabytes of data and trillions of records) and process queries that return, as Google puts it, “up to billions of rows”. Access to the service is through a subscribe web service.

Review: Tubes: Behind The Scenes At The Internet

Tubes: Behind The Scenes At The InternetIn this review we will be taking a look at the book “Tubes: Behind the Scenes at the Internet”, by Andrew Blum, to be released on 7th June 2012 published by Viking, an imprint of  Penguin Books (also available as a Penguin eBook). The author, Andrew Blum, is a correspondent at Wired magazine and has had work published in a number of other publications including “The New York Times”, so you might expect him to know a thing or two about the internet.

Getting Behind The Scenes At The Internet

So how is it possible to take a look behind the scenes for something as intangible as the internet? For us to get any real sense of the internet as a physical presence we can start by looking at our own computer, or mobile device, and follow the communication link back through our local environment. Perhaps we could trace this back to some kind of hub, router or wireless gateway somewhere, either at work, home, or some other place, and we could follow this connection further backwards until we hit another router, phone line, or some other mechanism for connecting to the “network”. Most likely, for home users and business users, this link may go back as the local telecommunications exchange.

But what happens next? What happens if we trace the path all the way back to its original source? Would we find a physical manifestation of “the internet” – the source? Or is the internet already all around us? And if so, what does it look like? Where does our information go? How does it get there? And how does it get back?

tubes2

In the prologue of the book the author sets the scene for the investigation, describing the disconnect that occurred when a squirrel munched on a backyard cable, and this is where the story begins as summed up by the author, “Because this much I knew: the wire in the backyard led to another wire, and another behind that – beyond to a whole world of wires.” As a result of this disconnect Andrew Blum embarks upon a physical journey to find the very real internet, “You write an email. You hit send. It appears ten thousand miles away. How did that happen?

Only many of these wires are actually fibre optic cables, long tubes filled with glass fibres, that carry light from one end to another in fractions of seconds. In 2006 Senator Ted Stevens of Alaska described the Internet as “a series of tubes,” and it is in this sense, of making this virtual world we know as the “internet” into something physical, that we can touch and feel, that the book aims to explore.

When talking about the topology of the Internet the author explains how the commonly accepted perception of the “information highway” does not account for the multi-layered nature of the internet, that “networks carry networks”, and the author continues on to explain that the internet is “more like the trucks on a highway than the highway itself.” This metaphor is ideal in showing how the internet, as we understand it, is constantly evolving and it is this kind of explanation that makes the book extremely accessible.

The book itself, around 304 pages, is split into seven primary sections including “The Map”, “The Whole Internet”, “Cities of Light”, and “Where Data Sleeps” with each section focusing on a particular, tangible, part of the internet. Prologue, Epilogue, Acknowledgements, Notes and Index sections are in support.

The narrative of the book is told almost like a dramatic storyline, only this story is very real, and the book is a riveting read helped by the author clearly having a passion for this subject which shines through in every chapter. My only wish would be that these physical places that Andrew finds upon his travels, from the backyard where the squirrel munched, through to the the Kubin-Nicholson building, TeleGeography, 8100 Boone, and PAIX, could be supported by the inclusion of photographs and diagrams which would add a very real and physical extra dimension to the story. However, there are details in the words that draw mental images of the internet, and if you enjoyed the book “How the Web was born” you are sure to enjoy this book.

In April 2011, a seventy-five year old woman deprived Armenia of its internet access when she sliced through a buried cable with her garden spade. That January, Egyptian authorities simply switched off 70% of the country’s internet connections in an attempt to quell a revolution. In 2009, a squirrel chewed through a wire in Andrew Blum’s backyard, slowing his broadband to a trickle and catapulting him on a quest to find out what this so-called ‘internet’ actually is. – Tubes: Behind the Scenes at the Internet

If you have any interest in what this thing we know of as “the internet” really is then this is a definitive book on the subject.

Tubes: Behind the scenes at the Internet (ISBN: 9780670918980) is currently available for pre-order, from Amazon for £9.09 (price checked on 14th May 2012), together with a number of other vendors with an RRP of £12.99 whilst an ePub version is also available for eBook readers.

For more information head over to the Penguin website, Tubes: Behind the Scenes at the Internet.

How Much Data Is Big Data?

With information collection climbing at an increasingly astonishing rate the demand for tools to analyse, and make sense of, this “big data” is becoming a critical component of board level strategy plans even though “big data” itself appears to have no consistent definition.

How Much Data Is Big Data?

According to a recent article in Data Science Series a report by IDC suggests that the classification for “Big Data” starts at sets of data over 100 terabytes in size (a terabyte being around a thousand gigabytes) although SearchCloudComputing describe Big Data as “doesn’t refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.” where a petabyte is around a thousand terabytes and an exabyte around a thousand petabytes. To put big data into perspective the complete works of Shakespeare takes around 5 megabytes (around 0.000005 terabytes) whilst the complete audio collection of works of Beethoven takes around 20 gigabytes (around 0.02 terabytes)

Big Data Is All Around Us

It may be generally agreed that understanding the structure of any data set and performing responsive analysis upon it can be potentially beneficial but just how much data is big data and where are the tools to analyse it? We might consider for a moment that data sets of the big data magnitude are beyond the reach of many of us however this is where things become really interesting. If we turn to life sciences we find that they are already well versed in collecting and analysing extremely large amounts of information, but is it big enough?

The analysis of Deoxyribonucleic acid (DNA), the blueprint for making life, is one such example of a large data set and the United States DNA Database, according to the FBI in 2012, held over 10 million offender profiles in 2012. A report in Genetic Future, highlighted that a data file containing “each and every DNA letter in your genome” would take around 1.5 gigabytes which, when combined with the 10 million records held by the US DNA Database, produces a data set of around 13 petabytes – certainly in the category of big data.

Interestingly, as far back as 1999, projects like SETI@home were taking advantage of distributed systems, back then focusing on grid computing as the primary platform, for handling and analysing large amounts of information. These distributed grid systems would take large sets of information, break them down into smaller sets, and distribute the analysis to many hundreds, even thousands, of volunteer computers. An interesting notion is that back in 2001 a Berkeley Report outlined that the SETI project needed to handle 39 terabytes of data, with data sets recorded onto 35 gigabyte tapes, so in the modern definition of big data this would not reach the 100 terabyte threshold outlined by IDC yet it serves as an example of just how far data collection has come.

Get Ready For Big Data

Smart Meters are an area that is likely to receive a large amount of scrutiny, and public interest, primarily because it sits in the consumer arena. Utility companies are readying themselves for the big data leap and are, presumably, already considering the kind of insights the data is expected to provide. One company, SAP, is already utilising its HANA architecture to build analytical solutions for this sort of data.

Too Much Data

There may be a temptation to simply collect as much, and as varied, data as possible in the hope that one day it may glean some useful insights. However, too much data can be more problematic than too little data so what is important is that enterprise begins to question the data it is collecting and asking itself if the information is really necessary and if so, how long it needs to be retained. The temptation might be to simply let the data quietly collect, perhaps in a data warehouse or large drive share, in the hope that one day a tool may arrive to make sense of the information. In many enterprise however business moves quickly and todays information may not have quite the same benefits tomorrow.