Big data is a term commonly used by the press and analysts yet few people really understand what it means or how it might affect them. At it’s core, Big Data represents a very tangible pattern for IT workers and demands a plan of action. For those who understand it, the ability to create an actionable plan to use the knowledge tied up in the data can provide new opportunities and rewards.
Let’s first solidify our understanding of Big Data. Big Data is not about larger ones and zeros nor is it a tangible measurement of the overall size of data under your stewardship. Simply stated, one does not suddenly have “big data” when a database grows past a certain size. Big Data is a pattern in IT. The pattern captures the fact a lot of data collections that contain information related to an enterprise’s primary business are now accessible and actionable for that enterprise. The data is often distributed and in a variety of formats which makes it hard to curate or use, hence Big Data represents a problem as much as it does a situation. In many cases, just knowing that data even exists is a preliminary problem that many IT workers are finding hard to solve. The peripheral data is often available from governments, sensor readouts, in the public domain or simply made available from API’s into other organizations data. How do we know it is there, how can we get at it and how can we get the interesting parts out are all first class worries with respect to the big data problem.
To help illustrate the concepts involved in Big Data, we will use a hospital as an example. A hospital may need to plan for future capacity and needs to understand the aging patterns from demographics data that is available from a national census organization in the country they operate in. It also knows that supplementary data is available in terms of finding out how many people search for terms on search engines related to diseases and the percentage of the population that smokes, is not living healthy lifestyles and participates in certain activities. This may have to be compared to current client lists and the ability to predict health outcomes for known patients of a specific hospital, augmented with the demographic data from the larger surrounding population.
The ability to plan for future capacity at a health institute may require that all of this data plus numerous other data repositories are searched for data to support or disprove the hypothesis that more people will require more healthcare from the hospital in ten years.
Another situation juxtaposed to illustrate other aspects to Big Data could be the situation whereby a single patient arrives at the hospital with an unknown disease or infection. Hospital workers may benefit from knowing the patients background yet may be unaware of where that data is. Such data may reside in that patients social media accounts such as FourSquare, a website that gamifies visits to businesses. The hospital IT workers in this scenario need to find a proverbial needle in a haystack. By searching across all known data sources, the IT workers might be able to scrape together a past history of the patient’s social media declarations which might provide valuable information about a person’s alcohol drinking patterns (scraped from FourSquare visits to licensed establishments), exercise data (from a site like socialcyclist.com) and data about their general lifestyle (stripped from Facebook, Twitter and other such sites). When this data is retrieved and combined with data from LinkedIn (data about the patients business life), a fairly accurate history can be established. By combining photos from Flickr and Facebook, Doctors could actually see the physical changes in the way a patient looks over time.
The last example illustrates that the Big Data pattern is not always about using large amounts of data. Sometimes it involves finding the smaller atoms of data from large data collections and finding intersections with other data. Together, these two hospital examples show how Big Data patterns can provide benefits to an enterprise and help them carry out their primary objectives.
To gain access to the data is one matter. Just knowing the data is available and how to get at it is a primary problem. Knowing how the data relates to other data and being able to tease out knowledge from each data repository is a secondary problem that many organizations are faced with.
Some of our staff members recently worked on a big data project for the United States Department of Energy related to Geothermal prospecting. The Big Data problem there involved finding areas that may be promising in terms of being able to support a commercially viable geothermal energy plant that must operate for ten or more years to provide a valid ROI for investors. Once the rough locations are listed, a huge amount of other data needs to be collected to help determine the viability of a location.
Some examples of the other questions that need to be answered with Big Data were:
- What is the permeability of the materials near the hot spot and what are the heat flow capabilities?
- How much water or other fluids are available on a year round basis to help collect thermal energy and turn it into kinetic energy?
- How close is the point of energy production to the energy consumption?
- Is the location accessible by current roads or other methods of transportation?
- How close is the location to transmission lines?
- Is the property currently under any moratoriums?
- Is the property parkland or other special use planning?
- Does the geothermal potential overlap with existing gas and oil claims or other mineral rights or leases?
All of this data is available, some of it in prime structured digital formats and some of it not even in digital format. An example of non-digital format might be a drill casing stored in a drawer in the basement of a University that represents the underground materials near the heat dome. By studying its’ structure, the rate of heat exchange through the material can provide clues about the potential rate of thermal energy available to the primary exchange core.
In order to keep track of all the data that exists and how to get at it, many IT shops are starting to use graphs and graph database technologies to represent the data. The graph databases might not store the actual data itself, but they may store the knowledge of what protocols and credentials to use to connect to the data, what format the data is in, where the data is located and how much data is available. Additionally, the power of a graph database is that the database structure is very good at tracking the relationships between clusters of data in the form of relationships that capture how the data is related to other data. This is a very important piece of the puzzle.
The conclusion of the introduction post to Big Data is that Big Data exists already. It is not something that will be created. The new Big Data IT movement is about implementing systems to track and understand what data exists, how it can be retrieved, how it can be ingested and used and how it related (semantically) to other data. Every IT shop in the world has done this to some degree from a “just use Google for everything” low tech approach to a full blown data registry/repository being implemented to track all metadata about the data.
The real wins will be when systems can be built that can automatically find and use the data that is required for a specific endeavor in a real time manner. To be truly Big Data ready is going to require some planning and major architecture work in the next 3-5 years.