Is big data really a new computing paradigm?

December 1, 2014

Gordon Plunkett

Big data has become a popular buzzword and trend in the computing industry. But are the challenges that big data presents new or have these challenges been around for some time? In particular, the study of geography is known for collecting vast volumes of data, so are big data issues affecting how GIS and Web mapping are performed?

It’s clear that big data has now hit prime time. It’s a topic you read and hear about with increasing regularity. IBM defines big data as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications”. In GIS circles, big data is certainly making a big impression. At a High Performance Geoprocessing Symposium I attended at Carleton University—sponsored in part by the University’s Geomatics and Cartographic Research Centre—Professor Fraser Taylor and Professor John ApSimon gave introductory remarks on the fact that big data is becoming more common in the GIS industry. They remarked that geospatial practitioners need to become more familiar with big data management and to become more skilled in handling the analytics of big data. Also recently, Professor Yvan Bedard gave a GeoConnections Webinar on topics such as the fundamentals of big data and the challenges and opportunities of big data in geomatics.

However, is big data really a new thing or has it been with us for some time? Today, a big data challenge could include processing a terabyte data file on several high performance servers. But is this really any different than the ‘old’ days when we were trying to process a 100MB file on a slow PC with a 640K memory limit? Same problem – except the orders of magnitude have changed. Today, the processors are many times faster, the disk storage systems are many times larger and throughput is much quicker. But in the old days, we did spend a lot of time figuring out how to ram lots of data through a state-of-the-art processor without having to wait a week to get the results.

So what’s changed? Isn’t the problem proportionally the same as in the past, when today’s processors are 1,000 times faster and disks are 1,000 times bigger? Why is big data such an important issue today, when a correspondingly similar problem has been around almost since computers were invented?

Big data has become a computing buzzword.

Well, I think many things have changed to make big data one of today’s great computing challenges. In no particular order, here are some issues creating today’s big data challenge:

Data Complexity
The rise of social media, sensor inputs, Internet of Things and data archives means data is getting more and more complex. No two tweets are the same–so how does a computer make sense of what a human tweeter is trying to say? Sensors are collecting more complex and interrelated data. The Internet of Things generally has simple sensors, but the massive volume of these sensors makes it difficult to mine the data they generate. Also, we now have a digital history that we didn’t have decades ago –so now we can try and determine what has changed because this may be the key to determining what’s happening today.

More Complex Questions
Users are asking more complex questions today than in the past. This is a good thing, but it makes obtaining the user input more complicated; plus, it makes the analytics more intricate. One of the more forgotten components of this multifarious issue is that the complexity of a question makes it more difficult for the user to determine if they have obtained the correct results. In short, complex questions yield complex answers, which means one must rely more and more on—you guessed it—big data.

Scalability
In the old days, generally PC processors were dedicated to a particular process and the OS would be set to run a big process until it was complete and without interruption. This is much less common today because servers are used to support a multitude of users who are doing various tasks and asking many questions. So systems need to pull in other processors when the system gets busy – hence cloud computing. For example a few years back, I was told by a major company that they could create a completely new 20 scale map cache of the world in two days. Without cloud computing, it would take enormous computing resources to accomplish this.

For the Community Map of Canada, an update of the map cache is most often performed only when and where the data has changed. However, if the map cartography has changed, then a new multiscale map cache needs to be produced and this can take a long time if only a few computing resources are available. In this case, cloud computing helps smooth out these peak loads when excessive computing power is required for a short period of time.

So is big data the new computing paradigm? To me, the problem has been around for quite some time, but what has changed is the complexity and interconnectedness of the input data and the challenging questions that users are posing and striving to answer.

For more on GIS and big data, visit Esri’s Web site for insight, resources, tools and case studies.