Data storage options for GeoAnalytics Server

October 29, 2020

Hossein Hosseini

Esri’s ArcGIS GeoAnalytics Server provides a reliable framework for spatial big data analytics. For GeoAnalytics’s functionality to execute successfully, the first step is to think of where the required data must be stored. This blog post serves as the first in a series of blog posts that discuss big data platforms, which are followed by a companion set of tutorials that explain how to perform the sorts of big data analyses.

Big data can generally be defined as data that are beyond the capability of traditional systems to store, process, and analyze. This blog post serves as an introduction to the nature of big data in general and in the spatial domain in particular, and introduces readers to the data source options that exist for the GeoAnalytics Server, which is Esri’s ArcGIS Enterprise big data capability. It serves as the first in a series of blog posts that discuss big data platforms, such as Apache Hadoop and cloud stores, which are followed by a companion set of tutorials that explain how to read data from one of these platforms and use it inside the Esri ecosystem in order to perform the sorts of big data analyses that standard tools cannot easily process.

Three key properties make big data, whether spatial in nature or in general, difficult to store and process within traditional relational databases that are common to most commercial GIS software. Specifically:

Volume: the volume of daily data that companies routinely collect has skyrocketed in recent years. Moreover, the advent of the Internet of Things (IoT) and the widespread presence of high volume data generating sensor networks in almost every aspect of modern life have propelled the need to rethink and rearchitect traditional data storage paradigms and data processing. Apart from the fact that relational databases have a storage limit in the petabyte range, storing large volumes of data in a conventional relational database is no longer a cost effective or efficient strategy;
Velocity: big data tend to come in at high velocity. Therefore, read/write operations must be very fast and in parallel in order to handle the huge amount of data that are generated in a short period of time. Read/write operations in the relational database model are not fast enough to capture real-time, high-velocity data generation. One of the major drawbacks of the relational database model when dealing with such big data flows and storage is its tabular format. In a nutshell, in the relational database model, data get stored in a number of related tables with predefined schemas. This means that the required tables, relationships between tables, definition of each table columns’ data types, and other constraints, such as primary and foreign key field definitions need to be designed in advance. This model might make sense when dealing with small to medium-sized (structured) datasets in the order of gigabytes (or even terabytes), but not always when datasets are in the order of petabyte. Simply stated, the performance of read/write operations drops significantly by imposing these restrictions when it is not necessary.
Variety: big data come from different sources, in different formats and structures. With highly varied data feeds including video, audio and other datatypes, incoming data can be structured, semi-structured, unstructured, or a combination of all three types. Structured data refer to information that has a predefined schema and can be stored neatly in a fixed row-column format. Unstructured data refer to information that, unlike structured data, does not fit into the traditional row-column relational database format. Text, images, and videos are some of the major unstructured data types. Between the structured and unstructured extremes, there is a third category called semi-structured. Similar to unstructured data, semi-structured data do not fit well into the tabular format of relational databases, but it is still possible to create semantic properties (such as tags or keys) that can be used to interpret the data content. XML and JSON formatted documents are two prime examples of semi-structured data. The relational database model was primarily developed to store structured data, hence it is not suited to the other two major datatypes that have recently grown in ascendency.

Because of the above-mentioned properties associated with big data, non-relational data stores, such as NoSQL databases, distributed file systems, and object storage services outperform relational databases when it comes to storing and retrieving big data. A sizeable portion of big data are inherently spatial and have a geographical component attached to them. Such data include road traffic; weather station, air quality, and wind speed recordings; human mobility, car accident, crime tracking; and endangered species monitoring. Having a robust framework to process and analyze spatial big data is extremely important. Esri’s ArcGIS GeoAnalytics Server provides a reliable framework for spatial big data analytics. However, for ArcGIS GeoAnalytics’s functionality to execute successfully, the first step is to think of where the required data must be stored. The next section discusses the ArcGIS GeoAnalytics Server’s data storage solution which is well suited to respond to the challenge of storing big data.

GeoAnalytics Server

Esri’s ArcGIS GeoAnalytics Server is an ArcGIS Enterprise solution developed specifically to handle scenarios, such as those described above, of big data processing and analysis. To expose the GeoAnalytics Server platform’s capabilities, an ArcGIS Server site must be licensed first for the GeoAnalytics Server role. The GeoAnalytics Server toolbox itself contains a set of powerful tools that have been developed to provide high-performance spatial analysis on big data. Before running any GeoAnalytics tools, the location of the large volume datasets that will be processed by the GeoAnalytics platform must be specified.

Required data for GeoAnalytics tools can be accessed from four sources within the Esri ecosystem:

Feature layers hosted in a relational ArcGIS Data Store: this approach is suitable for GIS data;
Feature layers hosted in a spatiotemporal big data store: this approach is a non-relational data store solution within the Esri ecosystem. The spatiotemporal big data store supports horizontal scaling and provides high performance read/write operations. This method enables archiving of high volume real-time data that come through the GeoEvent Server or location tracking data that are collected by applications such as ArcGIS Tracker;
Feature services: this approach is suitable for GIS or non-spatial tabular data; and
Stream services: with this approach, GeoAnalytics tools can receive real-time data feeds through the GeoEvent Server.

In addition to the above four internal data sources, four types of external sources that exist outside of the Esri ecosystem can also be registered with the GeoAnalytics Server to provide the big data required for GeoAnalytics tools:

A File share: this approach references a directory that resides on a local device’s storage or on a server in your network that is accessible to all instances of the GeoAnalytics Server. All required data must be put under this directory;
The Apache Hadoop Distributed File System (HDFS): this approach references a parent directory (folder) located in the HDFS environment (proper permission is also required for access);
Apache Hive: this approach references a Hive metastore database. Apache Hive is a data warehouse in the Hadoop ecosystem that provides an SQL-like interface. Metastore is the central repository of Apache Hive that stores metadata for Hive tables; and
A Cloud store: this approach references cloud storage services that contain data. Four types of cloud storage services can be registered with the GeoAnalytics Server, namely Amazon simple storage service (S3) buckets, Microsoft Azure Blob containers, the Alibaba object storage service (OSS), and Microsoft Azure Data Lake stores.

In the spatial domain, data such as shapefiles, or non-spatial data such as text or csv files can be registered through any of the above-mentioned external locations with the GeoAnalytics Server to provide the data access required for any GeoAnalytics tools. Moreover, the outputs of GeoAnalytics tools can be written to a registered file share, HDFS, or cloud store.

Although the file share option may provide increased security, it might not be a reasonable option for large volumes of data because of the following drawbacks:

Each machine has a certain limited capacity for storing data;
Performance of the read operation, which is crucial for analyzing big data, drops significantly when reading from one machine; and
Hard drives are fragile, they may fail any time and data may be lost.

Despite these potential drawbacks, a file share approach may nonetheless be a good option for prototyping and development, or in situations where a local system has enough capacity for storing data and performance is not a concern. However, for many real-world applications, referencing data using the other three options (i.e., HDFS, Hive, and cloud store) is a more practical approach. HDFS is the Hadoop primary data storage that allows users to store large volumes of data across hundreds of machines. Hive is a component of the Hadoop ecosystem that uses a metastore service to store metadata for datasets, while the actual datasets are stored in one of the Hadoop compatible file systems, such as HDFS or Ozone. Accessing data through a metastore usually provides better performance than accessing them directly from HDFS. Cloud-based storage provides many benefits, such as scalability, availability, and cost effectiveness.

With these introductory, yet important, points in mind, the next blog post will explain in more detail the drawbacks of traditional data storage systems in dealing with big data and how non-relational data stores substantially overcome these issues. The discussion will focus specifically on the Apache HDFS architecture and demonstrate how HDFS provides a very reliable, cost effective, and high-performance solution for storing and retrieving big data.

This post was translated to French and can be viewed here.