The Hadoop Distributed File System

November 20, 2020

Hossein Hosseini

This blog post addresses the new era of storage systems by focusing on the Hadoop Distributed File System (HDFS). HDFS is one of the supported external data sources that can be registered with Esri’s GeoAnalytics Server. Spatial and non-spatial data can be stored in HDFS and referenced in GeoAnalytics Server to provide the required data for GeoAnalytics tools. Moreover, the outputs of GeoAnalytics tools can be written to a registered HDFS location.

The era of storing all datasets on a single computer in multiple, linked tables is substantially a thing of the past. In the same way that the Internet has come to dominate person-to-person, person-to-business and business-to-business communications, the use of sensors and mobile devices has also become ubiquitous in modern society. Together, this massive exchange and resulting collection of massive amounts of substantially unstructured data have quickly surpassed the storage capability and capacity of relational databases within a single repository on a single device. This blog addresses the new era of massive data collection and storage. Specifically, it introduces the concept of distributed data storage which has been developed to address these new data storage and data management needs, focusing specifically on the Apache Hadoop Distributed File System (HDFS).

Traditionally, file systems have provided data custodians with a method for storing, organizing, and finding data efficiently on a storage device. Well-known examples of file systems include Microsoft’s file allocation table 32-bit (FAT32) and the more efficient new technology file system (NTFS), and on Linux the equivalents are the extended file system 3 and 4 (Ext3, Ext4), and XFS, which was created by Silicon Graphics and ported to Linux in 2001. These file systems are called local file systems as they have a local, machine-centred, view of the data and storage that they manage. Computer operating systems rely on these file systems to store and manage local data. Similar to those systems noted above, HDFS is also a file system. However, unlike local file system types it has a distributed view across a cluster of machines of the data that it manages. In other words, HDFS operates on top of the local file system that is specific to each machine. Distributed file systems are essentially a component of distributed systems, which use clusters of commodity hardware to store and process data. One of the most popular distributed systems that can be integrated into Esri’s GIS software ecosystem is Hadoop. HDFS is the Hadoop primary data storage component that allows users to store massive, or more commonly known as ‘big’, data collections across potentially hundreds of machines (nodes) within a cluster.

As discussed in the previous blog post titled “Data storage options for GeoAnalytics Server”, HDFS is one of the supported external data sources that can be used by GeoAnalytics Server. Spatial data, such as shapefiles, or non-spatial data, such as text, csv, or other file types, can be stored in HDFS and referenced in the GeoAnalytics Server to provide the required data for any of the analysis tools provided by the GeoAnalytics platform. Moreover, the outputs of GeoAnalytics tools can also be written to a registered HDFS location. Writing the results of GeoAnalytics tools to HDFS is supported in ArcGIS Enterprise 10.7 and later. The following sections of this blog post explains when and why to use HDFS (or other distributed data storage).

When to use HDFS?

With increasing use of distributed non-relational data stores, such as Esri’s spatiotemporal big data store, NoSQL databases, object storage, or file systems, it is germane to ask, “is the relational database model is dead”? The short answer to this question is no. Relational databases are still very powerful for online, transaction-oriented tasks where a high level of consistency is crucial. For example, many banks continue to use the relational database model to store all online banking transactions, since having real-time consistency is very important for their business. Consistency in banking data means that all users of the system must be able to see the same data on demand. It is not acceptable to have different views of data for a user’s account residing on different servers. All users have to be able to see the last updated value irrespective of which server processes database queries. It means that someone in Canada will see the same version of a data item as someone in Europe at a given time irrespective of which server executes a query.

Hence, database consistency is a major property of a relational database that most distributed NoSQL data storage and other file systems cannot guarantee. However, there are a few exceptions to this. For example, MongoDB inherits the consistency model from the relational database model. Rather than real-time consistency, distributed storage systems achieve eventual consistency, which means that all users of a system will see the last updated value after a short delay before the update of a data item propagates to all other servers in a cluster. For example, someone in Canada may see the most updated version of a data item earlier than someone in Europe if their query is executed on different servers.

A further question often asked about distributed non-relational data stores is why have they become so popular in recent years? This question can be answered simply with two words, namely scalability and performance. When big data are analyzed, the priority changes from consistency to performance in terms of read/write operations, and to scalability to provide enough resources for massive storage and efficient processing. When it comes to big data, distributed non-relational data stores are usually the best option. Big data come in all types of formats, from structured to unstructured and semi-structured data types. Unstructured data refer to information that, unlike structured data, does not fit into the traditional row-column relational database format. Text, images, and videos are some of the major unstructured data types. Between the two structured and unstructured extremes, there is a third category called semi-structured. Similar to unstructured data, semi-structured data do not fit well into the tabular format of relational databases, but it is still possible to create semantic properties (such as tags or keys) that can be used to interpret the data content. Extensible markup language (XML) and JavaScript Object Notation (JSON) formatted documents are two prime examples of semi-structured data. Non-relational data stores are the best option for storing such semi-structured data formats. For example, MongoDB, which stores data as binary JSON (BSON) documents, and Couchbase are good options for storing and querying JSON documents because of the functionality they provide for processing JSON documents.

To make the discussion more intuitive, suppose the type of data a developer needs to work with falls in one of the following quadrants based on data changes and data structure:

Dynamic data types get changed (insert/update/delete) continuously. On the other hand, static data remain unchanged after being collected, or change only infrequently. As it can be seen in the above table, relational databases are preferred for structured, dynamic data. For transactional data that need online updates, relational databases perform very well. However, if the size of data is extremely large, some NoSQL databases, such as MongoDB can be used to store data. For static structured data, non-relational data stores or warehouses are preferred because of their query performance. A data warehouse is still a relational database, but the database design is different since it is used for analytical purposes rather than transactional processing. For example, denormalizing data and storing them in fewer tables is a common practice in data warehousing. As a matter of fact, data warehouses must also get updated, but unlike transactional databases the updates don’t need to be reflected instantaneously. Depending on the type of business, a data warehouse may get updated at the end of each week or day, or even multiple times a day. However, the updates don’t happen online. Non-relational data stores, such as HDFS, or distributed data warehouses, such as Hive, are other good options for storing structured static data because they provide high performance read operations.

For semi-structured and unstructured static data, non-relational data stores, such as Esri’s spatiotemporal big data store, file systems (e.g., HDFS), NoSQL databases (e.g., MongoDB or HBase), or object storage (e.g., Amazon S3) systems can be used. As a general rule, distributed data stores deliver the highest performance when data are written once and read many times. In reality, a large portion of big data are static and rarely updated after creation. Storing semi-structured and unstructured dynamic data is more challenging. In situations where data are updated continuously, NoSQL databases are often the best option because they provide a better performance.

Why use HDFS?

To resolve the inherent limitations of single machine storage and processing, computing resources can be expanded in two ways. The first approach is vertical scaling (scale up) and the second is horizontal scaling (scale out). With vertical scaling, resources such as RAM and CPU can be added to the existing machine or the machine’s configuration can be expanded. With horizontal scaling, the computing power of the infrastructure is expanded by adding more machines, usually referred to as nodes, to the server site. Although vertical scaling might be an option in some scenarios, the capacity of one machine is still limited relative to horizontal scaling. Hadoop HDFS is based on the concept of horizontal scaling. Horizontal scaling provides many benefits for data storage, including:

Unlimited increase of system capacity: hundreds of machines can be added to a distributed system to increase its capacity (i.e., storage and computing power);
Fault tolerance: a distributed system can guarantee that if one or more components of a single machine fails, for example, a storage disk, the system continues operating uninterrupted;
Software failure: a distributed system can handle software failure in situations where a part of a distributed job that is assigned to a node fails; and
High availability: high availability is the result of a system’s fault tolerance capability. High availability means a system operates continuously without failing. System availability is measured by a system’s uptime, which shows the percentage of total running time. For a well-designed distributed system, the uptime measure is almost 100%.

It is worth mentioning that attempts have been made to provide scalability and high availability in relational databases. For example, Oracle introduced the maximum availability architecture (MAA) to enhance availability and provide fault tolerance. This architecture allows for horizontal scaling of relational Oracle databases across a pool of nodes. Moreover, there are some techniques, such as data sharding, which physically split data so they are distributed in the cluster such that each instance of a relational database has a subset of data to work with on its local disk (see Figure 1).

Figure 1: Table sharding in Oracle (Source: Overview of Oracle Sharding)

However, storing structured and normalized data in a distributed fashion may cause performance issues when querying the data, because in most cases tables that are located on different nodes need to be joined before returning the output to a user. Moreover, as explained earlier, relational databases are the best option for storing online transaction-oriented data and therefore the consistency is crucial. On the other hand, according to the Consistency-Availability-Partition tolerance (CAP) theorem, there is a trade-off between consistency and availability in distributed data storage (for more information visit CAP theorem). Hence, when preserving perfect consistency is crucial, having a highly available system is not guaranteed. HDFS does not have these limitations as it is usually used to store static datasets.

In order to see how the Hadoop framework provides scalability, fault tolerance, and availability features associated with horizontal scaling, the remainder of this blog post will provide an overview of the HDFS architecture and read/write operations in HDFS.

HDFS architecture

HDFS adopts a master-slave architecture, where one server machine operates as the master machine, called NameNode, and a number of slave (worker) machines, called DataNodes, that are used to store data (see Figure 2).

Figure 2: Architecture of the HDFS file system

Operations in HDFS are performed at a block level, meaning that when a file is stored in HDFS, the file is split into multiple pieces called data blocks or data ‘chunks’. The default size of a data block is 128 MB (in Hadoop 2.x and 3.x), but this size is configurable as needed. HDFS stores these data blocks by distributing them across a cluster of DataNodes according to its policy. HDFS almost follows a random distribution policy, meaning that it places each replica in a randomly selected DataNode. However, in this process HDFS takes other factors into consideration. For example, to improve fault tolerance, HDFS places one replica of each data block on a different rack with the rack awareness configured in an HDFS cluster (to learn more about the rack awareness in Hadoop, visit and read the ‘Hadoop rack awareness’ documentation).

Moreover, it is possible to configure the HDFS storage policies and tell HDFS how to store data. For instance, you can tell HDFS to store one replica in solid state drive (SSD) storage. If only some of the DataNodes are solid state, HDFS has to make sure one replica of a data block will be transferred to a DataNode that has SSD storage (to learn more about different HDFS storage policies, visit and read the ‘HDFS storage policies’ documentation). NameNode does not store any data blocks, as it is responsible for managing the data blocks that are located on the DataNodes. The mapping of data blocks to their associated file is stored in a metadata file called FsImage that is located on the NameNode. When a client sends a read file request to the NameNode, it uses the FsImage to give information to the client about the DataNodes where the relevant data blocks are located.

A distributed storage system such as this handles fault tolerance and provides a high-level of uptime through replication of data blocks. The default replication factor for a Hadoop distribution is three, meaning that every data block is replicated three times across the DataNodes in the cluster. Therefore, if one or even two DataNodes fail, a data block would still be accessible through the third replica. The default replication factor can be modified when required. For example, if data are very sensitive or there is the need for 100% data availability, the replication factor can be increased. Moreover, the capacity of storage can limitlessly increase by adding more DataNodes.

As explained earlier, HDFS is a platform for storing and processing static data for analytical purposes, and as such it should not be used for transactional data. Indeed, HDFS follows a write-once-read-many (WORM) access model for files. It means that data are written to the HDFS once and they do not get updated (although it is possible to do this). Since HDFS is built on the concept of distributed storage, updating a file may take a long time (usually longer than the case of relational databases). Hence, the major goal of HDFS is to provide a very fast reading operation when users want to fetch data for analysis. In order to store data in HDFS, it is important to understand read/write operations in HDFS to get a better insight into the workflow. This is discussed in the next section.

Note: some storage layers, such as Delta Lake and Apache Hudi, allow users to perform read-heavy or write-heavy operations in HDFS with low latency.

Read/Write operations in HDFS

Before starting to explain how read/write operations are carried out in HDFS, it should be noted that the actual read/write pipeline that is used by Hadoop is very detailed, and what is discussed here is a simplified version of the actual process.

Consider a scenario where the intention is to store a 228 MB text file named log.txt in HDFS with the replication factor set to three. The file will be split into two blocks, one 128 MB and the other 100 MB (and suppose these are called b1 and b2 respectively). After a client interacts with the NameNode, the NameNode prepares and sends the metadata back to the client. The client then uses the metadata received from the NameNode to transfer the data blocks to the DataNodes. The NameNode response may logically resemble the following:

Log.txt, b1:DataNode_A, copy_b1:DataNode_C & DataNode_D

Log.txt, b2:DataNode_B, copy_b2:DataNode_C & DataNode_D.

Figure 3 shows a simplified workflow in HDFS.

Figure 3: HDFS write pipeline

In this workflow,

the client machine interacts with the NameNode to create a file and the NameNode checks that the file doesn’t already exist and the client has permission for writing it. If these checks pass, a new file is created by the NameNode with no data inside it;
the NameNode returns a list of all DataNodes’ IP addresses where the client can write the data blocks to. Moreover, the NameNode sends a security token to the client;
the client uses the security token that it has received from the NameNode to connect to the selected DataNodes;
the client starts writing the data blocks directly on the selected DataNodes.

Writing data blocks is performed in parallel but writing replicas of a data block is done sequentially. For example, the client starts writing log—b1 on DataNode_A and log—b2 on DataNode_B in parallel. DataNode_A and DataNode_B are selected first because they are respectively the first DataNodes in the list for log—b1 and log—b2. These DataNodes are called primary DataNodes.

Once writing the first data blocks (i.e., log—b1 and log—b2) is finished, DataNode_A is responsible for replicating log—b1 on DataNode_C and DataNode_B is responsible for replicating log—b2 on DataNode_C. Once completed, DataNode_C is responsible for replicating log—b1 and log—b2 on DataNode_D. When writing and replicating each data block are completed successfully, all selected DataNodes send an acknowledgment to the client machine (three acknowledgements will be sent for each data block for a replication factor of three). Moreover, the primary DataNodes (DataNode_A and DataNode_B) update the block information on the NameNode. NameNode uses the data block information associated with a file when a client machine wants to read the file.

Finally, the client interacts with the NameNode to signal completion of the write operation.

For a read operation (see Figure 4), the following steps are initiated:

the client calls the NameNode to receive the locations of data blocks associated with the requested file;
NameNode first checks if the client has the privilege to read the file. If these checks pass, the NameNode returns a list of DataNodes’ IP addresses that have a copy of each data block. Here, the NameNode sends addresses for three DataNodes for each data block since the replication factor is set to three;
For each data block, DataNodes are sorted based on their geographic proximity to the client machine. The client then interacts with the closest DataNode for each data block and pulls the data in parallel. In Figure 4, it is assumed that DataNode_A is the closest node storing b1 and DataNode_B is the closest node storing b2;
Once reading all data blocks is completed successfully, the client closes the connection to the DataNodes.
Figure 4: HDFS read pipeline

This completes the read and write operations as performed in a simple example within the Hadoop system.

Summary

To conclude, this blog post has introduced you to the Hadoop HDFS, one of the high performance external data sources that can be registered with Esri’s GeoAnalytics Server to provide big data as input to GeoAnalytics’ processing tools. The next blog post will discuss in more detail the use of a cloud store approach, which is another widely used method to store big data in an external storage architecture that can be accessed by the GeoAnalytics Server environment.

This post was translated to French and can be viewed here.