Data lakes are quickly exploding into data oceans. The original 3 V's of data-volume, variety, and velocity-have been enhanced with 4 more V's. Together, these 7 V's make up the domain of a data fabric.
Organizations across industries are currently examining their data strategy. As they look to modernize their infrastructure and pursue digital transformation, their current storage and data management solutions pose significant obstacles. A new approach is required to support new applications, new technologies including containers, and new development through microservices.
Additionally, as IoT data sources from high-resolution sensors and smart applications to Industrial IoT devices, continue to explosively multiply, the ability to easily access and process shared data is required. Without a mechanism to collect, analyze and apply the results to operational systems, much of the value is lost.
Now, there is an emerging trend to implement a “data fabric” to provide a scalable, flexible solution that converges capabilities across data types and across locations.
With any emerging trend there are typically different technologies vying for prominence. Some data fabric approaches are extensions of traditional storage pools, other approaches are built on an ETL foundation and focused on a fabric consisting of sources and destinations. Other approaches are built on a virtualization substrate that relies on a layer of abstraction to produce a “fabric”.
Before diving into a detailed review of each of these solutions, you should first take a step back and understand the role and requirements of a data fabric.
A data fabric must support the modernization of storage and data management, and move away from the proliferation of data silos. But a data fabric must also integrate with legacy systems, without requiring their presence for the long-term. To work effectively a data fabric must be broad and support a vast array of applications and data types at scale across locations. While data fabrics are a significant change from the assumptions that usually surround data storage and processing, the requirements have their roots in big data.
The big data era was driven by the three V’s – Volume, Variety and Velocity. A data fabric does encompass these requirements but goes well beyond. In fact, an interesting way to summarize the requirements of a data fabric is the seven V’s – Volume, Variety, Velocity, Veracity, Vicinity, Visibility, and Value. When considering a data fabric solution evaluate approaches on the basis of these seven areas:
The vast volume of data requires a solution with vast scalability. Not only in terms of data sizes that extend to exabytes, but also a large number of files with the ability to support trillions of files. The fabric must continue to scale linearly with no bottleneck in terms of how data is stored, accessed, processed and protected.
Support for diverse data types that helps prevent silo proliferation. A solution needs to support structured and unstructured content including native support for files, tables, streams, documents, objects, etc. To that end the data fabric must support a wide variety of processing to support new and complex applications. Both new modern apps and legacy applications need to run smoothly together directly on the data fabric.
Gone are the days that can tolerate long delays to extract, manipulate, process and load data. Support for high velocity extends to not only how data is captured and collected, but all the way through the process until the data drives a resulting business action. Velocity is a key requirement to integrate low-latency, analytics directly into operational applications to “operationalize” the analytics in real time.
While most big data solutions are content to address the first three V’s. The fourth V is what forms the foundation for an enterprise data fabric. Eventual consistency, append only solutions and other work arounds that fail to provide consistency and data integrity at scale fall well short of a data fabric solution that requires cloud-grade reliability to serve as a system of record.
The ability to see a logical view of all the globally distributed data regardless of where it is physically stored in the fabric is an essential enabler of many core administrative functions. A global namespace is a required underpinning for global management, global access, and global protection.
A data fabric must extend from on-premise to multiple cloud installations all the way to the edge. A data fabric that is capable of spanning from edge to edge and into the cloud is needed to create vast IoT systems of intelligence. With the proper data fabric, job execution and data placement can be performed based on cost, speed, and compliance requirements. Multi-temperature data policies can also be enabled to control the automated placement of hot, warm, and cold data.
The final V for value is by far the most important. The main reason for a data fabric is to enable organizations to pursue digital transformation. Transformation is not limited to a single application or a single data source. It is the culmination of many, many different applications running in on top of a multi-tenant, secure, data fabric. A data fabric that has common underlying architecture to simplify management and development and drive down costs while supporting innovation.
With a nomenclature for data, we can now move forward with understanding sources and uses of the data. The challenge is to ensure that data lakes are not just stuffed with more data, but that they are filled with valuable data. ~ Mike Serrano, Sr Product Marketing Manager, NETSCOUT