Data Lake Platform provides a managed service using Kubernetes to provide integrated solutions (Data Flow, Data Transformation, Data As A Service) that ingests data from multiple data sources into a Data Lake. It provides the data workflow to orchestrate data transformation and data curation processes for analytics for business and operations.
It also provides the centralized data repository to store all structured and unstructured data at any scale for storing data as-is, without having first to structure the data and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. It provides Metadata (Catalog) Management to make data visible and easily accessible to consumers.
The Data Lake Platform eliminates extensive data modeling and ETLs normally required by the schema-on-write data warehouse solutions. Instead, The Data Lake Platform uses the schema-on-read solutions enabling fast data ingestion and consumption for both structured and unstructured data which increases the time-to-value for the fast-changing business requirement
Data Lake Platform allows a user using dpStudio UI to create an Enterprise Data Stack by configuring the data flow/transformation/repository/access settings.
Also, dpStudio allows a user to manage the lifecycle of Data Lake Stack (start, update, terminate) and monitor the runtime behavior of the Stack through build-in Observability features that monitor data usages and observe the internal health states of the components.
Data Lake Platform includes the following software components:
AWS EKS is used to provision stacks of Data Lake Platform using the customer's AWS account.
Google GKE is used to provision stacks of Data Lake Platform using the customer's Google account.
snapblocs provisions Data Lake Platforms following the well-architecture guides (i.e., AWS Well-Architected for AWS, Google Cloud Architecture Framework, etc.) for provisioning, configuring production-grade Kubernetes clusters and deploying workloads into the clusters. It provides benefits from patterns that have been used successfully for many customers in production environments. Also, snapblocs makes it easy to get started and easy to configure properly. Kubernetes is an open-source container-orchestration system for automating application deployment, scaling, and management. It is used to deploy selected Components. Kafka is used to building real-time data pipelines and streaming applications by integrating data from multiple sources and locations into a single, central Event Streaming Platform. Elastic is used to provide observability (monitoring, alerting, APM) for answering questions about what's happening inside the system by observing the outside of the system. Grafana is used to build visualizations and analytics to query, visualize, explore metrics, and set alerts for quickly identifying system problems to minimize disruption to services. StreamSets Data Collector is a low-latency ingest infrastructure tool used to create continuous data ingest pipelines using a drag and drop UI within an integrated development environment (IDE). It is used to ingest source data in-stream or batch to other data platforms such as Data Lake, on-prem, or cloud datacenter.
Dremio - Data lake engine that liberates your data with live, interactive queries directly on cloud data lake storage. Spark - open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Airflow - open-source workflow management platform to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface.