How to configure a Dremio component

How to configure a Dremio component

Dremio is a data lake engine that liberates big data with live, interactive queries directly on cloud data lake storage. Dremio delivers secure, self-service data access and lightning-fast queries directly on cloud providers AWS, or Azure (No GCP yet). In addition, the vertically integrated semantic layer and Apache Arrow-based SQL engine reduce time to analytics insight while increasing data team productivity and lowering infrastructure costs.
Choose an option to integrate Dremio with the Data Platform Stack.
  1. Create a new Dremio cluster within the Data Platform Stack
  2. Integrate externally created Dremio cluster with Data Platform Stack
(Option 1) Create a new Dremio cluster within the Data Platform (Recommended option)
Choose this option to create a new Dremio cluster managed by the Data Platform Kubernetes infrastructure and seamlessly integrate with other Data Platform components without affecting external applications outside the Data Platform.
There are several Dremio sub-components to be configured.
Master Coordinator
The master coordinator node has the unique function of managing metadata. The master coordinator node is also responsible for:
  1. Query planning
  2. Serving Dremio’s UI
  3. Handling client connections, including the REST API
Master Coordinator and Executor Settings
See the memory and CPU requirements for the Coordinator and Executor here.

Zookeeper
Dremio utilizes Apache ZooKeeper behind the scenes for cluster coordination.
A Zookeeper cluster is installed externally, not as an embedded ZooKeepeer on the Coordinator node to provide High Availability by default.
See the configuration for running ZooKeeper in production here.
NOTE:
Motivation for CPU requests and limits
By configuring the CPU requests and limits of the Containers that run in a cluster, make efficient use of the CPU resources available on cluster Nodes. By keeping a Pod CPU request low, gives the Pod a good chance of being scheduled. Having a CPU limit that is greater than the CPU request accomplishes two things:
  • The Pod can have bursts of activity where it uses CPU resources that happen to be available.
  • The amount of CPU resources a Pod can use during a burst is limited to a reasonable amount.
Not specifying a CPU limit for a Container can result in one of the following:
  • The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
  • The Container runs in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.

(Option 2) Integrate externally created Dremio cluster with a Data Platform Stack
Choose this option for an external Dremio cluster within the Data Platform. Be aware that any changes and usages while running the Data Platform may impact external applications (system) that depend on the Dremio cluster. For example, the runtime load to the Dremio cluster added by the Data Platform may impact the performance of those external applications. 
What's Next?

1.1-JPE

    • Related Articles

    • How to configure an AWS and K8S for DaaS component

      Amazon Web Services (AWS) AWS is one of the most comprehensive and broadly adopted cloud platforms providing on-demand cloud computing platforms. It offers over 175 fully-featured services from data centers globally. Kubernetes is a system for ...
    • How to configure Istio component

      Choose one of two options to integrate StreamSets Data Collector with the Data Platform Stack. Create a new StreamSets Data Collector cluster within the Data Platform Stack Integrate externally created StreamSets Data Collector cluster with Data ...
    • How to configure a Grafana component

      Grafana is open-source visualization and analytics software. Query, visualize and explore key metrics, set an alert to quickly identifying problems with the system to minimize disruption to services. snapblocs uses Grafana with Elastic Stack (ELK) to ...
    • How to configure a Kafka component

      Kafka is a distributed streaming platform used to publish and subscribe to streams of records. Kafka gets used for fault-tolerant storage. Kafka replicates topic log partitions to multiple servers. Use it to stream data to other data platforms such ...
    • How to configure an Elastic Stack component

      snapblocs use Elastic Observability for providing Observability of the running Data Platform. Observability of the Data Platform ensures that DevOps can easily detect undesirable behaviors (service downtime, errors, slow responses, etc.). And have ...