How to configure a NiFi component

How to configure a NiFi component

Choose one of two options to integrate Apache NiFi with the Data Platform Stack.
  1. Create a new NiFi cluster within the Data Platform Stack
  2. Integrate externally created NiFi cluster with Data Platform Stack
(Option 1) Create a new NiFi cluster within the Data Platform (recommended option)
Choose this option to create a new NiFi cluster managed by the Data Platform Kubernetes infrastructure and integrate seamlessly with other Data Platform components without affecting external applications outside the Data Platform.

As NiFi enables to automate and manage the flow of data between systems, NiFi is mainly used for data acquisition, transportation, a guarantee of data delivery, capable of handling complicated and diverse data flows, inclusive of data-based event processing with buffering and prioritized queuing. NiFi is highly configurable, and flow can be modified at run-time, enabling organizations to make immediate changes to tighten the feedback loop. NiFi also provides data provenance capabilities that enable data flow to be tracked from end to end.

NiFi provides a user-friendly drag-and-drop user interface that allows data administrators to quickly build out data flow, and to simplify operations with real-time control and monitoring. NiFi is designed to be a Big Data analytics tool, working with structured, unstructured, or semi-structured data of any size and format, with or without a schema.

For more details see the Nifi User Guide.

NiFi Settings
  1. Replicas: Number of Nifi nodes.

    Increase the replicas higher than 1 for H/A and F/O

  2. NiFi Auth Configuration Storage

    Storage capacity for the Nifi Auth Configuration.
    Default: 100Mi

  3. Data Directory Storage

    Storage capacity for the 'data' directory, which is used to hold things such as the flow.xml.gz, configuration, state, etc.
    Default: 1Gi

  4. FlowFile Repository Storage

    Storage capacity for the FlowFile repository.
    Default: 10Gi

  5. Content Repository Storage

    Storage capacity for the Content repository

    Default: 10Gi

  6. Provenance Repository Storage

    Storage capacity for the Provenance repository.

    Default: 10Gi

  7. NiFi Log Storage

    Storage capacity for nifi logs

    Default: 5Gi

  8. NiFi Java Heap Memory

    Amount of memory to give the NiFi java heap
    Default: 2

Motivation for CPU requests and limits
Configure the CPU requests and limits of the Containers that run in the cluster, efficiently using the CPU resources available on the cluster nodes. By keeping a Pod CPU request low, it gives the Pod a good chance of being scheduled. Having a CPU limit that is greater than the CPU request, accomplishes two things:
  • The Pod can have bursts of activity, making use of CPU resources that happen to be available.
  • The amount of CPU resources a Pod can use during a burst of activity is limited to a reasonable amount.
If CPU limit is not specified for a Container, it can result in one of these situations:
  • The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the node where it is running.
  • The Container runs in a namespace with a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.

(Option 2) Integrate externally created NiFi cluster with Data Platform Stack
Choose this option if the NiFi cluster is external and the NiFi cluster is within the Data Platform. Be aware that any changes and usages that occur while running the Data Platform may impact external applications (systems) that depend on this NiFi cluster.

What's Next?



    • Related Articles

    • How to configure Istio component

      Choose one of two options to integrate StreamSets Data Collector with the Data Platform Stack. Create a new StreamSets Data Collector cluster within the Data Platform Stack Integrate externally created StreamSets Data Collector cluster with Data ...
    • How to configure a Grafana component

      Grafana is open-source visualization and analytics software. Query, visualize and explore key metrics, set an alert to quickly identifying problems with the system to minimize disruption to services. snapblocs uses Grafana with Elastic Stack (ELK) to ...
    • How to configure a Kafka component

      Kafka is a distributed streaming platform used to publish and subscribe to streams of records. Kafka gets used for fault-tolerant storage. Kafka replicates topic log partitions to multiple servers. Use it to stream data to other data platforms such ...
    • How to configure an Elastic Stack component

      snapblocs use Elastic Observability for providing Observability of the running Data Platform. Observability of the Data Platform ensures that DevOps can easily detect undesirable behaviors (service downtime, errors, slow responses, etc.). And have ...
    • How to configure an AWS and K8S component

      Amazon Web Services (AWS) AWS is one of the most comprehensive and broadly adopted Cloud platforms providing on-demand cloud computing platforms. It offers over 175 fully-featured services from data centers globally. Kubernetes is a system for ...