How to configure a StreamSets Data Collector component

How to configure a StreamSets Data Collector component

Choose one of two options to integrate StreamSets Data Collector with the Data Platform Stack.
  1. Create a new StreamSets Data Collector cluster within the Data Platform Stack
  2. Integrate externally created StreamSets Data Collector cluster with Data Platform Stack
(Option 1) Create a new StreamSets Data Collector cluster within the Data Platform (recommended option)
Choose this option to create a new StreamSets Data Collector cluster managed by the Data Platform Kubernetes infrastructure and integrate seamlessly with other Data Platform components without affecting external applications outside the Data Platform.

StreamSets DC is a lightweight, powerful design and execution engine that streams data in real-time. Use Data Collector to route and process data in data streams.
To define the flow of data, design a pipeline in Data Collector. A pipeline consists of stages representing the origin and destination of the pipeline and any additional processing to be performed. After the pipeline design, click Start, and Data Collector goes to work.
Data Collector processes data when it arrives at the origin and waits quietly when not needed. View real-time statistics about the data, inspect data as it passes through the pipeline, or take a closer look at a snapshot of data.
It enables fast data ingestion and light transformation without hand-coding using drag and drop pre-built connectors between various sources and destinations.

For more details see the Data Collector User Guide.

StreamSets Data Collector Settings
  • Use Timeout in Seconds to control how long the Kubernetes controller waits for the successful creation of StreamSets Data Collector and its sub-components. If it times out, StreamSets reports the Data Collector component as a failure. You need to increase this value if you frequently experience this time out.
  • Set Replica to 1. The community version of StreamSets Data Collector doesn't support High Availability.
  • Use CPU/Memory Limit and Request to configure the CPU requests and limits of the Containers that run in the cluster. Efficiently use the CPU resources available on the cluster Nodes by keeping a Pod CPU request low, giving the Pod a good chance of being scheduled. Having a CPU limit that is greater than the CPU request accomplishes two things:
    • The Pod can have bursts of activity where it uses CPU resources that happen to be available
    • The amount of CPU resources a Pod can use during a burst is limited to some reasonable amount.
  • It's important to specify a CPU limit for a Container, or it could result in one of the following situations:
    • The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
    • The Container runs in a namespace with a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
(Option 2) Integrate externally created StreamSets Data Collector cluster with Data Platform Stack
Choose this option if the StreamSets Data Collector cluster is external and the StreamSets Data Collector cluster is within the Data Platform. Be aware that any changes and usages that occur while running the Data Platform may impact external applications (systems) that depend on this StreamSets Data Collector cluster.

What's Next?

1.1-JPE


    • Related Articles

    • How to configure Istio component

      Choose one of two options to integrate StreamSets Data Collector with the Data Platform Stack. Create a new StreamSets Data Collector cluster within the Data Platform Stack Integrate externally created StreamSets Data Collector cluster with Data ...
    • How to customize StreamSets DC+ Platform

      You can create a new stack for StreamSets DC+ Platform by following here. Test / Proof of Concept (POC) Stack To create a simple test StreamSets DC+ Platform stack, set the following parameters. On AWS & K8S component: Provider Key Name: Choose the ...
    • What is StreamSets DC+ Platform

      StreamSets Data Collector+ Platform provides data ingestion pipelines integrated with ETL processing for streaming, batch, and change data capture (CDC). This allows a user to use snapblocs dashboard UI to create a data ingestion pipeline stack by ...
    • How to configure StreamSets DC+ Platform

      You can initiate configuring a new stack from a few different places: On the Home page, "Configure stack" button on the Stacks statistics block. On the Stacks page, the "Configure new stack" button on the top page On the Projects page, select Project ...
    • How to customize Data Flow Platform

      After configuring a new stack of Data Flow Platform by following this, you can customize the stack. Test / Proof of Concept (POC) Stack To create a simple test Data Flow stack on cloud prividers, set the following parameters. AWS and K8S Component: ...