Choose one of two options to integrate StreamSets Data Collector with the Data Platform Stack.
Create a new StreamSets Data Collector cluster within the Data Platform Stack
Integrate externally created StreamSets Data Collector cluster with Data Platform Stack
(Option 1) Create a new StreamSets Data Collector cluster within the Data Platform (recommended option)
Choose this option to create a new StreamSets Data Collector cluster managed by the Data Platform Kubernetes infrastructure and integrate seamlessly with other Data Platform components without affecting external applications outside the Data Platform.
StreamSets DC is a lightweight, powerful design and execution engine that streams data in real-time. Use Data Collector to route and process data in data streams.
To define the flow of data, design a pipeline in Data Collector. A pipeline consists of stages representing the origin and destination of the pipeline and any additional processing to be performed. After the pipeline design, click Start, and Data Collector goes to work.
Data Collector processes data when it arrives at the origin and waits quietly when not needed. View real-time statistics about the data, inspect data as it passes through the pipeline, or take a closer look at a snapshot of data.
It enables fast data ingestion and light transformation without hand-coding using drag and drop pre-built connectors between various sources and destinations.
StreamSets Data Collector Settings
Use Timeout in Seconds to control how long the Kubernetes controller waits for the successful creation of StreamSets Data Collector and its sub-components. If it times out, StreamSets reports the Data Collector component as a failure. You need to increase this value if you frequently experience this time out.
Set Replica to 1. The community version of StreamSets Data Collector doesn't support High Availability.
Use CPU/Memory Limit and Request to configure the CPU requests and limits of the Containers that run in the cluster. Efficiently use the CPU resources available on the cluster Nodes by keeping a Pod CPU request low, giving the Pod a good chance of being scheduled. Having a CPU limit that is greater than the CPU request accomplishes two things:
- It's important to specify a CPU limit for a Container, or it could result in one of the following situations:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
The Container runs in a namespace with a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
(Option 2) Integrate externally created StreamSets Data Collector cluster with Data Platform Stack
Choose this option if the StreamSets Data Collector cluster is external and the StreamSets Data Collector cluster is within the Data Platform. Be aware that any changes and usages that occur while running the Data Platform may impact external applications (systems) that depend on this StreamSets Data Collector cluster.