Twister2

Twister2

  • Getting Started
  • Docs
  • Tutorial
  • AI
  • Examples
  • Contribute
  • Download
  • Configurations
  • Java Docs
  • GitHub
  • Blog

›Concepts

Compiling

  • Overview
  • Linux
  • MacOS
  • Maven Artifacts

App Development

  • API Overview
  • Developing Applications
  • Streaming Jobs
  • Batch Jobs

APIs

  • Worker API
  • Data API
  • Compute API
  • Operator API
  • Windowing API
  • Storm API
  • Apache Beam
  • Python API

Deployment

  • Job Submit
  • Standalone
  • Docker
  • Kubernetes
  • Minikube
  • Mesos
  • Nomad
  • Slurm
  • Dashboard
  • Logging
  • Configurations

Concepts

  • Overview
  • Architecture
  • Operators
  • Task System
  • Data Access

Resources

  • Publications

Operator API

Twister2 supports a DataFlow model for operators. A DataFlow program models a computation as a graph with nodes of the graph doing user-defined computations and edges representing the communication links between the nodes. The data flowing through this graph is termed as events or messages. It is important to note that even though by definition dataflow programming means data is flowing through a graph, it may not necessarily be the case physically, especially in batch applications. Big data systems employ different APIs for creating the dataflow graph. For example, Flink and Spark provide distributed dataset-based APIs for creating the graph while systems such as Storm and Hadoop provide task-level APIs.

We support the following operators for batch and streaming applications. The operators can go from M to N tasks, M to 1 task or 1 to N tasks.

Twister2 Batch Operations

The batch operators work on set of input data from a source. All this input data will be processed in a single operator.

OperatorSemantics
ReduceM tasks to 1, reduce the values to a single value
GatherM tasks to 1, gather the values from M tasks
Broadcast1 task to N, Broadcast a value from 1 to N tasks
PartitionM tasks to N, distribute the values in M tasks to N tasks according to a user-specified criteria
AllReduceM tasks to N, Reduce values from N tasks and broadcast to N tasks
AllGatherM tasks to N, Gathers values from M tasks and broadcast to N tasks
KeyedReduceM tasks to N, Reduce values of a certain key, only available with Windowed data sets
KeyedGatherM tasks to N, Gathers according to a user-specified Key, keys can be sorted, only available with Windowed data sets
JoinM tasks to N, Jons values based on a user-specified key

Twister2 Streaming Operations

The streaming operators work on single data items.

OperatorSemantics
ReduceM tasks to 1, reduce the values to a single value
GatherM tasks to 1, gather the values from M tasks
Broadcast1 task to N, Broadcast a value from 1 to N tasks
PartitionM tasks to N, distribute the values in M tasks to N tasks according to a user-specified criteria
AllReduceM tasks to N, Reduce values from N tasks and broadcast to N tasks
AllGatherM tasks to N, Gathers values from M tasks and broadcast to N tasks
KeyedReduceM tasks to N, Reduce values of a certain key, only available with Windowed data sets
KeyedGatherM tasks to N, Gathers according to a user-specified Key, keys can be sorted, only available with Windowed data sets
JoinM tasks to N, Joins values based on a user-specified key

Dataflow communications are overlaid on top of worker processes using logical ids.

We support both streaming and batch versions of these operations.

Reduce

The reduce operation collects data and performs and reduces them in a manner specified by the reduce function, which can be specified. To look at an simple example let us assume that we are performing a reduce operation on the values {1,2,3,4,5,6,7,8} if the function specified for the reduce operation was 'Sum' the result would be 36, if the function used was 'Multiplication' the result would be 40320.

Since this reduce is taking place in an distributed environment the same function needs to be applied to data that is collected from several distributed workers. Which means there is a lot of communications involved in transferring data between distributed nodes. In order to make the communication as efficient as possible the reduce operation performs its reduction in a inverted binary tree structure. Lets look at a more detailed example to get an better idea about how this operation works. Please note that some details will be left out to simply the explanation.

Example:

In this example we have a distributed deployment which has 4 worker nodes named w-0 t0 w-3. The reduce example that we are running has 8 source tasks and a single sink task (the task to which the reduction happens). The source tasks are given logical id from 0 to 7, the sink task is given a logical id of 8.

How each task is assigned to workers will not be explained in this section. We assume that the following task-worker assignments are in place. The tree structure that is used by the reduce operation will take into account the task-worker assignments to optimize the operation. The diagram below show the assignments and the paths of communication

Reduce Operation Tree

Black arrows in the diagram show the paths in which communication happens in the reduce operation. The inverted binary tress structure is more clear if you look at the red arrow. This structure allows the reduce operation to scale to large number of tasks very easily.

If we assume that each task generates a data array of {1,2,3} the final result after the reduce which will be available at the sink task will be {8,16,24}. From the diagram it is clear that the sink task only receives values from tasks 0,1,2,4. To further optimize the operation each task will perform partial reduce operations before sending out data to the next destination. So the data that each task sends to the sink task will be as follows for this example

  • 0 -> 8 : {1,2,3}
  • 1 -> 8 : {2,4,6}
  • 2 -> 8 : {4,8,12}
  • 4 -> 8 : {1,2,3}

Gather

The gather operation is similar in construct to the reduce operation. However unlike the reduce operation which uses an reduction function to reduce collected values, the gather operation simply bundles them together. The structure in which the gather communication happens is similar to reduce which is done using an inverted binary tree.

Example:

Lets take the same example discussed in the Reduce operation. In the reduce example the final result at the sink task with logical id 8 was {8,16,24}. In the gather since we collect all the data tht is sent from each source task the final results received at the sink task would be a set of arrays, 1 array for each source task.

Final result at 8 -> {{1,2,3},{1,2,3},{1,2,3},{1,2,3},{1,2,3},{1,2,3},{1,2,3},{1,2,3}}

If you look at what each of the tasks that actually send messages to sink 8 they would be as follows. Notice that the results are similar to the reduce operation.

  • 0 -> 8 : {1,2,3}
  • 1 -> 8 : {{1,2,3},{1,2,3}}
  • 2 -> 8 : {{1,2,3},{1,2,3},{1,2,3},{1,2,3}}
  • 4 -> 8 : {1,2,3}

AllReduce

The BSP APIs are provided by Harp and MPI specification (OpenMPI).

The DataFlow operators are implemented by Twister2 as a Twister:Net library.

← ArchitectureTask System →
  • Twister2 Batch Operations
  • Twister2 Streaming Operations
    • Reduce
    • Gather
    • AllReduce
Twister2
Docs
Getting Started (Quickstart)Guides (Programming Guides)
Community
Stack OverflowProject Chat
More
BlogGitHubStar
Copyright © 2020 Indiana University