Twister2

Twister2

  • Getting Started
  • Docs
  • Tutorial
  • AI
  • Examples
  • Contribute
  • Download
  • Configurations
  • Java Docs
  • GitHub
  • Blog

›APIs

Compiling

  • Overview
  • Linux
  • MacOS
  • Maven Artifacts

App Development

  • API Overview
  • Developing Applications
  • Streaming Jobs
  • Batch Jobs

APIs

  • Worker API
  • Data API
  • Compute API
  • Operator API
  • Windowing API
  • Storm API
  • Apache Beam
  • Python API

Deployment

  • Job Submit
  • Standalone
  • Docker
  • Kubernetes
  • Minikube
  • Mesos
  • Nomad
  • Slurm
  • Dashboard
  • Logging
  • Configurations

Concepts

  • Overview
  • Architecture
  • Operators
  • Task System
  • Data Access

Resources

  • Publications

Data API

TSets provides a convenient API for functional style distributed application programming in Twister2. TSets is a simplified abstraction toTask API. Its functionality is similar to Spark API, Flink API or Heron Streamlet APIs.

The user program is written as a set of data transformation steps/ data flow. This Dataflow would typically be a DAG (Directed Acyclic Graph). In the backend, user program is translated to Task API Task Graph.

Example TSet Program

Here is an example TSet program. We would start off with implementing Twister2Worker interface and initializing BatchEnvironment. And then we can create a source, transformations and finally a sink.

public class ExampleTSet implements Twister2Worker, Serializable {
  @Override
  public void execute(WorkerEnvironment workerEnv) {
    BatchEnvironment env = TSetEnvironment.initBatch(workerEnv);
    SourceTSet<Integer> source = env.createSource(new TestBaseSource(), 4).setName("Source");
    ReduceTLink<Integer> reduce = source.reduce(Integer::sum);

    reduce.forEach(i -> LOG.info("result: " + i));
  }
}

TSets are executed lazily. Once an action such as TSet.forEach(...) is called, the underlying Dataflow graph will be created and executed based on the TSet execution chain.

Following are the important aspects of TSets.

  1. Operation Modes
  2. TSetIWorker
  3. TSetEnvironment
  4. TSets and TLinks
  5. Twister2 Communication IMessage content types
  6. TSetGraph
  7. TSetOps
  8. TSetFunctions

Operation Modes

As in the, Task API there are two operation modes.

  1. Batch mode
  2. Streaming mode

Users can choose the operation mode by initializing the proper environment: BatchEnvironment or StreamingEnvironment. At the moment, batch and streaming modes can not be used together in a single Twister2Worker. These environment objects are singleton. They are initialized with a static init method.

    BatchEnvironment env = TSetEnvironment.initBatch(workerEnv);
    StreamingEnvironment env = TSetEnvironment.initStreaming(workerEnv);

TSetEnvironment

TSetEnvironment provides the entry point to the TSet API.

It can be used to,

  • access the WorkerEnvironment object for configurations and worker information
  • create Sources (Batch/ Streaming)
  • access the TSetGraph
  • execute/ run TSetGraph explicitly using,
    public void run()
  • add inputs to the execution

BatchTSetEnvironment provides following additional methods for execution

  • run a particular subgraph from the TSetGraph by
  public void run(BaseTSet leafTset)
  • run a particular subgraph from the TSetGraph and output the results as a DataObject
public <T> DataObject<T> runAndGet(BaseTSet leafTset)

TSets

This is the data abstraction which executes an operation on certain chunk of data.

There are two main distinctions,

  • TSets - Used for homogeneously typed data
  • TupleTSets - Used for keyed data

Each TSet is divided into Batch and Streaming to closely reflect Twister2 Communication semantics.

TLink

This is the communication abstraction which links two multiple TSets together. We can perform any communication operation supported by the Twister2:Net communication fabric using a TLink.

There are two main distinctions based on the Twister2 Communication IMessage content type,

  • SingleTLink - For communications that produces a single output
  • IteratorTLink - For communications that produces an iterator
  • GatherTLink - Specialized TLink for Gather operations (gather, allgather)

Each TLink is also divided into Batch and Streaming to closely reflect Twister2 Communication semantics.

##Twister2 Communication IMessage content types Understanding Twister2 Communication IMessage content types is important to determine the internals of TLinks.

Batch Comms

commmessage contentparallelism relationshipTLinkComment
ReduceTm to 1SingleTLink
AllreduceTm to 1SingleTLink
DirectIteratorm to mIteratorTLink
BroadcastIterator1 to mIteratorTLinkpar(source) = 1, one to many
GathergatherWithIndex/ gatherIterator<Tuple<Integer, T>>m to 1GatherTLinkpar(dest) = 1, many to one, int --> taskIndex of parents
gather / gatherWithoutIndexIterator
AllgatherallGatherWithIndexIterator<Tuple<Integer, T>>m to 1GatherTLinkint --> taskIndex of parents
allgatherIterator
KeyedGatherIterator<Tuple<K, Iterator>>m to nIteratorTLink<Tuple<K, Iterator>>
KeyedReduceIterator<Tuple<K, T>>m to nIteratorTLink<Tuple<K, T>>
PartitionIteratorm to nIteratorTLinkmany to many communication
KeyedPartitionIterator<Tuple<K, T>>m to nIteratorTLink<Tuple<K, T>>
JoinIterator<JoinedTuple<K, U, V>>m to nIteratorTLink<JoinedTuple<K, U, V>>

Streaming Comms

commmessage contentparallelism relationshipTLink
ReduceTm to 1SingleTLink
AllreduceTm to 1SingleTLink
DirectTm to mSingleTLink
BroadcastT1 to mSingleTLink
GathergatherWithIndexIterator<Tuple<Integer, T>>m to 1GatherTLink
gatherIterator
AllgatherallGatherWithIndexIterator<Tuple<Integer, T>>m to 1GatherTLink
allgatherIterator
PartitionTm to nSingleTLink
KeyedPartitionTuple<K, T>m to nSingleTLink<Tuple<K, T>>

TSetGraph

Users can create a chain of execution using TSets and TLinks. A TSet would expose a set of methods which exposes the downstream TLinks and similarly, a TLink would expose a set of methods which exposes the TSets which it can connect into.

Example:

src --> direct --> forEach 
            |
             ----> map --> direct --> forEach 

The above data flow graph can be represented by the following TSet Graph

    SourceTSet<Integer> src = dummySource(...).setName("src");

    DirectTLink<Integer> direct = src.direct().setName("direct");

    direct.forEach(i -> LOG.info("foreach: " + i));

    direct.map(i -> i.toString() + "$$").setName("map")
        .direct()
        .forEach(s -> LOG.info("map: " + s));

TSet API Overview

TSet API Overview

Batch Operations

Twister2 supports these batch operations.

OperationDescription
DirectA one to one mapping from a TSet to another
ReduceReduces a TSet into a single value
AllReduceReduces a TSet into a single value and replicate this value
GatherGather a distributed set of values
AllGtherGather a distributed set of values and replicate it
PartitionRe-distributes the values
BroadcastReplicate a single value to multiple
Keyed-ReduceReduce based on a key
Keyed-GatherGather based on a key
Keyed-PartitionPartition based on a key
JoinInner join with a key
UnionUnion of two TSets

Stream Operations

OperationDescription
DirectA one to one mapping from a TSet to another
ReduceReduces a TSet into a single value
AllReduceReduces a TSet into a single value and replicate this value
GatherGather a distributed set of values
AllGtherGather a distributed set of values and replicate it
PartitionRe-distributes the values
BroadcastReplicate a single value to multiple
Keyed-PartitionPartition based on a key

Cacheable TSets

Users can cache data of TSets using the TSet.cache() method. This would execute the chain upto that TSet and load the results to memory.

← Worker APICompute API →
  • Example TSet Program
  • Operation Modes
  • TSetEnvironment
  • TSets
  • TLink
    • Batch Comms
    • Streaming Comms
  • TSetGraph
  • TSet API Overview
  • Batch Operations
  • Stream Operations
  • Cacheable TSets
Twister2
Docs
Getting Started (Quickstart)Guides (Programming Guides)
Community
Stack OverflowProject Chat
More
BlogGitHubStar
Copyright © 2020 Indiana University