Twister2 Release 0.1.0
Twister2 0.1.0 is the first open source public release of Twister2. We are excited to bring a high performance data analytics hosting environment that can work in both cloud and HPC environments. This is the first step towards building a complete end to end high performance solution for data analytics ranging from streaming to batch analysis to machine learning applications. Our vision is to make the system work seamlessly both in cloud and HPC environments ranging from single machines to large clusters.
You can download source code from Github
Major Features
This release includes the core components of realizing the above goals.
- Resource provisioning component to bring up and manage parallel workers in cluster environments
- Standalone
- Kubernetes
- Mesos
- Slurm
- Nomad
- Parallel and Distributed Communications in HPC and Cloud Environments
- Twister2:Net - a data level dataflow communication library for streaming and large scale batch analysis
- Harp - a BSP (Bulk Synchronous Processing) innovative collective framework for parallel applications and machine learning at message level
- OpenMPI (HPC Environments only) at message level
- Task Graph - Create dataflow graphs for streaming and batch analysis including iterative computations
- Task Scheduler - Schedule the task graph into cluster resources supporting different scheduling algorithms
- Datalocality Scheduling
- Roundrobin scheduling
- First fit scheduling
- Executor - Execution of task graph
- Batch executor
- Streaming executor
- API for creating Task Graph and Communication
- Communication API
- Task based API
- Support for storage systems
- HDFS
- Local file systems
- NFS for persistent storage
These features translates to running following types of applications natively with high performance.
- Streaming computations
- Data operations in batch mode
- Iterative computations
Examples
With this release we include several examples to demonstrate various features of Twister2.
- A Hello World example
- Communication examples - how to use communications for streaming and batch
- Task examples - how to create task graphs with different operators for streaming and batch
- K-Means
- Sorting of records
- Word count
- Iterative examples
- Harp example
Road map
We have started working on our next major release that will connect the core components we have developed into a full data analytics environment. In particular it will focus on providing APIs around the core capabilities of Twister2 and integration of applications in a single dataflow.
Next release (End of December 2018)
- Hierarchical task scheduling - Ability to run different types of jobs in a single dataflow
- Fault tolerance
- Data API including DataSet similar to Spark RDD, Flink DataSet and Heron Streamlet
- Supporting different API's including Storm, Spark, Beam
- Heterogeneous resources allocations
- Web UI for monitoring Twister2 Jobs
- More resource managers - Pilot Jobs, Yarn
- More example applications
Beyond next release
- Implementing core parts of Twister2 with C/C++ for high performance
- Python APIs
- Direct use of RDMA
- FaaS APIs
- SQL interface
- Native MPI support for cloud deployements
License
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0