TeraSort
TeraSort is a common benchmark to measure and compare high performance big data frameworks such as Twister2. The idea is to measure the time to sort one terabyte of randomly distributed data.
This terasort is implemented according to the requirements listed in the https://sortbenchmark.org/ website. It uses partition and sort method. First we globally partition the data into available tasks. Then the data in each task is sorted. This gives a global sorting among the tasks.
Because terasort can run with data larger than the memory, we can use file system to store and sort the data. All this handle internally to twister2 and you can configure how much data to keep in memory etc.
This implementation can read from a file generated by the input generator in gensort Data Generator or it can use an in-memory generator.
When it runs with in-memory mode, we generate a batch of tuples, each having a random byte array as the key and a byte array of configurable length as the value. These tuples will be sent from multiple sources to a single sinks with keyed gather operation applied as the connection. Keyed Gather operation is configured with a comparator to sort by key.
Generating data files
gensort Data Generator can be used to generate data files.
In memory example command
./bin/twister2 submit standalone jar examples/libexamples-java.jar edu.iu.dsc.tws.examples.batch.terasort.TeraSort -size .5 -valueSize 90 -keySize 10 -instances 8 -instanceCPUs 1 -instanceMemory 1024 -sources 8 -sinks 8 -memoryBytesLimit 4000000000 -fileSizeBytes 100000000
File mode example command
./bin/twister2 submit standalone jar examples/libexamples-java.jar edu.iu.dsc.tws.examples.batch.terasort.TeraSort -size .5 -valueSize 90 -keySize 10 -instances 8 -instanceCPUs 1 -instanceMemory 1024 -sources 8 -sinks 8 -memoryBytesLimit 4000000000 -fileSizeBytes 100000000 -inputFile /path/to/file-%d
TeraSort parameters
Data Configuration - File Based mode
Parameter | Description | Default Value |
---|---|---|
inputFile | Path to the input file. This path can contain, %d which will be replaced with the task index at runtime. For example, if file is specified as input-%d, when executing task with index 0, it will search for file input-0. | Mandatory |
valueSize | Size of the value component of the tuple in bytes | Mandatory |
keySize | Size of the value component of the tuple in bytes | Mandatory |
Data Configuration - Non File Based mode
Terasort will switch to non file based mode, if filePath is not specified.
Parameter | Description | Default Value |
---|---|---|
size | Total size of data generated by sources. | Mandatory |
valueSize | Size of the value component of the tuple in bytes | Mandatory |
keySize | Size of the value component of the tuple in bytes | Mandatory |
Resource Configuration
Parameter | Description | Default Value |
---|---|---|
instances | No of twister2 worker instances | Mandatory |
instanceCpus | No of CPUs to allocate per each instance(Might not be applicable for all twister2 modes) | Mandatory |
instanceMemory | Amount of memory to allocate in MB for each instance | Mandatory |
sources | No. of sources to generate data. This will multiply the data size of your current configuration. For example, if you have specified, 512GB as the data size and 4 as the No. of sources, twister2 will produce and sort 1TB of data in total | Mandatory |
sinks | No. of sinks to receive the globally sorted data | Mandatory |
Tuneup Parameters
Parameter | Description | Default Value |
---|---|---|
memoryBytesLimit | Maximum amount of random access memory(in bytes) to utilize to hold incoming tuples. Tuples will be written to the disk, once this limit is exceeded. | 6400 |
fileSizeBytes | Size of a single file to use when going to disk | 64 |