Proposals:GridComputing

From KitwarePublic
Jump to navigationJump to search

Development and extension of ITK to be applied to grid computing

For more information contact: Simon Warfield, Steve Pieper, Ron Kikinis

There is an increasing trend to deploy grid computing infrastructures to support computations on extremely large datasets like those found in the Visible Human Project. We therefore believe that it is important for critical aspects of the architecture of ITK revisited and refined to support the emerging standards in the grid computing community and to develop example applications to demonstrate the power of the ITK/grid combination in real-world research computing scenarios.

Task 1: Survey of existing grid infrastructure software including the Globus Toolkit and associated middleware. Needs analysis of the bottlenecks in the current ITK as applied to collaborator projects within the SPL. Creation of initial ITK on grid computing examples using task-level parallelization of large population data sets (n > 1000).

Other examples of task level parallelism include parametric search in which an algorithm is applied to a single or small number of data sets many times (n > 1000) with different parameter settings.

Task 2: Focus on optimizing the current ITK inner-loops and threading model for CPU architectures and compilers deployed as compute grid nodes. Work with computer science collaborators to identify bottlenecks that can benefit from machine-level code optimizations to accommodate issues such as cache size, special instruction sets, and memory bandwidth hierarchies in the context of a heterogeneous compute grid environment.

Cache size varies from CPU model to CPU model and can have a tremendous impact upon performance. Also, utilization of vector instruction sets such as SSE, and scheduling and branch prediction must be carefully optimized for the specific CPU model, instruction cache size, data cache size and scheduling pipeline for best performance. Compilers, such as GCC or the Intel compiler, can generate optimized code that outperforms hand-tuned code. Our intention is to investigate the potential for performance gains on certain platforms through the generation of cache-optimized, instruction-set optimized, scheduler optimized binaries. We will also test the accuracy and precision of highly optimized code to ensure correct results are obtained.

We also want to optimize and accelerate specific algorithms, such as STAPLE. We think that at this stage, performance boosts can come from optimized thread parallel implementations of key algorithms and inner loops.

Task 3: Integrate message passing parallelism using the globus compatible MPI implementations (mpich-g2 or its successors) first to distribute ITK processing pipelines to among grid processing nodes and second to break individual algorithms across nodes.

The grid protocols have been undergoing tremendous change recently. There is no clear preferred protocol for distributing ITK processing pipelines among grid processing nodes at this time. Scheduling and access across Teragrid nodes has also been evolving. Some Teragrid sites are using tools such as PBS for managing jobs. We will focus our efforts on utilizing the protocols which are most stable and provide for the most significant benefits for the effort invested.

Task 4: Ongoing support of ITK grid tools at the task, threading, and message passing layers. Review of any newly developed grid middleware emerging during the project period and rework of ITK architecture to accommodate same.

In our initial effort we used Image Guided Neurosurgery as the clinical application driver, and optimized algorithms and performance especially suited for this application. This application requirements range from the need for highly optimized special instruction set inner loops and thread parallel execution for specific algorithms, such as intrasubject registration, to task level parallelism inherent in the statistical atlas construction that underlies robust segmentation methods. This also leverages our NSF funded ITR 0426558 A Novel Grid Architecture Integrating Real-Time Data and Intervention During Image Guided Therapy which provides active collaboration with Teragrid sites and access to expertise in the evolving grid protocols. We intend to maintain these application specific optimizations as ITK evolves, while adding new application drivers that ensure the changes we make are relevant and will have significant clinical impact.

Compute platforms

We would like to get feedback about the compute platforms used in the ITK community. In particular, we would like to know what effort on which hardware platforms would have the largest impact on the community.

CPU model family, user count:

 Intel  :  1
 AMD    :  0
 SPARC  :  0



Specific Tasks

  1. Thread-level parallelization of three key algorithms
    1. STAPLE

STAPLE is an expectation-maximization algorithm for the simultaneous truth and performance level estimation of a collection of segmentations via the computation of a probabilistic estimate of the true segmentation and a measure of the performance level represented by each segmentation. The source of each segmentation in the collection may be an appropriately trained human rater or raters, or may be an automated segmentation algorithm. The probabilistic estimate of the true segmentation is formed by estimating an optimal combination of the segmentations, weighting each segmentation depending upon the estimated performance level, and incorporating a prior model for the spatial distribution of structures being segmented as well as spatial homogeneity constraints.

The following tasks will be performed:

      1. Needs analysis of the current state of STAPLE in ITK. The current implementation is from the algorithm published by Warfield, Zou and Wells in MICCAI 2002 which is limited to binary labels. We will extend the implementation to allow for multi-category labels as described in (Warfield, Zou, Wells IEEE TMI 2004).
      2. Profiling of the current STAPLE algorithm to find areas which are relatively fast, those that are slow and can be optimized, and any true performance bottlenecks.
      3. Evaluation of thread scaling performance as a function of number of CPUs (1 to 16 CPUs on SMP machines).
      4. Evaluation of impact of different hardware platform on performance.
      5. Thread-level parallelization within STAPLE.
      6. CPU specific cache size, pipeline scheduling and instruction set optimizations for different hardware platforms.
    1. Resampling
    2. Mattes Mutal Information Rigid Registration
  1. Fourier Transform in ITK
  1. Grid protocols and Batch-level processing with ITK