ITK/PerformanceOptimization: Difference between revisions

From KitwarePublic
< ITK
Jump to navigationJump to search
No edit summary
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
Recent performance improvements efforts:
* [http://review.source.kitware.com/#/c/22387/ itk::ThreadPool refactoring]
* [http://review.source.kitware.com/#/c/22377/ Change itkMultiThreader to use OpenMP]
* [http://www.insight-journal.org/browse/publication/974 ITK TBB Insight Journal article]




Line 7: Line 12:
Links:
Links:
* [http://www.insight-journal.org/browse/publication/974 TBB Insight Journal article]
* [http://www.insight-journal.org/browse/publication/974 TBB Insight Journal article]
* ITK Module Template
* [https://github.com/InsightSoftwareConsortium/ITKModuleTemplate ITK Module Template]
* ITK Module Documentation
* [https://itk.org/ITKSoftwareGuide/html/Book1/ITKSoftwareGuide-Book1ch9.html#x48-1440009 ITK Module Documentation]
* CMake'ified TBB
* [https://github.com/hjmjohnson/tbb CMake'ified TBB]
* VTK SMPTools
* [https://blog.kitware.com/simple-parallel-computing-with-vtksmptools-2/ VTK SMPTools]
* itkScanlineIterator
* [https://itk.org/Doxygen/html/classitk_1_1ImageScanlineIterator.html itkScanlineIterator]
* ITK FFT classes
* [https://itk.org/Doxygen/html/classitk_1_1ForwardFFTImageFilter.html ITK FFT classes]


Next steps:
Possible next steps:
* TBB Insight Journal update
* TBB Insight Journal update
* TBB Insight Journal -> Remote Module
* TBB Insight Journal -> Remote Module
Line 20: Line 25:
* MKL backend for FFT
* MKL backend for FFT
* CMake-ified TBB upstreaming / available as an ITK module
* CMake-ified TBB upstreaming / available as an ITK module
* Improved awareness of ImageScanlineIterator
== Performance Analysis ==
=== Per Thread Cost of Job Dispatch ===
An important metric in analyzing the scalability of multi-threading algorithms is the knowing the cost to spawn or use additional thread for a task. Knowing the cost allows an estimate of the expected improvement additional thread can potentially achieve for a perfectly scale able algorithm. It can also provide an estimate of the threading over for example if you are running X filters per second with N threads, then you can compute the time spent with threading over head. The can help determine if the expected improvement from improving the threading model vs improving the algorithms.
The overhead for spawning threads is computed by measuring the time it takes the `AddImageFilter` to run with 1 thread on 1 pixel, and the time it takes to run with N threads on N pixels. Each thread does the one pixel trivial operation. The difference in execution time is considered the overhead for spawning the threads. Dividing by the number of additional threads gives us the overhead cost of “spawning” or dispatching. To estimate the cost the following SimpleITK code was initially used:
  def compute_thread_cost(n):
      sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(1)
      img = sitk.Image(1,1, sitk.sitkFloat32)
      t1 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25))
      sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(n)
      img = sitk.Image(1,n, sitk.sitkFloat32)
      t2 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25))
      print( "1 Thread: {0}\n{1} Threads: {2}".format(t1,n,t2))
      cost=(t2-t1)/(n-1.0)
      print("Cost per thread: {0}".format(cost))
SimpleITK was recompiled with the Default ITK, the [http://review.source.kitware.com/#/c/22387/5 rewritten-thread-pool], and [http://review.source.kitware.com/#/c/22377/1 TheReturnOfOpenMP].
The following results were with the following system:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                88
On-line CPU(s) list:  0-87
Thread(s) per core:    2
Core(s) per socket:    22
Socket(s):            2
NUMA node(s):          2
Vendor ID:            GenuineIntel
CPU family:            6
Model:                79
Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Stepping:              1
CPU MHz:              2800.617
BogoMIPS:              4410.78
Virtualization:        VT-x
L1d cache:            32K
L1i cache:            32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):    0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86
NUMA node1 CPU(s):    1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87
Here are the results.
Default ITK:
>compute_thread_cost(88)
1 Thread: 7.9870223999e-05
88 Threads: 0.00240182876587
Cost per thread: 2.66891786422e-05 (seconds)
rewriting-thread-pool:
>compute_thread_cost(88)
1 Thread: 7.60555267334e-05
88 Threads: 0.000521183013916
Cost per thread: 5.11640789865e-06 (seconds)
TheReturnOfOpenMP:
>compute_thread_cost(88)
1 Thread: 7.79628753662e-05
88 Threads: 0.000182867050171
Cost per thread: 1.2057951127e-06 (seconds)
To summarize these results: the rewriting-thread-pool topic has a 5.2X speed up for spawning thread, and the TheReturnOfOpenMP topic as a 22X speedup.
Re-running the test code has a bit (~20% with occasional outlier) of variability. The results are quite similar with spawning a smaller number of threads. So this type of testing can be done on a smaller system too.
So if you are using 100 filters per second with 100 threads each, the thread overhead will take 1% to 26% of the time. That is a lot of filters and threads to encounter this potential bottle neck, however we need to know the current usage and expectation for the number of filters and threads to run per second.

Latest revision as of 14:01, 30 June 2017

Recent performance improvements efforts:


ITK Performance Optimization Meeting Notes

2017-04-07

Links:

Possible next steps:

  • TBB Insight Journal update
  • TBB Insight Journal -> Remote Module
  • Improved compiler optimization flags
  • MKL backend for FFT
  • CMake-ified TBB upstreaming / available as an ITK module
  • Improved awareness of ImageScanlineIterator


Performance Analysis

Per Thread Cost of Job Dispatch

An important metric in analyzing the scalability of multi-threading algorithms is the knowing the cost to spawn or use additional thread for a task. Knowing the cost allows an estimate of the expected improvement additional thread can potentially achieve for a perfectly scale able algorithm. It can also provide an estimate of the threading over for example if you are running X filters per second with N threads, then you can compute the time spent with threading over head. The can help determine if the expected improvement from improving the threading model vs improving the algorithms.

The overhead for spawning threads is computed by measuring the time it takes the `AddImageFilter` to run with 1 thread on 1 pixel, and the time it takes to run with N threads on N pixels. Each thread does the one pixel trivial operation. The difference in execution time is considered the overhead for spawning the threads. Dividing by the number of additional threads gives us the overhead cost of “spawning” or dispatching. To estimate the cost the following SimpleITK code was initially used:

 def compute_thread_cost(n):
     sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(1)
     img = sitk.Image(1,1, sitk.sitkFloat32)
     t1 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25))
     sitk.ProcessObject_SetGlobalDefaultNumberOfThreads(n)
     img = sitk.Image(1,n, sitk.sitkFloat32)
     t2 = min(timeit.repeat( lambda img=img: sitk.Add(img,img), number=1,repeat=25))
     print( "1 Thread: {0}\n{1} Threads: {2}".format(t1,n,t2))
     cost=(t2-t1)/(n-1.0)
     print("Cost per thread: {0}".format(cost))

SimpleITK was recompiled with the Default ITK, the rewritten-thread-pool, and TheReturnOfOpenMP.

The following results were with the following system:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                88
On-line CPU(s) list:   0-87
Thread(s) per core:    2
Core(s) per socket:    22
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2800.617
BogoMIPS:              4410.78
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87

Here are the results.

Default ITK:

>compute_thread_cost(88)
1 Thread: 7.9870223999e-05
88 Threads: 0.00240182876587
Cost per thread: 2.66891786422e-05 (seconds)

rewriting-thread-pool:

>compute_thread_cost(88)
1 Thread: 7.60555267334e-05
88 Threads: 0.000521183013916
Cost per thread: 5.11640789865e-06 (seconds)

TheReturnOfOpenMP:

>compute_thread_cost(88)
1 Thread: 7.79628753662e-05
88 Threads: 0.000182867050171
Cost per thread: 1.2057951127e-06 (seconds)


To summarize these results: the rewriting-thread-pool topic has a 5.2X speed up for spawning thread, and the TheReturnOfOpenMP topic as a 22X speedup.

Re-running the test code has a bit (~20% with occasional outlier) of variability. The results are quite similar with spawning a smaller number of threads. So this type of testing can be done on a smaller system too.

So if you are using 100 filters per second with 100 threads each, the thread overhead will take 1% to 26% of the time. That is a lot of filters and threads to encounter this potential bottle neck, however we need to know the current usage and expectation for the number of filters and threads to run per second.