<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hello,<div><br></div><div>Well I did get to it before you:</div><div><br></div><div><a href="http://review.source.kitware.com/#/c/6614/">http://review.source.kitware.com/#/c/6614/</a></div><div><br></div><div>I also uped the size of the image 100x in your test, here is the current performance on my system:</div><div><br></div><div><div>System: <a href="http://victoria.nlm.nih.gov">victoria.nlm.nih.gov</a></div><div>Processor: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz</div><div> Serial #: </div><div> Cache: 32768</div><div> Clock: 2794.27</div><div> Cores: 12 cpus x 24 Cores = 288</div><div>OSName: Mac OS X</div><div> Release: 10.6.8</div><div> Version: 10K549</div><div> Platform: x86_64</div><div> Operating System is 64 bit</div><div>ITK Version: 3.20.1</div><div>Virtual Memory: Total: 256 Available: 228</div><div>Physical Memory: Total:65536 Available: 58374</div><div> Probe Name: Count Min Mean Stdev Max Total </div><div> MeanSquares_1_threads 20 0.344348 0.347567 0.00244733 0.352629 6.95134</div><div> MeanSquares_2_threads 20 0.251223 0.300869 0.0179305 0.321404 6.01738</div><div> MeanSquares_4_threads 20 0.215516 0.348677 0.173645 0.678274 6.97355</div><div> MeanSquares_8_threads 20 0.138184 0.182681 0.0297812 0.237129 3.65362</div><div>System: <a href="http://victoria.nlm.nih.gov">victoria.nlm.nih.gov</a></div><div>Processor: </div><div> Serial #: </div><div> Cache: 32768</div><div> Clock: 2930</div><div> Cores: 12 cpus x 24 Cores = 288</div><div>OSName: Mac OS X</div><div> Release: 10.6.8</div><div> Version: 10K549</div><div> Platform: x86_64</div><div> Operating System is 64 bit</div><div>ITK Version: 4.2.0</div><div>Virtual Memory: Total: 256 Available: 228</div><div>Physical Memory: Total:65536 Available: 58371</div><div> Probe Name: Count Min Mean Stdev Max Total </div><div> MeanSquares_1_threads 20 0.382481 0.383342 0.00186954 0.391027 7.66685</div><div> MeanSquares_2_threads 20 0.211908 0.335328 0.0777408 0.435574 6.70655</div><div> MeanSquares_4_threads 20 0.271531 0.315688 0.0390751 0.385683 6.31377</div><div> MeanSquares_8_threads 20 0.147544 0.192132 0.0299427 0.240976 3.84263</div></div><div><br></div><div><br></div><div>In the patch provided, it is implicitly done on assignment on a per-thread basis. What was most un-expected was when then allocation of the Jacobin was explicitly done out side the threaded part, the time when up by 50%! I presume that the sequential allocation, of the doubles in the master thread made the allocation sequentially, next to each other, and may be a more insidious form of false sharing. Below is the numbers from this run, notice the lack of speed up with more threads:</div><div><br></div><div><div>System: <a href="http://victoria.nlm.nih.gov">victoria.nlm.nih.gov</a></div><div>Processor: </div><div> Serial #: </div><div> Cache: 32768</div><div> Clock: 2930</div><div> Cores: 12 cpus x 24 Cores = 288</div><div>OSName: Mac OS X</div><div> Release: 10.6.8</div><div> Version: 10K549</div><div> Platform: x86_64</div><div> Operating System is 64 bit</div><div>ITK Version: 4.2.0</div><div>Virtual Memory: Total: 256 Available: 226</div><div>Physical Memory: Total:65536 Available: 57091</div><div> Probe Name: Count Min Mean Stdev Max Total </div><div> MeanSquares_1_threads 20 0.403931 0.40648 0.00213043 0.41389 8.1296</div><div> MeanSquares_2_threads 20 0.243789 0.367603 0.0894637 0.65006 7.35206</div><div> MeanSquares_4_threads 20 0.281336 0.354749 0.0431082 0.440161 7.09497</div><div> MeanSquares_8_threads 20 0.24615 0.301576 0.0552998 0.446528 6.03151</div></div><div><br></div><div><br></div><div>Brad</div><div><br></div><div><br><div><div>On Jul 26, 2012, at 8:56 AM, Rupert Brooks wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Brad,<div><br></div><div>The false sharing issue is a good point - however, i dont think this is the cause of the performance degradation. This part of the class (m_Threader, etc) has not changed since 3.20. (I used the optimized metrics in my 3.20 builds, so its in Review/itkOptMeanSquares....) It also does not explain the performance drop in single threaded mode.</div>
<div><br></div><div>Testing will tell... Seems like a Friday afternoon project to me, unless someone else gets there first.</div><div><br></div><div>Rupert</div><div><br clear="all">--------------------------------------------------------------<br>
Rupert Brooks<br><a href="mailto:rupert.brooks@gmail.com">rupert.brooks@gmail.com</a><br><br>
<br><br><div class="gmail_quote">On Wed, Jul 25, 2012 at 5:18 PM, Bradley Lowekamp <span dir="ltr"><<a href="mailto:blowekamp@mail.nih.gov" target="_blank">blowekamp@mail.nih.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word">Hello,<div><br></div><div>Continuing to glance at the class.... I also see the following member variables for the MeanSquares class:</div><div><br></div><div><div> MeasureType * m_ThreaderMSE;</div>
<div> DerivativeType *m_ThreaderMSEDerivatives;</div></div><div><br></div><div>Where these are index by the thread ID and access simultaneously across the threads causes the potential for False Sharing, which can be a MAJOR problem with threaded algorithms.</div>
<div><br></div><div>I would think a good solution would be to create a per-thread data structure consisting of the Jacobin, MeasureType, and DerivativeType, plus padding to prevent false sharing, or equivalently assigning max data alignment to the structure.</div>
<div><br></div><div>Rupert, Would like to take a stab at this fix?</div><div><br></div><div>Brad</div><div><div class="h5"><div><br></div><div><br><div><div>On Jul 25, 2012, at 4:31 PM, Rupert Brooks wrote:</div><br><blockquote type="cite">
Sorry if this repeats - i just got a bounce from Insight Developers, so im trimming the message and resending....<br clear="all">--------------------------------------------------------------<br>Rupert Brooks<br><a href="mailto:rupert.brooks@gmail.com" target="_blank">rupert.brooks@gmail.com</a><br>
<br>
<br><br><div class="gmail_quote">On Wed, Jul 25, 2012 at 4:12 PM, Rupert Brooks <span dir="ltr"><<a href="mailto:rupert.brooks@gmail.com" target="_blank">rupert.brooks@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Aha. Heres around line 183 of itkTranslationTransform.<div><div><br></div><div>// Compute the Jacobian in one position</div><div>template <class TScalarType, unsigned int NDimensions></div><div>void</div><div>TranslationTransform<TScalarType, NDimensions>::ComputeJacobianWithRespectToParameters(</div>
<div> const InputPointType &,</div><div> JacobianType & jacobian) const</div><div>{</div><div> // the Jacobian is constant for this transform, and it has already been</div><div> // initialized in the constructor, so we just need to return it here.</div>
<div> jacobian = this->m_IdentityJacobian;</div><div> return;</div><div>}</div><div><br></div><div>Thats probably the culprit, although the root cause may be the reallocating of the jacobian every time through the loop.</div>
<div>
<div><br></div><div>Rupert</div><div><br></div><div><snipped></div></div></div></blockquote></div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div>
</blockquote></div><br><div>
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; font-size: 12px; "><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">========================================================</font></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">Bradley Lowekamp<span class="Apple-converted-space"> </span><span class="Apple-converted-space"> </span></font></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">Medical Science and Computing for</font></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">Office of High Performance Computing and Communications</font></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">National Library of Medicine<span class="Apple-converted-space"> </span></font></p><p style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; "><a href="mailto:blowekamp@mail.nih.gov">blowekamp@mail.nih.gov</a></font></p><br class="Apple-interchange-newline"></div><br class="Apple-interchange-newline">
</div>
<br></div></body></html>