<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hans,<div><br></div><div>I have seem poor speed up by using more threads with by brief usage with ANTS. If I recall correctly more threads were actually slower for what ever I was doing at the time.&nbsp;</div><div><br></div><div>It's sounds like you are on the right track with the tools you are using and trying to remove unnecessary usage of smart pointers atomic reference counting.</div><div><br></div><div>Another thing to keep in mind is the memory layout of the data being processes, and the impact of multithreaded memory allocation.</div><div><br></div><div>When I have looked at these methods I didn't have time to figure out all the data structures an methods involved to get a good handle of what was going on.</div><div><br></div><div>I'd be interested in see speed up number of some of this problematic code, that is speed up with 2,3,4,8,16 threads (true, not hyper threaded) etc. Or perhaps a ratio of the total CPU time ( summed a crossed threads) vs wall time or something similar would be a simple fair number.&nbsp;</div><div><br></div><div>Good luck,</div><div>Brad</div><div><br><div><div>On Sep 23, 2013, at 7:46 AM, Brian Avants &lt;<a href="mailto:stnava@gmail.com">stnava@gmail.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><div dir="ltr">hi hans&nbsp;<div><br></div><div>thanks for looking at this - i suppose the good news is that there is plenty of room for improvement.&nbsp;</div><div><br></div><div>do you have a sense of whether this a registration-specific issue or if this is multi-threading in itk, in general?</div>

<div><br></div><div>am wondering if there is a simplified case that we can invent or find that will help clarify/isolate the issues.</div><div><br></div><div>see you tomorrow, probably. &nbsp;i just got in @ 6pm .... &nbsp;</div>

</div><div class="gmail_extra"><br clear="all"><div><div><br></div>brian<br><div><br></div><div><br></div></div>

<br><br><div class="gmail_quote">On Mon, Sep 23, 2013 at 8:16 AM, Johnson, Hans J <span dir="ltr">&lt;<a href="mailto:hans-johnson@uiowa.edu" target="_blank">hans-johnson@uiowa.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; ">

<div style="font-size:14px;font-family:Calibri,sans-serif;word-wrap:break-word">

<div>All,</div>

<div><br>

</div>

<div>

<div>Based on a &nbsp;discussion with Nick Tustison on the train from Nogoya airport to the MICCAI conference, I started some profiling to determine what is actually causing registration to be so slow. &nbsp;Some fixes have already been pushed to gerrit (<a href="http://review.source.kitware.com/#/c/12747/" target="_blank">http://review.source.kitware.com/#/c/12747/</a>)

 and that has shown about a 15% speed improvement. &nbsp;This however, appears to only be the tip of the iceberg.&nbsp;</div>

</div>

<div><br>

</div>

<div>In addition, &nbsp;I have been&nbsp;greatly disappointed that converting to floating point precision did not result in performance improvement (even though all my past experience indicates that it should be a performance improvement!). &nbsp;If these multithreading issues

 turn out to be the problem, that would explain why improving floating point performance does not improve overall performance. &nbsp;</div>

<div><br>

</div>

<div>

<div>=================</div>

</div>

<div><br>

</div>

<div>So far everything I've profile with regards to ants registration indicates that there is a serious flaw in the multi-threaded implementation.</div>

<div><br>

</div>

<div>20 of the 52 seconds are waiting for condition variables to clear (I.e. Variables are shared and require synchronization to complete). &nbsp;The thread concurrency histogram is particularly troubling. &nbsp;Only 1 or 2 threads are actually doing productive work

 at the same time. &nbsp;NOTE: THIS IS A REAL program that is actually in use for affine registration. &nbsp;I use it every day and have been terribly disappointed in it's speed. &nbsp;Every ants registration that you do like has this behavior.</div>

<div><br>

</div>

<div>=================</div>

<div><br>

</div>

<div>I'll continue to track down where the issues are, but it appears to be in places where a transform is referenced in multiple threads, but is requiring updating the internal reference count of the smart pointer. &nbsp;Each smart pointer reference count update

 requires a global lock on that object to do the increment/decrement.</div>

<div><br>

</div>

<div>More testing to follow.</div>

<div><br>

</div>

<div>Hans</div>

<div><br>

</div>

<div><span>&lt;BB93F41A-C611-4E82-8897-59D419BC5E08.png&gt;</span></div>

<br>

<br>

<br>

<hr>

Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged.&nbsp; If you are not the intended recipient, you are hereby notified that any

 retention, dissemination, distribution, or copying of this communication is strictly prohibited.&nbsp; Please reply to the sender that you have received the message in error, then delete it.&nbsp; Thank you.

<hr>

</div>

</blockquote></div><br></div>

</blockquote></div><br></div></body></html>