[ITK] [ITK-dev] [ITK Community] [Insight-developers] non-deterministic v4 registrations in 4.5.x

Mon Mar 31 10:12:00 EDT 2014

Hi Matt,

As noted in previous msg, the magnitude of the divergence doesn't' seem to
be very large on typical usage patterns, so from a practical point of view
it might be better to bound it, but I haven't done a very careful analysis
yet.

The most straightforward way to avoid this sort of thing is to always do
the sum the same way regardless of the number of threads. Simplest approach
to that is accumulating all the constituents into a (stably) ordered data
structure and performing the sum at the end.  Of course, this costs you
both the memory and the transaction costs, which may not be acceptable. On
the other hand, with enough elements you'll want some partitioning anyway
for accuracy, but this is best determined by the data range, not the number
of threads.

Most of the other methods I've used to avoid or mitigate this in the past
require detailed analysis of local computation and bounding the error at
each step, so you can safely truncate the final result to known digits, and
regardless of order the final estimate will be identical.  You can
sometimes avoid any speed or space costs this way, but it's fiddly work and
has to be reevaluated with any change.

On Sun, Mar 30, 2014 at 10:07 AM, Matt McCormick <matt.mccormick at kitware.com
> wrote:

> Hi Simon,
>
> Thanks for taking a look.
>
> Yes, your assessment is correct.  What is a strategy that would avoid
> this, though?
>
> More eyes on the optimizers are greatly welcome!
>
> Thanks,
> Matt
>
> On Fri, Mar 28, 2014 at 4:57 PM, Simon Alexander <skalexander at gmail.com>
> wrote:
> > There is a lot going on here, and I'm not certain that I've got all the
> > moving pieces straight in my mind yet, but I've had an quick look at the
> > implementation now. I believe the Mattes v4 implementation is similar to
> > other metrics it it's approach.
> >
> > As I suggested earlier in the thread: I believe accumulations like this:
> >
> >>  for( ThreadIdType threadID = 1; threadID <
> >> this->GetNumberOfThreadsUsed(); threadID++ )
> >>     {
> >>     this->m_ThreaderJointPDFSum[0] +=
> >> this->m_ThreaderJointPDFSum[threadID];
> >>     }
> >
> >
> > will guarantee that we don't have absolute consistent results between
> > different threadcounts, due to lack of associativity.
> >
> > When I perform only transform initialization and a single evaluation of
>  the
> > metric (i.e. outside of the registration routines), I get results
> consistent
> > with this, for example, results for an center-of-mass initialization
> between
> > two MR image volumes give me (double precision):
> >
> > 1 thread :  -0.396771472451519
> > 2 threads: -0.396771472450998
> > 8 threads: -0.396771472451149
> >
> > for the metric evalution (i.e. via GetValue() of the metric)
> >
> > AFAICS, This is consistent magnitude of delta from the above.  It will
> mean
> > not chance of binary equivalence between different
> threadcounts/partitioning
> > but you can do this accumulation quite a few times before the accumulated
> > divergence gets into digits to worry about.  This sort of thing is
> > avoidable, but at some space/speed cost.
> >
> > However, In the registration for this case it takes only about twenty
> steps
> > for divergence in the third significant digit between metric estimates!
> (via
> > registration->GetOptimizer()->GetCurrentMetricValue() )
> >
> > Clearly the optimizer is not following the same path, so I think
> something
> > else must be going on.
> >
> > So at this point I don't think the data partitioning of the metric is the
> > root cause, but I will have a more careful look later.
> >
> > Any holes in this analysis you can see so far?
> >
> > When I have time to get back into this, I plan to have a look at the
> > optimizer next, unless you have better suggestions of where to look next.
> >
> > cheers,
> > Simon
> >
> >
> >
> > On Wed, Mar 19, 2014 at 12:56 PM, Simon Alexander <skalexander at gmail.com
> >
> > wrote:
> >>
> >> Brian, my apologies for the typo.
> >>
> >> I assume you all are at least as busy as I am; just didn't want to leave
> >> the impression that I would definitely be able to pursue this, but I
> will
> >> try.
> >>
> >>
> >> On Wed, Mar 19, 2014 at 12:45 PM, brian avants <stnava at gmail.com>
> wrote:
> >>>
> >>> it's brian - and, yes, we all have "copious free time" of course.
> >>>
> >>>
> >>> brian
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Mar 19, 2014 at 12:43 PM, Simon Alexander <
> skalexander at gmail.com>
> >>> wrote:
> >>>>
> >>>> Thanks for the summary Brain.
> >>>>
> >>>> A lot of partitioning issues fundamentally  come down to the lack of
> >>>> associativity & distributivity  of fp operations.  Not sure I can do
> >>>> anything practical to improve it  but I will have a look if I can
> find a bit
> >>>> of my "copious free time" .
> >>>>
> >>>>
> >>>> On Wed, Mar 19, 2014 at 12:29 PM, brian avants <stnava at gmail.com>
> wrote:
> >>>>>
> >>>>> yes - i understand.
> >>>>>
> >>>>> * matt mccormick implemented compensated summation to address - it
> >>>>> helps but is not a full fix
> >>>>>
> >>>>> * truncating floating point precision greatly reduces the effect you
> >>>>> are talking about but is unatisfactory to most people ... not sure
> if the
> >>>>> functionality for that truncation was taken out of the v4 metrics
> but it was
> >>>>> in there at one point.
> >>>>>
> >>>>> * there may be a small and undiscovered bug that contributes to this
> in
> >>>>> mattes specificallly but i dont think that's the issue.  we saw this
> effect
> >>>>> even in mean squares.  if there is a bug it may be beyond just
> mattes.   we
> >>>>> cannot disprove that there is a bug.  if anyone knows of way to do
> that, let
> >>>>> me know.
> >>>>>
> >>>>> * any help is appreciated
> >>>>>
> >>>>>
> >>>>> brian
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 19, 2014 at 12:24 PM, Simon Alexander
> >>>>> <skalexander at gmail.com> wrote:
> >>>>>>
> >>>>>> Brain,
> >>>>>>
> >>>>>> I could have sworn I had initially added a follow up email
> clarifying
> >>>>>> this but since I can't find it in the current quoted exchange, let
> me
> >>>>>> reiterate:
> >>>>>>
> >>>>>> This is not a case of with different results on different systems.
> >>>>>> This is a case of different results on the same system if you use a
> >>>>>> different number of threads.
> >>>>>>
> >>>>>> So while that possibly could be some odd intrinsics issue, for
> >>>>>> example, the far more likely thing is that data partitioning is not
> being
> >>>>>> handled in a way that ensures consistency.
> >>>>>>
> >>>>>> Originally I was also seeing intra-system differences due to
> internal
> >>>>>> precision, but that was a separate issue and has been solved.
> >>>>>>
> >>>>>> Hope that is more clear!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Mar 19, 2014 at 12:13 PM, Simon Alexander
> >>>>>> <skalexander at gmail.com> wrote:
> >>>>>>>
> >>>>>>> Brian,
> >>>>>>>
> >>>>>>> Do you mean the generality of my AVX  internal precision problem?
> >>>>>>>
> >>>>>>> I agree that is a very common issue, the surprising thing there was
> >>>>>>> that we were already constraining the code generation in way that
> worked as
> >>>>>>> over the different processor generations and types we used, up
> until we hit
> >>>>>>> the first Haswell cpus with AVX2 support (even though no AVX2
> instructions
> >>>>>>> were generated).  Perhaps it shouldn't have surprised me, but It
> took me a
> >>>>>>> few tests to work that out because the problem was confounded with
> the
> >>>>>>> problem I discuss in this thread (which is unrelated).  Once I
> separated
> >>>>>>> them it was easy to spot.
> >>>>>>>
> >>>>>>> So that is a solved issue for now, but I am still interested the
> >>>>>>> partitioning issue in the image metric, as I only have a work
> around for
> >>>>>>> now.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 19, 2014 at 11:24 AM, brian avants <stnava at gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler
> >>>>>>>>
> >>>>>>>> just as an example of the generality of this problem
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> brian
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Mar 19, 2014 at 11:22 AM, Simon Alexander
> >>>>>>>> <skalexander at gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Brian, Luis,
> >>>>>>>>>
> >>>>>>>>> Thanks.  I have been using Mattes as you suspect.
> >>>>>>>>>
> >>>>>>>>> I don't quite understand how precision is specifically the issue
> >>>>>>>>> with # of cores.  There are all kinds of issues with precision
> and order of
> >>>>>>>>> operations in numerical analysis, but often data partitioning
> (i.e. for
> >>>>>>>>> concurrency) schemes can be set up so that the actual sums are
> done the same
> >>>>>>>>> way regardless of number of workers, which keeps your final
> results
> >>>>>>>>> identical.  Is there some reason this can't be done for the
> Matte's metric?
> >>>>>>>>> I really should look at the implementation to answer that, of
> course.
> >>>>>>>>>
> >>>>>>>>> Do you have a pointer to earlier discussions?  If I can find the
> >>>>>>>>> time I'd like to dig into this a bit, but I'm not sure when I'll
> have the
> >>>>>>>>> bandwidth.  I've "solved" this currently by constraining the
> core count.
> >>>>>>>>>
> >>>>>>>>> Perhaps interestingly, my earlier experiments were confounded a
> bit
> >>>>>>>>> by a precision issue, but that had to do with intrinsics
> generation on my
> >>>>>>>>> compiler behaving differently on systems with AVX2 (even though
> only AVX
> >>>>>>>>> intrinsics were being generated).  So that made things confusing
> at first
> >>>>>>>>> until I separated the issues.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 19, 2014 at 9:49 AM, brian avants <stnava at gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> yes - we had several discussions about this during v4
> development.
> >>>>>>>>>>
> >>>>>>>>>> experiments showed that differences are due to precision.
> >>>>>>>>>>
> >>>>>>>>>> one solution was to truncate precision to the point that is
> >>>>>>>>>> reliable.
> >>>>>>>>>>
> >>>>>>>>>> but there are problems with that too.   last i checked, this was
> >>>>>>>>>> an
> >>>>>>>>>>
> >>>>>>>>>> open problem, in general, in computer science.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> brian
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Mar 19, 2014 at 9:16 AM, Luis Ibanez
> >>>>>>>>>> <luis.ibanez at kitware.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Simon,
> >>>>>>>>>>>
> >>>>>>>>>>> We are aware of some multi-threading related issues in
> >>>>>>>>>>> the registration process that result in metric values changing
> >>>>>>>>>>> depending on the number of cores used.
> >>>>>>>>>>>
> >>>>>>>>>>> Are you using the MattesMutualInformationMetric ?
> >>>>>>>>>>>
> >>>>>>>>>>> At some point it was suspected that the problem was the
> >>>>>>>>>>> result of accumulative rounding, in the contributions that
> >>>>>>>>>>> each pixel makes to the metric value.... this may or may
> >>>>>>>>>>> not be related to what you are observing.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>    Thanks
> >>>>>>>>>>>
> >>>>>>>>>>>        Luis
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Feb 20, 2014 at 3:27 PM, Simon Alexander
> >>>>>>>>>>> <skalexander at gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I've been finding some regressions in registration results
> when
> >>>>>>>>>>>> using systems with different numbers of cores (so the thread
> count is
> >>>>>>>>>>>> different).  This is resolved by fixing the global max.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It's difficult for me to run the identical code on against
> >>>>>>>>>>>> 4.4.2, but similar experiments were run in that timeframe
> without these
> >>>>>>>>>>>> regressions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I recall that there were changes affecting multhreading in the
> >>>>>>>>>>>> v4 registration in 4.5.0 release, so I thought this might be
> a side effect.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So a few questions:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is this behaviour expected?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am I correct that this was not the behaviour in 4.4.x ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Does anyone who has a feel for  the recent changes 4.4.2 ->
> >>>>>>>>>>>> 4.5.[0,1]  have a good idea where to start looking?  I
> haven't yet dug into
> >>>>>>>>>>>> the multithreading architecture, but this "smells" like a
> data partitioning
> >>>>>>>>>>>> issue to me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Any other thoughts?
> >>>>>>>>>>>>
> >>>>>>>>>>>> cheers,
> >>>>>>>>>>>> Simon
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Powered by www.kitware.com
> >>>>>>>>>>>>
> >>>>>>>>>>>> Visit other Kitware open-source projects at
> >>>>>>>>>>>> http://www.kitware.com/opensource/opensource.html
> >>>>>>>>>>>>
> >>>>>>>>>>>> Kitware offers ITK Training Courses, for more information
> visit:
> >>>>>>>>>>>> http://kitware.com/products/protraining.php
> >>>>>>>>>>>>
> >>>>>>>>>>>> Please keep messages on-topic and check the ITK FAQ at:
> >>>>>>>>>>>> http://www.itk.org/Wiki/ITK_FAQ
> >>>>>>>>>>>>
> >>>>>>>>>>>> Follow this link to subscribe/unsubscribe:
> >>>>>>>>>>>> http://www.itk.org/mailman/listinfo/insight-developers
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Community mailing list
> >>>>>>>>>>>> Community at itk.org
> >>>>>>>>>>>> http://public.kitware.com/cgi-bin/mailman/listinfo/community
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> Powered by www.kitware.com
> >>>>>>>>>>>
> >>>>>>>>>>> Visit other Kitware open-source projects at
> >>>>>>>>>>> http://www.kitware.com/opensource/opensource.html
> >>>>>>>>>>>
> >>>>>>>>>>> Kitware offers ITK Training Courses, for more information
> visit:
> >>>>>>>>>>> http://kitware.com/products/protraining.php
> >>>>>>>>>>>
> >>>>>>>>>>> Please keep messages on-topic and check the ITK FAQ at:
> >>>>>>>>>>> http://www.itk.org/Wiki/ITK_FAQ
> >>>>>>>>>>>
> >>>>>>>>>>> Follow this link to subscribe/unsubscribe:
> >>>>>>>>>>> http://www.itk.org/mailman/listinfo/insight-developers
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
> > _______________________________________________
> > Powered by www.kitware.com
> >
> > Visit other Kitware open-source projects at
> > http://www.kitware.com/opensource/opensource.html
> >
> > Kitware offers ITK Training Courses, for more information visit:
> > http://kitware.com/products/protraining.php
> >
> > Please keep messages on-topic and check the ITK FAQ at:
> > http://www.itk.org/Wiki/ITK_FAQ
> >
> > Follow this link to subscribe/unsubscribe:
> > http://www.itk.org/mailman/listinfo/insight-developers
> >
> > _______________________________________________
> > Community mailing list
> > Community at itk.org
> > http://public.kitware.com/cgi-bin/mailman/listinfo/community
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://public.kitware.com/pipermail/community/attachments/20140331/761511d7/attachment-0002.html>
-------------- next part --------------
_______________________________________________
Powered by www.kitware.com

Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html

Kitware offers ITK Training Courses, for more information visit:
http://kitware.com/products/protraining.php

Please keep messages on-topic and check the ITK FAQ at:
http://www.itk.org/Wiki/ITK_FAQ

Follow this link to subscribe/unsubscribe:
http://www.itk.org/mailman/listinfo/insight-developers