[ITK-dev] [ITK] About the last optimization on VariableLengthVector

Tue Nov 17 10:03:19 EST 2015

On Tue, Nov 17, 2015 at 3:46 AM, Luc Hermitte <luc.hermitte at c-s.fr> wrote:
> Le 16/11/2015 19:44, Matt McCormick a écrit :
>> On Mon, Nov 16, 2015 at 8:55 AM, Luc Hermitte <luc.hermitte at c-s.fr> wrote:
>>
>>>>> Third Problem: Local Variables
>>>>> 3.1- Auxiliary variables
>>>>> Some algorithms (e.g. BSplines interpolations) require dynamic data
>>>>> structures like matrices. This implies allocation at each iteration.
>>>>> This particular problem isn't addressed in this patch. The solution to
>>>>> 3.2 will help to provide an solution.
>>>>>
>>>>> 3.2- Factorized computations
>>>>> [...]
>>>>>
>>>>> We need a C++98 solution.
>>>>> The only way I see is to rely on cached variables.
>>>>> This means that the ImageFunctions are no longer [[pure]] functions :
>>>>> they'll mutate some variables that we want to allocate once per
>>>>> ImageFunction and ... per thread.
>>>>> Indeed the ImageFunctions shall be reentrant.
>>>>>
>>>>> I was first hoping to use thread_local variables. Alas they cannot be
>>>>> non-static attributes. This is a problem: if I'm not wrong, a same
>>>>> ImageFunction could be used multiple times in a same processing chain.
>>>>>
>>>>> We still have a simple solution, but it requires a second modification
>>>>> to Evaluate() signature: we need to pass the current ThreadId.
>>>>>
>>>>> The new drawback: code migration requires more work. We can not write
>>>>> Evaluate() in terms of EvaluateNoCopy() as the latter requires the
>>>>> threadId that the former doesn't know. This is the reason why I expect
>>>>> the tests to fail.
>>>>> Instead we need to convert all client code (like the WarpImageFilter)
>>>>> and have the default implementation of EvaluateNoCopy() to call the
>>>>> already written and not yet converted Evalutate().
>>>>
>>>> This would be messy, and it would not work very well since the
>>>> ImageFunction's are not involved in orchestration of the threading.
>>>> Instead, it would be better to make the necessary auxiliary variables
>>>> private variables.  Parallel code that uses ImageFunction's should
>>>> create one ImageFunction instance per thread.
>>>
>>> I'm afraid that with this choice, code that currently works well in
>>> parallel won't work anymore. End users will need to rewrite all uses of
>>> image functions in order to use them in MT code. And I guess we will
>>> also need to rewrite several test cases.
>>>
>>> Unless we duplicate the Evaluate() code : i.e. unless we keep the
>>> current version, and we add the new and optimized rewrite (for VLV in
>>> most cases, or for every kind of variables in some other cases (e.g.
>>> BSpline interpolation where caching matrix allocations helps reduce
>>> computation time))
>>> Duplicating image function algorithms doesn't seem to be a good choice
>>> either.
>>
>> Right. Yes, this will not be maintainable.
>>
>> A feasible way forward seems to use a C++11 code path when available
>> that has enhanced performance (i.e., use auto with C++11).  The C++98
>> version will not be as fast, but it will work. ITK leans towards
>> readability, maintainability, and program-ability over performance in
>> cases like this.
>
> With this solution, we loose computation factorizations.
>
>    auto p4 = p0 + (p1 - p0) * d0
>    auto p5 = p2 + (p3 - p2) * d1
>    output = p4 + (p5 - p4) * d2
>
> p4 computations will be done twice on VLVs and std::valarrays.
> // I suppose the dangling reference issue solved.
>
> Using cached variables and introducing a new threadId parameter greatly
> improve computation on VLV. But also on some other algorithms.
>
> Readability and maintainability won't be that reduced.

We are working towards better multi-threading infrastructure a la vtkSMPTools:

  http://www.kitware.com/blog/home/post/915

Generally, the number of threads used are not known a priori.
Moreover, they may change during parallel operations. Adding cache
variables to ImageFunction's requires making them re-entrant, which
will likely have a much more negative impact on performance than any
cached variable gains in these other models. Depending on the backend,
it could get very messy or it may not even be possible.

>>>>> Fourth Problem: heterogeneous data types (i.e. castings)
>>>>> The input image type, the output image type, and the internal pixel type
>>>>> may all be different. That's why the code contain so many static_casts.
>>>>>
>>>>> Alas a static_cast<> on pixels will induce the creation of new pixel values.
>>>>> This is contrary to the purpose of this patch.
>>>>> Copy constructors and assignment operators already support on-the-fly
>>>>> implicit static_casting.
>>>>> We still need to be able to require casting only when it's required.
>>>>> That's where the new function CastInto(), MoveInto() and As() come into
>>>>> rescue.
>>>>
>>>> So compiler's are not intelligent to make a static_cast<> a no-op when
>>>> the types are the same?
>>>
>>> Unfortunately, it doesn't seem so. Check
>>> http://stackoverflow.com/questions/19106826/can-static-cast-to-same-type-introduce-runtime-overhead
>>
>> Thanks for the link. That thread seems to indicate that POD types are
>> not an issue, and non-POD types may not be an issue outside of -O0.
>>
>>
>>> Moreover, in the case of converting VLV<double> to VLV<int>, with a
>>> static_cast, we'll force the creation of a new variable which is what I
>>> striving to avoid.
>>>
>>> Hence the new helper functions that'll use the converting assignment
>>> operators in case of VLV, or static_cast in all other cases.
>>
>> Sounds reasonable.
>>
>> What are the different definitions and uses for  CastInto(),
>> MoveInto(), and As()?
>
> MoveInto will try to swap() the content of two heavy variables like VLV
> when both are of the same type and equivalent to lvalues. If they are
> VLV but if their ValueTypes differ, the on-the-fly converting assignment
> operator is used. Otherwise, static_cast is used.
> The objective is not to use C++11 std::move as for the moment it isn't
> implemented as a move-and-swap, but as a destructive move.
>
> CastInto is to be used when swap cannot be used.
> This is the case when one of the pixel is actually a proxy/pixel_view.
> (i.e. when m_ManageMemory == false). Swapping pixels would alter the
> input image, this is unacceptable. Hence, CastInto.
>
> Both could be used independently of Expression Templates.
>
> As<> produces an expression template (on VLV) that will eventually
> execute a static_cast on each pixel component. On other types, it
> becomes a simple static_cast.

Cool!

Matt