[ITK-dev] About the last optimization on VariableLengthVector

Thu Nov 12 12:17:59 EST 2015

Hello,

I've pushed a patch that illustrate a way to have better performances
with VariableLengthVectors (VLV).
    http://review.source.kitware.com/#/c/20389/
I suspect, this patch will need some discussions.

You'll find a descriptions of where I wish to go, of the problems
encountered on the way, and of possible solutions.

The biggest issue is how to have thread-safe cached variables in
functions that were pure functions until now, and still keep the code as
simple as possible, and minimize the number of modifications required to
improve the code base (regarding performances on VLV)

Regards
--Luc Hermitte

This is the final step of the performances enhancements to be made on
VLV in order to have performances on par with monoband itk::Image<>.

This patch won't pass tests. This is known.
I still have to reorganize the new code and to document it.

It's meant to illustrate different possibilities and issues to be
solved. Here is a short summary. Let discuss on the mailing list.

The *purpose* of this patch (and the previous ones as well) is to
eliminate allocations made of each pixel.

This will permit (on a Linear Interpolation test case to reduce the
computation time, on a VectorImage, from 85s to 27s -- it was around 2
minutes before the first patches, and on the same image read as an
itk::Image it takes ~15-17s)

First problem: PixelType Evalutate(...);
Currently, when we apply an ImageFunction, we call Evaluate() function
which requires the creation of a new VLV. Indeed this function return a
new pixel value by value.
The only way to get rid of the variable creation is to return the new
value as an output reference parameter.

This means all classes that inherit from ImageFunction will need to
change the signature of their EvaluateXxxx() functions. And that all
client classes must take this into account. In other words, many
modifications will be required if we want to have better performances on
VLV.

Note however, it's quite simple to migrate gradually the source code to
the new interface : we can have all client code explicitly use the new
(and badly named) EvaluateNoCopy(), which in turn call the old
Evaluate() -- or the other way around (which has a drawback regarding
cached variables).

Second Problem: Temporary pixel values
Some computations will produce new pixel values. For instance
  p = p0 + (p1 - p0) * d;
produces 3 new pixels values.
This has already been taken care of in the previous patches thanks to
Expression Templates.

Third Problem: Local Variables
3.1- Auxiliary variables
Some algorithms (e.g. BSplines interpolations) require dynamic data
structures like matrices. This implies allocation at each iteration.
This particular problem isn't addressed in this patch. The solution to
3.2 will help to provide an solution.

3.2- Factorized computations
Sometimes, a complex computation will take advantage of code
factorisation (easier to maintain, and better performances). e.g.

   Pixel p4 = p0 + (p1 - p0) * d0
   Pixel p5 = p2 + (p3 - p2) * d1
   output = p4 + (p5 - p4) * d2

This implies new pixel values, hence dynamic allocations.

In c++11, we could (almost) have written
   auto p4 = p0 + (p1 - p0) * d0

Except that p4 is an expression template. Its type is quite complex to
express in plain C++98/03. But the problem will be that the current
implementation of expression templates uses references. Here p4 will be
made of dangling references, the code would be bugged.
Workarounds would be possible in C++11 as it will permit us to
distinguish lvalue references from rvalue references in order to copy
the latter, and keep references to the former.
We are not in C++11 yet. Moreover, this would not factorize the
computations.

We need a C++98 solution.
The only way I see is to rely on cached variables.
This means that the ImageFunctions are no longer [[pure]] functions :
they'll mutate some variables that we want to allocate once per
ImageFunction and ... per thread.
Indeed the ImageFunctions shall be reentrant.

I was first hoping to use thread_local variables. Alas they cannot be
non-static attributes. This is a problem: if I'm not wrong, a same
ImageFunction could be used multiple times in a same processing chain.

We still have a simple solution, but it requires a second modification
to Evaluate() signature: we need to pass the current ThreadId.

The new drawback: code migration requires more work. We can not write
Evaluate() in terms of EvaluateNoCopy() as the latter requires the
threadId that the former doesn't know. This is the reason why I expect
the tests to fail.
Instead we need to convert all client code (like the WarpImageFilter)
and have the default implementation of EvaluateNoCopy() to call the
already written and not yet converted Evalutate().

Fourth Problem: heterogeneous data types (i.e. castings)
The input image type, the output image type, and the internal pixel type
may all be different. That's why the code contain so many static_casts.

Alas a static_cast<> on pixels will induce the creation of new pixel values.
This is contrary to the purpose of this patch.
Copy constructors and assignment operators already support on-the-fly
implicit static_casting.
We still need to be able to require casting only when it's required.
That's where the new function CastInto(), MoveInto() and As() come into
rescue.

Alas, all ImageFunction codes will need to be converted to use these
functions.