[Insight-developers] Empty FixedArray destructor: Performance hit using gcc (times 2) : attribute ((aligned (8)))

Fri Jun 6 10:10:35 EDT 2008

Hi Tom,

More on this,  Bradley Lowekamp kindly pointed us to the following
GCC mechanism for specifying the alignment of structures:

http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/Type-Attributes.html

Thank Brad !

---

The Attribute:

          __attribute__ ((aligned (8)))

also does the trick.

When we added it to the end of the minimalistic array:

class MyArray
{
public:
   MyArray() {};
   ~MyArray() {};
   double operator[](unsigned int k)
     {
     return foo[k];
     }
   double foo[2];
} __attribute__ ((aligned (8))) ;

we can compile without -malign-double and the structure
is still aligned to 8 bytes, despites the fact that the
destructor is sill present.

We could create ITK macros for this attribute options,
and define the macros at configuration time by using
TRY_COMPILES.

The remaining question is:

     Are there any drawbacks to this approach ?

At first sight, it is much better than the global
-malign-double option, and we can apply it only
to structures that we know must be aligned.

One challenge here is that although we want FixedArray<double,N>
to be 8-bytes aligned, we don't always want the FixedArray<T,N>
to be aligned this way.  For example: FixedArray<char,3>  ??

One option could be to create your pixel type as a class derivied
from FixedArray<double,2>, and see if we can apply the attribute
just to the derived class....

In this way, this will be an application specific issue, as opposed
to something that has to be done pervasively in ITK.

   Any suggestions ?

       Luis

---------------------
Luis Ibanez wrote:
> 
> Hi Tom,
> 
> 
> Trying to understand why this alignment happens, we have reduced
> the test to minimalistic implementation of FixedArray:
> 
> 
> 
> class MyArray
> {
> public:
>   MyArray() {};
>   ~MyArray() {};
>   double operator[](unsigned int k)
>     {
>     return foo[k];
>     }
>   double foo[2];
> };
> 
> 
> 
> With this implementation we have reproduced your observation
> that:
> 
> 
>   a) When the destructor exists, an array of MyArray(s) is
>      allocated in a 4byte boundary
> 
>   b) When the destructor does not exists, an array of
>      MyArray(s) is allocated in a 8byte boundary
> 
> 
> Then, by Googling about it we found this GCC flag:
> 
>               -malign-double
> 
> When compiling with this flag, your test is always aligned
> to 8 bytes, regardless of whether the destructor is present
> or not.
> 
> 
> 
> "Optimization in GCC"
> January 26th, 2005 by M. Tim Jones
> http://www.linuxjournal.com/article/7269
> 
> 
> <quote>
> Alignment Optimizations
> 
> In the second optimization level, we saw that a number of alignment
> optimizations were introduced that had the effect of increasing
> performance but also increasing the size of the resulting image. Three
> additional alignment optimizations specific to this architecture are
> available. The -malign-int option allows types to be aligned on 32-bit
> boundaries. If you're running on a 16-bit aligned target, -mno-align-int
> can be used. The -malign-double controls whether doubles, long doubles
> and long-longs are aligned on two-word boundaries (disabled with
> -mno-align-double). Aligning doubles provides better performance on
> Pentium architectures at the expense of additional memory.
> 
> Stacks also can be aligned by using the option
> -mpreferred-stack-boundary. The developer specifies a power of two for
> alignment. For example, if the developer specified
> -mpreferred-stack-boundary=4, the stack would be aligned on a 16-byte
> boundary (the default). On the Pentium and Pentium Pro targets, stack
> doubles should be aligned on 8-byte boundaries, but the Pentium III
> performs better with 16-byte alignment.
> </quote>
> 
> 
> 
> The conundrum, is that for structures that contain combinations of
> doubles and other types, then there will be a larger memory consumption.
> 
> 
> Consider for example a structure such as:
> 
>      class A
>      {
>      private:
>         char    foor;
>         double  bar;
>      };
> 
> 
> with the flag -mno-align-double (the default) this will use 12 bytes,
> versus 16 bytes that will be used when -malign-double is set.
> (measured with sizeof(A)).
> 
> 
> We still have not answered the fundamental question:
> 
>    Why is that the presence of a non-virtual destructor
>    changes the alignment ?
> 
> 
> 
>     Luis
> 
> 
> ------------------------
> Tom Vercauteren wrote:
> 
>> Hi,
>>
>> Thanks for your tests, it's great to have see such reactivity!
>>
>> Below is another test that will show the performance hit. You don't
>> need to recompile ITK to use it. What we did was to run a simple loop
>> on an C array of FixedArray. Then we hack around to get an 8 byte
>> aligned C array of FixedArray and run the loop again.
>>
>> In this case, the performance hit is clearly not as large as the one
>> we get in the real world case but is still large enough to be
>> conclusive.
>>
>>    Initial alignment: 4
>>    Initial execution time: 920ms
>>    New alignment: 0
>>    Execution time: 880ms
>>
>> Let me know what it gives on your setup.
>>
>> If the destructor is not implemented you would get ( Initial
>> alignment: 0 ) and the same timing results.
>>
>> Tom
>>
>>
>>
>> #include <iostream>
>> #include <itkFixedArray.h>
>>
>> int main()
>> {
>>    // Define the number of elements in the array
>>    const unsigned int nelements = 10000000;
>>
>>    // Define the number of runs used for timing
>>    const unsigned int nrun = 10;
>>
>>    // Declare a simple timer
>>    clock_t t;
>>
>>    typedef itk::FixedArray<double,2> ArrayType;
>>
>>    // Declare an array of nelements FixedArray
>>    // and add a small margin to play with pointers
>>    // but not map outside the allocated memory
>>    ArrayType * vec = new ArrayType[nelements+8];
>>
>>    // Fill it up with zeros
>>    memset(vec,0,(nelements+8)*sizeof(ArrayType));
>>
>>
>>
>>
>>    // Display the alignment of the array
>>    std::cout << "Initial alignment: " << (((int)vec)& 7) << "\n";
>>
>>    // Start a simple experiment
>>    t = clock();
>>    double acc1 = 0.0;
>>    for (unsigned int i=0;i<nrun;++i)
>>    {
>>       for (unsigned int j=0;j<nelements;++j)
>>       {
>>          acc1+=vec[j][0];
>>       }
>>    }
>>
>>    // Get the final timing and display it
>>    t=clock() - t;
>>
>>    std::cout << "Initial execution time: "
>>              << (t*1000.0) / CLOCKS_PER_SEC << "ms\n";
>>
>>
>>
>>
>>
>>    // We now emulate an 8 bytes aligned array
>>
>>    // Cast the pointer to char to play with bytes
>>    char * p = reinterpret_cast<char*>( vec );
>>
>>    // Move the char pointer until is aligned on 8 bytes
>>    while (((int)p)%8) ++p;
>>
>>    // Cast the 8 bytes aligned pointer back to the original type
>>    ArrayType * vec2 = reinterpret_cast<ArrayType*>( p );
>>
>>    // Make sure the new pointer is well aligned by
>>    // displaying the alignment
>>    std::cout << "New alignment: " << (((int)vec2)& 7) << "\n";
>>
>>    // Start the simple experiment on the 8 byte aligned array
>>    t = clock();
>>    double acc2 = 0.0;
>>    for (unsigned int i=0;i<nrun;++i)
>>    {
>>       for (unsigned int j=0;j<nelements;++j)
>>       {
>>          acc2+=vec2[j][0];
>>       }
>>    }
>>
>>    // Get the final timing and display it
>>    t=clock() - t;
>>
>>    std::cout << "Execution time: "
>>              << (t*1000.0) / CLOCKS_PER_SEC << "ms\n";
>>
>>
>>
>>
>>    // Free up the memory
>>    delete [] vec;
>>
>>    // Make sure we do something with the sums otherwise everything
>>    // could be optimized away by the compiler
>>    return acc1+acc2;
>> }
>>
>>
>>
>> On Thu, Jun 5, 2008 at 5:04 PM, Gert Wollny <gert at die.upm.es> wrote:
>>
>>> Am Donnerstag, den 05.06.2008, 10:24 -0400 schrieb Luis Ibanez:
>>>
>>>> Hi Gert,
>>>>
>>>> Thanks for the quick report !
>>>>
>>>> It makes sense that -g flag will prevent the method
>>>> from being optimized away.
>>>>
>>>> If you have a chance,
>>>> could you please test what happens when no -g is
>>>> used, and the optimization flag is set to -O3 ?
>>>
>>>
>>> It was not be optimized away, and valgrind/kcachegrind tells me  the
>>> destructor is located in libITKCommon.so.
>>>
>>> Actually, with -O3 the whole loop was optimized away. This is wired, to
>>> say the least, because, if the compiler doesn't see the implementation
>>> of the constructor and the destructor and uses the explicitly
>>> instanciated one, it can not know whether there is done something
>>> essential in one of the both, like changing a global variable.
>>>
>>> I've added some code to force the loop (attached).
>>>
>>> BTW: I think -g doesn't change the optimizers at all (with g++).
>>>
>>> Best
>>>
>>> Gert
>>>
>>>
>>>
>>>
>>>
>>
>>
> 

[Insight-developers] Empty FixedArray destructor: Performance hit using gcc (times 2) : __attribute__ ((aligned (8)))

[Insight-developers] Empty FixedArray destructor: Performance hit using gcc (times 2) : attribute ((aligned (8)))