[Insight-developers] Re: multi-thread efficient malloc for LevelSetNode

Joshua Cates cates@poblano.sci.utah.edu
Wed, 17 Apr 2002 10:58:06 -0600 (MDT)


Hi Simon,

This is great information and much appreciated.  I'll be digging through
it for ideas.  Your suggestion of more tests to determine the actual
extent of mutex contention is well taken, as constructing an alternative
memory manager is a tricky job.

Thanks,

Josh.

______________________________
 Josh Cates
 School of Computer Science
 University of Utah
 Email: cates@sci.utah.edu
 Phone: (801) 587-7697
 URL:   www.cs.utk.edu/~cates


On Wed, 17 Apr 2002, Simon Warfield wrote:

>
> Hi Joshua,
>
> > We are not using smart pointers for the LevelSetNodes, but calling new and
> > delete has the same mutex problem (there is no way around this, right?).
>
>   It is possible to implement an efficient memory allocator (i.e. malloc(),
> free(),new(),delete()) that minimizes mutex contention.  Basically, rather
> than a single global heap protected by a single mutex that all threads contend
> for, you can use a set of heaps and organize for each thread to try to
> allocate new memory from its own heap.
>
> There have been several implementations of this idea.  One with a good
> reputation is Hoard:
>   http://www.cs.utexas.edu/users/emery/hoard/
>   ftp://ftp.cs.utexas.edu/pub/emery/papers/asplos2000.pdf
>
>   I used a different implementation (ptmalloc) a few of years ago with good
> results.  ptmalloc() has become the default memory allocator in the GNU libc,
> so in fact the assumption that the default malloc() involves contending for
> a single mutex isn't true on systems using GNU libc (i.e. linux) and may not
> be true for other systems either.
>   One paper on performance tests for ptmalloc() is at:
> http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf
> However, Hoard is apparently even better.
>
>   All of that said, you might be surprised to not find much of a performance
> hit in doing 'parallel' memory allocation.  If your threads are not allocating
> memory at the same time then lock acquisition (for an uncontended mutex) can
> be very fast (Sun said 2 microseconds for their mutex implementation on
> their hardware a couple of years ago).
>
>   You can also mitigate the contention problem by the usual solution of
> allocating more than you need at each allocation time, and then handing
> it out as you need it.  From your description below you seem to be doing
> something like this. The idea is you need to allocate one LevelSetNode,
> then allocate ten LevelSetNode's, and don't do another allocation until you
> have made use of the other nine.  This reduces memory system contention by a
> factor of ten at the cost of slightly increased memory usage. I suggest to
> do careful timing of the actual impact of mutex contention or memory
> allocation costs before going to a lot of trouble to optimize for this. One
> simple test would be to drop in Hoard and see if you see much impact on runtime.
> If you don't, then this probably isn't worth optimizing on.
>
> > Date: Tue, 16 Apr 2002 12:21:36 -0600 (MDT)
> > From: Joshua Cates <cates@poblano.sci.utah.edu>
> > To: "Miller, James V (Research)" <millerjv@crd.ge.com>
> > Cc: Insight-Developers <insight-developers@public.kitware.com>
> > Subject: RE: [Insight-developers] design issue
> >
> > Hi Jim,
> >
> > Thanks for the input.  I'll take a look at these articles.
> >
> > >
> > > Can the LevelSetNode be a stacked based class? And hence not use
> > > SmartPointers at all? And hence avoid the mutexes?
> > >
> >
> > We are not using smart pointers for the LevelSetNodes, but calling new and
> > delete has the same mutex problem (there is no way around this, right?).
> > I have considered using std::vector for the storage class of my object.
> > It has the advantage that different allocators could be used, i.e. a
> > non-threadsafe allocator could gain some speed if you know you don't need
> > thread safety.
> >
> > Josh.
> >
> > ______________________________
> >  Josh Cates
> >  School of Computer Science
> >  University of Utah
> >  Email: cates@sci.utah.edu
> >  Phone: (801) 587-7697
> >  URL:   www.cs.utk.edu/~cates
> >
> >
> > On Tue, 16 Apr 2002, Miller, James V (Research) wrote:
> >
> > > Here are my thoughts:
> > >
> > > Not knowing anything about your implementation or level sets....
> > >
> > > Can the LevelSetNode be a stacked based class? And hence not use
> > > SmartPointers at all? And hence avoid the mutexes?
> > >
> > > You could create an std::vector of LevelSetNodes for each thread.
> > >
> > > In the current issue of C/C++ Users Journal (May 2002), there are a
> > > number of articles on threading.  One article covers the boost thread
> > > library which looks pretty nice.  Other articles are on "thread local
> > > storage" (globals per thread). These are slower than real globals but
> > > faster than mutex code.
> > >
> > > Another article talks about different mechanisms for mutexes.  One
> > > interesting idea is a recursive mutex where a thread pays a penalty
> > > for the mutex the first time it grabs it. As long as it doesn't release
> > > the mutex, subsequent calls to get the mutex are fast.
> > >
> > > I'll be looking at these articles in a little more detail to see
> > > if there is anything we want to fold into ITK.
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Joshua Cates [mailto:cates@poblano.sci.utah.edu]
> > > Sent: Monday, April 15, 2002 2:41 PM
> > > To: Insight-Developers
> > > Subject: [Insight-developers] design issue
> > >
> > >
> > > Hi all,
> > >
> > > I'm soliciting feedback on a design issue.
> > >
> > > In our SparseFieldLevelSetImageFilter we manage lists of small data
> > > objects (call them LevelSetNodes).  These lists must grow and shrink
> > > dynamically. Because the filter is multithreaded, allocating and freeing
> > > LevelSetObjects as they are needed is not an option.  It's one of those
> > > cases where a naive implementation could actually be slower in a
> > > multithreaded environment because of mutex locks.
> > >
> > > So what we do is to pre-allocate a "store" of objects for each thread from
> > > which that thread borrows to grow its lists.  This allows us to operate in
> > > a distributed memory mode.  Each thread has a private reserve of memory
> > > that will not be touched by other threads, eliminating the need for mutex
> > > locks. Blocking only occurs when a thread runs out of memory and has to
> > > enlarge its store.  I am planning to encapsulate this functionality into
> > > an itk::DataObject subclass.
> > >
> > > Questions are these:
> > >
> > > Have other developers come up with different solutions to this same
> > > problem?
> > >
> > > If so, what memory management objects might already exist in Itk (or STL)
> > > to help?
> > >
> > > Would other developers find this "itkObjectStore" class useful?
> > >
> > > Josh.
> > >
> > >
> > >
> > > ______________________________
> > >  Josh Cates
> > >  School of Computer Science
> > >  University of Utah
> > >  Email: cates@cs.utah.edu
> > >  Phone: (801) 587-7697
> > >  URL:   www.cs.utk.edu/~cates
>
> --
> Simon Warfield, Ph.D. warfield@bwh.harvard.edu Phone:617-732-7090
> http://www.spl.harvard.edu/~warfield           FAX:  617-582-6033
> Assistant Professor of Radiology,          Harvard Medical School
> Director, Computational Radiology Laboratory
> Thorn 329, Dept Radiology,  Brigham and Women's Hospital
> 75 Francis St, Boston, MA, 02115
> _______________________________________________
> Insight-developers mailing list
> Insight-developers@public.kitware.com
> http://public.kitware.com/mailman/listinfo/insight-developers
>