Traditional C++ memory pools are often quite dynamic, in that the block size of each allocation is passed as a function parameter to the implementation, which decides how to allocate memory and configure the memory pools. This article will present techniques how to differentiate memory pools of different block sizes with a C++ template parameter, and thus make it certain how many memory pools are used in an application and which memory pool is used in a specific code position at compile time. Therefore, it is quite static compared to the traditional way. I shall describe here the design and implementation of such a static memory pool and study its performances and peculiarities.
Take the memory pool implementation in SGI STL for example. Its interface is like
template <bool threads, int inst> class __default_alloc_template { public: static void* allocate(size_t); static void deallocate(void*, size_t); };
The memory pool knows the block size only when the user calls allocate. However, since the very size of an object is known when the new operation occurs in C++, we can use the template technique to make the block size fixed at compile time. The simplified interface is like
template <size_t Sz> class MemoryPool { public: static void* allocate(); static void deallocate(void*); };
Using this technique, the C++ compiler will separate the memory pools for users so that memory pools with different memory block sizes become completely different types. It is more natural and high run-time performance can be achieved for memory allocation/deallocation operations.
Now I will give the first implementation of memory_pool.h in Listing 1. It is the simplest implementation that conforms to the above interface. I’ll use it as a Proof of Concept. Of course I need to test it. The test program is in Listing 2. It is designed to run under GCC 2.95.3/3.x, and under STLport. This restriction is only to gurantee that we have an implementation of the SGI allocator to compare with. For me, I ran the test on the following platforms/compilers:
-D_STLP_NO_IOSTREAMS
)Test results (on a Celeron 550 MHz box) are here:
new /delete |
SGI allocator | MemoryPool | |
---|---|---|---|
Linux GCC 3.2.2 | 21.97 | 3.13 | 0.94 |
Linux GCC 2.96 | 21.97 | 1.14 | 0.86 |
MinGW GCC 3.2.3 | 43.172 | 3.695 | 1.101 |
MinGW GCC 2.95.3-8 | 44.534 | 1.071 | 0.911 |
MSVC 6.0/STLport 4.5.1 | 39.186 | 3.194 | 0.651 |
Excepting the case of SGI STL in GCC
2.95/2.96, where my MemoryPool
template performs similarly
to the SGI single_client_alloc
(which, in my test, is really __default_alloc_template<false,
0>
), the speed of my MemoryPool
template is
several times faster in all other combinations, using the commonest
optimization option (“-O2 -DNDEBUG
” in all the above compilers).
So it seems to work well! I am interested enough to investigate deeper.
The basic implementation given above makes the different memory pools complete separate and unknown to each other. There are chances that some pools of specific block sizes grab much unused memory, and make allocation of other block sizes fail since the other pools know no way to tell them to return unused memory. C++ provides no way to know which types are currently used in a program.
This naturally leads to my solution to implement the memory pools as singletons, which are described in detail in [Alexandrescu2001]. It is much easier to manipulate objects, as contrast to types, at run time. Therefore, the code is reorganized and, actually, mostly rewritten. Its characteristics are listed as follows:
MemoryPool
s inherit from
MemoryPoolBase
, which is an abstract class that provides
a virtual destructor and a pure virtual recycle
function;MemoryPoolSet
, which is a also singleton
(based on static construction and destruction for simplicity), to
record all MemoryPool
s (via pointers to
MemoryPoolBase
)The updated memory_pool.h is in Listing 3, and memory_pool.cpp in Listing 4. It is quite complete an implementation now.
One thing to note is that allocate()
will not throw
exceptions, but return a null pointer when it fails, and
deallocate()
does not accept a
null pointer. This is in contrast to the standard behaviour of
new
/delete
, but is intentional to achieve
execution efficiency and maximum flexibility. I shall deal with this
topic later on.
Exception safety is one thing often ignored, but not here :-). There are three places where exceptions could occur:
m_oMemoryPoolSet
in
MemoryPoolSet::MemoryPoolSet()
;std::set<MemoryPoolBase*>::insert(MemoryPoolBase*)
in MemoryPoolSet::add(MemoryPoolBase*)
;::operator
new(sizeof(MemoryPool<Sz>))
in
MemoryPool<Sz>::createInstance()
.
On all three points there might be memory problems, though the chances
are really rare. On a user’s point of view, they all occur at the call
to MemoryPool<Sz>::instance()
, so a
try/catch
block around instance()
will be
enough. Care is taken that no resource leakage shall happen, as long as
the compiler and the standard library are standard-conformant :-).
I used not to use a memory_pool.cpp, but put all stuff in the header file to make linking simpler. And I would very much prefer inline functions. However, test and disassembly show that inlining can have a negative impact to performance (mainly because of caches, I suppose), as well as size, of the resulting binary. Currently I still use many inline functions, but I deliberately un-inline some functions so that the happy path contains as little code as possible, i.e., un-inline code that are rarely called in the frequently used inline functions.
Tests are also updated. Two tests are in Listing 5 and Listing 6, separately. The
difference between the two tests is that one directly calls
instance()
of MemoryPool
, while the other
calls instanceKnown()
when it is known that an instance
already exists. The latter method does show some performance advantage.
I recommend that you, dear reader, try the tests yourself, on your
target platform.
Up to this point, during the process of making my code execute faster
and faster, I realized that this static memory pool has no theoretical
performance advantage over a dynamic one, since the pool selection in an
inline member function of a dynamic memory pool, when given a constant
argument, can be done at compile time instead of run time, with a
decent optimizing compiler. Thus a dynamic memory pool is not
necessarily faster than a static memory pool (although my static memory
pool implementation is always amongst the fastest). This is why the
SGI allocator, in MinGW GCC 2.95.3-8, can be faster than my
MemoryPool
. The reasons why it is slower in other
combinations are mainly caused by implementation choices, such as
inlining. So the key task for me is not to fall down. :-)
Sadly this time my test won’t compile under MSVC 6. And it is time to consider compiler compatibilities. I set the following compatibility requirements for my code:
It was just for compatibility that I chose to use size_t
instead of std::size_t
, and at the same time to include
<stddef.h>
instead of <cstddef>
for standard conformance. However, this time we encountered a problem
that cannot be worked around in a standard way:
memory_pool.cpp(40) : error C2248: 'MemoryPoolSet::~MemoryPoolSet' : cannot access private member declared in class 'MemoryPoolSet'
memory_pool.h(36) : see declaration of 'MemoryPoolSet::~MemoryPoolSet'
BCC 5.5.1 will complain too:
memory_pool.cpp:
Error E2166 memory_pool.cpp 58: Destructor for 'MemoryPoolSet' is not accessible
It is really a bug of the compilers. However, they are relatively easy
to work around. I use a special macro __PRIVATE
, which is
defined as “public
” for these two compilers, and
“private
” for others. Ugly but works.
Some compilers, e.g. DMC 8.38, still will
not compile: the nothrow
placement new
may not
be supported. To make it work, I have to allow the use of
malloc
/free
or other memory routines. Three
new macros are introduced:
MEM_POOL_ALLOCATE
/MEM_POOL_DEALLOCATE
to
abstract the system memory routines, and
MEM_POOL_USE_MALLOC
to choose between
malloc
/free
and ::operator
new
/delete
. To reduce name space pollution, they
are only effective in memory_pool.cpp.
Currently the allocated memory is not returned to the system at program
exit, since it is not necessary. Returning it only slows down the exit
operation. However, this will puzzle some memory leakage detectors. To
stop them from complaining, I let the destructor free all allocated
memory when the macro _DEBUG
is defined. This way, I can
also assure myself that my code does not leak memory.
The revised version of memory_pool.h is in Listing 7, and that of memory_pool.cpp in Listing 8. Tests are unchanged. Test results of all tested compilers follow, as the conclusion of this section:
SGI allocator | MemoryPool (1) | MemoryPool (2) | |
---|---|---|---|
Linux GCC 3.2.2 | 3.15 | 0.86 | 1.1 |
Linux GCC 2.96 | 1.13 | 1.11 | 1.09 |
MinGW GCC 3.2.3 | 3.775 | 1.201 | 1.171 |
MinGW GCC 2.95.3-8 | 0.911 | 1.281 | 1.011 |
MSVC 6.0/STLport 4.5.1 | 3.154 | 1.012 | 0.511 |
BCC 5.5.1/STLport 4.5.3 | 9.714 | 1.202 | 1.012 |
DMC 8.38/STLport 4.5.3 | 7.211 | 2.874 | 1.472 |
(The results under “SGI allocator” are from test1; those of “MemoryPool (1)” from test1; and those of “MemoryPool (2)” from test2.)
One of my original purposes is to use the memory pools transparently,
i.e., one may freely use new Obj
and delete
Obj
, but actually construct and destroy the object in the memory
pool. One may think about using a template technique like this:
template <class T> class UseMemPool { static void* operator new(…) { return MemoryPool<sizeof(T)>… } static void operator delete(…) { … } }; class Obj : public UseMemPool<Obj> { … }
Sadly there is a fatal problem about inheritance:
class Derived : public Obj , public UseMemPool<Derived> { … };
When we have a statement “ptr = new Derived;
”,
there is an ambiguity about whether
UseMemPool<Obj>::operator new
or
UseMemPool<Derived>::operator new
should be called.
We are at a dead end.
So I have to give up the elegant template technique, and resort to the old dear macro. It is quite straightforward, like:
#define DECLARE_MEMORY_POOL(Cls) \ public: \ static void* operator new(size_t s) \ { \ assert(s == sizeof(Cls)); \ return MemoryPool<sizeof(Cls)> \ ::instance().allocate(); \ } \ static void operator delete(void* p) \ { \ if (p != NULL) \ MemoryPool<sizeof(Cls)> \ ::instance().deallocate(p); \ } class Obj { … DECLARE_MEMORY_POOL(Obj) };
Wait, do you see anything wrong?
Yes, exceptions! According to the C++ standard, operator new should
throw an exception instead of returning a null pointer on failure,
unless it has the throw()
exception specification.
Obviously I am not doing either. Should I just add
“throw()
” after “operator
new(…)
”? It depends. It depends on the chances of
memory insufficiency, and on the allowed overhead when memory
insufficiency does happen. This ought to be decided by the user, not
me. So I defined two different macros, as in Listing 9. It should be noted
that I used instanceKnown()
instead of
instance()
in DECLARE_MEMORY_POOL__NOTHROW
,
because the instance()
call may throw an exception under
rare circumstances, and violates the nothrow
exception
specification. So a PREPARE_MEMORY_POOL
macro is specially
added to easily “prepare” the memory pool in a
try
/catch
block to prevent this problem. One
must call PREPARE_MEMORY_POOL(Obj)
before new
Obj
is really called, where Obj
is a class that uses
my MemoryPool
via
DECLARE_MEMORY_POOL__NOTHROW
, as shown below:
// In class definition class Obj { … DECLARE_MEMORY_POOL__NOTHROW(Obj) }; // During initialization PREPARE_MEMORY_POOL(Obj); // Normal usage Obj* p = new Obj; … delete p;
The implementation is quite complicated, but the usage is clear and neat, right?
So far, thread safety has not been taken care of. You cannot allocate
from a MemoryPool
in one thread and then deallocate in
another, or else anything hellish could happen. One of the worst things
about threads is that it is not standardized. No standard APIs could be
used across different platforms, so my implementation has to become a
little platform-dependent anyway.
I thought about using an existing thread wrapper, say, that of Boost or
ACE, but refrained because of the
dependency incurred. I would have seriously considered using
Loki ([Alexandrescu2001]) if it
had supported pthreads, but it didn’t. At last, I wrote two
wrappers myself: one was a fast_mutex
class to abstract the
thread-locking semantics, the other a Loki-like
class_level_lock
using the fast_mutex
as its
underlying mechanism. My MemoryPool
only used
class_level_lock
, whose main divergence from Loki was that
the template had an additional template parameter _RealLock
to boost the performance in non-locking scenarios (esp. when your C++
compiler supports class template partial specialization). I will not go
deeper about it here. It just needs to be noted that the MSVC -MT
or -MD
option and the GCC -mthreads
or -pthread
option will trigger multi-threading on,
and in other cases one might have to manually specify
-D_PTHREADS
or -D_WIN32THREADS
, or even modify fast_mutex.h to
support new platforms, but memory_pool.h need not be modified.
Even if one can transparently use memory pools in a multi-threaded
environment, it should be noted that multi-threading defeats the purpose
of memory pools: mutex locking/unlocking are really time-consuming. A
non-blocking mutex operation call can still take longer than the pooled
allocation/deallocation operation. And it is possible that one needs to
allocate/deallocate from a memory pool of one specific size only in one
thread, although the application is multi-threaded and operations on
memory pools of other sizes need to be protected from simultaneous
access: why should one pay for what one does not actually
need?—For this purpose I shall apply grouping (SGI used a similar technique in its STL implementation). In my implementation of
MemoryPool
grouping is used as follows:
-1
), which has mutex protection when multi-threading is
triggered on (but not otherwise).
So finally the complete production code of the static memory pool is
given, in Listings 10, 11, 12, 13, 14, and 15. It has changed much since
I first began to write this article, largely refactored and reorganized,
and even the coding style changed. The final code has also an
additional macro _STATIC_MEM_POOL_DEBUG
for debugging
purposes. Try it when you use it on a new compiler or have doubts about
something.
An earlier version of this memory pool had trouble with static object
initialization/de-initialization in multi-threaded applications. This
was because the initialization order of static objects in different
translation units cannot be guaranteed, and it was possible that the
compiler had not yet initialized the suitable mutex (as a template
static data member) when one tried to lock/unlock it. The solution is
ignore all lock/unlock operations when the fast_mutex
has
not been constructed or has already been destroyed.
Test results (of test.cpp in Listing 16) under some compilers follow:
Notes: 1) The libstdc++ shipped with MinGW GCC 3.2/3.3 has serious performance problems regarding mutex operations inalloc
because it uses the slower Win32MUTEX
instead ofCRITICAL_SECTION
. 2) Thesingle_client_alloc
test crashes under MinGW GCC 2.95.3-8 when compiled with-mthreads
. 3) To compile for single-threaded applications in GCC versions earlier than 3.3 with the-Wall
option, you may want to specify also-Wno-unused
to get rid of the incorrect warnings.
Single-threaded1 | Multi-threaded2 | |
---|---|---|
SGI alloc |
5.04 | 33.02 |
SGI single_client_alloc |
3.02 | 2.96 |
static_mem_pool<100,-1> |
0.91 | 30.41 |
static_mem_pool<100,0> |
0.86 | 0.99 |
-O2 -DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
”.-O2 -DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
-pthread
”.Single-threaded1 | Multi-threaded2 | |
---|---|---|
SGI alloc |
8.301 | 26.998 |
SGI single_client_alloc |
8.242 | 15.392 |
static_mem_pool<100,-1> |
1.152 | 8.413 |
static_mem_pool<100,0> |
1.132 | 1.011 |
-O2 -DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
-D_STLP_NOTHREADS
”.-O2 -DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
-mthreads
”.Single-threaded1 | Multi-threaded2 | |
---|---|---|
SGI alloc |
3.214 | 13.409 |
SGI single_client_alloc |
3.215 | 3.194 |
static_mem_pool<100,-1> |
0.661 | 7.181 |
static_mem_pool<100,0> |
0.661 | 2.233 |
-GX -O2
-DNDEBUG -D_STLP_NO_OWN_IOSTREAMS
-D_STLP_NOTHREADS
”.-GX -O2
-DNDEBUG -D_STLP_NO_OWN_IOSTREAMS
-MT
”.Single-threaded1 | Multi-threaded2 | |
---|---|---|
SGI alloc |
2.223 | 12.938 |
SGI single_client_alloc |
2.203 | 2.834 |
static_mem_pool<100,-1> |
0.621 | 7.261 |
static_mem_pool<100,0> |
0.561 | 1.031 |
-GX -O2
-DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
-D_STLP_NO_OWN_IOSTREAMS
-D_STLP_NOTHREADS
”.-GX -O2
-DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
-D_STLP_NO_OWN_IOSTREAMS
-MT
”.
One might notice that GCC 3.4 is not used
in testing. There are several reasons: 1) the SGI alloc
is not there; 2) STLport 4.6.2 does not work with it; and 3) when
the static memory pool and the test were first written, GCC 3.4 had not yet been released. I finally
decided to add a test to simulate a real scenario with and without the
static memory pool that is able to run under GCC 3.4. The test code is in Listing 17, and the test
results are as follows:
new /delete |
static_mem_pool (default) |
static_mem_pool (thread-specific) |
|
---|---|---|---|
Linux GCC 3.2.21 | 5.87 | 0.84 | 0.83 |
Linux GCC 3.4.21 | 5.91 | 2.61 | 0.77 |
Linux GCC 3.2.22 | 12.82 | 8.77 | 0.84 |
Linux GCC 3.4.22 | 12.73 | 8.45 | 0.76 |
-O2 -DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
”.-O2 -DNDEBUG
-DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION
-pthread
”.I must confess that my original purpose that a static memory pool could be faster does not prove valid: a decent dynamic memory pool implementation can be as fast, at least in many cases, because the compiler can play the magic and make calculations at compile time. In fact, since I introduced the singleton pattern in order to manipulate the pools at run time, the static memory pools can even be slower. I am still a little hesitant whether this is worth while (however, without singletons there will be object lifetime problems and will be no safe way to return pooled memory to the system on program exit; see also Chapter 12 of [Carrol&Ellis1995]).
That said, I think the static memory pool does show some unique features:
And the strongest point of a static memory pool is flexiblity: it is trivial to specialize the template somewhere in the middle to make it behave in some other ways without changing a byte in the client, and without changing a byte in memory_pool.h! This is the abstraction the C++ language gives us. It is exactly because of the possibility of such abstractions that I love C++ much better than C.
From the engineering point of view, my implementation is quite decent now: it could avoid unnecessary mutex locking in a multi-threaded application, and make allocation/deallocation of objects in the memory pool transparent, and is quite flexible and efficient. However, there are many possible improvements to be done, e.g.:
static_mem_pool
in
static_mem_pool_set
and use
this information to optimize how to recycle memory;static_mem_pool::recycle
works to
reduce possible memory fragmentation (I have rarely had need to
carefully handle out-of-memory conditions in real applications
indeed).Suggestions are always welcome. More recent versions, if released, will be found at the SourceForge project Stones of Nvwa.
So I have presented here the design and implementation of a static memory pool. I hope you will find my description useful, or at least some ideas behind it are. This is an example how template technique can make better abstractions. I love this C++ way.
[Alexandrescu2001] | Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley, 2001. |
[Carrol&Ellis1995] | Martin Carrol, Margaret Ellis. Designing and Coding Reusable C++. Addison-Wesley, 1995. |
2004-3-10, first draft by Wu Yongwei
2005-1-11, revised and put online by Wu Yongwei
2005-11-24, last content update by Wu Yongwei
2017-12-26, last typographical change (besides style sheet) by Wu Yongwei
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Licence.