Design and Implementation of a Static Memory Pool

Traditional C++ memory pools are often quite dynamic, in that the block size of each allocation is passed as a function parameter to the implementation, which decides how to allocate memory and configure the memory pools. This article will present techniques how to differentiate memory pools of different block sizes with a C++ template parameter, and thus make it certain how many memory pools are used in an application and which memory pool is used in a specific code position at compile time. Therefore, it is quite static compared to the traditional way. I shall describe here the design and implementation of such a static memory pool and study its performances and peculiarities.

Principle

Take the memory pool implementation in SGI STL for example. Its interface is like

template <bool threads, int inst>
class __default_alloc_template {
public:
  static void* allocate(size_t);
  static void deallocate(void*, size_t);
};

The memory pool knows the block size only when the user calls allocate. However, since the very size of an object is known when the new operation occurs in C++, we can use the template technique to make the block size fixed at compile time. The simplified interface is like

template <size_t Sz>
class MemoryPool {
public:
  static void* allocate();
  static void deallocate(void*);
};

Using this technique, the C++ compiler will separate the memory pools for users so that memory pools with different memory block sizes become completely different types. It is more natural and high run-time performance can be achieved for memory allocation/deallocation operations.

Basic implementation

Now I will give the first implementation of memory_pool.h in Listing 1. It is the simplest implementation that conforms to the above interface. I'll use it as a Proof of Concept. Of course I need to test it. The test program is in Listing 2. It is designed to run under GCC 2.95.3/3.x, and under STLport. This restriction is only to gurantee that we have an implementation of the SGI allocator to compare with. For me, I ran the test on the following platforms/compilers:

Test results (on a Celeron 550 MHz box) are here:

Table 1: Proof-of-Concept Test
  new/delete SGI allocator MemoryPool
Linux GCC 3.2.2 21.97 3.13 0.94
Linux GCC 2.96 21.97 1.14 0.86
MinGW GCC 3.2.3 43.172 3.695 1.101
MinGW GCC 2.95.3-8 44.534 1.071 0.911
MSVC 6.0/STLport 4.5.1 39.186 3.194 0.651

Excepting the case of SGI STL in GCC 2.95/2.96, where my MemoryPool template performs similarly to the SGI single_client_alloc (which, in my test, is really __default_alloc_template<false, 0>), the speed of my MemoryPool template is several times faster in all other combinations, using the commonest optimization option ("-O2 -DNDEBUG" in all the above compilers).

So it seems to work well! I am interested enough to investigate deeper.

Towards completeness

The basic implementation given above makes the different memory pools complete separate and unknown to each other. There are chances that some pools of specific block sizes grab much unused memory, and make allocation of other block sizes fail since the other pools know no way to tell them to return unused memory. C++ provides no way to know which types are currently used in a program.

This naturally leads to my solution to implement the memory pools as singletons, which are described in detail in [Alexandrescu2001]. It is much easier to manipulate objects, as contrast to types, at run time. Therefore, the code is reorganized and, actually, mostly rewritten. Its characteristics are listed as follows:

The updated memory_pool.h is in Listing 3, and memory_pool.cpp in Listing 4. It is quite complete an implementation now.

One thing to note is that allocate() will not throw exceptions, but return a null pointer when it fails, and deallocate() does not accept a null pointer. This is in contrast to the standard behaviour of new/delete, but is intentional to achieve execution efficiency and maximum flexibility. I shall deal with this topic later on.

Exception safety is one thing often ignored, but not here :-). There are three places where exceptions could occur:

  1. initialization of m_oMemoryPoolSet in MemoryPoolSet::MemoryPoolSet();
  2. std::set<MemoryPoolBase*>::insert(MemoryPoolBase*) in MemoryPoolSet::add(MemoryPoolBase*);
  3. the implicit call to ::operator new(sizeof(MemoryPool<Sz>)) in MemoryPool<Sz>::createInstance().

On all three points there might be memory problems, though the chances are really rare. On a user's point of view, they all occur at the call to MemoryPool<Sz>::instance(), so a try/catch block around instance() will be enough. Care is taken that no resource leakage shall happen, as long as the compiler and the standard library are standard-conformant :-).

I used not to use a memory_pool.cpp, but put all stuff in the header file to make linking simpler. And I would very much prefer inline functions. However, test and disassembly show that inlining can have a negative impact to performance (mainly because of caches, I suppose), as well as size, of the resulting binary. Currently I still use many inline functions, but I deliberately un-inline some functions so that the happy path contains as little code as possible, i.e., un-inline code that are rarely called in the frequently used inline functions.

Tests are also updated. Two tests are in Listing 5 and Listing 6, separately. The difference between the two tests is that one directly calls instance() of MemoryPool, while the other calls instanceKnown() when it is known that an instance already exists. The latter method does show some performance advantage. I recommend that you, dear reader, try the tests yourself, on your target platform.

Up to this point, during the process of making my code execute faster and faster, I realized that this static memory pool has no theoretical performance advantage over a dynamic one, since the pool selection in an inline member function of a dynamic memory pool, when given a constant argument, can be done at compile time instead of run time, with a decent optimizing compiler. Thus a dynamic memory pool is not necessarily faster than a static memory pool (although my static memory pool implementation is always amongst the fastest). This is why the SGI allocator, in MinGW GCC 2.95.3-8, can be faster than my MemoryPool. The reasons why it is slower in other combinations are mainly caused by implementation choices, such as inlining. So the key task for me is not to fall down. :-)

Implementation issues

Sadly this time my test won't compile under MSVC 6. And it is time to consider compiler compatibilities. I set the following compatibility requirements for my code:

It was just for compatibility that I chose to use size_t instead of std::size_t, and at the same time to include <stddef.h> instead of <cstddef> for standard conformance. However, this time we encountered a problem that cannot be worked around in a standard way:

memory_pool.cpp(40) : error C2248: 'MemoryPoolSet::~MemoryPoolSet' : cannot access private member declared in class 'MemoryPoolSet'
        memory_pool.h(36) : see declaration of 'MemoryPoolSet::~MemoryPoolSet'

BCC 5.5.1 will complain too:

memory_pool.cpp:
Error E2166 memory_pool.cpp 58: Destructor for 'MemoryPoolSet' is not accessible

It is really a bug of the compilers. However, they are relatively easy to work around. I use a special macro __PRIVATE, which is defined as "public" for these two compilers, and "private" for others. Ugly but works.

Some compilers, e.g. DMC 8.38, still will not compile: the nothrow placement new may not be supported. To make it work, I have to allow the use of malloc/free or other memory routines. Three new macros are introduced: MEM_POOL_ALLOCATE/MEM_POOL_DEALLOCATE to abstract the system memory routines, and MEM_POOL_USE_MALLOC to choose between malloc/free and ::operator new/delete. To reduce name space pollution, they are only effective in memory_pool.cpp.

Currently the allocated memory is not returned to the system at program exit, since it is not necessary. Returning it only slows down the exit operation. However, this will puzzle some memory leakage detectors. To stop them from complaining, I let the destructor free all allocated memory when the macro _DEBUG is defined. This way, I can also assure myself that my code does not leak memory.

The revised version of memory_pool.h is in Listing 7, and that of memory_pool.cpp in Listing 8. Tests are unchanged. Test results of all tested compilers follow, as the conclusion of this section:

Table 2: Test Results on Various Compilers
  SGI allocator MemoryPool (1) MemoryPool (2)
Linux GCC 3.2.2 3.15 0.86 1.1
Linux GCC 2.96 1.13 1.11 1.09
MinGW GCC 3.2.3 3.775 1.201 1.171
MinGW GCC 2.95.3-8 0.911 1.281 1.011
MSVC 6.0/STLport 4.5.1 3.154 1.012 0.511
BCC 5.5.1/STLport 4.5.3 9.714 1.202 1.012
DMC 8.38/STLport 4.5.3 7.211 2.874 1.472

(The results under "SGI allocator" are from test1; those of "MemoryPool (1)" from test1; and those of "MemoryPool (2)" from test2.)

Dynamic object manipulation

One of my original purposes is to use the memory pools transparently, i.e., one may freely use "new Obj" and "delete Obj", but actually construct and destroy the object in the memory pool. One may think about using a template technique like this:

template <class T> class UseMemPool
{
  static void* operator new(...)
  {
    return MemoryPool<sizeof(T)>...
  }
  static void operator delete(...)
  {
    ...
  }
};

class Obj : public UseMemPool<Obj>
{
  ...
}

Sadly there is a fatal problem about inheritance:

class Derived : public Obj
              , public UseMemPool<Derived>
{
  ...
};

When we have a statement "ptr = new Derived;", there is an ambiguity about whether UseMemPool<Obj>::operator new or UseMemPool<Derived>::operator new should be called. We are at a dead end.

So I have to give up the elegant template technique, and resort to the old dear macro. It is quite straightforward, like:

#define DECLARE_MEMORY_POOL(Cls) \
public: \
  static void* operator new(size_t s) \
  { \
    assert(s == sizeof(Cls)); \
    return MemoryPool<sizeof(Cls)> \
           ::instance().allocate(); \
  } \
  static void operator delete(void* p) \
  { \
    if (p != NULL) \
      MemoryPool<sizeof(Cls)> \
      ::instance().deallocate(p); \
  }

class Obj {
  ...
  DECLARE_MEMORY_POOL(Obj)
};

Wait, do you see anything wrong?

Yes, exceptions! According to the C++ standard, operator new should throw an exception instead of returning a null pointer on failure, unless it has the "throw()" exception specification. Obviously I am not doing either. Should I just add "throw()" after "operator new(...)"? It depends. It depends on the chances of memory insufficiency, and on the allowed overhead when memory insufficiency does happen. This ought to be decided by the user, not me. So I defined two different macros, as in Listing 9. It should be noted that I used instanceKnown() instead of instance() in DECLARE_MEMORY_POOL__NOTHROW, because the instance() call may throw an exception under rare circumstances, and violates the nothrow exception specification. So a PREPARE_MEMORY_POOL macro is specially added to easily "prepare" the memory pool in a try/catch block to prevent this problem. One must call PREPARE_MEMORY_POOL(Obj) before "new Obj" is really called, where Obj is a class that uses my MemoryPool via DECLARE_MEMORY_POOL__NOTHROW, as shown below:

// In class definition
class Obj {
  ...
  DECLARE_MEMORY_POOL__NOTHROW(Obj)
};

// During initialization
PREPARE_MEMORY_POOL(Obj);

// Normal usage
Obj* p = new Obj;
...
delete p;

The implementation is quite complicated, but the usage is clear and neat, right?

Thread safety

So far, thread safety has not been taken care of. You cannot allocate from a MemoryPool in one thread and then deallocate in another, or else anything hellish could happen. One of the worst things about threads is that it is not standardized. No standard APIs could be used across different platforms, so my implementation has to become a little platform-dependent anyway.

I thought about using an existing thread wrapper, say, that of Boost or ACE, but refrained because of the dependency incurred. I would have seriously considered using Loki ([Alexandrescu2001]) if it had supported pthreads, but it didn't. At last, I wrote two wrappers myself: one was a fast_mutex class to abstract the thread-locking semantics, the other a Loki-like class_level_lock using the fast_mutex as its underlying mechanism. My MemoryPool only used class_level_lock, whose main divergence from Loki was that the template had an additional template parameter _RealLock to boost the performance in non-locking scenarios (esp. when your C++ compiler supports class template partial specialization). I will not go deeper about it here. It just needs to be noted that the MSVC -MT or -MD option and the GCC -mthreads or -pthread option will trigger multi-threading on, and in other cases one might have to manually specify -D_PTHREADS or -D_WIN32THREADS, or even modify fast_mutex.h to support new platforms, but memory_pool.h need not be modified.

Even if one can transparently use memory pools in a multi-threaded environment, it should be noted that multi-threading defeats the purpose of memory pools: mutex locking/unlocking are really time-consuming. A non-blocking mutex operation call can still take longer than the pooled allocation/deallocation operation. And it is possible that one needs to allocate/deallocate from a memory pool of one specific size only in one thread, although the application is multi-threaded and operations on memory pools of other sizes need to be protected from simultaneous access: why should one pay for what one does not actually need? —For this purpose I shall apply grouping (SGI used a similar technique in its STL implementation). In my implementation of MemoryPool grouping is used as follows:

So finally the complete production code of the static memory pool is given, in Listings 10, 11, 12, 13, 14, and 15. It has changed much since I first began to write this article, largely refactored and reorganized, and even the coding style changed. The final code has also an additional macro _STATIC_MEM_POOL_DEBUG for debugging purposes. Try it when you use it on a new compiler or have doubts about something.

An earlier version of this memory pool had trouble with static object initialization/de-initialization in multi-threaded applications. This was because the initialization order of static objects in different translation units cannot be guaranteed, and it was possible that the compiler had not yet initialized the suitable mutex (as a template static data member) when one tried to lock/unlock it. The solution is ignore all lock/unlock operations when the fast_mutex has not been constructed or has already been destroyed.

Test results (of test.cpp in Listing 16) under some compilers follow:

Notes: 1) The libstdc++ shipped with MinGW GCC 3.2/3.3 has serious performance problems regarding mutex operations in alloc because it uses the slower Win32 MUTEX instead of CRITICAL_SECTION. 2) The single_client_alloc test crashes under MinGW GCC 2.95.3-8 when compiled with -mthreads. 3) To compile for single-threaded applications in GCC versions earlier than 3.3 with the -Wall option, you may want to specify also -Wno-unused to get rid of the incorrect warnings.
Table 3: Test on Linux GCC 3.2.2
  Single-threaded1 Multi-threaded2
SGI alloc 5.04 33.02
SGI single_client_alloc 3.02 2.96
static_mem_pool<100,-1> 0.91 30.41
static_mem_pool<100,0> 0.86 0.99
1 Compiled with "-O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION".
2 Compiled with "-O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION -pthread".

Table 4: Test on MinGW GCC 3.2.3/STLport 4.6.2
  Single-threaded1 Multi-threaded2
SGI alloc 8.301 26.998
SGI single_client_alloc 8.242 15.392
static_mem_pool<100,-1> 1.152 8.413
static_mem_pool<100,0> 1.132 1.011
1 Compiled with "-O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION -D_STLP_NOTHREADS".
2 Compiled with "-O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION -mthreads".

Table 5: Test on MSVC 6.0/STLport 4.6.2
  Single-threaded1 Multi-threaded2
SGI alloc 3.214 13.409
SGI single_client_alloc 3.215 3.194
static_mem_pool<100,-1> 0.661 7.181
static_mem_pool<100,0> 0.661 2.233
1 Compiled with "-GX -O2 -DNDEBUG -D_STLP_NO_OWN_IOSTREAMS -D_STLP_NOTHREADS".
2 Compiled with "-GX -O2 -DNDEBUG -D_STLP_NO_OWN_IOSTREAMS -MT".

Test 6: Test on MSVC 7.1/STLport 4.6.2
  Single-threaded1 Multi-threaded2
SGI alloc 2.223 12.938
SGI single_client_alloc 2.203 2.834
static_mem_pool<100,-1> 0.621 7.261
static_mem_pool<100,0> 0.561 1.031
1 Compiled with "-GX -O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION -D_STLP_NO_OWN_IOSTREAMS -D_STLP_NOTHREADS".
2 Compiled with "-GX -O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION -D_STLP_NO_OWN_IOSTREAMS -MT".

One might notice that GCC 3.4 is not used in testing. There are several reasons: 1) the SGI alloc is not there; 2) STLport 4.6.2 does not work with it; and 3) when the static memory pool and the test were first written, GCC 3.4 had not yet been released. I finally decided to add a test to simulate a real scenario with and without the static memory pool that is able to run under GCC 3.4. The test code is in Listing 17, and the test results are as follows:

Table 7: Real Scenario Test
  new/delete static_mem_pool (default) static_mem_pool (thread-specific)
Linux GCC 3.2.21 5.87 0.84 0.83
Linux GCC 3.4.21 5.91 2.61 0.77
Linux GCC 3.2.22 12.82 8.77 0.84
Linux GCC 3.4.22 12.73 8.45 0.76
1 Compiled with "-O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION".
2 Compiled with "-O2 -DNDEBUG -DHAS_CLASS_TEMPLATE_PARTIAL_SPECIALIZATION -pthread".

Conclusions

I must confess that my original purpose that a static memory pool could be faster does not prove valid: a decent dynamic memory pool implementation can be as fast, at least in many cases, because the compiler can played the magic and make calculations at compile time. In fact, since I introduced the singleton pattern in order to manipulate the pools at run time, the static memory pools can even be slower. I am still a little hesitant whether this is worth while (however, without singletons there will be object lifetime problems and will be no safe way to return pooled memory to the system on program exit; see also Chapter 12 of [Carrol&Ellis1995]).

That said, I think the static memory pool does show some unique features:

And the strongest point of a static memory pool is flexiblity: it is trivial to specialize the template somewhere in the middle to make it behave in some other ways without changing a byte in the client, and without changing a byte in memory_pool.h! This is the abstraction the C++ language gives us. It is exactly because of the possibility of such abstractions that I love C++ much better than C.

From the engineering point of view, my implementation is quite decent now: it could avoid unnecessary mutex locking in a multi-threaded application, and make allocation/deallocation of objects in the memory pool transparent, and is quite flexible and efficient. However, there are many possible improvements to be done, e.g.:

Suggestions are always welcome. More recent versions, if released, will be found at the SourceForge project Stones of Nvwa.

So I have presented here the design and implementation of a static memory pool. I hope you will find my description useful, or at least some ideas behind it are. This is an example how template technique can make better abstractions. I love this C++ way.

Bibliography

[Alexandrescu2001]  Andrei Alexandrescu. Modern C++ Design: Generic Programming and Design Patterns Applied. Addison-Wesley, 2001.
[Carrol&Ellis1995]  Martin Carrol, Margaret Ellis. Designing and Coding Reusable C++. Addison-Wesley, 1995.

2004-3-10, first draft by Wu Yongwei
2005-1-11, revised and put online by Wu Yongwei
2005-11-24, last updated by Wu Yongwei


This work is licensed under a Creative Commons Attribution-Share Alike 2.5 Licence.

Return to Main