Quantcast
Viewing all 10 articles
Browse latest View live

Now Where Did I Allocate That?!

Welcome to my very first ever blog, hopefully you enjoy it and find it useful.

Figuring out where a memory allocation was made doesn’t have to be difficult, and with a little effort, it isn’t.  Most game studios’ code I’ve worked with have their own version of memory tracking and memory allocators, usually based off of Doug Lea’s work (http://g.oswego.edu/dl/html/malloc.html), each with some sort of allocation header, footer or synchronized allocation data structure associated with every allocation.  The memory tracking is beyond the scope of this article, but is assumed to exist in so much as it notes the memory is allocated or free (maybe a future article if there is interest).  An allocation structure typically contains the size of the allocation at a minimum, stored as a 32-bit unsigned integer, followed by the requested memory allocation.

 To make life easier, expand the allocation structure to include another 32-bit unsigned integer and a const character pointer to hold the line number and filename where the allocation was made.  If you do not already have an allocation structure, here’s one of what we have so far:

typedef struct TAllocationHeader
{
            const char* filename;
            unsigned int line;
            unsigned int size;
} TALLOCATION_HEADER;

Next take a look at the function signatures for malloc, calloc, realloc, memalign, new , free and delete.  Each method will need to be wrapped so the allocation structure can be attached to the allocated memory and filled in with the filename and line number information.  There are other methods, and other variations of the ones listed here, but extending the methodology is left for the reader.

void* malloc(size_t n)
void* calloc(size_t n_elements, size_t element_size)
void* realloc(void* p, size_t n)
void* memalign(size_t alignment, size_t n)
void* operator new(size_t size)
void* operator new(size_t size, void* placement)
void* operator new[](size_t size)
void* operator new[](size_t size, void* placement)
void free(void* p)
void operator delete(void* p)
void operator delete(void* p, void* placement)
void operator delete[](void* p)
void operator delete[](void* p, void* placement)

To wrap the methods create new names for the C methods, and overload the operators for the C++ methods, for example:

void* tracked_malloc(size_t n, const char* filename, unsigned int line)
void* operator new(size_t, const char* filename, unsigned int line)

(Similarly add the “const char* filename, unsigned int line” to all the other method parameter lists)

Something I’ve found useful, and some people will probably boo and hiss about it, is to use preprocessor macros to make removing the filename and line easy during production builds.

#if DEBUG
#define TRACKING_INFO    , const char* filename, unsigned int line
#else
#define TRACKING_INFO
#endif

Then the method signatures end up getting declared like so:

void* tracked_malloc(size_t n    TRACKING_INFO)
void* operator new (size_t n    TRACKING_INFO)

Notice there is no comma after the “size_t n”, because it is already included in the DEBUG version of TRACKING_INFO, or a complete blank in the non-DEBUG version. 

In the implementation of each allocation method increase the requested size by the size of the allocation structure.  Allocate the memory at the increased size.  Cast the allocated memory pointer to an allocation structure. Copy the filename parameter to the allocation structure’s filename pointer member. Copy the line parameter into the allocation structure’s line member. Copy the increased size into the allocation structure’s size member.  Move the pointer forward by the size of the allocation structure, and return the new address.

For the de-allocation methods, move the pointer back to the beginning of the allocation structure and de-allocate.

After the allocation and de-allocation methods are in place, switch all the standard method calls to the tracked methods.  Where filename and line are to be passed use the C and C++ defined __FILE__ and __LINE__ preprocessor macros.

Now any time you find an allocation that isn’t free when you think it should be, just move the pointer back by the size of the allocation structure, cast the pointer to an allocation structure and read where it was allocated.

Additional Notes:

“But the free and delete methods didn’t use the new parameters?”

Thank you for noticing.  These are actually something that can be used for tracking allocations and de-allocations if you implement and insert a reporting method call in each allocation and de-allocation method.

void report_mem(const char* action, void* p)

When implementing your report_mem method, the allocation information can be obtained by moving the pointer to the beginning of the allocation structure, and then fed to whatever tracking implementation you choose.  For example, the information could be sent over a network socket to a monitoring application like MemAnalyze has done (http://www.gamasutra.com/view/feature/1430/monitoring_your_pcs_memory_usage_.php).

More Preprocessor Lovin’

Another way to ease use via the preprocessor is to have it remove the __FILE__ and __LINE__ information in production builds.  For example:

#if DEBUG
#define TRACKING_PARAMS    , ___FILE__, __LINE__
#else
#define TRACKING_PARAMS
#endif

Then the calls to your tracked methods would look like so:

void* p = tracked_malloc(n    TRACKING_PARAMS);

If your compiler supports the C99 standard, you can also use the __FUNCTION__ preprocessor macro in your allocation header, but be careful about taking up too much memory or causing alignment issues.

Extra Credit

If you choose to add any additional information to the allocation structure, keep in mind that there may potentially be many thousands of allocations at any one time in a game, so don’t go crazy.  Consider that memory may need to be aligned, and by making the allocation structure a size not commonly divisible by 4 or 8 bytes, or whatever the alignment needed is, there will end up being wasted space to make the alignment possible, so you might as well use that.

For information about replacing memory allocation methods in a multithreaded environment, in a non-blocking manner, there is the following site: http://www.nedprod.com/programs/portable/nedmalloc/

Thank you.

Thanks to codemonkey_uk for pointing out my initial implementation using a character array in the allocation structure could be improved by using a pointer instead; this benefits the allocation structure in relation to alignment and also significantly reduces memory usage for tracking.

Thanks to Julien Koenen for presenting a great strategy in the comments below for getting an entire callstack without changing the existing function signatures.  Also Julien’s strategy makes it easier to integrate into other libraries.  Some caveats are that you must have enough memory, outside the allocation structure, to store the callstacks, and gathering the callstack information is fairly platform specific.


Social Media Games On the Mind

The game industry’s move towards social media games is something that has been on my mind lately, so for today’s blog I’m going to share my opinions, and would like to hear back on your opinions.  Some examples are EA, Sony and Microsoft’s moves to embrace social media games (http://bit.ly/iaGL77, http://bit.ly/i0nLIV, http://on.mash.to/bCqgov).  The quality of social media games and why companies might forego traditional games to make social media games instead has also led to a bit of discusson (http://bit.ly/hjVx8H, http://bit.ly/gNGUcd).  

There are probably plenty of other reasons companies are focusing on social media games, but to me it seems like it is mostly an easy business decision for them.  Social media outlets have hundreds of millions of users (e.g. Facebook 500 Million plus users http://on.fb.me/12oAN).  One of the most successful traditional on-line games, World of Warcraft boasts of 12 Million plus players (http://bit.ly/dl38rl) and has a cost in excess of $100 Million to make (http://on.wsj.com/doruRT).  Farmville, at about the same time as Blizzard’s 12 Million player announcement, dropped 25% to somewhere below 60 Million players (http://bit.ly/9zbNvs), with estimates for development costs somewhere between $200K and $500K (Zynga is still private so they aren’t forthcoming with real numbers).  So with slightly less than 12% of Facebook’s usership playing Zynga’s game, hundreds of millions more potential players, and development costs for a social media hit estimated at a fraction of the cost of a successful traditional on-line game, the numbers for a social media game look very attractive to a company deciding whether to make a social media game their next game or go the traditional route.  Once the decision is made to make a social media game, the method to make it profitable is developed. 

The monetization of social media games can reduce the quality of the games in my opinion.  It seems obvious to me, that at its heart, the games industry is a business that must make profit to stay in business, so it’s understood that considerations must be made to monetize the game.  The method chosen to drive profits have the potential to encourage shortened or poor design, and the most likely method to cause this, in my opinion, is pay to progress.  A social media game that allows a player to pay to improve their skill or progress gives the designer a crutch to say that if the challenge or skill they are working on balancing is too hard for a player, the player can just pay to get past it, and this encourages the designer to quit balancing too soon.  Companies looking to increase the bottom line have an incentive to skew the game balance to encourage more people to pay to progress, rather than focusing more on making the game enjoyable; this just smacks of the fox guarding the hen-house.  Now to be fair, there is also a point at which people will stop playing if the game is no longer fun, so some attention must always be given to making the game enjoyable, even when the focus is on the bottom line.

The incentive to play for skill based players is also reduced for social media games that share achievments with friends and communities when they allow pay to progress.  Putting a little asterisk next to those who pay to progress might seem fair, but that removes some incentive to pay, and pay is what the monetization motive is all about for a company, so business-wise the asterisk is a counter-active decision.  Without a way to distinguish players who progress by skill, those who do progress by skill just had their accomplishments lessened, and the luster of the rewards have been tarnished, because the reputation appears to be the same for those that just paid to progress.  Thus the skill players start to lose their incentive to continue playing.  Does wanting a level field for the accomplishments make the skill players snobbish or elitist… maybe, but then if the achievment is worth having shouldn’t it be earned?

 Not all social media games will use the pay to progress method, and not all companies who allow it will be so motivated to increase the bottom line at the expense of the quality of the game, but this one design illustrates to me that there are new pitfalls to be avoided when making a social media game.

Ready, Set, Allocate! (Part 1)

Links to other parts:

Allocating dynamic memory can be a major slow down in any game.  A memory heap system that replaces the built-in allocation system can help to overcome many slow down situations.  Something I’ve used and seen used is to allocate as much memory as possible into one heap, and then rewrite malloc and free to use that heap; this requires tracking and a bit of memory, but the speed increase has proven well worth the effort.  Thanks to Lee Marshall, formerly of Locomotive Games, for leading me back to memory management concepts and pointing me towards Doug Lea’s work.

 Over the next few posts I’ll present the method I used, as well as the metrics to show the speed increase.  All my tests are run on an old laptop running Windows XP Pro Service Pack 3, with a Pentium-M 1.6GHz processor, 400MHz FSB, and 512MB of RAM.

 The first issue to be tackled is to setup the tracking.  Following the tracking will be a breakdown of malloc, and then free.  Afterwards a simple high performance timer, test and metrics will be discussed.

 In my code I use an array of tracking units, each element is an unsigned 32-bit integer with each bit representing a page of tracked memory.  The size of the array is determined by the amount of memory to be tracked and the page size.

 First the amount of memory to track must be chosen.  The current console systems range up to 512MB of memory, but most of my work was done on the PSP which only has 32MB, so 32MB is the size I choose to test with, even though the PSP only really allows use of about 20MB after the OS and volatile memory.

32 * 1024 * 1024 = 33554432 Bytes

Second the page size for each allocation of memory to track must be chosen.  This will be the minimum allocation size, so a balance must be struck that doesn’t require too many small allocations for large files, such that tracking becomes a heavy burden, and also doesn’t waste too much space for a reasonable number of small allocations.  In my code, the page size should also be made a power of 2, for speed of considerations to be explained later.  I’ve chosen 4096 Bytes.

 Since each page is represented by a bit in a tracking unit:

number of pages per tracking unit =
4 Bytes per tracking unit * 8 bits (or pages) per Byte
(32 pages per tracking unit)
 
number of pages required to track the total memory =
33554432 Bytes / 4096 Bytes per page
(8192 pages)
 
number of tracking units =
8192 pages / 32 pages per tracking unit
(256 tracking units)

 So at a cost of 256 tracking units, at 4 Bytes each, it will take 1KB to track 32MB with a 4KB page size.  Using a similar setup for 512MB would require 16KB to track, and for 2GB would require 65KB to track.  Pretty sweet, eh?

To aid in tracking a few constants are setup:

1
2
3
4
5
6
7
8
9
10
#define _MB( size ) size * 1024 * 1024
// Page Tracking
#define TRACKING_UNIT   u32
const TRACKING_UNIT kMemTrackingUnitAllPagesInUse = 0xFFFFFFFF; // All of a Tracking Unit's Pages in use bit mask
const u32 kMemInUse = 0x01; // Page is in use
const u32 kMemNumPagesPerUnit = ( sizeof( TRACKING_UNIT ) * 8 /* Bits per byte */ ); // Number of pages tracked per tracking unit
const u32 kMemPageSize = 4096; // Size of each page to track
const u32 kMemTotalTracked = _MB( 32 ); // Total amount of memory to track
const u32 kMemNumPages = kMemTotalTracked / kMemPageSize + ( (kMemTotalTracked % kMemPageSize)? 1 : 0 ); // Number of pages required to track memory
const u32 kMemNumTrackingUnits = kMemNumPages / kMemNumPagesPerUnit + ( (kMemNumPages % kMemNumPagesPerUnit)? 1 : 0 ); // Number of units required to track all pages

And finally the tracking array and memory to be tracked are allocated:

1
2
TRACKING_UNIT aMemPageTrackingBits[ kMemNumTrackingUnits ]; // Array of tracking bits
const void* kpMemory = malloc( kMemTotalTracked ); // This is it, the Memory

 Next time I’ll start covering malloc, which maintains and uses the tracking array.  At this point I’m thinking it’s looking a bit long for one post.  As for Mike’s challenge to “Show your Ignorance!” this entire series of posts is open season on me. ;)

Ready, Set, Allocate! (Part 2)

Links to other parts:

In this part I will begin to discuss the method and assumptions used to rewrite malloc.

 First comes the huge non-secret, secret.  Aligned memory is easier to allocate than un-aligned memory, because the addresses work nicely within this system.  Since it can be assumed that allocations will always be made along the alignment boundary, there is no requirement to divide and track sub-alignment size blocks, which saves time in operations and space in tracking. The downside is the allocations potentially waste space and fragment memory.  With some careful planning and usage of design patterns such as the Object Pool, the wasted space and fragmentation can be minimized.

One of the dirty little secrets about the tracking system mentioned in Part 1 is it only tracks whether a page of memory is in use or not, which doesn’t inform the free method about how much memory was previously allocated when malloc was called.  So all malloc calls should mark the memory being returned with some sort of header to denote how much memory was allocated, so the corresponding free call will have the information it needs to release the memory and update the tracking information.  In my code I use a simple header that tracks the size and the alignment.

typedef struct _TAllocationHeader
{
      size_t      uSize;
      u32         uAlignment;
} TALLOCATION_HEADER;

I should mention a few typedefs I use: 

u8        - unsigned char
u16      - unsigned short
u32      - unsigned integer
l64       - long long

 Now, technically, a size_t, or 32-bit unsigned integer, for the uSize member is overkill for a 32MB memory heap (2^25 would suffice), but trying to go smaller isn’t possible with the current data types; the next data type down, a u16 only goes to 65,535.  Even if a data type existed between u16 and u32, it would remove the generalization to be able to easily switch to 512MB or larger.  The 32-bit unsigned integer can denote up to a 4GB allocation, which hopefully should be more than will be needed in one allocation; if not, adjust accordingly for your needs.

The u32, 32-bit unsigned integer, for the Alignment is also overkill, but instead of trying to conserve space using a u8, 8-bit unsigned char, the u32 is chosen because it keeps the address at the end of the header aligned within the 32-bit system.  Something to keep in mind when modifying the allocation header to suit your needs is that the size of the header should be a multiple of the alignment size (e.g. in this system, on a 32-bit platform, headers should be a multiple of 4-bytes (32-bits = 4 bytes)), this provides the benefit that memory addresses returned will be useable with operations that require memory alignment (such as those used in SIMD).

The malloc function signature looks like so:

void* my_malloc( size_t uSize, u32 uAlignment )

uSize is the amount of memory in bytes being requested, and uAlignment is the size in bytes to use for calculating address boundaries.  Calls to malloc can be redirected to use my_malloc by first ensuring malloc.h is not included, then providing a define that replaces malloc with my_malloc during the pre-processor phase of compilation, and swapping the malloc calls for the define. 

#define MYMALLOC( uSize ) my_malloc( uSize, 4 )

So here’s what malloc needs to do… 

  1. Add padding to compensate for alignment of the allocation.
  2. Align the header so the beginning address of the memory will be aligned.
  3. Add the size of the header to the allocation.
  4. Check the request to make sure it will fit in the memory being tracked.
  5. Find enough contiguous pages of memory to satisfy the allocation.
    1. Find a starting point of available pages within a tracking unit.
    2. Find whole tracking units where all pages are available and required in fulfilling the request.
    3. Find the remaining pages needed to fulfill the request.
  6. Return the memory, if enough available memory was found, otherwise return NULL.

 The first four steps feel mostly like housekeeping steps to me, so I’ve kept the explanations mostly to just a word description of what is done in code.  If more explanation is needed, please post questions.

 

1. Add padding to compensate for alignment of the allocation.

If the allocation request doesn’t match the alignment requirement, it’s simple enough to fix.  The additional space is calculated by taking the modulus of the size by the alignment, subtracting the remainder from the alignment, and adding the result back into the size.

uSize += (((uSize % uAlignment) > 0) ? uAlignment - (uSize % uAlignment) : 0);

2. Align the header so the beginning address of the memory will be aligned.

As an optimization, this step can be skipped, if you know that your allocation header will always take enough memory to leave the next available byte on an alignment boundary.  However, if someone calls my_malloc with an alignment size greater than the size of the allocation header, the result will be an unaligned address boundary.  The size in bytes to pad the allocation header can be obtained using a similar method to what was used to pad the allocation itself.  The final size of the allocation header is also saved so it can be used to calculate the return address.

u32 uAllocationHeaderPaddingSize = ((sizeof(TALLOCATION_HEADER) % uAlignment) > 0) ? uAlignment - sizeof(TALLOCATION_HEADER) % uAlignment : 0;
u32 uAllocationHeaderSize = sizeof(TALLOCATION_HEADER) + uAllocationHeaderPaddingSize; 

3.  Add the size of the header to the allocation.

Nothing big here, just keeping track of how much memory will be required.

uSize += uAllocationHeaderSize;

4. Check the request to make sure it will fit in the memory being tracked.

The number of pages being requested is calculated and saved.  First the number of whole pages is obtained by dividing the allocation size by the size of a page of memory.  Then any remaining memory that is required will fit into one page, so one page will be added if the allocation size modulus by the size of a page of memory is anything other than zero.

const u32 uNumPagesRequested = uSize / kMemPageSize + ((uSize % kMemPageSize) ? 1 : 0);

Next the number of tracking units requested is calculated and saved in similar fashion.  The number of whole tracking units is the number of pages requested divided by the number of pages per tracking unit.  Then any remaining pages will fit into one tracking unit, so one tracking unit will be added if the number of pages requested modulus by the number of pages per tracking unit is anything other than zero.

const u32 uNumTrackingUnitsRequested = uNumPagesRequested / kMemNumPagesPerUnit + ((uNumPagesRequested % kMemNumPagesPerUnit) ? 1 : 0);

Last the number of tracking units requested is compared to the total number of tracking units.  If there are more tracking units being requested than the total number of tracking units, the allocation failure is reported by returning NULL.

if( uNumTrackingUnitsRequested > kMemNumTrackingUnits )
{
      return NULL;
}

5. Find enough contiguous pages of memory to satisfy the allocation.

The search for pages is broken up into three stages, starting point, whole tracking units, and remaining pages.  The starting point looks for a tracking unit that either resolves the entire request or has pages available at the end and is contiguous to the next tracking unit.  The whole tracking units consist of 32 pages, at 4KB each, and by searching for available whole tracking units, the search is sped up by not having to do an individual search for each page internally.  The remaining pages stage aims to fulfill the remainder of the request by checking the tracking unit that is contiguous to either the tracking unit used for the starting point, or the last whole tracking unit, depending on which was last used.

In the next part the three stages of the search for pages will be covered.  To be continued…

Ready, Set, Allocate! (Part 3)

Links to other parts:

Update (3/7/2012): Fixed an infinite-loop bug found by Simon Lundmark in section 5b that occurred in the following sequence of events:

1. The search for whole tracking units runs past the end of the tracking array
2. The for-loop terminates on the i < kMemNumTrackingUnits condition
3. uBeginningTrackingUnit does not get updated
4. The uNumContigousTrackingUnitsAvailable != uNumContiguousTrackingUnitsToFind condition is true.
5. The while-loop is continued, with uBeginningTrackingUnit at the same value it started with in the previous iteration.

Original (with edits):

In today’s part I will be discussing the three stages of the search for pages in the malloc rewrite. Special thanks to Simon Lundmark for digging in, pointing out and providing the correction for a couple flaws.

First though, here is a quick recounting of the computer used for development.  I’m working on an old laptop running Windows XP Pro Service Pack 3, with a Pentium-M 1.6GHz 32-bit processor, 400MHz FSB, and 512MB of RAM.

With the given machine architecture (Little Endian), I want to be clear that I conceptualize the units of the tracking array like it’s Big Endian, when the end of the tracking unit is referred to, this refers to the least significant bit, and the beginning or start of the tracking unit is the most significant bit.

Image may be NSFW.
Clik here to view.

Figure 1: Big Endian bit field

The array of tracking units positions the end of each preceding tracking unit next to the start of the subsequent tracking unit.

Image may be NSFW.
Clik here to view.

Figure 2: Preceding / Subsequent tracking units

Just to give a quick reminder, here’s the list of things malloc needs to do:

  1. Add padding to compensate for alignment of the allocation.
  2. Align the header so the beginning address of the memory will be aligned.
  3. Add the size of the header to the allocation.
  4. Check the request to make sure it will fit in the memory being tracked.
  5. Find enough contiguous pages of memory to satisfy the allocation.
    1. Find a starting point of available pages within a tracking unit.
    2. Find whole tracking units where all pages are available and required in fulfilling the request.
    3. Find the remaining pages needed to fulfill the request.
  6. Return the memory, if enough available memory was found, otherwise return NULL.

The search begins at the first tracking unit, and continues iterating over the array of tracking units until all tracking units have been exhausted.

u32 uBeginningTrackingUnit = 0;
while (uBeginningTrackingUnit < kMemNumTrackingUnits)
{

5a. Find a starting point of available pages within a tracking unit.

Four potential situations exist within the first tracking unit.

  1. All the pages have been used.
  2. Not enough free pages exist internal to the tracking unit to fulfill the request.
  3. All the pages needed to fulfill the request are contained within the starting tracking unit.
  4. Available pages exist starting at some point within the tracking unit, extending to the end of the tracking unit, and may be used in conjunction with the next tracking unit to fulfill the request if enough available contiguous pages exist.

The first two situations are not useful for fulfilling the request, so the last two are the ones for which to test.  If there are enough pages contained within a tracking unit to fulfill the request, then they can be located by building a bitmask that represents enough pages and using that bitmask to find a match with the available pages.  The number of pages not needed in a tracking unit is calculated by subtracting the number of pages needed from the number of pages per tracking unit.  Then the bitmask is created by bit shifting the ‘all pages in use’ constant towards the end of the tracking unit by the number of pages not needed; this leaves the number of pages that are needed as the bitmask.

      u32 uPreOffset = 0;
      TRACKING_UNIT uPreBitMask = 0;
      if (uNumPagesRequested < kMemNumPagesPerUnit)
      {
            // Build a PreBitMask to deal with there being enough contiguous pages internal to the
            // beginning unit to satisfy the request, or there is enough room to start a request
            // across the tracking unit boundary.
 
            // Build a PreBitMask that will find internal contiguous pages.
            uPreBitMask = kMemTrackingUnitAllPagesInUse << (kMemNumPagesPerUnit - uNumPagesRequested);
      }

If there are not enough available pages contained within a tracking unit to fulfill the request, then either the entire tracking unit, if all pages are available, or whatever available pages are at end of the tracking unit can be used in conjunction with the available pages contained in the start, or entirety, of the subsequent tracking unit.  The bitmask is built that represents all pages available.

      else
      {
            uPreBitMask = kMemTrackingUnitAllPagesInUse;
      }

Once the bitmask is created it is repeatedly checked and shifted toward the end of the tracking unit until a match to the bitmask is found, or all bits in the bitmask are exhausted.  If a match isn’t found prior to shifting the bitmask, then the bits, representing needed pages, shifted off the end of the tracking unit will need to be found in the next tracking unit, and this is kept in the uPreOffset counter used in a for loop.

      for ( ; uPreOffset < kMemNumPagesPerUnit; ++uPreOffset)
      {
            if (( ~(aMemPageTrackingBits[uBeginningTrackingUnit]) & uPreBitMask) == uPreBitMask)
            {
                  // Found a tracking unit with enough contiguous pages internal to the tracking unit
                  // or with available contiguous pages at the end of the tracking unit.
                  break;
            }
            uPreBitMask = uPreBitMask >> 1;
      }

Next, a check is done to make sure the bitmask wasn’t completely exhausted, and if it was to move to the next tracking unit and start over.

      // Check if the PreOffset has completely exhausted the current beginning tracking unit.
      if (uPreOffset == kMemNumPagesPerUnit)
      {
            uBeginningTrackingUnit++;
            // Minimize worst case searches, where contiguous allocations, or one large allocation, fills contiguous tracking units.
            while (aMemPageTrackingBits[uBeginningTrackingUnit] == kMemTrackingUnitAllPagesInUse)
            {
                  uBeginningTrackingUnit++;
            }
            continue;
      }

If the bitmask wasn’t completely exhausted, then the number of remaining pages needed is calculated, and the number of remaining pages needed is checked.  If more pages are needed then a check is performed to ensure more tracking units are available.  When no more tracking units are available, but more pages are needed, a null is returned.  In the case that enough pages have been found to fulfill the request, the pages to be assigned are marked as in use, the beginning memory address is calculated, the memory used to pad the allocation header, if any, is cleared, the allocation header is written, and the memory address immediately following the allocation header is returned.  By placing the allocation header immediately prior to the memory address returned, the size and alignment information can be easily retrieved during a free method call, or anywhere it needs to be viewed, such as in a debugger, simply by moving the memory address pointer backwards.

      TRACKING_UNIT uNumRemainingPagesNeeded = ((uNumPagesRequested >= (kMemNumPagesPerUnit - uPreOffset))? (uNumPagesRequested - (kMemNumPagesPerUnit - uPreOffset)) : 0);
 
      // Check if all the required pages needed have been found.
      if (uNumRemainingPagesNeeded == 0)
      {
            // Mark the pages of the tracking unit as used.
            aMemPageTrackingBits[uBeginningTrackingUnit] |= uPreBitMask;
 
            // Calculate the memory address.
            u32 uAddress = ((uBeginningTrackingUnit * kMemNumPagesPerUnit) + uPreOffset) * kMemPageSize;
 
            // Zero out the Allocation Header Padding memory.
            memset((void*)((u32)(const_cast<void*>(kpMemory)) + uAddress), 0, uAllocationHeaderPaddingSize);
            // Store the size and alignment of the allocation (Pad the front of the header, so _free and realloc can get the alignment data).
            ((TALLOCATION_HEADER*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderPaddingSize))->uSize = uSize;
            ((TALLOCATION_HEADER*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderPaddingSize))->uAlignment = uAlignment;
 
            // Return the memory.
            return (void*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderSize);
      }
 
      // Check if beginning tracking unit was the last one (early termination saves unnecessary checking).
      // If the next tracking unit is less than the beginning tracking unit, uNextTrackingUnitToCheck was larger than the data type would hold,
      // and therefore the last tracking unit is already included.
      u32 uNextTrackingUnitToCheck  = (uBeginningTrackingUnit + 1) % kMemNumTrackingUnits;
      if (uNextTrackingUnitToCheck < uBeginningTrackingUnit)
      {
            return NULL;
      }

5b. Find whole tracking units where all pages are available and required in fulfilling the request.

Since the tracking units are 32-bit unsigned integers, any tracking units where all pages in the tracking unit are available will have a value of zero.  The number of contiguous whole tracking units to find is calculated and then the tracking units subsequent to the starting tracking unit are evaluated.  If a partially or fully used tracking unit is encountered before all the required whole tracking units are found, then the search for whole tracking units is terminated.  If a tracking unit is completely available, it is counted and evaluation continues until failure or all required contiguous whole tracking units are found and confirmed.

      // The last tracking unit may be partial, so begin by determining how many whole units to find.
      // This will ensure that any remaining individual available page searches are limited to one tracking unit.
      u32 uNumContiguousTrackingUnitsAvailable = 0;
      u32 uNumContiguousTrackingUnitsToFind = uNumRemainingPagesNeeded / kMemNumPagesPerUnit;
      for (u32 i = uNextTrackingUnitToCheck; i < kMemNumTrackingUnits && uNumContiguousTrackingUnitsAvailable < uNumContiguousTrackingUnitsToFind; ++i)
      {
            if (aMemPageTrackingBits[i] == 0)
            {
                  uNumContiguousTrackingUnitsAvailable++;
            }
            else
            {
                  break;
            }
      }

Once the search has terminated or completed, the number of available whole tracking units found is compared to the number of contiguous whole tracking units required, to make sure the search wasn’t terminated early.  If the search was terminated early, then the tracking unit to start searching from is moved up to the current tracking unit, and the search for the requested memory must begin again at the beginning.

      // Check for failure to locate enough contiguous whole tracking units.
      if (uNumContiguousTrackingUnitsAvailable != uNumContiguousTrackingUnitsToFind)
      {
            // Skip past all the memory that was just checked for whole contiguous tracking units,
            // because the allocation isn't going to fit in there, and trying again would be wasted effort.
            uBeginningTrackingUnit = uNextTrackingUnitToCheck + uNumContiguousTrackingUnitsAvailable;
            continue;
      }

The number of pages needed is recalculated, subtracting out the pages contained in the whole tracking units that were just located.

      uNumRemainingPagesNeeded = uNumRemainingPagesNeeded - (uNumContiguousTrackingUnitsAvailable * kMemNumPagesPerUnit);

A check is then performed to determine if all the pages needed have been found.  If all the pages needed have been found, then the same steps are taken as described at the end of the search for the starting point, except when the pages are marked as used the whole tracking units used to satisfy the request are marked as well.  If more pages are required then the algorithm continues.

      if (uNumRemainingPagesNeeded == 0)
      {
            // Mark the pages of the tracking unit as used.
            aMemPageTrackingBits[uBeginningTrackingUnit] |= uPreBitMask;
            for (u32 i = uNextTrackingUnitToCheck; i < (uNextTrackingUnitToCheck + uNumContiguousTrackingUnitsAvailable) ; ++i)
            {
                  aMemPageTrackingBits[i] |= kMemTrackingUnitAllPagesInUse;
            }
 
            // Calculate the memory address.
            u32 uAddress = ((uBeginningTrackingUnit * kMemNumPagesPerUnit) + uPreOffset) * kMemPageSize;
 
            // Zero out the Allocation Header Padding memory.
            memset((void*)((u32)(const_cast<void*>(kpMemory)) + uAddress), 0, uAllocationHeaderPaddingSize);
            // Store the size and alignment of the allocation (Pad the front of the header, so _free and realloc can get the alignment data).
            ((TALLOCATION_HEADER*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderPaddingSize))->uSize = uSize;
            ((TALLOCATION_HEADER*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderPaddingSize))->uAlignment = uAlignment;
 
            // Return the memory.
            return (void*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderSize);
      }

5c. Find the remaining pages needed to fulfill the request.

Before checking for the last tracking unit to fulfill the request, a check is performed to make sure the index for the next tracking unit to check is not past the end of the array of tracking units.  If the next tracking unit to check would be past the end of the array of tracking units, then not enough contiguous memory exists, and null is returned. When the index for the next tracking unit to check is valid, then a bitmask is created for the remaining pages needed; the bits for this are shifted to begin at the start of the tracking unit, because the pages must be contiguous with the previously located pages.

      // Check the next tracking unit for the last of the contiguous pages needed.
 
      // Check if available tracking units already includes the last one (early termination saves unnecessary checking).
      // If the next tracking unit is less than or equal to the beginning tracking unit, uNextTrackingUnitToCheck was larger
      // than the last tracking unit or larger than the data type would hold, and therefore the last tracking unit is
      // already included.
      uNextTrackingUnitToCheck = (uNextTrackingUnitToCheck + uNumContiguousTrackingUnitsAvailable) % kMemNumTrackingUnits;
      if (uNextTrackingUnitToCheck <= uBeginningTrackingUnit)
      {
            return NULL;
      }
 
      // Build a bitmask for finding the last batch of contiguous pages.
      TRACKING_UNIT uPostBitMask = kMemTrackingUnitAllPagesInUse << (kMemNumPagesPerUnit - uNumRemainingPagesNeeded);

One last check is performed to determine if enough contiguous pages have been located to fulfill the request.  The new bitmask is compared to the next tracking unit to check, if a match is found, then the same steps are taken as described at the end of the search for the starting point, except when the pages are marked as used the whole tracking units used to satisfy the request and the pages required in the next tracking unit to check, are marked as well.  If a match is not found, then the tracking unit to start at is moved to the tracking unit that was evaluated for the end in this iteration, because it may be the start of the tracking units that will fulfill the request in the next iteration of the search over the tracking unit array.

      // The pages must be at the beginning of the tracking unit to be contiguous with the previous pages.
      // If the negation of the next tracking unit bits AND'd with the pages needed bit mask is equal to the pages needed bit mask, then the memory has been found.
      if ((~(aMemPageTrackingBits[uNextTrackingUnitToCheck]) & uPostBitMask) == uPostBitMask)
      {
            // Mark the pages of the tracking unit as used.
            aMemPageTrackingBits[uBeginningTrackingUnit] |= uPreBitMask;
            for (u32 i = (uBeginningTrackingUnit + 1); i < uNextTrackingUnitToCheck ; ++i)
            {
                  aMemPageTrackingBits[i] |= kMemTrackingUnitAllPagesInUse;
            }
            aMemPageTrackingBits[uNextTrackingUnitToCheck] |= uPostBitMask;
 
            // Calculate the memory address.
            u32 uAddress = ((uBeginningTrackingUnit * kMemNumPagesPerUnit) + uPreOffset) * kMemPageSize;
 
            // Zero out the Allocation Header Padding memory.
            memset((void*)((u32)(const_cast<void*>(kpMemory)) + uAddress), 0, uAllocationHeaderPaddingSize);
            // Store the size and alignment of the allocation (Pad the front of the header, so _free and realloc can get the alignment data).
            ((TALLOCATION_HEADER*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderPaddingSize))->uSize = uSize;
            ((TALLOCATION_HEADER*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderPaddingSize))->uAlignment = uAlignment;
 
            // Return the memory.
            return (void*)((u32)(const_cast<void*>(kpMemory)) + uAddress + uAllocationHeaderSize);
      }
 
      // Set the next tracking unit to begin checking.
      uBeginningTrackingUnit = uNextTrackingUnitToCheck;
} // End of while (uBeginningTrackingUnit < kMemNumTrackingUnits)

If all the tracking units have been exhausted, without locating enough pages to fulfill the request, then null is returned.

      // If this has been reached, then not enough contiguous memory was found.
      return NULL;

And that’s malloc.

In the bigger picture, there are issues that still need to be considered and dealt with, such as fragmentation or frequent small requests. Issues such as those can be handled through other schemes, like memory heaps, hashes containing lists of available memory with frequently used sizes and object factories that would be implemented on top of malloc. The reason these other schemes are not handled in malloc itself is because in my implementation I chose not to incur the costs associated with them in trying to allocate larger blocks of memory.  In other words, only incur the execution costs when necessary.

I hope this post has been informative, and I promise next time it will not be so long when I cover the free method.

Ready, Set, Allocate! (Part 4)

Links to other parts:

In this part the free method is covered.

 The system used for development and testing is an old laptop running Windows XP Pro Service Pack 3, with a Pentium-M 1.6GHz 32-bit processor, 400MHz FSB, and 512MB of RAM.

 To release memory back to the allocation system, the following steps will be taken:

  1. Retrieve the allocation header.
  2. Determine the pages used to track the memory.
  3. Release the pages back into the memory heap.

 The free function signature looks like so:

void my_free (void* pMem)

pMem is the memory to be released back to the allocation system. The free function can be redirected similarly to malloc, using a define.

#define MYFREE( pMem )	my_free( pMem )

1. Retrieve the allocation header.

Recall, the allocation header was placed just prior to the address returned from malloc.  So to retrieve the allocation size and alignment information, the address pointed to by pMem is decremented by the size of an allocation header; this makes the address pointed to the beginning of the allocation header, so a quick cast and the allocation header members can be retrieved.

       // The tracking header exists just prior to the pMem pointer, step backwards the size of an allocation header to get it.
      u32 uSize = ((TALLOCATION_HEADER*)((u32)(pMem) - sizeof(TALLOCATION_HEADER)))->uSize;
      u32 uAlignment = ((TALLOCATION_HEADER*)((u32)(pMem) - sizeof(TALLOCATION_HEADER)))->uAlignment;

The amount of padding used before the header, to keep the memory aligned is determined, and used to move the pMem pointer back, so that when the memory is released, all the memory added for the allocation header is released as well.

      u32 uAllocationHeaderPaddingSize = ((sizeof(TALLOCATION_HEADER) % uAlignment) > 0) ? uAlignment - sizeof(TALLOCATION_HEADER) % uAlignment : 0;
      // Align the header, so the beginning address of the memory will be aligned.
      u32 uAllocationHeaderSize = sizeof( TALLOCATION_HEADER ) + uAllocationHeaderPaddingSize;
      // Move the pointer backwards to include the allocation header.
      pMem = (void*)( (u32)( pMem ) - uAllocationHeaderSize );

2. Determine the pages used to track the memory.

The first thing to do here is to figure out the “address”, or offset from the beginning of the memory being tracked.  Then the tracking unit where the allocation was started can be determined using the address.  Also the offset for the beginning bit/page within the first tracking unit is calculated using the address.

      u32 uAddress = (u32)(pMem) - (u32)(const_cast(kpMemory));
      u32 uBeginningTrackingUnit = uAddress / kMemPageSize / kMemNumPagesPerUnit;
      u32 uPreOffset = uAddress / kMemPageSize % kMemNumPagesPerUnit;

The size information obtained from the allocation header allows the number of pages used to track the allocation to also be calculated.

      u32 uNumPagesUsed = uSize / kMemPageSize + ((uSize % kMemPageSize)? 1 : 0);

3. Release the pages back into the memory heap.

Clearing pages for each tracking unit follows the same basic pattern, a bitmask of the pages used is built, then bitwise inverted and logically AND’ed to the tracking unit, with results stored back into the same tracking unit.

If the entire allocation is contained within one tracking unit, then it is cleared and the function returns.

      // Check if the used pages are contained within one tracking unit.
      if(uNumPagesUsed < (kMemNumPagesPerUnit - uPreOffset))
      {
            // The entire allocation is contained in the one Tracking Unit, build a bit mask and clear the pages all at once.
            aMemPageTrackingBits[uBeginningTrackingUnit] &=  ~((kMemTrackingUnitAllPagesInUse << (kMemNumPagesPerUnit - uNumPagesUsed)) >> uPreOffset);
            return;
      }

Otherwise, the free operation is broken down into three parts:

  1. Free the partially used tracking unit for the beginning of the allocation.
  2. Free any whole tracking units used for the allocation.
  3. Free a partially used tracking unit, if used, for the end of the allocation.
      // All pages from the PreOffset to the end of the first Tracking Unit are used, clear them.
      aMemPageTrackingBits[uBeginningTrackingUnit] &= ~(kMemTrackingUnitAllPagesInUse >> uPreOffset);
 
      // Determine how many pages are left to clear.
      uNumPagesUsed -= (kMemNumPagesPerUnit - uPreOffset);
 
      // Clear whole tracking units.
      u32 uNextUnitToClear = (uBeginningTrackingUnit + 1);
      for( ; uNextUnitToClear <= (uBeginningTrackingUnit + (uNumPagesUsed / kMemNumPagesPerUnit)); ++uNextUnitToClear)
      {
            aMemPageTrackingBits[uNextUnitToClear] &= ~kMemTrackingUnitAllPagesInUse;
      }
 
      // Determine how many pages are left to clear.
      uNumPagesUsed -= ((uNextUnitToClear - (uBeginningTrackingUnit + 1)) * kMemNumPagesPerUnit);
 
      // Terminate if done, also protects from going past the end of the aMemPageTrackingBits array.
      if(uNumPagesUsed == 0 || (uNextUnitToClear >= kMemNumTrackingUnits))
      {
            // If this assertion fails, then something is messed up with the tracking information,
            // because it was believed that pages were still in use that must exist past the array
            // of tracking bits.
            assert(uNumPagesUsed == 0);
            return;
      }
 
      // Clear the remaining pages.
      aMemPageTrackingBits[uNextUnitToClear] &= ~(kMemTrackingUnitAllPagesInUse << (kMemNumPagesPerUnit - uNumPagesUsed));

And that’s it, the free method is done.  Next time I’ll cover the simple high performance timer used for testing.

Ready, Set, Allocate! (Part 5)

Links to other parts:

In this part I will go over a simple high performance timer class.  I’m sure there are plenty of other examples out there that are similar, especially since I put this together by looking at what others did on the web.  So first off, thank you to everyone who helped me knowingly or otherwise.

The system used for development and testing is an old laptop running Windows XP Pro Service Pack 3, with a Pentium-M 1.6GHz 32-bit processor, 400MHz FSB, and 512MB of RAM.

 Two things of note, the following requires the machine to support a high-resolution performance counter, and contradictory to everything else I’ve done up to this point, the timer code is platform specific.  The class uses the _LARGE_INTEGER union from WinNT.h, and calls the QueryPerformanceCounter and QueryPerformanceFrequency methods from WinBase.h; my reasoning behind choosing this is that those methods were written to perform exactly the task at hand and optimized for the platform I’m testing on.

The timer keeps the total time since the last time it was started, and subtracts the total amount of time spent paused to return the current amount of time tracked.  The timer class is minimalistic, it contains four data members, and in addition to a constructor it has only four methods.  The data members keep the frequency of the high-resolution performance counter, the number of ticks when the timer was started or last paused and a running total of all ticks spent while paused.  The four methods start/reset, pause or resume the timer, and return the current time as hours, minutes and seconds, including the fractional part of the seconds. 

class CHighPerfTimer
{
protected:
  _LARGE_INTEGER* m_pTicksPerSecond;      // The counter's frequency
  _LARGE_INTEGER* m_pStartTicks;          // The starting point in ticks
  _LARGE_INTEGER* m_pPausedTicks;         // The amount of ticks spent when the timer was last paused
  _LARGE_INTEGER* m_pTotalPausedTicks;    // The amount of ticks spent paused in total
 
public:
      CHighPerfTimer(void);
 
      void Start(void);
      void Pause(void);
      void Resume(void);
 
      void GetTime(s32& hours, s32& minutes, double& seconds);
};

1. Initialize the timer.

The constructor initializes the data members, and then determines the frequency of the high-resolution performance counter on the machine.  The frequency of the high-resolution performance counter varies from machine to machine, but will not change while a machine is running, so it must be obtained to be able to determine the meaning of the values returned from QueryPerformanceCounter, with respect to time.

CHighPerfTimer::CHighPerfTimer(void)
{
      m_pTicksPerSecond = new LARGE_INTEGER();
      m_pStartTicks = new LARGE_INTEGER();
      m_pPausedTicks = new LARGE_INTEGER();
      m_pTotalPausedTicks = new LARGE_INTEGER();
 
      m_pTicksPerSecond->QuadPart = 0;
      m_pStartTicks->QuadPart = 0;
      m_pPausedTicks->QuadPart = 0;
      m_pTotalPausedTicks->QuadPart = 0;
 
      // Get the high resolution counter's frequency
      QueryPerformanceFrequency(m_pTicksPerSecond);
}

2. Starting the timer.

Start initializes the timer and begins the counting of ticks.  The QueryPerformanceCounter  call obtains the number of ticks reported by the system’s high-resolution performance counter, and can be compared against a later call to QueryPerformanceCounter, so that counting need only be done internally by the system’s high-resolution performance counter.

void CHighPerfTimer::Start(void)
{
      // Reset the Paused ticks count.
      m_pPausedTicks->QuadPart = 0;
      m_pTotalPausedTicks->QuadPart = 0;
 
      // Get the starting ticks.
      QueryPerformanceCounter(m_pStartTicks);
}

3. Pausing the timer.

Pause starts counting ticks while the timer isn’t considered as running, by getting the current number of ticks for comparison against a later call to QueryPerformanceCounter.

void CHighPerfTimer::Pause(void)
{
      // Start the paused ticks count.
      QueryPerformanceCounter(m_pPausedTicks);
}

4. Resuming a paused timer.

Resume ends the counting of ticks since Pause was called and adds the difference of ticks between calls to Pause and Resume to the total number of ticks spent while paused.  If the timer hasn’t previously been started, then resume will start the timer.

void CHighPerfTimer::Resume(void)
{
      if (m_pStartTicks->QuadPart != 0)
      {
            // End the paused ticks count.
            LARGE_INTEGER endPausedTicks;
            QueryPerformanceCounter(&endPausedTicks);
            if (endPausedTicks.QuadPart > m_pPausedTicks->QuadPart)
                  m_pTotalPausedTicks->QuadPart += endPausedTicks.QuadPart - m_pPausedTicks->QuadPart;
            m_pPausedTicks->QuadPart = 0;
      }
      else
      {
            Start();
      }
}

5. Reporting the tracked time of the timer.

GetTime gets the current number of ticks from the high-resolution performance counter, converts the difference between the current number of ticks and the starting number of ticks, minus the total ticks spent paused, into a time value by dividing by the frequency, and then separates the time into hours, minutes and seconds, including the fractional part of the seconds.

void CHighPerfTimer::GetTime(s32& hours, s32& minutes, double& seconds)
{
      LARGE_INTEGER ticks;
      double time;
 
      // Get the latest tick count.
      QueryPerformanceCounter(&ticks);
 
      // If the timer is paused, discount the time since it was paused.
      LARGE_INTEGER immediatePausedTicks = *m_pTotalPausedTicks;
      if (m_pPausedTicks->QuadPart > 0)
            immediatePausedTicks.QuadPart += ticks.QuadPart - m_pPausedTicks->QuadPart;
 
      // Convert the elapsed ticks into number of seconds
      time = (double)(ticks.QuadPart - immediatePausedTicks.QuadPart - m_pStartTicks->QuadPart)/(double)m_pTicksPerSecond->QuadPart;
 
      // Number of hours
      hours = (s32)time/3600;
 
      // Number of minutes
      time = time - (hours * 3600);
      minutes = (s32)time/60;
 
      // Number of seconds
      seconds = time - (minutes * 60);
}

Next time I plan to cover the results of my comparisons between this malloc and the built-in malloc.

Ready, Set, Allocate! (Part 6)

Links to other parts:

In this part I will cover three tests. The first test examines how long it takes to do a set number of allocations wth a given size. The second test examines how long it takes to exhaust thirty-two megabytes of memory with four kilobyte allocations a set number of times. The third test examines how long it takes to allocate a set number of times in a worst case scenario. Each of the tests takes a file pointer as a parameter, so that it can write the results of the test out in a comma seperated value (CSV) format.

Due to my old laptop giving up the ghost I’m having to change the specs for the system I am using. I’ve done the best with what I have to keep the specs as close to the original ones as possible. The system now used for development and testing is an old desktop running Windows XP Pro Service Pack 3, with a Pentium-4 3.0GHz 32-bit processor, 800MHz FSB, and 512MB of RAM.

First, here are a few important details.

static const double MAXALLOCATIONS = 10000;
static const u32 STEPMULTIPLIER = 10;
static const u32 HEADERSIZE = 8; // sizeof(TALLOCATION_HEADER)
static const u32 NUMREPORTS = (u32)ceil(log(MAXALLOCATIONS)/log((d64)STEPMULTIPLIER)) + 1;

The MAXALLOCATIONS is the maximum set number of allocations to run for each test. The STEPMULTIPLIER is the multiplicand to use when stepping up the number of allocations to run for each subsequent iteration (i.e. going from 100 allocations to 1000 allocatons by multiplying 100 * STEPMULTIPLIER). I’ve pre-calculated HEADERSIZE as a convenience, so it doesn’t need to keep being calculated. Also since TALLOCATION_HEADER is declared in a separate .cpp file, with no declaration in the corresponding header, I’ve gone ahead and hard coded the value here. Last the NUMREPORTS represents the number of iterations that will be performed and reported for each test.

The structure for recording the times and reporting them later is a simple collection of values, with the size used for the allocations, and the hours, minutes and seconds for both allocating and freeing the memory. The seconds are stored as a double to keep the milliseconds.

struct Record
{
	s32 allocationSize;
	s32 allocHours;
	s32 allocMinutes;
	d64 allocSeconds;
	s32 freeHours;
	s32 freeMinutes;
	d64 freeSeconds;
};

Two other structures are used to associate the number of allocations, or repetitions of allocating / freeing the memory, performed at the given allocation size. The first structure is used with the first test and stores an array of records, so that differing allocation sizes (1Byte to 32MB, by multiples of 2) can be associated with the number of times the allocation is performed. The second structure is used with the second and third tests, associating a single record with the number of times the test is performed.

struct AllocReportSizes
{
	static const s32 NUMALLOCATIONSIZES = 25;
 
	s32 numAllocations;
	Record records[NUMALLOCATIONSIZES];
};
 
typedef struct AllocReportExhaustion
{
	s32 numRepetitions;
	Record record;
} AllocWorstCase;

1. How long does it take to allocate X times at size Y?

First thing to do is set up an array of reporting structures, one for each number of allocations to be tested, along with an entry for each size to be reported (1Byte to 32MB). The number of allocations increase by a multiple of STEPMULTIPLIER for each successive reporting structure.

// Allocate the given size, then free it, a specified number of times
bool TestSizes(FILE* pOutputFile)
{
	if (!pOutputFile)
		return false;
 
	AllocReportSizes* pReports = new AllocReportSizes[NUMREPORTS];
 
	for (u32 i = 0, numAllocs = 1; i < NUMREPORTS; ++i, numAllocs *= STEPMULTIPLIER)
	{
		pReports[i].numAllocations = numAllocs;
		pReports[i].records[0].allocationSize = MEM_1B;
		pReports[i].records[1].allocationSize = MEM_4B;
		pReports[i].records[2].allocationSize = MEM_8B;
		pReports[i].records[3].allocationSize = MEM_16B;
		pReports[i].records[4].allocationSize = MEM_32B;
		pReports[i].records[5].allocationSize = MEM_64B;
		pReports[i].records[6].allocationSize = MEM_128B;
		pReports[i].records[7].allocationSize = MEM_256B;
		pReports[i].records[8].allocationSize = MEM_512B;
		pReports[i].records[9].allocationSize = MEM_1KB;
		pReports[i].records[10].allocationSize = MEM_2KB; // Header space in a block doesn't become an issue until 4KB.
		pReports[i].records[11].allocationSize = MEM_4KB - HEADERSIZE;
		pReports[i].records[12].allocationSize = MEM_8KB - HEADERSIZE;
		pReports[i].records[13].allocationSize = MEM_16KB - HEADERSIZE;
		pReports[i].records[14].allocationSize = MEM_32KB - HEADERSIZE;
		pReports[i].records[15].allocationSize = MEM_64KB - HEADERSIZE;
		pReports[i].records[16].allocationSize = MEM_128KB - HEADERSIZE;
		pReports[i].records[17].allocationSize = MEM_256KB - HEADERSIZE;
		pReports[i].records[18].allocationSize = MEM_512KB - HEADERSIZE;
		pReports[i].records[19].allocationSize = MEM_1MB - HEADERSIZE;
		pReports[i].records[20].allocationSize = MEM_2MB - HEADERSIZE;
		pReports[i].records[21].allocationSize = MEM_4MB - HEADERSIZE;
		pReports[i].records[22].allocationSize = MEM_8MB - HEADERSIZE;
		pReports[i].records[23].allocationSize = MEM_16MB - HEADERSIZE;
		pReports[i].records[24].allocationSize = MEM_32MB - HEADERSIZE;
	}
 
	// Write the headers.
	fprintf(pOutputFile, "Action,Num Allocations,MEM_1B,MEM_4B,MEM_8B,MEM_16B,MEM_32B,MEM_64B,MEM_128B,MEM_256B,MEM_512B,MEM_1KB,MEM_2KB,MEM_4KB,MEM_8KB,MEM_16KB,MEM_32KB,MEM_64KB,MEM_128KB,MEM_256KB,MEM_512KB,MEM_1MB,MEM_2MB,MEM_4MB,MEM_8MB,MEM_16MB,MEM_32MB\n");

Next iterate through each of the reporting structures to determine how long it takes to allocate the specified number of allocations with each of the given sizes. The allocation is performed, then the memory is released, so the next allocation doesn’t operate under different conditions. The time tracked is only the time spent in the allocation and free functions, and those two are tracked seperately.

	for (u32 i = 0; i < NUMREPORTS; ++i)
	{
		s32 numAllocations = pReports[i].numAllocations;
		fprintf(pOutputFile, "Allocate,%i", numAllocations);
		for (s32 j = 0; j < AllocReportSizes::NUMALLOCATIONSIZES; ++j)
		{
			s32 allocationSize = pReports[i].records[j].allocationSize;
 
			CHighPerfTimer timerAlloc;
			CHighPerfTimer timerFree;
 
			void* pMem;
			for (s32 k = 0; k < numAllocations; ++k)
			{
				timerAlloc.Resume();
				pMem = malloc(allocationSize);
				timerAlloc.Pause();
				timerFree.Resume();
				free(pMem);
				timerFree.Pause();
			}
 
			timerAlloc.GetTime(pReports[i].records[j].allocHours, pReports[i].records[j].allocMinutes, pReports[i].records[j].allocSeconds);
			timerFree.GetTime(pReports[i].records[j].freeHours, pReports[i].records[j].freeMinutes, pReports[i].records[j].freeSeconds);
 
			fprintf(pOutputFile, ",%02i:%02i:%02.4f", pReports[i].records[j].allocHours, pReports[i].records[j].allocMinutes, pReports[i].records[j].allocSeconds);
		}
		fprintf(pOutputFile, "\n");
	}
 
	for (u32 i = 0; i < NUMREPORTS; ++i)
	{
		s32 numAllocations = pReports[i].numAllocations;
		fprintf(pOutputFile, "Free,%i", numAllocations);
		for (s32 j = 0; j < AllocReportSizes::NUMALLOCATIONSIZES; ++j)
		{
			fprintf(pOutputFile, ",%02i:%02i:%02.4f", pReports[i].records[j].freeHours, pReports[i].records[j].freeMinutes, pReports[i].records[j].freeSeconds);
		}
		fprintf(pOutputFile, "\n");
	}
	return true;
}

Allocating a bunch of objects of the same size can add up in time spent allocating, and if lots of them are done at the same time, it can be a performance issue. The problem becomes more prevalent with larger allocations when the system has to search for larger blocks of contiguous memory.

When the test is run using the heap system here, allocation times range from less than a millisecond, for 1 Byte being allocated 1 time, to 27.3 milliseconds, for 32 MB being allocated 10000 times. However, when the test is run using the standard memory allocator, allocation times range from less than a millisecond, for 1 Byte being allocated 1 time, to 9 minutes 17 seconds 442.7 milliseconds, for 32 MB being allocated 10000 times. Now that’s a pretty sensational performance improvement, but something a bit more realistic in real world use might be 64KB allocated 1000 times, and with that the improvement still shows through with 0.7 milliseconds and 66.3 milliseconds respectively.

Windows XP cannot be limited to allocating the requested memory from a specific range of addresses, short of implementing a memory allocation system, which lends to increased allocation times while searching for available memory with the standard allocation system, and that is partly what is being demonstrated here.

2. How long does it take to allocate all the memory X times at 4KB per allocation?

Similar to the previous test, an array of reporting structures is setup, but this time it is for each number of repetitions to exhaust the memory. Only the smallest size possible from the system is tested, since larger allocations decrease the amount allocations and time.

// Allocate one block at a time, until memory is full, then free the memory, a specified number of times
bool TestExhaustion(FILE* pOutputFile)
{
	if (!pOutputFile)
		return false;
 
	u32 blockSize = _KB(4);
	AllocReportExhaustion* pReports = new AllocReportExhaustion[NUMREPORTS];
 
	for (u32 i = 0, numRepetitions = 1; i < NUMREPORTS; ++i, numRepetitions *= STEPMULTIPLIER)
	{
		pReports[i].numRepetitions = numRepetitions;
		pReports[i].record.allocationSize = blockSize - HEADERSIZE; // blockSize - sizeof(TALLOCATION_HEADER)
	}
 
	// Write the headers.
	fprintf(pOutputFile, "Action,Num Repetitions,Time to exhaust 32MB using 4KB per allocation\n");

Then the reporting structures are iterated over to determine how long it takes to allocate all the memory, at 4KB per allocation, the specified number of repetitions. Each time the memory is exhausted it is released and the next repetition begins.

	u32 uTotalMem = _MB(32);
	for (u32 i = 0; i < NUMREPORTS; ++i)
	{
		s32 numRepetitions = pReports[i].numRepetitions;
		s32 allocationSize = pReports[i].record.allocationSize;
 
		CHighPerfTimer timerAlloc;
		CHighPerfTimer timerFree;
 
		void** ppMem = new void*[uTotalMem / blockSize];
		for (s32 j = 0; j < numRepetitions; ++j)
		{
			for (u32 k = 0; k < (uTotalMem / blockSize); ++k)
			{
				timerAlloc.Resume();
				ppMem[k] = malloc(allocationSize);
				timerAlloc.Pause();
			}
			for (u32 k = 0; k < (uTotalMem / blockSize); ++k)
			{
				timerFree.Resume();
				free(ppMem[k]);
				timerFree.Pause();
			}
		}
		timerAlloc.GetTime(pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
		timerFree.GetTime(pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
 
		fprintf(pOutputFile, "Allocate,%i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
		fprintf(pOutputFile, "Free, %i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
	}
 
	return true;
}

Allocating all the memory at 4KB per allocation demonstrates some good and bad allocation conditions, though not the absolute worst (that’s saved for the last test), .When no memory is previously allocated, available memory is discovered quickly, and because each subsequent allocation is contiguous whole blocks eventually end up getting examined with a single check, reducing the time to discover available memory. Performing lots of small allocations means that more searches are performed than if larger allocations were made, because larger allocations would exhaust the memory quicker, and more searches adds up to more time spent searching for available memory.

Exhausting 32 MB of memory at 4KB per allocation with the heap system here ranges in time from 10.1 milliseconds, to do it 1 time, to 1 minute 39 seconds 995 milliseconds, to do it 10000 times. When using the standard memory allocation system, the times range from 101.5 milliseconds, to exhaust 32MB of memory 1 time, to 15 minutes 52 seconds 593.3 milliseconds, to do it 10000 times. Even more interesting is the increase in performance of the Free method. To free the same memory that was allocated, the heap system here ranges from 6.1 milliseconds to 1 minute 1 second 891.4 milliseconds, while the standard memory allocation system clocks in a range from 613.6 milliseconds to 1 hour 43 minutes 35 seconds 161.5 milliseconds. At first I didn’t believe it took more than an hour and a half to free the memory, so I re-ran the test and came up with a similar time (1:41:58.453.5). While the idea of exhausting the memory 10000 times in a row may not be a real world problem, it still highlights the time gains that can be made over many allocations and de-allocations with a specialized system.

Allocating all the memory in Windows XP isn’t the same as allocating all the memory in the system described herein, since Windows XP will begin page-swapping once the physical memory has been exhausted. However, the time it takes Windows XP to allocate 32MB worth of memory at 4KB per allocation versus the time it takes the system described herein to do the same can still be compared for time to complete.

3. How long does it take to allocate 8KB of memory X times in the worst case?

As previously done, an array of reporting structures is created, but this one is used to report the time it takes to allocate an 8KB request in a worst case scenario.

// With every other block available, until the end of the memory where two blocks are available, allocate two blocks of memory, a specified number of times
bool TestWorstCase(FILE* pOutputFile)
{
	if (!pOutputFile)
		return false;
 
	u32 blockSize = _KB(4);
	AllocWorstCase* pReports = new AllocWorstCase[NUMREPORTS];
 
	for (u32 i = 0, numRepetitions = 1; i < NUMREPORTS; ++i, numRepetitions *= STEPMULTIPLIER)
	{
		pReports[i].numRepetitions = numRepetitions;
		pReports[i].record.allocationSize = (2* blockSize) - HEADERSIZE; // (2 * blockSize) - sizeof(TALLOCATION_HEADER)
	}
 
	// Write the headers.
	fprintf(pOutputFile, "Action,Num Repetitions,Worst case allocation time for an 8KB block\n");

The worst case sets up the memory to be entirely allocated in 4KB blocks, then goes back through and frees up every other block. At the end of the memory the test ensures an 8KB block is available to fullfill the request. Any searches for free memory will discover there are free blocks within each page and will search through each page to check if enough contiguous blocks exist to fullfill the requested 8KB.

	u32 uTotalMem = _MB(32);
	void** ppMem = new void*[uTotalMem / blockSize];
 
	// Allocate all the memory, one block at a time, then free every other block, then free the next to last block,
	// creating two contiguous blocks at the end of the memory.
	for (u32 i = 0; i < uTotalMem / blockSize; ++i)
	{
		ppMem[i] = malloc(blockSize - HEADERSIZE);	// blockSize - sizeof(TALLOCATION_HEADER)
	}
	for (u32 i = 1; i < uTotalMem / blockSize; i += 2)
	{
		free(ppMem[i]);
		ppMem[i] = NULL;
	}
	free(ppMem[uTotalMem / blockSize - 2]);

The test iterates through the reporting structures, performing the 8KB request for the specified number of repetitions in each one. Every time the available 8KB is found at the end of the memory it is immediately deallocated so the test can be performed for the next iteration.

	// Allocate two blocks of memory, with the only two contiguous blocks of memory being the last two, then free it a given number of times.
	for (u32 i = 0; i < NUMREPORTS; ++i)
	{
		s32 numRepetitions = pReports[i].numRepetitions;
		s32 allocationSize = pReports[i].record.allocationSize;
 
		CHighPerfTimer timerAlloc;
		CHighPerfTimer timerFree;
 
		for (s32 j = 0; j < numRepetitions; ++j)
		{
			void* pMem;
			for (u32 k = 0; k < (uTotalMem / blockSize); ++k)
			{
				timerAlloc.Resume();
				pMem = malloc(allocationSize);
				timerAlloc.Pause();
				timerFree.Resume();
				free(pMem);
				timerFree.Pause();
			}
		}
		timerAlloc.GetTime(pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
		timerFree.GetTime(pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
 
		fprintf(pOutputFile, "Allocate, %i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.allocHours, pReports[i].record.allocMinutes, pReports[i].record.allocSeconds);
		fprintf(pOutputFile, "Free, %i,%02i:%02i:%02.4f\n", numRepetitions, pReports[i].record.freeHours, pReports[i].record.freeMinutes, pReports[i].record.freeSeconds);
		printf("\n");
	}
 
	// Free the memory used to setup the testcase
	for (u32 i = 0; i < uTotalMem / blockSize - 1; ++i)
	{
		if (ppMem[i] != NULL)
			free(ppMem[i]);
	}
 
	return true;
}

The worst case test gives a sense of the time an allocation can take with this heap system at its worst, and shows how important it is to avoid fragmentation of the memory. Times to perform the test range from 390.5 milliseconds, to complete 1 time, to 1 hour 5 minutes 12 seconds 634.8 milliseconds, to complete 10000 times. In the real world one hopes to never get into a fragmentation predicament like the one setup here, and careful planning of heaps is usually done to help avert this kind of situation.

Since there is no way to limit where Windows XP will allocate memory, and because Windows XP will start page swapping to the harddrive when it runs out of physical memory, this last test isn’t something that can be fairly compared, so it has been skipped for the standard memory allocator.

While this heap system has its advantages, it can be improved upon in a number of ways. Here are three ideas that spring to mind . The heap system could be made thread-safe. Second, small allocations, less than the block size, need to be handled at the sub-block level to avoid wasting space. One way that could be done is to allocate a heap containing a set number of blocks (e.g. 128 blocks (512KB)), then any time a small request comes in, it can be redirected to that heap. This method will require some overhead to track the sub-block allocations. Third, this heap system could be improved for debugging by incorporating tracking information to determine where an allocation was made.

I hope this series has been helpful, and look forward to hearing all the great ideas other people have on how to improve upon what has been shown here.


Windows XP’s Low Fragmentation Heaps…

After finishing Part 6 of my Ready, Set, Allocate! series, I received a kind suggestion that I examine the Windows Low-Fragmentation Heaps (LFH) and compare allocation times for the heap system developed in my series.

Well, I spent a few hours getting a simple LFH policy setup and working; it really shouldn’t have been that hard. The time spent was mostly due to a couple wierd quirks (read as requirements) Microsoft has made for using a LFH policy, and me just skimming the documentation and forums to learn about the LFH policy. With Windows 7 and Windows Vista, a process’s heap is created by default with a LFH policy. However with Windows XP it allocates with a default look-aside lists heap policy, and you have to change it into a LFH policy using their HeapSetInformation method. But here comes the first quirk.

Most developers I know write and test code using some kind of IDE (typically Visual Studio), where they have a key combination setup to compile, link and then run their shiny new executable. And they typically first run their programs in a Debug configuration so they can debug any issues they run into. But Microsoft’s LFH policy won’t run in any Debug configuration, on Windows XP, unless you set an environment variable (_NO_DEBUG_HEAP = 1) that turns off the usage of debug memory heaps. Well, that seems counter-productive for testing and debugging memory based issues!?! And then there’s this setting of an environment variable, that if not cleaned out when you’re done can be easily forgotten and will prevent using debug heaps in any other project you run.

A handle to the default heap can be retrieved using GetProcessHeap, or one can be created using HeapCreate. Creating a heap allows for specifying upto 3 flags (1 of which will not work with LFH policies), an intial size, and a maximum size, but there is no option to create it with an LFH policy, so it has to be changed after it is created. In the series of blogs, I created a fixed size block of memory for allocating memory, which helped me simulate the memory constraints of a limited memory device like a PSP. But the second quirk is that you can only switch your heap to a LFH policy if you don’t set a maximum size, because the LFH policy will grow the number of pages allocated to the policy whenever it runs out of memory. So trying to simulate a fixed memory size with a LFH policy is not an option.

As a policy, LFH gets applied to an allocation request under certain conditions, and utilizes the process’s existing heap. One of those conditions is that each allocation has to be less than 512KB, so any test values 512KB or greater aren’t using the LFH policy, they are utilizing the same methods previously compared in my blog series.

When test one was run, the choice of what a real world example might be like (64KB allocated 1000 times) went from 66.3 milliseconds without the LFH policy to 8.9 milliseconds with the LFH policy. The heap system I presented in the blog series still operates faster at 0.7 milliseconds.

When test two was run, exhausting 32MB of memory using only 4KB allocations. it also showed an improvement, going from 101.5 milliseconds without the LFH policy to 64.0 milliseconds with the LFH policy, to complete the exhaustion 1 time; and to exhaust the memory 10000 times, it went from 15 minutes 52 seconds 593.3 milliseconds, to 10 minutes 34 seconds 249.2 milliseconds. This still falls significantly short of the heap system I presented, which performed the same tasks in 10.1 milliseconds, to do it 1 time, and 1 minute 39 seconds 995 milliseconds, to do it 10000 times. The best improvement from the implementation of the LFH policy was in the amount of time it took to free the memory for test two, which used to range from 613.6 milliseconds to 1 hour 43 minutes 35 seconds 161.5 milliseconds, and now ranges from 50.8 milliseconds to 7 minutes 36 seconds 251.1 milliseconds. But again the heap system I presented still wins for time to execute with 6.1 milliseconds for one exhaustion, to 1 minute 1 second 891.4 milliseconds for 10000 exhaustions of the memory.

While the heap system I presented is faster, the two main drawbacks at the moment are that it is not thread safe, and it also does nothing to prevent memory fragmentation, and those are two advantages that can be had using the LFH policy. The main advantages to using the heap system I presented is that it is cross-platform capable, it runs very fast, and the code is available for modification. I’m grateful to have learned something new regarding the memory heaps on Windows, and am also pleased to see the LFH policy improving the standard allocator on Windows XP, but that isn’t enough to convince me to use it, since I aim to target other platforms as well.

References:

  1. Low-Fragmentation Heap
  2. HeapCreate
  3. HeapSetInformation

User Defined Build Command Macros in MSVS 2010…

Everyone likes to do things their own way, and deciding where to place a project, solution or SDK that makes sense to them is no exception.  When setting up a build environment using Microsoft’s Visual Studio, one of the most useful items I’ve found, when considering development teams, are the Build Command Macros (BCM), e.g. $(ProjectDir) or $(SolutionDir).  They can be used for Include and Lib paths, parameters to scripts and utilities for Pre and Post Build Events, etc.  BCMs ease the conflicts between people who can’t agree on whether the project should be in their C:\Dev, C:\GameDev or F:\Projects\Games\XBox360 directory.  They also make it easier to change the location when you want to do multiple checkouts on the same machine to run some what-if scenarios simultaneously.  The problem I kept running into was that it appeared specifying or naming new BCMs appeared to be reserved solely for Microsoft.

History

In the past, setting up paths for anything that wasn’t pre-defined by a BCM usually came down to asking every dev to set a few environment variables, and this has always seemed, at best, messy to me; especially since this doesn’t include a nice way to distribute new macros with source control checkouts.  One work-around is to have a batch file that gets hooked in, but then you have to make sure everyone has the proper entry that hooks the batch file in (still messy).  Another alternative is to create Make files and define some macros in the top-level make file that everyone can set to their system settings (better, but now we’ve spread a bunch of make files all over the place, and are requiring the dev to manage the build in multiple places, when that was part of what the IDE was supposedly built to help simplify for us).

Web searches for ways to create BCMs usually end up in off-topic forums about recording macros.  Responses from other devs using MS Visual Studio, when asked if they knew of a way to create a User Defined Build Command Macro in the Visual Studio IDE, ranged from “No idea” to “Can’t be done.”  Ever the persistent bugger, these answers have never satisfied me.  It has always seemed logical to me that any good IDE with an associated build environment, should allow its users to extend the interface.  Recently I discovered, Microsoft does allow the user to define their own BCMs on a per project basis in Visual Studio 2010 and here is how it is done (I assume the reader is familiar with Visual Studio 2010, and is capable of creating/configuring a simple solution/project).

Image may be NSFW.
Clik here to view.

Select the Property Manager window.

Property Sheet

First, either open your solution/project, or create a new one to test with in Visual Studio 2010.  Then go to the View menu, Other Windows sub-menu and select Property Manager (If you do not find Property Manager under the Other Windows sub-menu, check directly under the View menu, Cort Stratton reported it there in his Professional edition of Visual Studio 2010).  In the Property Manager window that appears, right-click on the name of your project and select Add New Project Property Sheet from the context menu.  When the dialog appears, enter a name for your property sheet and click OK (I’ve named mine User Defined Property Sheet).   This will add the Property Sheet to all build configurations, you could also have created specialized property sheets for each build configuration by right-clicking on them individually and adding a new property sheet to each one.

 

Add a User Macro

Next double-click on the newly created Property Sheet, and this will open the Property Pages dialog for the Property Sheet.  In the left-hand tree-view, select User Macros.  In the right-hand pane that appears click the Add Macro button, which will display the Add User Macro dialog to enter the name of your Macro as well as the definition and click OK (it gives you the option here to Set the macro as an environment variable in the build environment, but why anyone would want to do that when that is the very thing it is being used to avoid, escapes me).  In my example, I’ve setup a macro called SDK_BOOST, since I’m setting up the Boost library for a test project.  You don’t need to surround your macro name by the dollar-sign and parenthesis $(), Visual Studio will add those for you.  Repeat the process of clicking the Add Macro button, entering your macros and clicking OK for any other macros you want.  Then after you’re done entering macros, click the OK button on the Property Pages dialog.  Note: Don’t forget the trailing backslash in your paths.

 

Image may be NSFW.
Clik here to view.

Project Properties VC++ Directories.

Using the User Macro

Now whenever you define a path, parameter, etc. in your Project Properties, where macros are allowed, you can use your macro.  For example in the Project Properties, VC++ Directories, I’ve specified the Include directory for the Boost library using my macro.  One other benefit is that the macro will also appear in the list of available macros when you choose to expand the Edit dialog for properties like the Include directories or Pre or Post Build Events.

 

Wrap up

So this is neat and all, but one of my original issues was that the environment variables weren’t something that could be distributed using source control checkouts.  Well, the Property Sheet we created is a .props  file that was created in the Project directory, using the name we gave the Property Sheet, and this file can be checked into source control (Tada).  It’s not much different than having a bunch of .make files spread throughout the directories, but this keeps the management in the IDE, hopefully making our lives easier.

Viewing all 10 articles
Browse latest View live