Topic : C++ Performance
Author : Glen McCluskey & Associates LLC
Page : 1 Next >>
Go to page :


Part I

Handling a Common strcmp() Case


In this section of the newsletter we will present some practical performance tips for improving code speed and reducing memory usage. Some of these tips will be useful only for C++ code and some will be more general and applicable to C or other languages.


As a first example, consider an application using C-style strings and functions such as strcmp(). A recent experience with this sort of application involved a function that does word stemming, that is, takes words such as "motoring" and reduces them to their root stem, in this case "motor".


In profiling this function, it was observed that much of the overall time was being spent in the strcmp() function. For the C++ compiler in question (Borland 3.1), this function is written in assembly language and is quite fast, and attempts to speed it up by unrolling the equivalent code locally at the point of function call will typically result in slowing things down.


But it's still the case that calling a function, even one implemented in assembly language, has some overhead, which comes from saving registers, manipulating stack frames, actual transfer of control, and so on. So it might be worth trying to exploit a common case -- the case where you can determine the relationship of the strings by looking only at the first character.


So we might use an inline function in C++ to encapsulate this logic:


        inline int local_strcmp(const char* s, const char* t)
        {
                return (*s != *t ? *s - *t : strcmp(s, t));
        }


If the first characters of each string do not match, there's no need to go further by calling strcmp(); we already know the answer.


Another way to implement the same idea is via a C macro:


        #define local_strcmp(s, t) ((s)[0] != (t)[0] ? (s)[0] - (t)[0] : \
                strcmp((s), (t)))


This approach has a couple of disadvantages, however. Macros are hard to get right because of the need to parenthesize arguments so as to avoid subtly wrong semantics. Writing local_strcmp() as a real function is more natural.


And macros are less likely to be understood by development tools such as browsers or debuggers. Inline functions are also a source of problems for such tools, but they at least are part of the C++ language proper, and many C++ compilers have a way of disabling inlining to help address this problem.


How much speedup is this approach good for? In the word stemming program, for input of about 65000 words, the times in seconds were:


        strcmp()                9.7

        inline local_strcmp()   7.5

        #define local_strcmp()  7.5

or a savings of about 23%. Obviously, this figure will vary with the compiler and the application.


This particular speedup is achieved by exploiting a common case -- the case where the first letters of two strings are different. For applications involving English words, this is often a good assumption. For some other types of strings, it may not be.




Part II

Handling Lots of Small Strings With a C++ Class


In performance tips this issue, we will present a complete C++ class along with its implementation code, which is appended to the end of the discussion.


This C++ example addresses a common problem in applications that use a lot of small strings. The problem has to do with the overhead associated with allocating the string via malloc() (in C) or operator new() (C++). Typically, such overhead is 8-16 bytes per string. And allocating then deallocating many small blocks will tend to fragment memory.


The Copy class deals with the problem by internally allocating large blocks and then shaving off small chunks for individual strings. It keeps track of all the large blocks allocated and deallocates them when a given Copy object is no longer needed. To use this system, you would allocate a Copy object for each major subsystem in your application that uses small strings. For example, at one point in your application, you might need to read in a dictionary from disk and use it for a while. You would allocate a Copy object and then use it to allocate the strings for each word, then flush the strings all at once.


In the application that this class was devised for, implementing string copying in this way saved 50K out of a total available memory pool of 500K. This is with Borland C++, which rounds the number of requested bytes for a string to the next multiple of 16, or an average wastage of 8 bytes. Since the Copy class uses 1024-byte chunks, on average 512 bytes will be wasted for a given set of strings, so the breakeven point would be 512 / 8 = 64 or more strings.


There are many variations on this theme. For example, if you are certain that the strings will never be freed, then you can simply grab a large amount of memory and shave chunks off of it, without worrying about keeping track of the allocated memory. Or if you have many objects of one class, such as tree nodes, you can overload operator new() for that class to do a similar type of thing.


Note that this particular storage allocator is not general. The allocated storage is aligned on 1-byte boundaries. This means that trying to allocate other than char* objects may result in performance degradation or a memory fault (such as "bus error" on UNIX systems). And the performance gains of course decline somewhat with large strings, while the wastage increases from stranding parts of the 1024-byte allocated chunks.


This same approach could be used in C or assembly language, but C++ makes it easier and encourages this particular style of programming.


An example of usage is included. A dictionary of 20065 words with total length 168K is read in. Without use of the Copy class it requires 354K, an 111% overhead. With the Copy class it takes 194K, an overhead of 15%. This is a difference of 160K, or 8 bytes per word. The results will of course vary depending on a particular operating system and runtime library. And the Copy version runs about 20% faster than the conventional version on a 486 PC.


The driver program that is included will work only with Borland C++, so you will need to write some other code to emulate the logic.

#include <string.h>
#include <assert.h>



const int COPY_BUF = 1024; // size of buffer to get const int COPY_VEC = 64; // starting size of vector


class Copy {

        int ln;                                 // number of buffers in use
        int maxln;                              // max size of vector
        char** vec;                             // storage vector
        int freelen;                            // length free in current
        char* freestr;                          // current free string

public:

        Copy();                                 // constructor
        ~Copy();                                // destructor
        char* copy(char*);                


Page : 1 Next >>