Topic : Performance Programming Applied to C++
Author : Joris Timmermans
Page : << Previous 2  Next >>
Go to page :

be in the Fast Class

Fast classes, the holy grail of C++ programming. Any of you that followed my "C++ operator overloading" thread know why I'm writing this.

I'll use the same example - a 3d vector class - because it's so applicable to our particular industry. Also note that the next rant is almost an exact description of my adventures over the past few weeks writing a vector class. I was actually making these mistakes up to a month ago!

Say you need this vector class, because you're doing a lot of vector math, and writing it out each time is simply too much work. You want to improve your coding efficiency, without sacrificing too much of the speed, so you decide to make a vector class, CVector3f (3f for 3 floats). You want to implement a few operators (+, -, *), making use of that great C++ feature of operator overloading, because you know it will improve the readability and maintainability of your code.

In your first draft, you've quickly implemented a constructor, copy constructor, destructor and three operators. You haven't paid special attention to performance, and haven¡¯t used inlining, but simply put your declaration in the header file, and the implementation in the .cpp file.

So what can you do to make it faster? Well, one thing I've already suggested, that's inlining the class functions in your header file. It will remove the overhead of a function call for those member functions that the compiler manages to inline. It probably won't make that much difference in your execution speed for large functions, though it will be noticeable in this vector class because the functions are so small.

Another thing to consider: do we really need the destructor? The compiler can generate an empty destructor for you, and it's probably at least as efficient than the one you've written. In our vector class, we have nothing that explicitly needs destruction, so why waste programming time on it?

The operators can probably be sped up as well. Chances are, you've written your operators like this:

CVector3f operator+( CVector3f v )
   CVector3f returnVector;
   returnVector.m_x = m_x + v.m_x;
   returnVector.m_y = m_y + v.m_y;
   returnVector.m_z = m_z + v.m_z;

   return returnVector;

There's so much hidden redundant code in this function, that it almost makes me queasy. Think about it, the first line declares and constructs a temporary variable. That means the default constructor for this object gets called, but we don't NEED it to be initialized, because we're going to assign all new values anyway.

The return at the end is similar - returnVector is a local variable, so it cannot be returned straight off. Instead, the copy constructor is called on it, something that usually takes quite a bit of processor time, specially in such small functions as this one. A more insidious one is the parameter passed. This one is also a copy of the original argument, so another memory allocation and copy constructor call.

What if we wrote another constructor, one with three arguments for x, y and z, and used it as follows:

CVector3f operator+( const CVector3f &v ) const
   return CVector3f( m_x + v.m_x, m_y + v.m_y,  m_z + v.m_z )

That is minus two copy constructor calls, it makes a difference. Notice that I've also added the const keywords. Not a speed improvement, but certainly a safety improvement. A more compiler-internal point to make here, is that the function I've written allows the compiler to more easily make it's own optimizations as well. There are very few assumptions in this function, it's all very explicit, making it a very likely candidate for inlining or other, more complex optimizations such as the "return value optimization" (see the references at the end of this article for more information).

The point I am trying to make here, is that there can be a lot of "hidden" overhead in C++ code. Constructors/Destructors, and the way inheritance and aggregation work, can make a simple-looking function perform a lot of complex initialization behind the scenes. Knowing when this occurs, and how to avoid or reduce its effects, is a very important part of learning how to write C++ with no surprises.

Know your language, it can only help you.

Part 4 : Digging Deeper

So now you have a pretty fast C++ class, but you're still not happy. Time to go even deeper.

1. Loop optimizations
Loop unrolling used to be a "big thing". What is it? Well, some loops can simply be written outright. Consider:

for( int i = 0; i < 3; i++ ) array[i] = i;

this is logically the same as

array[0] = 0; array[1] = 1, array[2] = 2;

The second version is slightly faster, because no loop has to be set up - the initialization and incrementing of i takes some time. Most compilers can already do this though, so in most cases, you probably won't get much gain, and a huge code bloat. My best advice here is, if you can't find anything else to speed up, try it, but don't be surprised if it doesn't make a difference.

2. Bit shifting
Bit shifting works for integers only. It is basically a way to multiply or divide by a power of two in a way that's much faster than a straight multiplication (and CERTAINLY faster than a division).

To understand how to use it, think of it using these formulae:

x << y ¨® x * 2y
x >> y ¨® x / 2y

I think Andr§Û LaMothe made a big deal of this in his "Tricks of the Game Programming Gurus" books, that¡¯s probably where you heard about it. It's where I heard about it anyway. In some cases, it can be very rewarding indeed. Consider the following (simplistic) code:

i *= 256;


i = i << 8;

Logically, they are the same. For this simple example, the compiler might even turn the first into the second, but as you get more complex ( i = i << 8 + i << 4 is equivalent to i *= 272 for example )the compiler might not be able to make the conversion.

3. Pointer dereference hell

Do you have code like this in your program?

for( int i = 0; i < numPixels; i++ )
   rendering_context->back_buffer->surface->bits[i] = some_value;

The exaggeration probably makes the problem stand out. This is a long loop, and all that pointer-indirection is going to eat more time than Homer eats donuts.

You might think this is a contrived example, but I've seen a lot of code that looks like this in code released on the 'net.

Why not do this?

unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits;
for( int i = 0; i < numPixels; i++ )
   back_surface_bits[i] = some_value;

You're avoiding a lot of dereferencing here, which can only improve speed, and that's a good thing!

Goltrpoat on pointed out to me that it could be faster still, here's his, very valid, suggestion:

unsigned char *back_surface_bits = rendering_context->back_buffer->surface->bits;
for( int i = 0; i < numPixels; i++,back_surface_bits++ )
   *back_surface_bits = some_value;

The previous item was just a special (albeit frequent) case of the following:

4. Unnecessary calculations within loops.

Consider this loop:

for( int i = 0; i < numPixels; i++ )
   float brighten_value = view_direction*light_brightness*( 1 / view_distance );
   back_surface_bits[i] *= brighten_value;

The calculation of brighten_value is not only expensive, it's unnecessary. The calculation is not influenced by anything that happens within the loop, so you can simply move it outside the loop, and keep re-using that value inside the loop.

This problem can occur in other ways too - unnecessary initialization in functions that you call within the loop, or in object constructors. Be careful when you code, always think "do I really need to do this?".

5. Inline assembler

The last resort, if you really, REALLY know what you are doing, and why it will be faster, you can use inline assembler, or even pure assembler with C-style linkage so it can be called from your C/C++ program. However, if you use inline assembler, either you'll have to work with conditional compilation (testing to see if the processor you're writing assembler for is supported on the platform you are compiling for), or give up on source-compatibility with other platforms. For straight 80x86 assembler, you probably won't mind much, but when you get to MMX, SSE, or 3DNow! instructions, you are limiting your possibilities.

When you get this low, a disassembler may be useful too. You can instruct most compilers to generate intermediate assembler code, so you can browse through it to see if you can improve that functionality's efficiency by hand.

Again, that's really a question of knowing your tools. In Visual Studio, you can do this using the /FA and /Fa compiler switches.

Part 5 : Math Optimizations

1. Rewriting simple maths expressions for

Page : << Previous 2  Next >>