Labs: iTorque2d vs. Cocos2D Performance, Part 2

April 11, 2011 by
Categories: Labs

NOTICE: Apologies to GarageGames, but they have shown that you can get 49fps, not 1fps if you do it right. Please check out this article for more details.

I have just completed documenting and comparing the rendering subsystems of iTorque2D and Cocos2D. What started as a complex foray into graphics performance optimization took some adventurous twists and turns.

Phase 1: Rendering

Rendering systems can’t really be that different, and as such, there isn’t anything major I can say that is different between the two. They both have scene graphs, they both sort the scenes to be displayed. Each iterates over its children and renders the objects using OpenGL. Most of the openGL calls are nearly identical.

So what causes one rendering system to run 100 sprites at 1fps, and the other at 60fps, if they are both so similar?

At first I thought texture thrashing was to blame, but after a bit of testing, I found this to be wrong. My assumption was that glBindTexture() was constantly swapping out textures in the gpu, hogging gpu. Its hard to find specific information on the performance of glBindTexture(), and more specifically on iOS. While I was almost positive that major texture thrashing was occurring in iTorque2D, I needed to test this.

I went ahead and set only one texture to be used in the scene. I then ran the scene on the device. glBindTexture() was called on each sprite. This had no effect on the performance, it still remained poor. My hypothesis was that this would be faster. Wrong.

Then I added a small optimization. I removed all textures but one from the world, and then I only call glBindTexture() if the texture does not equal a global ‘currentGLName’ variable. This means that glBindTexture() is only called once in the entire course of rendering. This should effectively fix the issue with texture thrashing. Wrong again. Setting glBindTexture() once had no effect on performance.

So what was the issue? I didn’t know, but I suspected I need to batch stuff. One difference between engines is that iTorque2D has no corollary to CCBatchNode.

So if batching was the issue, I assumed that if I removed the CCBatchNodes from my Cocos2D test, then I would get the same problem as iTorque2D in Cocos2D. This is exactly what I did. I created another CCScene that did not use CCBatchNodes. The result? Performance did not change. CCBatchNodes really had no impact on frame rate, at least up to 300 sprites. The layering also shows that the drawing is not in order, so there is no way it could be batched.

Instruments screenshot showing OpenGL performance without CCBatchNode. 60FPS with 300 sprites

So I’ve ruled out texture thrashing and batch calls as the performance issue. At this point I became stumped, as these were my initial assumptions on the performance bottlenecks. I had to find other possible reasons for this issue, but since my initial assumptions were wrong, I wanted to verify that a performance issue that I ruled out early on was not a problem: TorqueScript.

Phase 2: TorqueScript

TorqueScript has a reputation for being less than ideal. This is probably my least favorite part of iTorque2D, I wish they went for Lua. Initial looks at my TestSpritesBehavior assumed that there would not be an issue with performance. All it did was loop over 100 sprites and pong them around.

My first test was to remove the loop that updated the scene. Doing so brought the fps up to 60!! This was really interesting. With 300 sprites the rendering was at 30fps. General impressions were that without any TorqueScript moving sprites around, performance was at 50% of the given performance of Cocos2D.

I could tell that when I slowly added more TorqueScript, I could make the fps go down. But was there any specific code that was causing specific slowdowns? I wanted to test this more thoroughly, so I set about creating more tests. I was quickly thwarted by a small but large bug in TorqueScript. Up to this point I was using FPS as my benchmark. Now that I was testing script, I wanted to test executions in milliseconds. It turns out that there were serious issues with getRealTime(), which is returning an unsigned long that gets somehow mushed into a signed long, which means I get this awesome arithmetic:

echo("[CH] time for loop:" SPC %start SPC %endTime SPC "diff:"SPC (%endTime - %start));
[CH] time for loop: 106383923 106383925 diff: 8
[CH] time for loop: 195538583 195538584 diff: 16

I wracked my head against this for a while. I was stumped because as far as I could tell, there is no such thing as an unsigned long in TorqueScript! Additionally the values are all way below the 2 trillion mark, so there must have been some bit offset problems deep somewhere. I came up with something like this awesome solution:

U32 uMills= Platform::getRealMilliseconds();
S32 sMills = uMills - 107946204; // This being a random getRealMilliseconds() grabbed from output.
return sMills;

So with that I could run some basic performance tests!

TorqueScript Performance Tests

EDIT: Here is the Torque Project for the following tests. Note that you will need to figure out how to fix the getRealTime bug for these to give useful numbers. These tests were run on a first gen iPad. The below numbers may be slightly different, your mileage may vary!

Running an empty for loop 10,000 times:

%l = 10000;
for(%i=0;%i<%l;%i++){
}

Execution time average: 9-10ms.
Running a for loop over 100 pre-existing objects, and accessing that object:

%l = %this.sprites.getCount();
for(%i=0;%i<%l;%i++){
%sprite = %this.sprites.getObject(%i);
}

Execution time average: 2-3ms.

Running a for loop over 100 pre-existing objects, accessing that object, then accessing two values of that object, and parsing the point information on that object:

%l = %this.sprites.getCount();
for(%i=0;%i<%l;%i++){
%sprite = %this.sprites.getObject(%i);
%pos = %sprite.position;
%vel = %sprite.velocity;
%posX = getWord(%pos, 0);
%posY = getWord(%pos, 1);
%velX = getWord(%vel, 0);
%velY = getWord(%vel, 1);
}

Execution time average: 9-11ms.

Running a full pong-style loop on 100 sprites:

%l = %this.sprites.getCount();
for(%i=0;%i<%l;%i++){
%sprite = %this.sprites.getObject(%i);
%pos = %sprite.position;
%vel = %sprite.velocity;
%posX = getWord(%pos, 0);
%posY = getWord(%pos, 1);
%velX = getWord(%vel, 0);
%velY = getWord(%vel, 1);
// [CH] Move it.
%posX += %velX;
%posY += %velY;
// [CH] Clamp it.
%posX = t2dGetMax(-512, t2dGetMin( %posX, 512));
%posY = t2dGetMax(-384, t2dGetMin( %posY, 384));
// [CH] Set inverse velocity if at world bounds.
if(%posX >= 512 || %posX <= -512){
%sprite.velocity = -%velX SPC %velY;
}
if(%posY >= 384 || %posY <= -384){
%sprite.velocity = %velX SPC -%velY;
}
// [CH] Set position.
%sprite.position = %posX SPC %posY;
}

Execution time average: 20-21ms.

Accessor Testing

I did some basic comparative testing of builtin getter setter vs. dynamic field. This just gets a value in torque and sets it to a local variable.
It takes 17ms to get a dynamic field property on 500 objects. Tests show no difference if the object is a sprite or a SimObject (34ns per loop)

It takes 20ms to get the ‘position’ variable, which is a getter that generates a position var on the fly. There is additional logic within the getter, so this takes a bit longer. (40ns per call)

iTickable Double Whammy

Another interesting thing of note is how iTickable works, and the behavior that I see. When I run the above loop over 100 sprites, the fps drops to 1fps. What I’m seeing is that even tho the screen updates every 1 second, the onUpdate() function is still being called 60 times!

From what I can surmise, the issue is that Torque uses a fixed time step, and attempts to ‘catch up’ by executing multiple ticks, even tho the screen is only refreshing once. This means that Torque is running a 20ms loop *sixty* times per visible frame. The astute observer will notice that this is greater than a second, and so the fps is most likely an incorrect value. This also explains the following pattern:

[CH] FPS: 10.4
[CH] FPS:  5.4
[CH] FPS:  3.8
[CH] FPS:  2.9
[CH] FPS:  2.4
[CH] FPS:  2.1
[CH] FPS:  1.9

The fps is  constantly dropping until it hits 1.1 or so in the beginning of execution. My current assumption is that this is related to the ‘catching up’ aspect of the scheduler, but since the performance will always be below the requested fps, it can never catch up. End result? Poor performance becomes even worse. This would most likely never be an issue with a real-world game, because you would never purposefully overload a game like this.

Cocos2D uses a variable time step, so if there is very poor performance, or stutter, it does not feel the need to ‘catch up’. All movement can (and should) be computed against the tick’s delta to provide a smooth interpolation, regardless of cpu saturation. I wonder if iTorque2D’s rationale for a fixed time step is to provide a more stable rigid-body simulation. These do not fare well under variable time steps.

So ultimately my search for graphics optimizations comes to an unlikely conclusion in TorqueScript. So what to optimize where next? That may be a topic for another post!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>