Display speed in software-rendering ports

scifista42 · September 29, 2015

I'm crossposting a part of my display speed analysis post in Everything Else forum, the part that relates to Doom ports, hoping for more feedback.

Basically, I've measured that if I define pixel truecolor data as an array of integers, merely updating the pixels within a windowed SDL application with screen resolution 800*600 takes 2.7 miliseconds on avarage. Is it fast or is it too slow for the purposes of an application with some kind of software renderer with speed comparable to Doom's software renderer?

Let's say that I want to achieve a stable frame rate of 35 FPS. 1/35 of a second = 28.57 miliseconds. This means that I have 28.57 miliseconds to process a frame, which I want to do fully by my own software-rendering algorithm. Out of these 28.57 miliseconds, 2.7 miliseconds are needed just for the pure display part, to actually update pixels in the window. That's roughly 9% of the available time. Will it be enough?

Compare it to Doom, precisely to modern Doom source ports with software rendering, given the same frame rate (35 FPS) and screen size (800*600). Let's disregard the fact that my renderer is truecolor and Doom's is 8-bit, compare just pure speed. I expect my software-renderer to be several times slower than Doom's - that means, taking several times more time to process a single frame before displaying it. My concern and question is: Do I have a chance to maintain 35 FPS under these circumstances? When a modern Doom source port processes a frame, does it take less than 2.7 miliseconds to update display in the window? Does it take less than 9% of the total available time, and does the actual software rendering algorithm take more or less than 91% of said available time to render a frame to buffer (before it's displayed)?

I hope I'm making myself clear enough. I'm merely concerned about speed, and it's actually irrelevant what exactly my "software-renderer" will be doing, perhaps something completely different than being a 2.5D graphic engine, but it doesn't matter when we're discussing just speed of display, which is what I want. I'd be glad if anyone experienced could explain me if this display speed that I had achieved is enough or not enough fast for realtime rendering, assuming that further software-rendering code will have to be executed every frame. In comparison with Doom, or otherwise, I'd just like to get a realistic outlook on the problematique.

Maes · September 29, 2015

CPU time in Doom is spent on average pretty equally between gamecode and rendering, and if rendering alone is considered, overheads like sorting the sprites and traversing the BSP tree can easily dominate the "rendering" time, so worrying about having the fastest possible frame buffer is kinda moot. Doom in particular spends a lot of time placing pixels on the screen one by one within minimal optimizations or acceleration, and calculations/putting pixels are split almost equally within rendering time. Compared to them, the final framebuffering operation that "throws" everything to the screen, is almost negligible.

Sure, it doesn't hurt to have it go as fast as possible, but even with a zero rendering time, you wouldn't get significant total speed increases.

If anything, the figure you quoted does place a theoretical upper limit at how many frames per second can be rendered at a given resolution (assuming that EVERYTHING ELSE takes zero time), which in your case would be 370 fps.

In any case, I'd say you have more than enough leeway, UNLESS the other parts of the engine need way more than (28.57-2.7) ms per frame to process (as can happen in Doom in very complex or very large open levels). In that case, you should really be seeking to optimize other parts of your engine or, if its design permits for it, move to OpenGL or Direct3D (though it's possible to misuse them and actually lose performance).

scifista42 · September 29, 2015

Maes said:
CPU time in Doom is spent on average pretty equally between gamecode and rendering, and [...] calculations/putting pixels are split almost equally within rendering time.

Thanks. At first this sounded very positively for me... But wait, you're talking about vanilla, aren't you? You basically said that display in Doom takes 25% of frame processing time, and the remaining 75% is renderer + game code. In other words, display takes up to 7 miliseconds, and renderer + game code take up to 21 miliseconds.

However, there are several caveats: Firstly, vanilla Doom has resolution 320*200, not 800*600 like my test did. Secondly, vanilla Doom uses dated display methods, which I wouldn't compare with SDL on modern platforms. And thirdly, most frighteningly: I suppose that Doom game code + renderer is as fast as possible, and we agreed that it (without the pixel display) takes 21 miliseconds in resolution 320*200. As I said about my project, I expect my code (equivalent to game code + renderer together) to be several times slower than Doom's when given the same screen resolution and intended frame rate. This means that I might be screwed in the end, because I might end up with too much more than 21 miliseconds needed for the code to execute. This has nothing to do with display, though. I will see.

Maes said:
or, if its design permits for it, move to OpenGL or Direct3D (though it's possible to misuse them and actually lose performance).

The critical fundamental aspect of my idea, the aspect that gives it a meaning to exist at all, is that the entire screen pixel data must be represented AND software-processed as an array of data representing RGB colors of pixels on the screen. With that in mind, do you think it has a meaning to consider using OpenGL/Direct3D anyway?

Maes · September 29, 2015

scifista42 said:
Thanks. At first this sounded very positively for me... But wait, you're talking about vanilla, aren't you? You basically said that display in Doom takes 25% of frame processing time, and the remaining 75% is renderer + game code. In other words, display takes up to 7 miliseconds, and renderer + game code take up to 21 miliseconds.

These are all empirical approximations of course, and it's possible to make pathological maps where they are violated (e.g. one that goes beyond 100% usage in game logic alone, or one that would consume way more time to render than to process game logic). The pixel-drawing part, however, due to how the renderer works, is pretty constant and increases linearly with resolution, as every pixel you see on the screen is drawn exactly once per frame. The only major exception to this is sprite overdraw, which can also be driven to pathological extremes (e.g. blocking the player's view with an infinite queue of pinkies).

scifista42 said:
However, there are several caveats: Firstly, vanilla Doom has resolution 320*200, not 800*600 like my test did.

That only changes the pixel-rendering part proportionately (more pixels to draw), however having a larger visible screen area also means that the engine will "decide" to render more detail, so there might be a secondary but much smaller effect. But that is quantifiable in terms that e.g. in 320x200 there were 65 visplanes visible, and with a higher resolution you can now see 67 in the same scene.

scifista42 said:
Secondly, vanilla Doom uses dated display methods, which I wouldn't compare with SDL on modern platforms.

That "dated" method is also the fastest and simplest for arbitrary rendering: simply putting a pixel exactly where you want it. Doom in particular doesn't benefit much from rendering primitives, area filling, scrolling, fixed-scale tiles or sprites. If you application however does benefit from them, by all means, use them. However, at best they will simplify/speed up your user-rendering code, not the final framebuffer transfer speed. For that, you can only hope that the OS/the libraries etc. that you're using implement it in the best way possible, and don't slow you down with too much user friendly/data-safe "fluff".

scifista42 said:
And thirdly, most frighteningly: I suppose that Doom game code + renderer is as fast as possible, and we agreed that it (without the pixel display) takes 21 miliseconds in resolution 320*200.

It's pointless to quote precise times without referring to a specific hardware and software configuration, which includes CPU, RAM, video hardware, OS, source port being used, WAD being played etc. Doom ideally takes 1/35 or about 28.6 ms per frame rendered, but it doesn't meant that it always uses all of that time.

Some, like pushing the frame buffer to the display, will indeed take a fixed amount of time (those 2.7 milliseconds we said, though it could be another number too). With what's left of it, game processing and rendering must be accomodated.

If you take too much time...well...you can choose to either drop frames or accept a slowdown. Doom drops frames during gameplay by skipping the rendering if necessary, while during benchmarking timedemos it renders every frame but allows for slowdown.

To get a guaranteed 35 fps under all circumstances, that means that the hardware must have enough "juice" to render even the most complex scene and process even the most complex situation at most in that amount of time. PC games are usually NOT designed with such restraints, and frame skipping/automatic degradation are commonplace (actual slo-mo during gameplay is uncommon today, except as part of deliberate "bullet time" effects).

Now, if you are working in such a constrained environment that you must respect a fixed amount of CPU processing time AND honor a given display resolution and provide a guaranteed fixed framerate (a console?) then you will need to downsize the game itself or publish it with sufficiently high minimum requirements, as was often common practice. Set your goals accordingly.

TL; DR: the framebuffer rendering speed only provides you with a theoretical upper limit, if everything else was cost-free/effortless, and it's only meaningful to compare it to processing time for your game if it's particularly slow (e.g. certain platforms like Java ME phones had an excruciatingly slow way of writing to the display, so only certain genres of games became popular) and you cannot do other processing in the meantime. But on a modern PC this just isn't the case.

Gez · September 29, 2015

scifista42 said:
With that in mind, do you think it has a meaning to consider using OpenGL/Direct3D anyway?

There's a reason you have OpenGL code in ports such as Chocolate Doom or Eternity, and it's not for rendering.

Maes · September 29, 2015

Gez said:
There's a reason you have OpenGL code in ports such as Chocolate Doom or Eternity, and it's not for rendering.

Well, even that way, he'll discover that instead of 370 fps theoretical max he might be able to do e.g. 500 or 600 or even 1000, it doesn't matter. Even if the rendering time was zero and theoretical FPS was infinite, it wouldn't matter.

If the goal is to guarantee that each and every situation in his game code will be resolved in no more than 28.57 ms ALWAYS (a "constant guaranteed framerate" design), then he needs to precisely control the complexity of game assets, anticipate the maximum expected complexity etc. and of course set a beefy enough minimum requirements ante.

There's certainly something wrong with the whole setup if that 9%, 5%, 2% or 1% of total processing time that rendering/framebuffering will occupy if considered so important, rather than focusing on what happens in that other 91%, 95%, 98 or 99%.

RjY · October 4, 2015

There are three stages to each frame:

- run the tic, update the world
- draw the player view into a frame buffer
- SDL_Flip

In rboom the timings of 640x480 fullscreen are as follows.

Running a tic

This is very very fast, especially for vanilla game physics.

% repeat 5 { { doom2 ~/doom/30ns7155.lmp -nosound -nodraw -timedemo } | grep FPS }
FPS: 13975.5 (162515 tics [77:23] / 407 real [0:11])
FPS: 14079.3 (162515 tics [77:23] / 404 real [0:11])
FPS: 14044.5 (162515 tics [77:23] / 405 real [0:11])
FPS: 14079.3 (162515 tics [77:23] / 404 real [0:11])
FPS: 14114.2 (162515 tics [77:23] / 403 real [0:11])

That's about 0.07ms per tic.[1]

Running a tic then drawing it to framebuffer

Run each tic then draw the pixels into a frame buffer. That means, write numbers into a block of memory. You're still on CPU and system RAM.

% repeat 5 { { doom2 -nosound -noblit -timedemo demo2 } | grep FPS }
FPS: 187.3 (2001 tics [0:57] / 374 real [0:10])
FPS: 188.3 (2001 tics [0:57] / 372 real [0:10])
FPS: 188.3 (2001 tics [0:57] / 372 real [0:10])
FPS: 188.3 (2001 tics [0:57] / 372 real [0:10])
FPS: 187.8 (2001 tics [0:57] / 373 real [0:10])

Obviously a lot slower, 5.3ms per frame.[2]

Running a tic and displaying it on screen

Finally we bring in actual graphics display. This only adds one extra call - SDL_Flip() - into the loop. But as we see, that call is very slow.

% repeat 5 { { doom2 -nosound -timedemo demo2 } | grep FPS }
FPS: 92.9 (2001 tics [0:57] / 754 real [0:21])
FPS: 92.9 (2001 tics [0:57] / 754 real [0:21])
FPS: 93.0 (2001 tics [0:57] / 753 real [0:21])
FPS: 92.9 (2001 tics [0:57] / 754 real [0:21])
FPS: 92.9 (2001 tics [0:57] / 754 real [0:21])

10.7ms per frame. Thus SDL and the rest of the graphics stack underneath essentially doubles the draw time.[3]

After that, you just go to sleep for the remainder of your 28.57ms.

________
[1] On a 10 year old single core Athlon64 3200+ with 512MB RAM.

[2] Writing pixels in columns, against the grain of memory, is highly suboptimal. I have never ported any of the "write multiple columns at once" renderer speedup patchsets from PrBoom upstream because the column drawers are complicated (and hard-to-modify) enough as it is. If I double the resolution that 5.3ms shoots up to nearly 50ms per frame, which is unusable.

[3] This is where I fall out with SDL2. Using SDL_FULLSCREEN_DESKTOP (my window manager hates plain SDL_FULLSCREEN, sadly) the "SDL_RenderPresent dance" - a sequence of function calls which form the SDL2 equivalent of SDL_Flip - take 25ms to return regardless of framebuffer size, which means I have to run a tic and draw the screen in under 3.5ms if I am to have 35 frames per second.

scifista42 · October 4, 2015

RjY said:
[3] This is where I fall out with SDL2. Using SDL_FULLSCREEN_DESKTOP (my window manager hates plain SDL_FULLSCREEN, sadly) the "SDL_RenderPresent dance" - a sequence of function calls which form the SDL2 equivalent of SDL_Flip - take 25ms to return regardless of framebuffer size, which means I have to run a tic and draw the screen in under 3.5ms if I am to have 35 frames per second.

As I have discovered during my SDL tests (described in my other thread), the fastest way to update window in SDL2 is function SDL_UpdateWindowSurfaceRects (with 1 rectangle covering the full window as the function's parameter), which is equivalent to SDL_Flip in both its speed and usage. Just feel free to disregard SDL2's arbitrary "renderers" and related features.

Thanks for the info anyway, I wouldn't thought of -timedemo.

kb1 · October 5, 2015

You gain some advantages writing 4-byte truecolor pixels using modern processors. For example, the 4-byte pixel write should be as fast as the single byte write, or possibly faster. Now, calculating what color that truecolor pixel should be is another story.

But, very roughly, I'd feel pretty confident with a 3ms 800x600 truecolor screen render. The biggest 2 issues with painting the screen in a Doom way are:
#1. The renderer paints vertically, which runs counter to what the modern system's caching system expects. That means a possible cache flush on every pixel write! If you can get your pixel write code to use MOVs that do not update cache, you can alleviate that, but that requires assembly, or non-portable directives.

#2. Sprite overdraw. Again, cache issues.

Your 32-bit pixels only barely affect either of those (since you'll be writing 4x the data). But, since cache is really only getting in the way anyway, it doesn't really matter whether you're writing 1 byte, or 4.

But, if you're getting 3ms, while painting the screen vertically, you're doing something right. You're not painting the screen horizontally to get 3ms, are you? If so, you're in for a surprise, I'm afraid.

scifista42 · October 6, 2015

@kb1: I'm confused, did you speak to RjY or to me? Because there might be a few misunderstandings if you were actually speaking to me. I was just filling my screen "buffer" with random (but precalculated) data and measured speed of pure updating the pixels on display. Also, my latest, improved experiments confirmed that it actually takes only 1.4 milisecond to update an 800x600 window (actual display), while I can directly access the real screen pixel memory for read/write operations anytime between calls of the window update function. The speed of my own algorithm of writing-into-the-pixel-data-memory was not a part of the measurement at all.

Maes · October 7, 2015

Just a few notes of caution here: keeping three separate RGB arrays is not very efficient, because that's not how pixels are stored on the graphics card at all. You have to treat each pixel as a "solid" block containing all three RGB components (usually arranged as ARGB, where A is the Alpha channel is simply ignored). For memory access reasons, 32 bits are used even for 24-bit truecolor.

Updating just one color component of a pixel may be fast when using three separate arrays, but joining an element from each array on-the-fly in order to write it on the screen with bit shifting, bit-masking etc. is not. Keeping color components separate only makes sense in image-processing software, not multimedia/games.

Finally, as I said before, any benchmark that you do regarding the frame buffer's speed only gives you a maximum theoretical performance bound, a kind of "ideal world" goal. If your application needs to have a strictly bounded computation time per-frame, then the algorithms and data you use must also have a strictly-bounded and predictable execution complexity. For example, the use of iterative solvers or neural network training or any other kind of calculation where you don't know the required number of steps a-priori (or at least their worst-case upper-bound), is a very bad idea if real-time, bounded-time performance is required.

scifista42 · October 7, 2015

Maes said:
Just a few notes of caution here: keeping three separate RGB arrays is not very efficient, because that's not how pixels are stored on the graphics card at all. You have to treat each pixel as a "solid" block containing all three RGB components (usually arranged as ARGB, where A is the Alpha channel is simply ignored). For memory access reasons, 32 bits are used even for 24-bit truecolor.

That's exactly what I do and always knew about. In fact, my read/write array is no longer a "buffer", now it occupies the exact same memory space where the window's internal pixel data are stored and ready to be blitted, giving me the most direct control with the fastest display time, and it seems to work perfectly.

Maes · October 7, 2015

In any case, worrying about rendering performance is premature without at least an idea of the overall performance of your planned game. Unless you're coding something out of the ordinary, using polynomial complexity algorithms etc. which really make every computron count, you can use an existing game of a similar genre to yours as a yardstick, to get an estimate of reasonable minimum requirements, performance, etc.

If you anticipate the use of a heavily non-linear algorithm, then scrounging a tiny bitbit of linear improvements from something that normally represents a fraction of the total frame time won't help much: your focus should be elsewhere.

kb1 · October 7, 2015

scifista42 said:
@kb1: I'm confused, did you speak to RjY or to me? Because there might be a few misunderstandings if you were actually speaking to me. I was just filling my screen "buffer" with random (but precalculated) data and measured speed of pure updating the pixels on display. Also, my latest, improved experiments confirmed that it actually takes only 1.4 milisecond to update an 800x600 window (actual display), while I can directly access the real screen pixel memory for read/write operations anytime between calls of the window update function. The speed of my own algorithm of writing-into-the-pixel-data-memory was not a part of the measurement at all.

Honestly, I was basing it off of your first post. But, yeah it still applies. You mention filling the buffer with pre-calculated data. Use care here. what I was trying to portray was that there is a big difference between writing data left-to-right, a horizontal row at a time, vs. top-to bottom, a column at a time. The former works nicely with the cache subsystem, and the latter wreaks havoc on the cache, and is usually much much slower. Unfortunately, the latter is how Doom paints walls. It didn't matter in the 486 days, cause ALL writes were slow :). But, it can be even worse that running cache-less, if you write columns, cause the processor has to wait for cache flushes. Try it: Write a pair of X Y loops. In the first test, put X in the inner loop, and paint pixels to (X,Y). In the second test, make the Y loop the inner loop. The second test should be a lot slower. Also, experiment with resolution: If the resolution is a multiple of the cache line size, it can be even worse.

That's why ports such as Eternity, PrBoom, ZDoom, etc, have such complicated renderers - they are attempting to paint multiple columns at the same time, and other tricks, which seem convoluted (cause they are), and look as if they would be much slower. But, they are typically, in fact, faster, cause they trigger fewer cache invalidations.

Normally, when you read a word of data from main memory, the modern memory hardware goes ahead and reads that word, and a large chunk of memory adjacent to that word. Depending on the cache setup, it may go ahead and read 64 bytes, and store it into the cache. This is done, because more often than not, your program will want to read more data, sequentially...and, there it is, in cache! This really speeds up sequential block memory reads.

This idea also works for memory writes. Writing to cache is faster than writing to main memory. The problems occur when you write in a non-sequential manner. For caching to work, the hardware must be able to guarantee that memory read from the cache exactly mirrors real memory, and that writes to cache end up being written to the appropriate address in real memory.

The cache is very small, and whenever you read/write to areas not cached, the current data in cache must be either written to memory, or invalidated (flushed), before the new memory area can be read in, and that takes a significant amount of time.

Ironically, a system designed to make memory faster can actually slow down memory usage drastically, in certain usage cases.

Disclaimer: My description is extremely simplified. If someone feels inclined to "break out the manual", please use that info to enhance the discussion. I already know how it works, thank you :)

@Quasar (mergesort vs. quicksort): Nice spot on find and fix!

Maes · October 7, 2015

Well, with a big, linear, chunky frame buffer, on Intel at least, "updating" means executing one big, fat REP MOVS instruction (which is what most implementations of memcpy end up doing, too), so there's no immediate concept of row-major or column-major ordering here. You 're simply copying the whole thing "as is".

If you start inserting breaks between rows (which would still be memcpy-able blocks themselves), it still will be OK. If you try drawing column-by-column however, there you cannot use any speedup trickery, other than very complicated loop unrolling/ "fat" column drawing etc.

Can't beat the elegance of that single memcpy op, though.

scifista42 · October 8, 2015

@kb1: Okay, I will keep it in mind, although it was already clear to me that sequential writing to memory cells that are directly following each other is faster than writing to memory cells that are offset by a constant number (to go by columns, this number = screen width).

Still, you were talking about a whole different thing that me: You were talking about rendering, as an algorithm that writes the desired pixel content into memory. I was talking merely about displaying, as a system function (for example SDL_UpdateWindowSurfaceRects) that takes data from a given memory block and actually displays them in a window on the computer's monitor. Only the latter is my issue which I've been describing in this thread. Not measuring rendering speed itself.

Maes · October 8, 2015

Well, if you want the fastest possible displaying, then you should preload everything to video/texture RAM, and program the video card's hardware to simply switch between viewports. It will be literally as fast as the RAMDAC or as the digital stream output will allow. But of course this system would be impractical for use in a game, unless you could run a major portion of it on the card itself (which is not entirely unthinkable, thanks to unified shaders and CUDA).

kb1 · October 9, 2015

scifista42 said:
@kb1: Okay, I will keep it in mind, although it was already clear to me that sequential writing to memory cells that are directly following each other is faster than writing to memory cells that are offset by a constant number (to go by columns, this number = screen width).

Still, you were talking about a whole different thing that me: You were talking about rendering, as an algorithm that writes the desired pixel content into memory. I was talking merely about displaying, as a system function (for example SDL_UpdateWindowSurfaceRects) that takes data from a given memory block and actually displays them in a window on the computer's monitor. Only the latter is my issue which I've been describing in this thread. Not measuring rendering speed itself.

Ah. That's what I get for not comprehending the entire post. I thought your frames-per-second concern included Doom-like column-based wall rendering, which, plays a major role in limiting frame rate.

At any rate (no pun intended :), good luck with your project!

Sign In

Display speed in software-rendering ports

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in