Parallel software renderer

Maes · November 24, 2010

I took a break from advancing the actual game logic codebase, in order to optimize the renderer a bit. In general, I only tested MochaDoom on systems with decent graphics cards, and even if the renderer is pure software, since I use a "direct" screen access Java plays nicely and maps it directly to D3D/OpenGL/Whatever, giving me nice framerate even at high resolutions.

However, when we upgraded my b0x at work with a new Quad Core Athlon II X640, I couldn't resist testing it. To my disappointment, since we opted to use the mobo's integrated ATI 4250, that beast of a machine got way worse frame rates than my Dual Core T8300 laptop (which however has an nVidia 8600M): about 30% lower, which was a real bummer. A typical framerate was 60 fps @ 1280 x 800, whereas my laptop managed to pull at least 90.

Thinking that this was an injustice, I picked up my parallel processing experience from my graduation days, and parallelized the Doom's software renderer (always in Java).

No big mystery how I did it: in RenderSegLoop, instead of actually drawing stuff to the screen, I save them to "RenderColumnInstruction" objects (themselves organized in a resizeable array, since I don't know a-priori how many I will actually need for a given frame). Then those instructions are split between N worker threads.

Also, I render the visplanes (floors) in parallel with the walls. This is possible because RenderSegLoop has already correctly constructed the visplanes, so I don't need to actually draw the walls. Visplane drawing itself is also parallelized

Thus, I have three levels of parallelism:

Between visplanes and walls.
Between walls.
Between visplanes.

A typical configuation is 2 Wall threads + 1 Floor thread for a Quad core, as I found that increasing floor threads doesn't help much unless you have several distinct visplanes of more or less the same screen area, but even 1+1 helps a lot.

Wall rendering simply divides the Render Column Instructions between N threads. Each instruction contains directives for a single strand of
wall (middle, low and top textures are considered separate strands).

Synchronization issues: RenderBSPNode and everything under it is still run serially, but only because it doesn't actually do any drawing anymore. It only stores the correct instructions for the colfuncs and visplanes and because it's so lightweight compared to rendering it's probably not worth doing it for normal maps. Wall and visplane rendering commences after RenderBSPNode has completed and join the main execution before DrawMasked is called (which has an inherenty serial progression and doesn't seem worth the effort to parallelize, yet, unless sprite positioning meets certain criteria).

The result? The aforementioned lousy 60 fps skyrocketed to 100 fps under the same conditions. In fact, I think it works so well that I will include it in the first public releases ;-)

And, correct me if I'm wrong, this is the first time Doom's software renderer has been parallelized in any form?

Quasar · November 24, 2010

Maes said:
And, correct me if I'm wrong, this is the first time Doom's software renderer has been parallelized in any form?

To my knowledge yes. I'd expect some people to follow quickly though, now that you've blazed the trail ;)

This won't be possible in any SDL ports til we find a way around the multicore access violations though >_>

I have thought about writing a parallel BSP builder in the past. The process of BSP tree building proceeds in a piecewise separate fashion on either side of a split so it should be possible to have a pool of threads (say, one per core) and assign them subtrees to build - as each subtree is finished that thread would return to the pool for reassignment by the next thread that splits a subtree.

The only real question is, is there a real purpose to it? BSP tree building is probably so fast now even on a single core that it's not enough of an issue to warrant parallelism - but then again I've never tried building a BSP tree for a level with 90000 segs either (essel... :P)

Graf Zahl · November 24, 2010

It still takes several seconds, depending on your CPU.

I get around 7 seconds to build GL nodes with ZDBSP on a 2.4GHz Quad Core for levels with ca. 60000 sidedefs (using only one of them, of course.)

Maes · November 24, 2010

Well, have fun, I committed the code :-p

Look for a file named "ParallelRenderer.java" and compare it with "UnifiedRenderer.java" to understand what I've pulled. Looking at it, in retrospect, it was embarassingly simple to perform (but then again I don't have exactly zero experience in analyzing parallel problems, and I knew exactly what tools to use with Java to achieve it).

If you're feeling brave, try "AWTRenderViewTester" vs "AWTParallelRenderTester1" in the testers package and send me some benchmarks ;-)

And, if you are feeling even MORE brave, fire up Main in package i ;-) (put an IWAD in the root of the compiled classes folder though, as well as any PWADs. Standard command line arguments will work -for the most part-.

fraggle · November 24, 2010

Neat. I've considered doing something similar for Chocolate Doom, except in my case the speed bottleneck isn't the actual rendering (320x200 can be drawn at ridiculous rates), but in the screen scale-up code. The screen could be divided up into strips, with a thread assigned to render each strip.

Unfortunately the workaround for the SDL sound bug forces all threads onto the same core, so it's not really feasible.

4mer · November 24, 2010

For scaling you could always use the graphics card hardware.

Quasar · November 25, 2010

fraggle said:
Neat. I've considered doing something similar for Chocolate Doom, except in my case the speed bottleneck isn't the actual rendering (320x200 can be drawn at ridiculous rates), but in the screen scale-up code. The screen could be divided up into strips, with a thread assigned to render each strip.

Unfortunately the workaround for the SDL sound bug forces all threads onto the same core, so it's not really feasible.

We really need a solution to that problem badly >_>

I wonder if in already highly tuned native code that issues like write barriers will not be a problem when trying approaches like this. Seems you'd need to make sure the areas of rendering assigned to different threads were as cache-distinct as possible or else they'd end up with conflicts on cache commits - or do I misunderstand the nature of such things? :P If I am correct, then a multithreaded approach done naively in such a codebase might actually carry a significant penalty.

Gez · November 25, 2010

Quasar said:
We really need a solution to that problem badly >_>

OpenAL?

Csonicgo · November 25, 2010

Gez said:
OpenAL?

you nkow, I wish ports would dive right in and support that. does Zdoom? with fluidsynth I wouldn't think midi is a reason to have sdl_mixer anymore.

Quasar · November 25, 2010

I've heard plenty about why OpenAL is not a suitable digital audio library for DOOM ports.

Graf Zahl · November 25, 2010

It's certainly better than SDL_Mixer. Hell, even using DirectSound directly under Windows would probably be a better option than that.

If someone it interested they should get into contact with the guy who wrote the ZDoom implementation.

Maes · November 25, 2010

Quasar said:
I wonder if in already highly tuned native code that issues like write barriers will not be a problem when trying approaches like this. Seems you'd need to make sure the areas of rendering assigned to different threads were as cache-distinct as possible or else they'd end up with conflicts on cache commits - or do I misunderstand the nature of such things?

Yup, it's pretty much how it works. When deciding how to assign workload to separate threads, you must take into account whether you're doing something computationally heavy (e.g. dense matrix inversion, trigonometric and exponential ops etc.) or simply copying data around.

Doom's software rendering falls somewhere between the two: column-rendered graphics pretty much destroy cache coherency by their very nature, and they also have fixed-point arithmetic and trigonometry thrown in to calculate scaling and offset, in the case of floor flats. Since the most easy way to divide work is by columns, rather than by screen lines, you can assign them any way you like e.g. alternating columns or whole ranges of columns or sort them by dc_source etc.

If on the other hand what you're doing consists mostly of block copies, then you must take cache coherently pretty much anally into account (e.g. as in a super-optimized matrix transposition algorithm I examined).

A way to achieve block-scaling without doing pixel squaring and maintaining a certain amount of cache coherency, is to scale by integer amounts in this way:

Decide the amount of horizontal and vertical scaling (we're talking about block scaling, right?), let's call them M and N (integer amounts only)
Draw the resized screen (including 3D view, status bar, menus etc. everything) "sparsely", aka drawing only the 0-th, M-th, 2*M-th, etc. pixel horizontally and 0-th, N-th, 2*N-th, etc. pixel vertically.
Apply per-line horizontal post-processing, by expanding every pixel into its neighboring (to the right) tuple of pixels. This can be done horizontally and doesn't interfere with actual column rendering, so cache lines can be dedicated entirely to whole horizontal spans.
Same thing for the "spaces within lines, possibly by a separate function. It would work best to assign e.g. the first Y/2 scanlines to 1 thread and the other Y/2 to another, rather than e.g. alternating.

Sounds a bit naive, but it's both inherently parallelizable and cache-coherent (no line skipping, no per-column trasversals etc.)

Now I don't know how much of a performance hog it is to perform this kind of scaling in C/C++, however there's a risk that if it's too fast compared to column-based rendering and threads don't work enough time, then you will likely experience thread-barrier penalties, unless you're scaling to some serious resolution and/or use at most 2 worker threads for this particular task.

Gez · November 25, 2010

Csonicgo said:
does Zdoom?

Yes and no. There is an OpenAL branch that is sometimes updated by Chris. It's not in the trunk though.

More precisely, Chris worked on this so as to offer to ZDoom an interface that allows to use either FMOD Ex or OpenAL, interchangeably. It's still WIP, though.

Other than this branch of ZDoom, the "port-like" Jedi Engine clone DarkXL uses OpenAL too.

fraggle · November 25, 2010

Quasar said:
We really need a solution to that problem badly >_>

I was talking with entryway yesterday about removing the POSIX version of the affinity workaround (sched_setaffinity) as I don't think it's necessary. It seems to be a Windows-only problem, but for a long time I believed it was a cross-platform issue because of a different bug that has now been fixed.

There seems to be some confusion about this bug - whether it's Windows-only or cross-platform, whether it's SDL, SDL_mixer or Windows itself that is the problem, etc. I'd really like to clear up the confusion, find out what we really know and just track down the cause already. It's rather ridiculous.

The SDL_mixer code for Windows MIDI obviously also needs updating. In the past I've thought about whether it would be most sensible to split out the MIDI code into a separate SDL_MIDI library. You may also be interested to see the DOSbox MIDI code (midi_*.h).

I wonder if in already highly tuned native code that issues like write barriers will not be a problem when trying approaches like this. Seems you'd need to make sure the areas of rendering assigned to different threads were as cache-distinct as possible or else they'd end up with conflicts on cache commits - or do I misunderstand the nature of such things? :P If I am correct, then a multithreaded approach done naively in such a codebase might actually carry a significant penalty.

I don't think it's a huge problem. If the screen is divided into parallel strips as I describe, they'll be in different memory pages, so there shouldn't be any cache conflicts.

Quasar · November 25, 2010

fraggle said:
There seems to be some confusion about this bug - whether it's Windows-only or cross-platform, whether it's SDL, SDL_mixer or Windows itself that is the problem, etc. I'd really like to clear up the confusion, find out what we really know and just track down the cause already. It's rather ridiculous.

The SDL_mixer code for Windows MIDI obviously also needs updating. In the past I've thought about whether it would be most sensible to split out the MIDI code into a separate SDL_MIDI library. You may also be interested to see the DOSbox MIDI code (midi_*.h).

I found potential race conditions between the main thread and the MCI event pump callback several years ago and didn't really understand how to fix them. I'm almost certain it's possible for the MIDI data to be freed before the callback is finished with it, and/or other data structures to be altered while they're in use by other threads.

Either way it'd probably be better to just start from scratch and forget the existing code.

fraggle · November 25, 2010

Quasar said:
I found potential race conditions between the main thread and the MCI event pump callback several years ago and didn't really understand how to fix them. I'm almost certain it's possible for the MIDI data to be freed before the callback is finished with it, and/or other data structures to be altered while they're in use by other threads.

So are you saying that the multicore bug (the one that requires the affinity fix) is caused by the SDL_mixer native MIDI code? If you disable music, the bug doesn't occur?

Either way it'd probably be better to just start from scratch and forget the existing code.

I agree. If nobody can be bothered with the effort to do that, I'd be happy with fixing the existing code, though.

Csonicgo · November 25, 2010

fraggle said:
So are you saying that the multicore bug (the one that requires the affinity fix) is caused by the SDL_mixer native MIDI code? If you disable music, the bug doesn't occur?

I can verify this. in EE, with multicore, the MIDI is atrocious.

fraggle · November 25, 2010

Csonicgo said:
I can verify this. in EE, with multicore, the MIDI is atrocious.

What do you mean by "atrocious"? The bug causes the game to lock up, right?

Quasar · November 26, 2010

fraggle said:
So are you saying that the multicore bug (the one that requires the affinity fix) is caused by the SDL_mixer native MIDI code? If you disable music, the bug doesn't occur?

I agree. If nobody can be bothered with the effort to do that, I'd be happy with fixing the existing code, though.

I am saying it is possible, not that it is confirmed. I couldn't find anything that should have been causing it in the digital audio code, whereas I DID find the potential problems in the MIDI code.
But, I've also had some people claim that they've seen the crash happen with music disabled, which makes no sense.

So what it might be suggesting is that this is more than one problem.

I don't know what CSonicGo is referring to...

Csonicgo · November 26, 2010

fraggle said:
What do you mean by "atrocious"? The bug causes the game to lock up, right?

no, it sounds like this. (MP3)

Quasar · November 26, 2010

Csonicgo said:
no, it sounds like this. (MP3)

Never had an issue like that on any of my machines. Also you should probably specify what settings you had active when this happened - did you circumvent the default value of the affinity flag? Because otherwise I don't think this is the same problem.

Csonicgo · November 26, 2010

yes, the affinity flag was off. But why should it matter? if it causes problems, I call that a bug. I'm not trolling, I just don't like that this can't be "fixed", and I wish we knew what was causing it.

Graf Zahl · November 26, 2010

Csonicgo said:
and I wish we knew what was causing it.

Careless multithreading setup. These errors are really nasty. It's likely that they also can appear in a single-core system but with a significantly reduced probability.

Such errors are notoriously hard to find, especially when the people who made them are not among those who try to fix it.

RestlessRodent · December 1, 2010

I've thought about doing this ReMooD and thought up of some ways to do it, but I decided not to fall through with it.

On the unrelated note of SDL_mixer, usually passing -nomusic won't cause a complete lockup when music is handled by SDL_mixer. At least for any port I write, I just use plain old SDL since SDL_mixer is another dependency and requires libraries that might not ever be used (MP3/OGG support for example). For Music, I just use native MIDI.

Sign In

Parallel software renderer

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in