Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Justince

Some port questions

Recommended Posts

Yeah, Jaguar Doom source is still hanging around somewhere. For example, you can find it here.

As for multithreading: many modern ports use it to some extent, for example ZDoom uses a different process for the sound system. Maes experimented with parallelizing certain rendering tasks in Mocha Doom. But multithreading the game logic itself would be difficult because the way it was designed in 1993 is too reliant on everything being in a single process. Just look at all the global variables that are "shared" by many functions... And contrarily to the audio and video output, you can't just replace game logic wholesale and hope it'll be good enough.

Share this post


Link to post

The quick and easy answer is trying and optimizing the renderer, since it's one of the most taxing subsystems and can be decoupled from the game logic to a large extent, but the benefits will depend a lot on what is being rendered and whether it's the map objects (monsters) or the geometry that's too taxing.

From my own test, a map like nuts.wad runs at about 60 fps "normally" in Mocha Doom at 1280 x 900 resolution, using the standard single-threaded renderer. If I disabled rendering completely, I could get at most 90 fps or so. This means that even if you manage an infinite speedup of the renderer's code, for that particular map, you won't get more than a 50% speed benefit, since you have things outside the renderer to take care of.

By using 2-4 threads and parallelizing the drawing of flats, walls and sprites, for that particular map again, I managed to get between 65 and 75 FPS, which seems a reasonable speedup considering the theoretical maximum would be 90 with zero rendering overhead.

But nuts.wad is an anomalous situation: very little geometry and too many monsters. Other maps with the opposite situation (e.g. a barren map with very complex geometry) might benefit more, while some will be average or equally complex on both fronts.

The biggest bottleneck after rendering is the game logic, which as others said, is inherently serial. You can't parallelize it without losing causality and repeatability of actions (so you essentially sacrifice the ability to play back demos, and even RECORD demos with your own port, since actions won't be repeatable unless you use a single thread). If you introduce thread locks and mutexes to try and preserve causality and demo compatibility, you will certainly lose any benefit from multithreading, and the execution will degenerate to a single thread.

Share this post


Link to post

I find multithreading most needed when something in the game triggers a new sound effect, causing a split-second freeze in the gameplay. Some of the modern ports should implement it.

Share this post


Link to post

Decoupling video & audio processing from gameplay processing should be a given, and to a certain extent is possible even without a multithreading programming model (e.g. with interrupts), but even if you have exactly zero A/V processing latency, you will still have a lot to do.

And starting a new thread for every new sound effect....not really a good idea, given the way mixing engines work and the overhead associated with creating and destroying threads continuously. The way certain sound systems work however (e.g. Java's) effectively do function in this way: each time you play a single SoundClip clip, a temporary Audioline object with its own dedicated thread is created. Not terribly efficient, especially in situations with lots of overlapping, short sound effects.

It's much more efficient to create a single "freewheeling" AudioLine that never shuts down, and "feeding it" through a dedicated mixing routine (which can also run on its own thread).

Share this post


Link to post

LinuxDoom had its own separate sndserver process that stayed running, and that worked pretty well even on 486dx/33 machine. Maybe there was some extra latency compared to the DOS version, but I couldn't tell the difference. They must have done standard IPC via pipe and signal (I think that code was part of the doomsrc release, so check it).

Share this post


Link to post

Just guessing here, but the pause is likely due to the sound not being cached by the engine. In EE, for example, you can configure it to precache sounds effects and you don't get pauses (and the disk icon if you've enabled it) that way.

Share this post


Link to post
Ladna said:

Just guessing here, but the pause is likely due to the sound not being cached by the engine. In EE, for example, you can configure it to precache sounds effects and you don't get pauses (and the disk icon if you've enabled it) that way.

Yeah but that would increase the loading time at program startup.

Share this post


Link to post

Even vanilla Doom precaches sounds during soundsystem initialization. Imagine if it didn't.

Share this post


Link to post

Yeah it does increase startup time, so it's definitely a preference thing.

Re: vanilla, I imagine it *tries* to cache them in the zone, but rarely used ones get purged pretty quickly. Again, just a guess here.

Share this post


Link to post

When using an older port, try giving it a huge memory allocation, to avoid loosing stuff from cache. Older ports cannot grow their memory allocation, thus they purge stuff. Default varies but could be around 8 MB or less. I think most levels can be contained in 40 MB.
> Doom -mb 512

DoomLegacy has a separate sound process when compiled for Linux X-windows. I think this was inherited from linuxdoom ??.

Share this post


Link to post
Justince said:

Thanks for the info guys, looks like the easiest way to get these two processors to get along is going to be the interrupt method.


If you are simply trying to "dedicate" one processor to audio or I/O while the other one runs logic/rendering, sure. That's easy, but won't really boost much.

If you want to try multithreaded rendering, you can look up the Mocha Doom source code to see some already implemented methods, though prDoom also has one. In essence, performance depends a lot on what you consider as the minimum rendering unit/instruction to build a pipeline around.

I have tried both column-based parallelization (minimum unit is a wall or sprite column) as well as seg-based (minimum unit is an entire wall seg, which results in drawing multiple columns).

Both methods have advantages and disadvantages: column-based is really simple to implement and easy to balance (N threads, each gets 1/Nth of total columns to render), but it requires dynamical memory allocation (the actual number of columns to render is heavily variable) and some overhead for storing the column pipeline. It doesn't scale very well to very high resolutions or complex architecture.

Seg-based is more complicated, especially when it comes to split work between multiple threads: some walls will be drawn by different threads, and it's hard to ensure that all threads will get an equal amount of work. But -in theory- it should have an advantage with complex architecture, as the number of actual walls visible is often much lower than individual columns, so less overhead.

Flats are a special case, more similar to how segs work. Sprites can be parallelized either by column or by sprites, but can only be rendered in parallel after they have been sorted. Sprite sorting itself can be parallelized, if you have an efficient sorter with little start/stop overhead. Rendering sprites in parallel using individual sprites as the base unit has the same work-splitting considerations as seg-based wall drawing. My approach? I say that with N threads, each of them draws only those sprites that are fully contained in its 1/Nth portion of the screen, occasionally drawing partial sprites. Some sprites might be rendered by more than one thread (partially) with no overdraw, e.g. a pinky in your face.

I don't know what programming model you're going to use though. The best thing, if you have really low-level control (like the 32X port, which also had twin CPUs, maybe you should look at its source code) would be to come up with customized SMP primitives with less overhead than threads. Depends on the OS you'll be using though.


Another idea is to have one core run the game logic, and at the end of each tic, copy the state of various objects to a temporary memory location, and immediately start computing the next tic. The other core will start rendering the frame that represents that saved state. If you assume that rendering takes as long as running the logic for each tic, this method should in theory give up to 100% speedup with 2 cores.

Furthermore, the renderer itself can still be internally parallelized in order to run a bit faster, but total tic running time will be dominated by the slowest of the two (needless to say, they should be synchronized at the end of the tic).

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×