Why don't I have a custom title by now?!
Thanks for the info guys, looks like the easiest way to get these two processors to get along is going to be the interrupt method.
If you are simply trying to "dedicate" one processor to audio or I/O while the other one runs logic/rendering, sure. That's easy, but won't really boost much.
If you want to try multithreaded rendering, you can look up the Mocha Doom source code to see some already implemented methods, though prDoom also has one. In essence, performance depends a lot on what you consider as the minimum rendering unit/instruction to build a pipeline around.
I have tried both column-based parallelization (minimum unit is a wall or sprite column) as well as seg-based (minimum unit is an entire wall seg, which results in drawing multiple columns).
Both methods have advantages and disadvantages: column-based is really simple to implement and easy to balance (N threads, each gets 1/Nth of total columns to render), but it requires dynamical memory allocation (the actual number of columns to render is heavily variable) and some overhead for storing the column pipeline. It doesn't scale very well to very high resolutions or complex architecture.
Seg-based is more complicated, especially when it comes to split work between multiple threads: some walls will be drawn by different threads, and it's hard to ensure that all threads will get an equal amount of work. But -in theory- it should have an advantage with complex architecture, as the number of actual walls visible is often much lower than individual columns, so less overhead.
Flats are a special case, more similar to how segs work. Sprites can be parallelized either by column or by sprites, but can only be rendered in parallel after they have been sorted. Sprite sorting itself can be parallelized, if you have an efficient sorter with little start/stop overhead. Rendering sprites in parallel using individual sprites as the base unit has the same work-splitting considerations as seg-based wall drawing. My approach? I say that with N threads, each of them draws only those sprites that are fully contained in its 1/Nth portion of the screen, occasionally drawing partial sprites. Some sprites might be rendered by more than one thread (partially) with no overdraw, e.g. a pinky in your face.
I don't know what programming model you're going to use though. The best thing, if you have really low-level control (like the 32X port, which also had twin CPUs, maybe you should look at its source code) would be to come up with customized SMP primitives with less overhead than threads. Depends on the OS you'll be using though.
Another idea is to have one core run the game logic, and at the end of each tic, copy the state of various objects to a temporary memory location, and immediately start computing the next tic. The other core will start rendering the frame that represents that saved state. If you assume that rendering takes as long as running the logic for each tic, this method should in theory give up to 100% speedup with 2 cores.
Furthermore, the renderer itself can still be internally parallelized in order to run a bit faster, but total tic running time will be dominated by the slowest of the two (needless to say, they should be synchronized at the end of the tic).