Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
GooberMan

Rum and Raisin Doom - Haha software render go BRRRRR! (0.3.1-pre.8 bugfix release - I'm back! FOV sliders ahoy! STILL works with KDiKDiZD!)

Recommended Posts

Well, fucking Jack Vermeulen strikes again with his proprietary bullshit.

 

I've got the blockmap functional, but far from usable. The updated Planisphere 2 was build with DeepBSP

 

https://doomwiki.org/wiki/Blockmap

 

So let's use that as a basis. You can see that the blockmap entries table is full of offsets from the start of the blockmap lump (which need to be multiplied by the size in bytes of a 16-bit integer, ie 2 bytes).

 

Planisphere 2's blockmap size is 254 blocks wide by 364 blocks high. 92,456 distinct entries. This alone means that any index in those entries cannot point to any area after the table, as the maximum value of a 16-bit integer is 65,536 (it needs a byte offset of 184,920 minimum). So it looks to me that the offsets in this case mean offsets after the table. And assuming that, I get the game up and running.

 

But it's still not quite good enough. At the start room, it looks like the blockmap is offset a bit. I don't think that's actually the case though. The blockmap lump size is 430,680 bytes. That's 245,760 bytes for the entries that it refers to. 122,880 unique 16-bit integers. Still outside of the range of a 16-bit integer. So I did some counting. There's 21,903 unique sets of line index lists, and 23,903 unique offsets in the table. So that indicates compression by list reordering and reuse. But a common index that fills empty cells is 26,924. As a raw offset, it's rubbish. As an index for a unique list, it's outside of bounds.

 

Every other port I've looked to for reference that handles extended nodes rebuilds the blockmap itself. Even Risen3D does this, and that's the one port I would have expected to have done it according to whatever DeepBSP does.

 

So. Am I missing something, or is DeepBSP's blockmap format actually known and documented somewhere?

Share this post


Link to post
4 hours ago, GooberMan said:

So. Am I missing something, or is DeepBSP's blockmap format actually known and documented somewhere?

It is probably not what you mean, but Crispy/Doom Retro (since version 1.8) support DeepBSP V4 extended nodes, which implies the format is known and thus documented. Perhaps @fabian or @bradharding can give a better answer

Share this post


Link to post
7 hours ago, GooberMan said:

(Video might take a little while to get away from potato quality from time of posting, turns out that someone has put copyright claims up for Duke Nukem 3D tracks so I clicked the "mute section" option. One minute out of 10, despite the track playing on the entire video. But that's enough to make YouTube take its sweet motherloving time to process.)

There's all sorts of copyright trolls out there. Pretty sure Lee Jackson wouldn't do something like that.

 

Also, if that comment in your Doom Retro link is actually what I think it is, @Maes may be able to shed some light on it.

Share this post


Link to post
16 hours ago, Dark Pulse said:

There's all sorts of copyright trolls out there. Pretty sure Lee Jackson wouldn't do something like that.

 

Also, if that comment in your Doom Retro link is actually what I think it is, @Maes may be able to shed some light on it.

 

Yup, that's the good old Mocha Doom blockmap hack alright. Hey, the purpose was exactly for it to make it everywhere, so mission kind-of accomplished I guess? If anything, I wonder why this isn't standard in every Doom port by now (at least those using a source code reasonably close to the original, and not aiming at 100% bug replication). It's a relatively unintrusive and "oldschool friendly" fix.

 

As for shedding some light.... here's an old post I made on the subject. Hoo boy, has time passed!

Share this post


Link to post

One of the problems is that there's nothing on Doom Wiki for it at the moment there. Might be time for an article edit, cleanboxing the code without this kind of known knowledge isn't easy.

 

Thanks for the information, Maes.

Share this post


Link to post
20 hours ago, GooberMan said:

The DeepBSP nodes are actually well documented. There's zero information on the blockmap however.

 

Just quickly glancing at Doom Retro, it does actually seem to have what I'm looking for. I'll look a bit deeper in to that.

 

Thank ye, sir.

Then i misunderstood the thing, but i am glad i got some pointers out.

 

No problem, glad i could help :)

3 hours ago, GooberMan said:

One of the problems is that there's nothing on Doom Wiki for it at the moment there. Might be time for an article edit, cleanboxing the code without this kind of known knowledge isn't easy.

 

Thanks for the information, Maes.

Yeah you would be amazed how much usefull information is contained in user posts. Glad it is linked now :)

Share this post


Link to post

I've had a long-standing problem with the R&R renderer, concerning bleeding midtexes in certain circumstances.

 

image.png.f52b45655af875c26562c7c257dbb87a.png

 

For a long time, I considered it a problem with the R_ScaleFromGlobalAngle function. It was the first thing I needed to tweak at high resolution. Using the default max scale of 64 is fine at 320x200, but it immediately results in texture warping at higher resolutions. I ended up deciding that the 64 max scale was theoretically fine, and as a result multiply that value by <render width>/320 (ie scale it by ratio to the original screen width). Which works fine - until you get to the texture leak as seen on the right of the screenshot.

 

So I sat there and traced the code. And eventually came across this particular line that caused a line that wasn't drawing to suddenly draw:

spritecontext->sprtopscreen = centeryfrac - FixedMul(spritecolcontext.texturemid, spritecontext->spryscale);

And it clicked that the problem is in fact a bitdepth problem. So I checked the results of the multiplication, and sure enough. It requires 33 bits to store the correct value. And here we are with a fixed_t that's 32 bits wide.

 

I've been planning on moving the renderer over to 40.24 fixed instead of 16.16. That will solve this particular problem all by itself. But then someone asked me just how much worse is floating point performance. Whipped up a quick benchmarking program and got this on my Skylake i7.

int32 100,000,000 muls: 52038.5us
int64 100,000,000 muls: 53254.6us
float 100,000,000 muls: 140550us
double 100,000,000 muls: 150290us

Which is about what expected. 32 and 64 bit integer performance basically the same, with slower 64 bit float performance compared to 32 bit float.

 

I am, of course, also targeting ARM processors with this port. Which meant running the benchmark on my Raspberry Pi 4. It's running Ubuntu 22 64-bit, and as such these results were surprising.

int32 100,000,000 muls: 463821us
int64 100,000,000 muls: 662320us
float 100,000,000 muls: 581962us
double 100,000,000 muls: 580035us

32 and 64 bit float performance is equivalent, which is actually really good to know. But 64 bit integer performance worse than float? Okay. That's weird. Don't know why that would be the case yet.

 

End of the day: You might think moving from 16.16 fixed to 40.24 fixed will be bad for the renderer - until you note that fixed point multiplication and division is done with 64 bit integers anyway. It does suggest that there's more performance on the ARM to be had by switching to float (at least until a newer revision that doesn't make a mess of 64 bit integers).

 

But since I'm in the middle of moving the renderer over to C++, the fixed point template code I'm writing will be able to magically hide switching between integer and floating point math with one template parameter. I'll come back to that later.

Share this post


Link to post

Let's talk about how fun it is to break the Doom renderer.

 

Doing work like this means I'm always breaking something when I'm in the middle of changing it wholesale over to something else. Like with this early example:
 

 

The renderer is going to break often if you try poking around in it. A lot of this is down to the language used, to be honest. Take for example my desire to move over to 40.24 fixed point in the renderer. I've written a C++ fixed point object that I plan on using once I've done the full C++ pass on the codebase, but until then the only difference between fixed_t and int32_t is the identifier you use to declare a variable. One is a typedef to the other. Thus, you can't do operator tricks to catch misbehaving code. You need to do things the manual way of auditing all instances of a variable if you want to change its type.

 

The core routines right now on my github use 44.20 fixed point values. Many of the supporting structures have not been moved over to that format yet, so the work is incomplete. Core routines working though means that if the renderer breaks, it will explicitly be because input values haven't been converted to 44.20 correctly.

 

This, of course, means that I broke the renderer to get here. My new flat renderer was fairly painless, and that's not worth talking about here because it literally worked first time.

 

Getting the column renderer to run though, hahaha.

image.png.fc3f0b213bcfc8893ecafb459baa6008.png

 

So. That meant trawling through the code, identifying select conversions to make, and trying again.

image.png.fef4cf4c4776ee8c1e7f9e659ec27d17.png

 

The sky rendering correctly is a very good thing - it proves that the core rendering routines are working as intended. And it's just a matter of correcting the inputs for rendering walls and sprites.

 

image.png.3f21c9f6826ddd309326f370ad59bdc6.png

 

Like so. We're nearly there.


image.png.853fc401e02ad88070d3b5d23a968b4a.png
 

There's probably a glitchcore aesthetic mod waiting to happen looking at some of these shots. The pistol actually scrolls as you move, and the sprites all look corrupted. Still, it wasn't long before I realised that I still haven't written my own masked column renderer, and thus I'd have to do a bit of maintenance on the original functions. And with that, fun was over and everything was working.

 

image.png.e8269292abbabcad5ed81717fea52b42.png

 

Well. Almost working.

 

image.png.e66f4ee8cc1e6b4cfb9cf5e3f2bba00d.png

 

This here is a testmap that immediately triggers the texture bleeding I mentioned in my previous post. And lo and behold, switching to 40.24 still had the problem. So. I checked, and to my bemusement I still needed more bits than I had to store the integral part of the value. Fiiiiiiiiiiine. 44.20 it is. And with that, the switchover up and running.

 

To double triple prove it to myself: The first time I noticed this bug was playing Plutonia 2. It uses a 2S linedef for decoration very early in MAP01. And I noticed glitches around a certain area. Lined myself up, made a save game. Used it for testing and reference ever since. This is the other side of breaking the doom renderer - when it breaks for no obvious reason when literally everything else is working as expected. Now that I'm using a fixed format that retains the correct information though? Well let's check to see if we're finally proper done.

 

Before:

image.png.e81baf901b7a1649b46f6aebe4a12738.png

 

And after:

image.png.0dfd86e5f9c37c20dc91b6e6de6f2750.png

 

Finally.

 

Anyway. I've done most of the hard work of moving the renderer over to a different fixed point format. When it's all C++, it won't be too much effort to proper finish that work and then specialise my fixed point object to run on floats internally. Run some benchmarks. See which renderer runs best for which platform. And because of how I'm setting it up, it's simple to switch back to 16.16 and run those tests too.

Share this post


Link to post

Let's also talk about trying to fix the long lines bug.

 

This is one of those ones that just go away if you increase your fixed point accuracy, or convert to float. If you do it wholesale, you never see the bug again. Everyone go home, good job.

 

But I wanted to understand the exact causes of it better.

 

The Doom Wiki entry on the subject blames the atan table being only 2048 entries long:

 

Quote

The effect is caused by a severe lack of precision in one of the engine's distance calculation functions. As part of the algorithm to determine the distance of a wall from the player, the engine tries to calculate the distance to one of the line's vertices. The engine does so in part by finding an angle to the vertex using an inverse-tangent operation. As this would be slow and difficult to calculate directly, the engine stores a lookup table with inverse tangent values. However, the table is only 2048 values long. This means the distance cannot be calculated with any precision as the distance to a vertex nears, or exceeds, 2048 units.

 

Now, this wasn't sitting right with me. Because a long time ago, with the community knowledge that the trigonometric tables were too small I actually implemented bigger tables. This chunk of defines right here, in fact, expands the tables from the 2048 default to, currently, 16384 entries. And yet as you can see in the Planisphere 2 video I still get the long wall error. So I started digging deeper. 

 

I started tracing the code and noticed that with a very simple test map that this chunk of code was resulting in bad values - and it exactly was the call in to the function highlighted by the Doom Wiki, R_PointToDist. Doom's renderer is actually no more complicated than high school trigonometry, and what that chunk is doing is calculating the hypotenuse of a triangle with two edges implied by the X and Y values ( xy -> 0y, 0y -> 00; with 00 -> xy being the line calculated and 0 being zero on the corresponding axis).


The values being returned to me looked fine. My test map doesn't cover, say, the entire blockmap range. The values being played with in 16.16 fixed were multiplying just fine and returning sensible values. I converted the function to my new 44.20 fixed format anyway, and saw no difference in functionality. Hmm. Time to dig deeper.

 

I was noticing elsewhere in the code that some angles were being calculated all whacky. When the wall was rendering correctly, the vertex that was straight ahead of me was roughly 90 degrees, and the far vertex was returning somewhere around 150. Cool. But then when it breaks it returns very close to 180. Huh? Well that's weird. I guess it's time to inspect the next function then, R_PointToAngle.

It calls in to a function called SlopeDiv and oh there's your problem. I linked to the Linux Doom source there to highlight what I was saying about type fluidity in C compared to C++. This is a function that has always taken unsigned int parameters and returned an index in to the relevant tables. Problem here? The values passed to the function are always fixed_t. Now, clearly they went unsigned because they didn't want to deal with negative values for a table lookup. That's fine, I can do a clamp operation which the code does anyway to not exceed the maximum range. Of course, being 16.16 anyway with the divide means it was hitting that accuracy problem again. Convert over to 44.20 fixed point, compile run.

End result: It works!
 

image.png.908c724adc9f03d08c9605155dfbd1a2.png

And just to prove it's not the small size of the trigonometry table that's to blame, I changed those defines I highlighted to zero resulting in the original tables to be used. Resulting screenshot:

image.png.f1d1d8b6fe085ef11edb8c49d078062a.png

 

So we're way closer to Planisphere 2 being playable. We're not quite there yet though. I'm intent on working out just what the heck format the stored blockmap takes (every other port that can play this regenerates blockmaps on load). And, well, there's also some other precision issues.

image.png.7ba1b34b66cda13abea13745ab8cba10.png

 

That upper left corner of the map? Yeah. So with all the 16.16 fixed math still running to handle the BSP parsing and the like, the multiplication values start wrapping around at a certain point when both X and Y values are positive. That'll be a no-brainer fix, 44.20 fixed point math or floating point will entirely eliminate the issue.

 

Probably worth editing that wiki article though, since it only correctly identifies one of two functions that needs to operate in higher precision; and the 2048-sized trigonometry tables are not an actual issue at the end of the day (EDIT: except for wall wobble, but I also suspect other precision issues are at play there too).
 

longwall.zip

Edited by GooberMan

Share this post


Link to post

So, the remaining visual errors in Planisphere 2 ended up not being fixed point accuracy issues at all. Instead, I needed to update the subsector struct to actually use 32-bit indices so that data wasn't being clipped from the DeepBSP nodes.

 

image.png.c282fbef5c96d606ebf6335ec2d320bb.png

 

image.png.fea42081fdc5bbfaea4d8f9a4a7b12aa.png

 

So hooray. Which now leads to the question. Where should I optimise? Well, that requires getting some good profiling information out of the BSP traversal. Which means updating my profiling graphs with things I'm interested in.

 

The first port of call for better performance is visplane lookups. I've already implemented the reverse-search method so it should be a speed boost in most cases. But, well, it's still a chonker.

 

image.png.5fc23393f5b3b78bbecda23600de3ce8.png

 

Once I've moved more of my code over to C++, I'll likely go down the "visplane hashing" route that everyone else has been doing since Boom.

 

What I'm interested in though is the line clipping and preparation for rendering. It's a heftier chonker.

 

image.png.233161332a056ce7355057c5595c6679.png

 

It's also something that I can optimise the good ol' fashioned way instead of waiting for my code to use a better language. This will probably be a focus for my efforts in the immediate future.

 

But what about the BSP traversal itself? Well that's included in the "everything else" category.

 

image.png.7352926453c800a753e61a082e2dea61.png

 

Again, this is something I can optimise the good ol' fashioned way. But it won't be my biggest wins. Well, I guess optimising the math functions it uses will have a knock-on effect all over the rendering module. But there's definitely bigger gains to be had elsewhere.

 

Anyway. I might make another point release soon now that Planisphere renders. Still don't have a solution for its blockmap, but I think it should essentialy be playable anyway once I get the origin offsets right (ie do something either exactly like- or similarly to- Maes' solution).

TINY EDIT: Got the automap rendering again for it too. Short story: the extents of the map result in an integer wrap when the map tries to calculate the width with a (pos_max - pos_min) calculation, which makes either width or height negative resulting in nothing being drawn. The automap renders in 44.20 fixed point now, which completely sidesteps that problem.

image.png.e1d13de0979bf3c5f5f2bb0531f1af27.png

Edited by GooberMan : automap image added

Share this post


Link to post

So. I've finally got this Planisphere 2 blockmap figured out.
 

On 6/1/2022 at 7:02 PM, GooberMan said:

But a common index that fills empty cells is 26,924. As a raw offset, it's rubbish. As an index for a unique list, it's outside of bounds.

 

The first index for the blockmap list after the header and the table is 92,460. If you do a modulo operation of that index with 65,536 (ie the number of entries you can have before integer overflow ruins your day) you get 26,924. So, end of the day the blockmap just didn't account for integer overflow. What this means though is that we can apply some corrective steps to the table:

  1. Initialise an offset value to (<first entry offset> % 65,536)
  2. Check if the current list offset in the table is less than the previous one. If so, increment an offset value by 65,536
  3. Add the offset value to the current index
    1. If, however, the index is equal to (<first entry offset> % 65,536), then set it to that first entry

Which now leaves us with an almost-likely-correct blockmap table. I say almost likely because each time you detect an integer wraparound, there's a chance that you'll encounter a 26,924 value that isn't meant to point to the first list.

And a bit of a map rendering hack, and we can iterate through blockmap cells to show which lines are used.
 

 

Which looks about correct, yeah. I haven't verified if there's blockmap gaps thanks to that first list offset hack, but what I know now is that we do very definitely have a blockmap that reports the expected lines for the expected cells.

And running around in map, there are parts that are doing blockmap lookups correctly:

image.png.ebe04a0095d7699549a34efbe46ec5a9.png

 

And parts that, well, aren't. Like the starting room:
image.png.2befbdf96111759c861dca523baded92.png
 

So that's going to be the previously-mentioned issues about the math resulting in bad values for the lookup. Now. I'm keeping the playsim vanilla-accurate. But to fix that - either via the Maes method or some other method - means making the playsim do things that the vanilla playsim does not. So I've got a decision to make. I already need people to specify the -removelimits command line to load this map. Maybe the playsim could read that and do correct blockmap lookups for large maps?

Edited by GooberMan : Needed the full 65,536. Derp.

Share this post


Link to post

So. I accidentally ZDoom node support.

 

Planisphere 2 is basically working (I'll upload a playthrough video), so I asked Ling for a new map to test R&R on. The Given was suggested. And immediately it wouldn't load. I was a bit puzzled at first, but it turns out that I can't read and it needs a port that implements ZDoom nodes.

So I cleanboxed it, with nothing but the ZDoom wiki to tell me the format of the nodes. I'd already done some work on the loading code to template it, allowing me to write code once and just swap in input types as necessary. This has worked for limit-removing (ie interpreting vanilla data as unsigned) and DeepBSP (lots of size changes all over the place) node types. ZDoom nodes are a bit special though - they put several lump types in one lump; and the structures are wildly different in some cases. So it's not exactly templated code to load its NODES lump, but it does resuse much of the existing code.

 

The only bit of note you need to know to load in ZDoom nodes and jam it in to a vanilla renderer is the data it doesn't include in the file itself. P_ApproxDist, R_PointToAngle2, and making sure you get the front and back sectors correct when converting the data is the only thing you need to take in to consideration. Everything else is very straightforward.

Wrote the code, it compiled. Hit run, and the map loaded and rendered first time.

 

image.png.9c3c349a8361860f08fb68b6f8cc4f7c.png

 

So that note in the Cacowards about the performance wasn't fucking around. That's with a backbuffer of 2560x1200, which is more pixels than 1080p. Lowering it to 1706x800 still won't get you a 60FPS renderer on my Skylake i7 @ 2.6GHz. This will be a good testbed for optimisations indeed.

So. What's taking all the time?

 

Visplanes:
image.png.af135a14b9b77ce0396217a577edd4d6.png

And preparing walls for render:
image.png.2bfd627fa4eeb5b863a310a6d53c258b.png

 

So basically the two areas I was going to focus on with Planisphere 2.

 

tl;dr is that now my challenge is to get this running at an acceptable rate on my Raspberry Pi. This shall be fun.

EDIT: Fixed my "remove limits" check to not hammer M_CheckParam all the time. Got some tasty framerate back. 3 milliseconds is 3 milliseconds, so I'll take it.

image.png.b8add049802c624563ac5939d9bf569d.png

Edited by GooberMan : Showing off how programming like a noob can ruin your day

Share this post


Link to post

Here's a video of Planisphere 2.
 

 

I've got some optimisations I want to try before I make another point release.

Share this post


Link to post

Now let's talk about profiling and tracking down performance issues.

 

Sample-based profiling on Windows is in a bit of a sad state. The tools available to you (including the ones built-in to Visual Studio) are all bound by the kernel's profiling report rate. It only allows a maximum of 8KHz. Now this was a sensible value back in, say, the 90s when your home PCs were only able to hit 100MHz if you were rich. IPC count was not great, and all the delays in the system meant that 8KHz would get you a good idea of what your program was actually doing.

 

So anyway I'm on a 2.6GHz processor. But let's look at it another way. If you're trying to keep to 60FPS, then what this means is that Windows will only give you 64 samples per frame. That's bonkers.

There is another form of profiling commonly used - instrumented profiling. This means inserting hooks in to your code that sends out markers for a profiler. The common profiling tools all support this. And since I want to dig in and see where my bottlenecks are, I now support this.

I've also build a bare-bones UI in to R&R because that's what I do.

 

image.png.116c96f1ea53693cfe061175097567dd.png

 

Here's an example from something I was profiling. I test-rendered the sky twice with two different functions to compare and- wait, hold on, why is my limit-removing column drawing function running 7 times slower than the standard one? And what is a limit-removing column drawing function?

First up, the what: It's a column renderer that can handle arbitrary texture sizes instead of the hard-coded 128-tall textures from the original Doom code. Cool. But why is it such a dog?

static INLINE pixel_t Sample( colcontext_t* context, rend_fixed_t frac, int32_t textureheight = 0 )
{
	return context->source[ (frac >> RENDFRACBITS ) % textureheight ];
}

Well there's your problem right there. The modulo operator. Or, in English, the remainder of a divide operation. And it's happening on Every. Single. Sample. Welp. We can do better here. I avoid branches as much as possible, but to get some speed back we can throw one in. A bit of hand tuning later and:

static INLINE pixel_t Sample( colcontext_t* context, rend_fixed_t& frac, const int32_t& textureheight )
{
	rend_fixed_t texfixed = (rend_fixed_t)textureheight << RENDFRACBITS;
	if( frac >= texfixed )
	{
		frac -= texfixed;
	}

	return context->source[ frac >> RENDFRACBITS ];
}

Much better. I'm still not happy with it, but I've minimised operations as much as possible so that's one positive. (EDIT: That's a lie, I only need to do that texture conversion to fixed once. Derp.)

 

Short story, it got back down to acceptable ranges.

image.png.973863c95a98127ad58525a528301912.png

I can live with that. This is a 1080p equivalent render buffer, so that's acceptably slower than the standard function. Now, there are tricks I can do so that this function essentially never gets called... But that's for another time.

 

Having my profiler running though means that I've been able to track down other things that aren't ideal and can be trimmed away. So let's do some screenshot comparisons here. Before I implemented the limit-removing column renderer:

image.png.f9fccf2c0c9453aed91751c1cc91df53.png

 

After implementing it:

 

image.png.562ddffb6213c7604e4e680f036bf0df.png

 

And after optimising it and a couple of other things:

image.png.d82d7f247c30b42a7166178c8b5482a9.png

 

Nice. I'm at a net-win compared to before the limit-removing column renderer. Of course, we're far from finished here. Implementing instrumentation means I've been drilling down on to exactly what the problems are and thinking of ways to deal with them. And I need to point out: having this API on does slow your program down. You might think "inaccurate times are useless". Not necessarily. We can determine proportions of time that a function takes accurately enough and work out what to optimise from there. I have some clear targets here:

image.png.e92ed36ae8dfef7cc5d2dceef28bfe92.png

(And yes, that extra time you're seeing on the render graphs is the current overhead of the profiler)

Share this post


Link to post

So. Since I made that post about 24 hours ago. I shaved 7 milliseconds off the rendering of that Planisphere 2 scene.

 

image.png.c43ae0704d14b0c791a62934fd863953.png

 

How, you might ask? Well. Let's go back in to a previous optimisation/discovery that I did.

 

I rewrote the flat renderer a while back. The original flat renderer worked great for a non-transposed buffer, kept in cache coherency for the output thanks to the span rendering moving left to right across the screen/backbuffer. But that's no good for R&R's transposed backbuffer. The same kind of cache misses you were getting on the wall renderer was just moved over to the flat renderer. Obviously I needed to render visplanes going down the screen just like walls do in order to retain the performance benefits of a transposed backbuffer.

 

The discovery I made is that the span renderer is kinda unnecessary. I mean, it totally was back in 1993. It precalculated some values that are constant for horizontal lines, and converting visplane lines to spans resulted in faster rendering than trying to render visplane lines one by one. But it occured to me when writing the new code: visplanes are actually just a collection of raster lines for a perspective-correct texture mapper. And that's the code I implemented.

 

So anyway. Visplanes are literally just screen-width arrays of rasterlines. There's one thing you can say about most visplanes though: they do not cover the entire screen. Thus, most visplanes are massive wastes of memory - especially when you get to high resolution rendering.

 

I've hit the delete key on the old visplane code. It's gone. You won't find it in R&R unless you roll back to an earlier revision.

 

In its place, I've put raster regions.

 

They superficially function like visplanes - a collection of rasterlines. Their storage is temporary, and obtained from a pool that gets reset at the start of every render frame. Every time I want to add new lines, I do not try to match with previous raster groups. I just grab a new group and storage for the lines, store it in a single-linked list, and off I go.

 

And as you can see, 7 milliseconds fell off my profiling.

 

There's a few other things worth pointing out.

 

My visplane structures for one thread was clocking in at over 100 megabytes. I needed to continually increase the number of visplanes thanks to these limit-removing maps I'm testing. I allocate for 8 threads by default, so that's a good chunk of memory spent just on visplanes. In my new code, on that Planisphere 2 scene it reserves (checks notes) just over 2 megabyte per frame. That's down from 800+ megabytes to support the multithreaded renderer.

 

No visplane overflows means I've been able to look at that scene when rendering on a single thread for the first time without crashing.

image.png.14703e33a73de16d1840d571ee0c1dea.png

 

I broke something while I was doing all this. Check out the top of these walls for example:

image.png.7cde5e7b5756ab53994bd0f0b83282f8.png

 

Once I work out what I did wrong there, the next target in my sights is another one of those vanilla limitations that keeps screen-width arrays.

 

image.png.ee6ece6e9358dfd32be38a2438ff6951.png

 

I also had a look at how Planisphere 2 performs in other ports. Let's just say that when comparing both software and hardware renderers, Rum and Raisin Doom is going to be the only port that can keep 60FPS on this map... on my 2.6GHz Skylake i7. I haven't tried this on my Raspberry Pi 4 yet. But needless to say, at this point this renderer is at a stage where I should be able to keep 60FPS on a Nintendo Switch at 1080p for all currently released maps on the Unity port.

Share this post


Link to post
7 minutes ago, GooberMan said:

Rum and Raisin Doom is going to be the only port that can keep 60FPS on this map... on my 2.6GHz Skylake i7

 

There's a lot of ground covered in this thread, so forgive me if you've covered this already, but I'm curious - this means GL renderers as well?  If so, what is the bottleneck on GL ports?  Is there really too much geometry to batch and send to the GPU at once?

Share this post


Link to post

Hey, wanted to step in beyond the likes to say I'm really enjoying this dev diary. It's interesting reading the thinking and process behind getting something like this to work. Already had a run around in Hellbound Map29 with R&R and I'll be sure to give Planisphere 2 my first ever go in it once it's ready!

Share this post


Link to post
20 minutes ago, AlexMax said:

this means GL renderers as well?


R&R currently runs faster than the GL renderers I tried, yes. I'd need to get frame interpolation in, and implement a better way of worker threads to sleep after a frame before I can illustrate this fully. I'm also only testing with 4 threads - the amount I'd run on a Pi/Switch; and also the load balancer keeps pushing work disproportionately to the final thread when running threads are >4. But tl;dr - yes.

I would need to look at the code for the ports in question to get a clearer idea of what the bottlenecks are. I also have a solid view of how I'd implement a traditional hardware renderer myself, although my current thinking on the matter is somewhat less-traditional.

Share this post


Link to post

Loaded up the ol' Pi 4 and ran latest on it.

 

image.png.d46b0ee617a1934e54b1be4ae6cacb22.png

 

That's with a pixel count slightly higher than 1080p. The IWADs should all be able to run at 60, just need to focus on some usability issues and do that "wake threads" thing I was talking about.

 

The real question though is "What about Planisphere???"

image.png.4ab2abf73850c421b7bb089727fb3d0c.png

 

Not great. Not terrible. But actually playable in a "This reminds me of playing Doom on my 486SX 33 back in the day" way. Another way to look at it - that's "GTA4 on consoles" performance territory.

 

I dropped down to 1706x800 renderbuffer and gave it a bit of a play. Not bad. Some scenes absolutely murder the Pi though:

image.png.1a6d26d06428c3a01109bc01ce4c4568.png

 

image.png.601d8321b14912033f00775ae51427ae.png

 

image.png.7b4fe05f79432cafbc5f826ded55b62f.png

Share this post


Link to post

The fact it's even running alright on a Pi 4 is still impressive as hell, no matter how you slice it.

 

Granted, given today's generation, anything sub-20 FPS would probably feel unacceptable to them.

Share this post


Link to post
On 6/11/2022 at 2:27 AM, GooberMan said:

I am, of course, also targeting ARM processors with this port. Which meant running the benchmark on my Raspberry Pi 4. It's running Ubuntu 22 64-bit, and as such these results were surprising.


int32 100,000,000 muls: 463821us
int64 100,000,000 muls: 662320us
float 100,000,000 muls: 581962us
double 100,000,000 muls: 580035us

32 and 64 bit float performance is equivalent, which is actually really good to know. But 64 bit integer performance worse than float? Okay. That's weird. Don't know why that would be the case yet. 

 

I got curious and decided to see how much of a performance difference it made. Change some defines in m_fixed.h so that rend_fixed_t is just an alias for fixed_t (itself an alias for int32_t) and:

 

image.png.3b988a99890670f219b5d90ab1edf1e2.png

 

About 10 milliseconds saved on that scene... at the expense of reintroducing rendering bugs that have no good solution at 32-bit short of using 32-bit floats.

 

Still, it does confirm to me that there's value in my intended approach - providing a renderer at 32-bit precision by default, and only using the 64-bit precision renderer when -removelimits is running. I'm 100% curious as to what some of the worst-performing 100% vanilla maps are now. One of my stated goals here is to get Doom's renderer running at 1080p on a Switch, and since I can't exactly do Switch homebrew the Raspberry Pi is the next best thing.

 

I suppose the obvious test WADs will be whatever's been released on the Unity port, since those currently do run on a Switch.

 

EDIT: Well how about NUTS.WAD

image.png.ee5deb7670a8007f033817ffa5101d81.png

Yeah, I definitely need the 44.20 renderer for that.

image.png.c9db414f3445c915082f3721b1d007d4.png

Edited by GooberMan

Share this post


Link to post

This one's just for fun - I've used fraggle's text screen library to render the traditional vanilla loading screen messages.

 

 

Share this post


Link to post
14 hours ago, GooberMan said:

One of my stated goals here is to get Doom's renderer running at 1080p on a Switch, and since I can't exactly do Switch homebrew the Raspberry Pi is the next best thing.

Not sure what this means based upon reading it, so I'll just say this: Homebrew on a Switch is possible, but only certain models of it can be easily jailbroken.

 

https://ismyswitchpatched.com/

 

Basically, any Switch from about the first year or so of production is guaranteed to be jailbreakable; any of the Switch Lites are pretty much guaranteed not to be outside of soldering expensive modchips.

 

The rest are hit-or-miss. But the community has a generally good idea (based on serial numbers) of what ones are or aren't.

 

Of course, if you're saying that it's not possible because you lack one personally, that's a whole different can of worms... but I'd think some in the community might be able to help out with that.

Share this post


Link to post

There's nothing technically standing in the way of doing Switch homebrew myself.

 

Professionally, however, is a different matter. I'm a professional in the video game industry, and that means I need to play by those rules.

 

Having said that, it's unlikely I'll ever actually use a Switch devkit myself. But still.

Share this post


Link to post
4 hours ago, GooberMan said:

There's nothing technically standing in the way of doing Switch homebrew myself.

 

Professionally, however, is a different matter. I'm a professional in the video game industry, and that means I need to play by those rules.

 

Having said that, it's unlikely I'll ever actually use a Switch devkit myself. But still.

Well yeah, obviously you can't use anything that is the official stuff to do homebrew. But obviously, the homebrew community has baked up their own solutions.

 

Still, yes, it's true that it being linked back to you could be a bad thing I guess.

Share this post


Link to post

Getting dangerously close to being a real port now...

image.png.8dd498df59f269550f6af840dab993e9.png

 

I'll do a 0.2.0 release once I've done a bit more work on the limit-removing side.

I was incidentally pointed towards Vanilla Sky on Discord the other day. It needs a limit-removing port. And, yeah, Rum and Raisin doesn't break a sweat playing it. It's the complete opposite of Planisphere in that regard - it's a big city map where the entire map isn't visible half the time. What it does do, however, is mix flats and textures. This isn't a huge task to hack something together. Both are already stored as full textures in memory, although I will need to transpose flats and update code to match. And it's easy enough to use the index values there to indicate whether to look for a flat or a texture. But as I'm converting the code to C++, I can do it better. So I'll do that, then call it a build.

Share this post


Link to post

https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.0

 

Release is out.

 

Preliminary support is in for using flats and textures on any surface.

image.png.c64923b3bcad2edcc36aed70940bfaae.png

Which means Vanilla Sky renders as intended.

image.png.9231fe3bf346ebcaa1df8283c281d8c1.png

 

But it's not perfect.

image.png.38549b16d8a862c8beb960e0441468a2.png


It has rendering routines based on powers-of-two. And MSVC absolutely cannot deal with the template shenanigans I'm doing, it takes half an hour to compile that file now topkek. Clang just does not give a fuck, even when compiling on my Raspberry Pi.

 

Still got some work to do though. Vanilla Sky isn't exactly playable thanks to bad blockmap accesses. Still, this 0.2.0 release is the "break my port with some maps" release.

Share this post


Link to post
On 7/11/2022 at 2:25 PM, GooberMan said:

Well there's your problem right there. The modulo operator. Or, in English, the remainder of a divide operation. And it's happening on Every. Single. Sample. Welp. We can do better here.

Fun fact, starting with Cannon Lake Intel reduced 64-bit integer division from 97 cycles to just 18.  (Zen is 45 cycles, not sure if Zen 2/3 have improvements here off the top of head.)  Of course since you appear to only need to account for 1x overflow this method is probably still faster even on the latest processors.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×