Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
GooberMan

Rum and Raisin Doom - Haha software render go BRRRRR! (0.3.1-pre.8 bugfix release - I'm back! FOV sliders ahoy! STILL works with KDiKDiZD!)

Recommended Posts

6 hours ago, Blzut3 said:

Of course since you appear to only need to account for 1x overflow this method is probably still faster even on the latest processors.

 

And also the ARM used in the Raspberry Pi. But I think I'm going to do a deep dive on how to handle division anyway. You can turn on a faster division at compile time on ARM; and there's also things like libdivide. I don't think it'll be a massive win at this point, but it'll still shave a bit of time off.

 

My next focus on ARM though is just what in the heck is going on with thread time consistency. Only the final thread is performing in a consistent manner, every other thread wildly fluctuates in execution time. I can eliminate weirdness with the load balancing algorithm too, based on screenshots. With load balancing:

 

image.png.3e29e819c862c810f87aa1b7f3e99ac5.png

 

And with no load balancing.

 

image.png.72792c573d4ff2d76753bdcd3fd52c43.png

 

Getting those to level out and not fluctuate should let the load balancer work better, and bring the total frame time down.

Share this post


Link to post

There are some division operations in the source code that can be optimized as reciprocal multiplications, i've implemented them in the FastDoom port and while they work fine, those don't speedup much the game. Regardles libdivide, it can also be used but in my case it was detrimental, maybe because OpenWatcom isn't as good as GCC.

Share this post


Link to post

Latest release: 0.2.1

https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.1

 

Still the same deal as the last release, it's semi supported. I want maps that are limit-removing that break this port so I can work out why and tighten it up.

This release has some null pointer bug fixes, and some oddities I encountered when trying to -merge Alien Vendetta instead of -file. The big one y'all will be interested in though: I decided it was well past time I implement frame interpolation. Now it hits whatever your video card can handle. As it's borderless fullscreen on Windows, it'll be limited to your desktop refresh rate.

Share this post


Link to post

Been going through the column rendering routines to get speed back on UI/sprite/etc elements. You know what that means:

 

It's glitchcore time!
 

 

Also I guess this shows off frame interpolation and all that. Some issues with SDL and it being unable to detect the highest refresh rate a duplicated display is running at means I can't get 120FPS footage just yet. But it'll come.

Share this post


Link to post

Maybe it's time for a Planisphere update too. Because I've been seeing threads drop below 9ms lately.

image.png.ee5ee55112ec837c8b866cd495652a2c.png

 

But maybe more impressively: The Given. At 2560x1200, the original screenshots showed >40ms per render thread back on July 6.

image.png.ae25e3b65bfb423154d7d251693c7e9d.png

 

That's basically playable in software now. Even more so if you drop it to a Crispy-style resolution.

image.png.63ee33d355f27c1772a664183eb6042e.png

 

Still todo: Fixing the load balancing code to not pile on the last thread when threads > 4.

 

But I'm chasing something else right now: vissprites and masked textures. I decided to open up Comatose yesterday (runs, but seems to require some Boom line types so you can't leave the first room without noclipping). It's something of a dog on software renderers.

 

image.png.11839c9af4c286b94d03ddd34e218e0b.png

 

Disproportionately on sprite draws. Running -skill 0 shows very reasonable render times. So I wanted to know what was going on. Threw some more profile markers in to see where the time was going.

 

I'm seeing two problems here:

 

1) Sprite clipping is awful, it does a ton of work just to render nothing.

image.png.d8f3ccd0713761fe191b3beb8ac305bb.png

 

2) Sprite clipping is awful, it does a ton of work and then when it does draw stuff the rendering routines aren't ideal but aren't really the performance bottleneck here.

image.png.afe1b5cdf8461eda3120323adedb3aa3.png

 

So I'm currently grokking how sprite clipping works. I already have ideas on what I want to do to it, but I need to understand a few more bits of the code before I can dive in and do what I want with it.

Share this post


Link to post
49 minutes ago, GooberMan said:

But maybe more impressively: The Given. At 2560x1200, the original screenshots showed >40ms per render thread back on July 6.


image.png.ae25e3b65bfb423154d7d251693c7e9d.png

 

 

 

 

I wanted to quantify how good was that, so I compared the same map, same point more or less, on my preferred engine: GZDoom. I had 39-40ms @ 1440p on my amd 3600x.

So good job!

Share this post


Link to post
15 hours ago, GooberMan said:

So I'm currently grokking how sprite clipping works. I already have ideas on what I want to do to it, but I need to understand a few more bits of the code before I can dive in and do what I want with it.

I normally find all this stuff impressive, and I've liked a lot of these posts just because I'm amazed by how you can make this software (SOFTWARE!) renderer go so slickly fast...

 

But now you throw out a word like grokking? Oh you magnificent bastard.

 

Enjoy my rarely-awarded invulnerability sphere!

Edited by Dark Pulse : Awared?

Share this post


Link to post

I've been testing the port and it performs incredible even on slow computers!! But I think i've found a bug in the 0.2.1 release, the keyboard movement is weird. Pressing left or right keys causes the screen to move in a weird way, it basically "jumps" and is not smooth at all. But if you use the mouse to move, the movement is perfectly fine. Maybe I have something wrong in the options.

 

 

Share this post


Link to post
3 hours ago, viti95 said:

But I think i've found a bug in the 0.2.1 release, the keyboard movement is weird. Pressing left or right keys causes the screen to move in a weird way, it basically "jumps" and is not smooth at all. But if you use the mouse to move, the movement is perfectly fine. Maybe I have something wrong in the options.

 

This is intentionally broken until I finish player interpolation.

 

So here's how it currently works.

 

For all mobjs, the previous and current positions are interpolated according to where the display frame is in between the simulation frame. This does technically mean that you will be guaranteed to see past data except for one out every <refresh rate> frames where a tic lines up exactly with a second. The one exception to this rule is the player. The angle is not interpolated at all. Instead, the most recent angle is used. And for each display frame, we peek ahead in to the input command queue and add any mouse rotation found.

 

This was the quick "it works" method that let me get everything up and running.

 

Here's how it should work.

 

Exactly as above, except do for the display player's movement exactly what we do for mouse look.

 

The full solution will require properly decoupling the player view from the simulation. At this point, it will be the decoupler's responsibility to create correct position and rotation for the renderer instead of the renderer doing the job. There's some more setup I need to do for that to work correctly, but it will also cover all keyboard inputs once working.

 

It may not seem like much to anyone, but I really got sensitive to input lag when implementing 144Hz support in Quantum Break. I tested a few other ports to see how they feel too, in fact. prBoom has the worst feel by far to me, with every other port tried feeling about the same. My intention is to ensure there's basically zero effective lag between when input is read and when it is displayed (ie read on the same frame you render) and thus remove input error considerations from skilled play.

 

There's also one thing I've realised after playing Doom at 35Hz for so long: it cannot be overstated how much of an effect frame interpolation has had on raising the Doom skill ceiling.

Edited by GooberMan

Share this post


Link to post

Don't know if this can be certified as a bug, I wanted to test it on my older Core2 Duo laptop but found that Rum and Raisin requires an OpenGL 3.0 graphics card. No love for older video cards LOL.

 

rum_raisin_021_intel_945gm.png.2d4c95d7fffaaaaa191aedc69109e920.png

Share this post


Link to post

It initialises GL 3 or higher, yes. It's more for planned features than anything it requires right now.

 

Buuuuuut having said that, I have stated that this is about getting the renderer to run efficiently on modern systems. So yeah, Core 2 Duo, that predates the first i7. What I've done so far should theoretically work just fine on that line of processors, but I'd definitely want to look at the threading performance with how I've got things set up thanks to the way-less-sophisticated cache. And it's an Intel integrated GPU there so it would certainly not have the capability for what I have in mind.

Share this post


Link to post

i am a dogshit coder but i really enjoy the math and science behind what you are doing here, thank you for the detailed explanations as they have triggered some inspiration in me in my day job to make Not Shitty Algorithms That Don't Suck as opposed to the usual Fuck It, Numerically Grind A Solution Whatever rut i had been in for some time

Share this post


Link to post
On 8/16/2022 at 8:15 PM, GooberMan said:

It initialises GL 3 or higher, yes. It's more for planned features than anything it requires right now.

Do those planned features include running the software renderer on GPU cores?  I've always wondered how well that would actually work.

Share this post


Link to post
16 hours ago, Blzut3 said:

Do those planned features include running the software renderer on GPU cores?  I've always wondered how well that would actually work.

 

I won't be attempting this until I've simplified the renderer some more. As mentioned above, sprite clipping is next in my sights. The flat rasteriser I wrote will also be the only rendering routine when it moves over to the GPU, no more wall rendering code. Which will also mean simplifying how the wall rendering is set up, most of the time right now is spent outside of the lowest rendering loop just setting up things like wall offsets etc.

But needless to say: I don't think the original renderer would be worth porting over to compute code. I know the kind of code that gets all those effects you see in Returnal, the simpler your code the better it is on the GPU. And, as is displayed here, the better it is on the CPU.

tl;dr - I've still got work to do.

Share this post


Link to post
On 8/17/2022 at 1:15 AM, GooberMan said:

It initialises GL 3 or higher, yes. It's more for planned features than anything it requires right now.

 

Buuuuuut having said that, I have stated that this is about getting the renderer to run efficiently on modern systems. So yeah, Core 2 Duo, that predates the first i7. What I've done so far should theoretically work just fine on that line of processors, but I'd definitely want to look at the threading performance with how I've got things set up thanks to the way-less-sophisticated cache. And it's an Intel integrated GPU there so it would certainly not have the capability for what I have in mind.

Very interesting project! For the record Doom 2 maps are running at a stable 60 FPS at 1080p on my 3.16 GHz Core 2 Duo machine with an OpenGL 3.3 capable video card.

 

Also meant to add that the dashboard is a very nice feature and that the 'Valve Classic' skin hit me with some serious nostalgia.

Edited by vadrig4r

Share this post


Link to post

https://github.com/GooberMan/rum-and-raisin-doom/releases/tag/rum-and-raisin-doom-0.2.2

 

So. Short story, I went to London late July/early August and returned with the rona in tow. That took me out for a few weeks, and then I couldn't get back in to the swing of things. And then something happened today.

 

image.png.7d7f0335f84ebc007e72a452ad5ad40c.png

 

A new mod release on the Unity port. I knew that all mapsets/mods had been converted to actually be IWADs instead of PWADs, so I decided to get the IWAD version running in Rum and Raisin. First things first is to get the IWAD. Which you'll find at %UserProfile%\Saved Games\id Software\DOOM 2\WADs\17 on your Windows machine. Take the file called 17 from inside that folder, copy and rename it to whateveryouwant.wad, and then load it up with the -iwad command line parameter.

 

There was a few things that I had to fix to get it up and running. Importantly: none of this works without the -removelimits command line parameter.

  • IWADs with a DEHACKED lump now autoload it
  • Widescreen asset support was added. Tested with Doom 1 (shareware, registered and ultimate), Doom 2, TNT, and Plutonia; and with both Unity's 16:9 assets and Nash's 21:9 assets. Also tested in 4:3, 16:9, and 21:9 resolutions

That was pretty much all I had to do to get things up and running. I assume other ports should handle the IWAD just fine, but there's still so much of this codebase that is still Chocolate.

 

On the subject of widescreen assets. There is now an optional download, the WidePix Pack, linked in the release. These are just WAD files that I look for when -removelimits is specified. If the files are there, they're loaded. If they're not, then you continue on your merry. I basically dumped Nash's lump files from GitHub in to separate WADs. Literally didn't need to do any more than that, they're all set up with correct offset markers. Great stuff. I decided to keep them separate for the time being, but it's easy enough to use them.

 

Note, however, a couple of IWAD gotchas with this:

  • If you're going to use the Unity doom.wad and doom2.wad (found in the DOOM_Data\StreamingAssets or DOOM II_Data\StreamingAssets subfolders), you need to use -removelimits since widescreen support is a limit-removing feature.
  • If you want to only use those Unity widescreen assets, you either need to not have the .widetex files alongside Rum and Raisin Doom or use the -nowidetex command line parameter.
  • If you're using an IWAD other than the id Software IWADs (say, for example, Harmony) then you will need to use -nowidetex since I haven't worked out a good way to automate detection of non-id IWADs.

Anyway.

 

I also did some work before I got the rona to remove hardcoded limits from the renderer entirely. As such, you can now select from some 4K backbuffer resolutions.

 

Right now, if you turn on the render graphs with 4K backbuffers running you will notice something weird...

 

image.png.9ba1449b5ecb52bff7e7bf723d609ebb.png

 

So it's clearly rendering in enough time to hit 120 FPS, but it's only hitting 45 FPS (and unstable at that). What gives?

 

The reason is pretty simple: The Chocolate Doom code that uploads the backbuffer to the GPU does so by converting to 32-bit on the CPU before uploading. So not only do you have that cost, but you have the cost of the PCIe bus to deal with. 4K backbuffer, multiplied by 32 bits, multiplied by 120 times a second, equals LOLNO not enough physical bandwidth. Perfectly fine for Vanilla sized buffers, but not for 4K.

 

I'm still in the middle of refactoring things to upload the palette and the backbuffer without modification to the GPU so that I can run a shader and decompress to full colour. The bandwidth requirements then become realistic, so 4K at 120FPS will be quite reasonable. But I'll save that for a next minor version release instead of a revision release.

 

Still. This is the same scene with a 1080p equivalent backbuffer.

image.png.c2351450e70e2c75d1ddc4d69bd2a3b1.png

I like how the render threads linearly scale to the actually-four-times-the-output-pixels.

Share this post


Link to post
1 hour ago, GooberMan said:

I'm still in the middle of refactoring things to upload the palette and the backbuffer without modification to the GPU so that I can run a shader and decompress to full colour. The bandwidth requirements then become realistic, so 4K at 120FPS will be quite reasonable. But I'll save that for a next minor version release instead of a revision release.

Given the hardware you're targeting you'll probably avoid the gotchas but I will still warn you that you are by far not the first person to think of this.  Probably some OpenGL 3 hardware knowing those drivers, but for some reason this ends up not working for some people and I know a couple of commercial games that used this had the code reverted.

 

There's technically enough bandwidth for 4K 120FPS 32-bit on a PCIe 1.0 x16 link (so a modern 3.0 or 4.0 GPU should have plenty of overhead), and I'm fairly certain I remember doing locked 60fps 5K software rendering with ZDoom on a 2017 iMac which also did 8->32 in software.  Cursory test suggests that my Zen 1 system can push over 100fps with GZDoom at 5K.  So while having the ability to offload that conversion to the GPU is probably a good option to have, you still have another bottleneck.  Honestly Chocolate Doom always seemed to me to have relatively slow blits, but given the limits it never really was observable outside of timedemo.

Share this post


Link to post

FWIW, we successfully use the screen + 1d texture successfully on the Unity version on all 18 million platforms we run on, but we're probably not supporting as old or screwed up machines as some other projects trying it have been trying to support (we're 64 bit only, and Unity probably has high enough requirements to knock out some of the lower end machines.) It was quite a significant speed boost on all platforms, I believe.

Share this post


Link to post
6 minutes ago, Blzut3 said:

If I recall correctly Keen Dreams was one example of a port that did it and had problems. @Edward850 might remember the details.

Wasn't shader related, curiously moving to SDL2s palette drawer didn't turn up any issues (though I seem to recall Icculus has compatibility modes for that?). It was something strange going on with the hand-made SSE intrinsics for the screen blitter, causing the screen to constantly blank out. In the end we just reverted it to the plain C copy buffer, compiler does a better job anyway.

Share this post


Link to post
5 hours ago, Blzut3 said:

you are by far not the first person to think of this

 

Not the first time I've done it either. This method goes back to the early days of Xbox 360 development.

 

Quote

you still have another bottleneck

 

Also very likely, I'm never happy with SDL any time I look at it so the more I move over to a hand-written backend the better. Consider this though: 4K with palette decompression on the GPU will use essentially the same bandwidth as my 1080p equivalent buffer there that's hitting 120FPS.

Edited by GooberMan

Share this post


Link to post

image.png.ab8d8e9d074eddfe24933d9130d879bf.png

 

Allow me to give you an anecdote from nearly 13 years ago in my career now.

 

I was working on Game Room, a commercial emulator service for Microsoft platforms. Now, it was certainly cool getting a Commodore 64 emulator running and both listening to Monty on the Run on a 360, and programming BASIC with a chatpad on the 360... but that emulator never saw release so that's just a cool moment. What I actually want to talk about is a bug that I had to track down.

 

Microsoft's QA was adamant that we were getting choppy performance on a specific hardware configuration. We couldn't reproduce it, so we actually duplicated the hardware and software installs. Still couldn't get it. But Microsoft wouldn't waive the bug. End result, the EP put me on the problem. I really had no idea where to start on this one. Can't reproduce it on the exact hardware and software configuration? Could be anything. Failing video card? Nah, they've already replaced it with an identical one. The hardware was entirely ruled out, so it had to be something with the software. Drivers? Identical.

 

So I did what every average programmer does, and started googling about it. I really can't imagine googling for stuff these days, there's just SO. MUCH. JUNK. at the top of search results. Back then, it took me an hour of crawling the web to finally come across something that made me go "No, it couldn't be."

I went over to the test machine. Opened up the power configuration settings. Put it on low power mode. Loaded up Game Room. Bam. Finally, we reproduced the bug.

 

So I updated the bug report saying that the only way we could get this bug to happen was by forcing our identical hardware and software machine in to low power mode. And I heard back from the EP next day that Microsoft had decided to quietly waive the bug.

 

Now, this little anecdote is a lead in to the next point. Because my main development machine is a laptop. The i7 Skylake processor I refer to. Good CPU for the time... but afflicted with a GeForce 960M GPU. This was the year before they started putting desktop GPUs in to laptops, and the performance of the 960M in benchmarks was about 10% lower than the original Xbox One's GPU. But never mind that. It's a laptop. Which means the primary workhorse GPU is the Intel Integrated GPU, and the GeForce only gets used by request. Which means that its normal state is about this:

 

image.png.44ed77fe594fc8cfab85771b81604341.png

 

Now, that bit I've highlighted will be a bit confusing unless you know how to read it. It's basically saying that it is a PCIe 3.0 capable card with 16 lanes, but is currently running at PCIe 1.1 capabilities. That's power settings for you. Not only will this be incredibly common, but it's good for example purposes here. PCIe 1.x at 16 lanes does give you bandwidth of 4 gigabytes per second. Which means that you could hit 4K 120FPS with a 32-bit software rendering... if the system is doing literally nothing else but blindly pushing data over the PCIe bus. Which means ring 0 mode with the CPU doing nothing while DMA does its thing. Have your normal Windows session running? Not a chance. Not while there's other things sharing the PCIe bus (sound cards, M.2 NVMe drives are becoming more common, etc) and you're multitasking things like Discord and web browsers and what have you. Theoretical maximums like that are basically as useful as saying that a small and cheap hatchback car right off the factory line can easily hit 300km/h if you drop it from a plane and it approaches terminal velocity. It requires a very special set of circumstances to hit that speed.

 

Did you know that Nvidia GPUs only need 8 PCIe lanes? Or that AMD cards only need 4? If you get one running at full 16 then excellent great job, but there's a chunk of people out there that will have lower bandwidth by default because of this. So even if they have a PCIe 3 card. Power settings can ruin the day. The number of lanes in use can ruin the day. PCIe 3 16x has a maximum bandwidth of 16 gigabytes a second. tl;dr is that PCIe 3 4x (ie minimum AMD) is as good as PCIe 1 16x. Not good enough to handle 4K 32-bit 120FPS data transfers.

 

But let's specify exactly why I said the bandwidth isn't enough. Because the theoretical maximum bandwidth per second does actually matter. Here's the thing: if we're playing Doom, we're running a simulation. And then we generate a visual representation of that so that the user can see and interact with the simulation. If we want that to hit 120FPS, we need to be doing all that in an 8.333 millisecond slice. A rule of thumb in the industry is to aim a millisecond slower than the target - performance spikes happen, so give it some wiggle room.

 

In the case of Doom here, we can see that 4-thread rendering on a Doom 2 map will take over 4 milliseconds. Let's call it 4.5 milliseconds. That's already over half of our timeslice. Add in the simulation and everything else that goes on in a simulation frame, and we're really getting towards breaking that 7.333 millisecond barrier. Now, there are tricks like entirely kicking rendering off the main thread to get a good chunk of time back (at the expense of introducing input lag). But realistically I want my frame uploaded and displayed in 1 millisecond at most once I'm done rendering.

 

And this is where faster PCIe speeds come in handy. How long does it take to upload a 4K 32-bit buffer? Well, for PCIe 1 16x/PCIe 3 4x, that's 1/120th of a second. 8.333 milliseconds. Yeah, that's never going to be good enough. PCIe 3 16x? 1/480th of a second. About 2.1 milliseconds. Still uncomfortable. And it certainly wouldn't fly with the more complicated maps I've been testing with even if I kicked rendering off the main thread. PCIe 4 16x is the first spec that hits my goal at a hardware level. So just like everything else with Rum and Raisin Doom - throw money at the problem and it's solved. Render with more threads. Buy a higher speed bus. The world is your oyster if you can afford oysters for dinner every night.

 

I'm not looking to throw money at the problem though.

 

I've mentioned that my primary GPU on my laptop here is the intel integrated. The screenshot at the top of the post is using a 4K 8-bit backbuffer decompressed with a pixel shader on the GPU. As you can see there, it's knocked a huge amount of time off the previous 4K shot. 8 to 10 milliseconds on average. I didn't upload a shot of the 960M running that scene with 4K 32-bit backbuffers because it actually runs slower than the Intel Integrated. This will be likely very common on laptops where the discrete GPU is not in charge of the show. It's the Intel chip that's responsible for displaying the end product. So to use the GeForce, we need to upload the buffer to the GPU to create the final presentation and then the Intel needs to get it back to display on the monitor/over HDMI. Not a good setup at all.

 

As it turns out though. Reducing that bandwidth to 4K 8-bit is enough to finally push performance in favor of the GeForce.

 

image.png.75d61f1b4885e365304c3b730f8cb643.png

 

It's still an unstable framerate. As I've discussed, I'm not hitting my targets in other areas just yet. And the Intel GPU running the show for the system means that there's going to be the second bottleneck there that I can't control. But hey. Anyone on a desktop with a comparable CPU and a similarly crap GPU will get better performance than me as long as it's running in PCIe 3 16x mode.

 

Tomorrow, I'll get the latest results running on the Raspberry Pi. Speaking of low-powered low-bandwidth systems.

Share this post


Link to post

I can provide some testing with low bandwidth PCIe setup, right now I have a GTX 960 on a PCIe x4 2.0 bus driving a 4k 65 inch display with HDR 😂

Share this post


Link to post

My laptop is also driving a 4k 65 inch HDR display. HDR off though, pretty sure this hardware doesn't support it. It does support 120Hz output though, but only at 1080p.

 

But yeah, I'll have to work out how to get good performance metrics for dropping to kernel mode out of this thing without telling everyone interested in testing to run something like Superluminal and profile every sesson. Not only do you have bandwidth to worry about, but the OS is going to try to switch your active threads to idle every 4 milliseconds on Windows (pretty sure it's similar on Linux). So if we can line up the scheduling so that the OS would normally try to take focus away while it's down in kernel mode drivers transferring over the PCIe bus then we will get a far more stable framerate.

Share this post


Link to post

How about a before and after of the Pi 4 running a 1080p backbuffer?

This is the SDL renderpath:

image.png.3ac01002ca3151c554e186a34ccce135.png

 

Aaaaaand the GL renderpath:

image.png.e310c60366083b04704d3302d24f2352.png

 

It's a rock solid 60 FPS with Doom 2. Of course, you need to run it fullscreen. The windowed compositing on Ubuntu at least is atrocious. But there's more than enough there to show that a Switch can handle the IWADs at 60 FPS with Rum and Raisin Doom.

 

There is one oddity though.

 

image.png.41cc7baf544dae34720f78fc00bb98e5.png

 

Black rendering as not-black. I've seen this somewhere before, and on the titlescreen too. I just cannot remember where. Will have to have a think about what might be causing it. You can check the GitHub log for things the Pi didn't like. This will just be another one of those things.

 

EDIT: And yes. Confirmed to be a precision issue, depending on how the hardware feels it's upflowing to the next palette entry. TITLEPIC actually uses the blue colour ramp 240-247, so the black 247 is coming out as 248 when it tries to sample from the 8-bit texture. There's various ways to deal with it, and I'll settle on one of them in the near future.

EDIT2: Yeah, so depending on GLES version your highp floats could be half floats instead. As in, 16 bit floats. Which leads to a complete loss of precision breakdown when you're trying to get palette entry 247 and higher.

 

So we solve this in two steps:

  • Since 0-255 is being mapped in to 0-1, we bring that mapping back to where we want it by dividing by 256 and multiplying by 255 (represented by the 0.99609375 constant).
  • Since that breaks precision elsewhere, we nudge the lookup by 1/1024.


Bit dodgy, but you've just gotta think like a half-precision floating point to get to something precise enough for your needs. Either way, everything renders perfectly on the Pi now.

Edited by GooberMan

Share this post


Link to post

Just being a pedant and none of this changes your point:

On 10/19/2022 at 9:32 PM, GooberMan said:

Not while there's other things sharing the PCIe bus (sound cards, M.2 NVMe drives are becoming more common, etc)

Unless we're talking a system with a traditional north/south bridge architecture (pre 1156/FM1/AM4) the GPU is usually on its own dedicated bus so those other devices should have essentially no effect on your GPU bandwidth.  This is of course assuming that there's not a limit coming from elsewhere (i.e. literally not enough CPU time to facilitate the transfer).  You could hang a GPU off the chipset (glorified PCIe switch these days) and this could be true, but this is unusual for a machine that's running games.

 

And on that note...

On 10/19/2022 at 9:32 PM, GooberMan said:

Did you know that Nvidia GPUs only need 8 PCIe lanes? Or that AMD cards only need 4?

The numbers you quote are the artificial limits (granted they're there for a practical reason) put in place for SLI and Crossfire.  There's nothing stopping you from running a GPU from either vendor on a x1 link as most often demonstrated by mining rigs.  The more sane reason for a person to do this is, with the right adapters, it's possible to hang a GPU off an express card slot back in the days when laptops had those (also some laptop docks expose x1 PCIe).  Now we have Thunderbolt which gives you x4 PCIe and you can run an nvidia GPU off that.

Share this post


Link to post
On 10/19/2022 at 7:57 AM, Blzut3 said:

Given the hardware you're targeting you'll probably avoid the gotchas but I will still warn you that you are by far not the first person to think of this.  Probably some OpenGL 3 hardware knowing those drivers, but for some reason this ends up not working for some people and I know a couple of commercial games that used this had the code reverted.

 

My experience with this has always been that the problems come from the compulsive need to support inferior OpenGL versions.

If you upload a palette and a paletted texture it really makes sense to use texelFetch in the shader to read the palette, which allows to address the texture in actual texels, to avoid the rounding issues that may otherwise plague you.

So in the end the question here is, what hardware to target? Is the tiny group of holdouts stuck on pre-GL 3 hardware really worth compromising stability and performance? At the very least provide two shaders then - one using GL3 features that can handle this properly and a fallback with no guarantees attached.

 

6 hours ago, GooberMan said:

There is one oddity though.

 

Black rendering as not-black. I've seen this somewhere before, and on the titlescreen too. I just cannot remember where. Will have to have a think about what might be causing it. You can check the GitHub log for things the Pi didn't like. This will just be another one of those things.

 

 

Read my stuff above. Unless you have to deal with hardware that does not support it, do not use texture() to sample from a palette texture. Use texelFetch which has a lot less precision issues because you pass both x and y as an integer texel coordinate, not as an interpolated float in the range of [0..1].

 

 

Share this post


Link to post
4 hours ago, Graf Zahl said:

My experience with this has always been that the problems come from the compulsive need to support inferior OpenGL versions.

 

...so the Raspberry Pi's GL implementation then? It only resolves as GLES 3.0, and it's not fully compliant to the GLSL spec at that from my immediate tests.

 

But hey, like I said in the post you quoted from, there's ways to do it without direct texel fetching. As mentioned a few posts back, these methods go back to the early days of shader usage in the console games industry. You know, back when we didn't have direct texel fetching on the GPUs. Blanket "DON'T DO THIS I KNOW BETTER" statements help precisely no one but yourself.

Share this post


Link to post

I don't know if it's been said before, but saves are broken in 0.2.2. The position of things are shifted when you load the game, putting the player and other things outside the map.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×