GZDoom - Trying to improve OpenGL pipelining.

beloko · April 16, 2019

Hi,

I'm trying to improve the pipelining of the OpenGL renderer, this is primiarily for a mobile port (Delta Touch) to better performance.

(Please note, what follows may be very obvious and completely wrong, I'm not an OpenGL expert!)

The problem:

On mobile it is very bad to call a glFinish(), this breaks the pipeline and has a very negative effect on rendering performance, so I usually remove all glFinish() calls.

Now I am testing out GLES3.2 - so the modern renderer, this causes serious graphic errors. I believe this is because the buffers are unsyncronised (GL_MAP_UNSYNCHRONIZED_BIT), so they are modfied while they are used by the GPU on the previous frames (because glFinish() has been removed)

The (partial) solution:

To try and fix this I have created multiple GL buffers which get cycled on each frame so the GPU and CPU are not accessing the buffers at the same time.

This has mostly fixed the rendering corruption, and as has increased performance by about x2 on mobile and x2 on an Intel HD Graphics 4000.

Sauce:

https://github.com/emileb/gzdoom/tree/g4.0.0_gl_buffered

The new problem is moving sectors - because there are multiple buffers they need to ALL be updated when a sector moves, otherwise when the sector stops there will be old data in the buffer. I believe this can be fixed with a bit more hacking.

My question is:

Is this worth pursuing and does it make any sense, or is there a fundemental reason why this may not work? If I manage to fix the moving sector updating problem, will there still be other unfixable issues do you think? Is there any other unsyncronised data which could break renderering?

If you want to try out the modification I did here is the update to GZDoom 4.0.0 (32bit): Try on a weak GL3+ computer and see if improves performance, it gave a x2 FPS increase on my old laptop ( note, it has the moving sector problem )

https://drive.google.com/open?id=1KII1gDLl3WbZhyLuPochmH-PB4HFRW1n

Cheers!

Dark Pulse · April 16, 2019

Paging @Graf Zahl who would naturally be the person to ask for something like this.

Gez · April 16, 2019

@dpJudas too.

Graf Zahl · April 16, 2019

The function you need to adjust is FFlatVertexBuffer::CheckPlanes.

To decide whether to update a plane it compares the current height against an array of plane heights for the 4 planes a sector may possess. This array (vboheight) only exists once, so if you have multiple buffers you also need multiple checking arrays in that place.

What you also may need to do is adding a fence sync object to ensure that you do not start overwriting your older buffers if you get too far ahead of the GPU - which can easily happen if you no longer synchronise your CPU work with the GPU via glFinish.

Edited April 16, 2019 by Graf Zahl

beloko · April 16, 2019

32 minutes ago, Graf Zahl said:

The function you need to adjust is FFlatVertexBuffer::CheckPlanes.

To decide whether to update a plane it compares the current height against an array of plane heights for the 4 planes a sector may possess. This array (vboheight) only exists once, so if you have multiple buffers you also need multiple checking arrays in that place.

Yes I see! I Will fix this and update and see if it works, thank you!

beloko · April 16, 2019

Yeah seems to work now on my PCs, however crashes on mobile. Will investiage tomorrow.

If anyone wants to test on a PC try this:

https://drive.google.com/open?id=1QR8-CeeGZdHKfWcWMM1LmKMXrNikPox2

dpJudas · April 17, 2019

Not sure if you're already aware of this - mentioning it just to be sure. The glFinish call is also used to keep input lag at a minimum. If you remove it the throughput will be better, but it will also increase the input latency.

Like Graf mentions, if you remove the glFinish call you really want to replace it with fences. But where to put them is tricky. One option is to put it right before SwapBuffers and wait for it right after. This makes it begin the next frame while the previous is still waiting to be presented. The throughput won't be as good, but at least the input delay will kept to less than one vsync cycle (8 to 16 ms on most monitors). Alternatively you can put the fence right after SwapBuffers and the wait just before. This will increase the input lag but hopefully not enough to get your users to complain loudly.

I really can't remember stacking up more than 1 extra frame because the input lag will be horrible - despite whatever additional throughput improvement it might provide. Even one frame will already annoy people sensitive to mouse lag, such as myself.

kb1 · April 17, 2019

1 hour ago, dpJudas said:

...input latency...

Is the input latency a direct result of the renderer having more than 1 buffered frame that it's waiting to show? I'm a bit confused - seems to me there's 4 scenarios:

(let's assume a screen refresh rate of 70Hz - 2 frames per Doom tic)

1. You're rendering at least 70 fps (2 input reads per tic - no issues)

2. You're rendering > 35 fps, but less than 70 fps (sometimes 1 input read, sometimes 2 - no issues)

3. You're rendering exactly 35 fps (1 input read per tic - still ok?)

4. You're rendering slower than 35 fps (lag unavoidable)

Are you saying that, with the buffer swapping technique, in scenario 3, you might be blitting frame 10, building frame 11, but your current input will be applied to, say, frame 12? Is that the source of input lag you're describing?

dpJudas · April 17, 2019

That's because you're looking at this from the playsim viewpoint. In GZDoom (and I'm assuming most of the other source ports too, although not 100% sure if this applies to Chocolate Doom) the mouse movements are updated every frame, which the user sees, and then at the next tic time its current position is used. I used the more generic term input latency to include both touch and mouse input, but I'm specifically thinking about the delay from the user doing an action that pans the camera until he sees it happen. Most gamers call the effect mouse lag.

When GZDoom reads the mouse location each frame this happens exactly at one specific moment in time. From there it generates the draw commands sent to the GPU. The GPU then begins to draw it, but if it doesn't display it immediately there's already a delay from the user action to visual confirmation on the monitor. Without the glFinish line the GPU driver will render it into a swap chain image and keep it there until it is time to show it. This is excellent for throughput as any hickups in the pipeline is hidden, but the catch is that for each extra image rendered there's an additional 16 ms delay (at 60 hz vsync locked). Even if you turn vsync off you could still end up with a delay if the GPU encounters a heavy scene.

Anyway, NVidia and friends wants you to remove the glFinish line as that gives them extra cool benchmark results, but I think a gamer does not want you to. Naturally one wants as high a frame rate as possible, but its a trade off where you have to make sure the pipeline will never get filled with data that essentially is already out of date the moment you queue it. For that reason the best way is to use some fences that allow the app to know how far along the GPU has gotten so commands are never queued further ahead than whats absolutely needed.

beloko · April 17, 2019

5 hours ago, dpJudas said:

Not sure if you're already aware of this - mentioning it just to be sure. The glFinish call is also used to keep input lag at a minimum. If you remove it the throughput will be better, but it will also increase the input latency.

Thanks for this, I wasn't aware that was one of the reasons glFinish is present, but did consider it would effect input latency if the GPU is really behind on rendering.

I actually had to put a 12(!) buffer pipeline in to avoid corruption on the my Intel HD Graphics 4000 laptop. I'll reduce this to something more sensible and put in fences. On mobile only 2 extra buffers are needed, but I only have one GLES3.2 device to test with at the moment. It makes the frame rate much smoother and I personally didn't notice the latency, but I am sure it is there.

Will update the branch when done if anyone is interested.

Edited April 17, 2019 by beloko

dpJudas · April 17, 2019

glFinish is a crude way of doing it that has some performance cost (think it was about 30% on my old computer) - getting rid of it is the right thing to do. Just make sure you put something in place to prevent frames from stacking up as otherwise it will affect the player in a very negative way.

When I initially got back to Doom a few years ago ZDoom wasn't the first port I tried, but it was the first I tried that didn't give me a bad case of mouse lag (I didn't try them all; not saying they all do). Some level of latency can be tolerated but you need to be careful as it is a decisive factor in port choice for players. It is also very individual - I'm sure there are people even more sensitive to it than me.

Graf Zahl · April 17, 2019

GZDoom's 'stat rendertimes' actually displays the time needed for glFinish. For me it varies between 1 and 5 ms normally, with maps like Frozen Time at the upper end of that range. Gaining those 5 ms would be a huge boost, actually, because they will get covered nearly completely by time where otherwise the GPU is idle. So if this could be optimized it should, but even then it might make sense to leave glFinish in as a user option. I once tried but ran into the same problem with moving floors but back in the day didn't have sufficient knowledge about such things as GPU synchronization so that feature ultimately went nowhere.

What I am wondering is whether having more than two buffers makes even sense, unless some CPU-side per-frame timing is used to avoid getting two frames too quickly after each other. If you have 3 buffers, they can be filled in quick succession but there is no more guarantee that they are evenly distributed. It may well be that the first one is 5 ms, the second one 10ms and the third one, which needs to wait on a fence, takes, say, 20 ms. When that happens the game no longer runs smoothly, even though the frame rate is nominally high.

Edited April 17, 2019 by Graf Zahl

dpJudas · April 17, 2019

2 hours ago, Graf Zahl said:

What I am wondering is whether having more than two buffers makes even sense, unless some CPU-side per-frame timing is used to avoid getting two frames too quickly after each other. If you have 3 buffers, they can be filled in quick succession but there is no more guarantee that they are evenly distributed. It may well be that the first one is 5 ms, the second one 10ms and the third one, which needs to wait on a fence, takes, say, 20 ms. When that happens the game no longer runs smoothly, even though the frame rate is nominally high.

I would say no. In the classic triple buffering scenario the logic was that one buffer would be front, one would just have finished and queued up, one is currently being rendered.

For GZDoom it is different because the post processing creates a lot more render buffers. Here the swap chain image only functions as the final output and as such doesn't block additional work in the same way. If you look at the SubmitCommands function on the vulkan2 branch you'll see it doesn't even attempt to acquire the swap chain image until it actually finished rendering everything (*) and only has one action left to perform: run the present shader.

It can be argued that for optimal throughput the engine needs a second set of scene image buffers, but for latency reasons I don't think its a good idea to get that far ahead. I think the ideal is if the next frame has already received the first draw data for the next frame when the previous frame is copied to post process. That way the GPU and CPU can already be working on the next frame the moment the current is doing its final postprocess work and getting queued up for the next vsync.

Generally I think more than 2 buffers only really makes sense for non-interactive stuff like movie players. For games the input lag will be felt.

*) Right now it doesn't submit anything to the GPU until it acquired the image, but there's no technical reason why it couldn't. Still, it illustrates how the swap chain is barely used in modern GZD.

Edit: seems I didn't read properly what I was replying to. :) What you are describing about the uneven frame length is why, even with vsync off, too many buffers can cause some very horrible mouse lag if the GPU falls behind. Yet another reason why more isn't always better.

Edited April 17, 2019 by dpJudas

Graf Zahl · April 17, 2019

In the end I prefer to have stable 60 fps most of the time, as it is right now. The more such optimizations get added the more complicsted it gets to synchronize it all properly.

From what you say it is inevitable that if we want two independent vertex buffers we also need to duplicate the entire postprocessing chain so that two frames can truly be done independently - one being rendered and the other being set up. One also has to ask when the duplication of resources starts to put pressure on the video RAM, leaving less for other important stuff like textures and static buffers.

kb1 · April 18, 2019

17 hours ago, dpJudas said:

...When GZDoom reads the mouse location each frame this happens exactly at one specific moment in time. From there it generates the draw commands sent to the GPU. The GPU then begins to draw it, but if it doesn't display it immediately there's already a delay from the user action to visual confirmation on the monitor...

17 hours ago, beloko said:

...I actually had to put a 12(!) buffer pipeline in to avoid corruption on the my Intel HD Graphics 4000 laptop...

15 hours ago, Graf Zahl said:

...What I am wondering is whether having more than two buffers makes even sense, unless some CPU-side per-frame timing is used to avoid getting two frames too quickly after each other. If you have 3 buffers, they can be filled in quick succession but there is no more guarantee that they are evenly distributed. It may well be that the first one is 5 ms, the second one 10ms and the third one, which needs to wait on a fence, takes, say, 20 ms. When that happens the game no longer runs smoothly, even though the frame rate is nominally high.

13 hours ago, dpJudas said:

I would say no. In the classic triple buffering scenario the logic was that one buffer would be front, one would just have finished and queued up, one is currently being rendered...

Thanks, guys - that was the source of my confusion: I was wondering why you'd ever need more buffers than required to show the current frame, have the next frame ready, and possibly an in-progress frame. 12 buffers? Are those 12 new frames waiting to be displayed? I would think that if you were rendering faster than the screen refresh rate, the GPU should be halted until the next VSYNC, because any frames generated in between will never be seen.

For example, after displaying frame 10, you gather inputs and use them to generate frame 11. If VSYNC hasn't yet happened, you simply halt rendering, and wait for VSYNC, at which time you swap to frame 11 immediately. But, if VSYNC has happened (because your renderer is slower than the screen refresh rate), you continue building frame 11 while still showing frame 10.

In this scenario, input is always applied to the very next frame, regardless of renderer speed (unless I'm missing something).

In other words, a nice scenario is that you have frame 11 completely built before frame 10's VSYNC is triggered. But, if your renderer is fast enough to be able to have frame 11 and frame 12 completely built before frame 10's VSYNC, if it were me, I'd toss out frame 11, and move right to frame 12 after frame 10. This assumes that I read input before building every frame. In this best-case scenario, input lag is reduced to double the screen refresh rate. To accomplish this, you must be able to estimate, with some accuracy, how long it takes to render the current scene. What you don't want is for the GPU to be busy at new frame time.

Given the choice of either building frames that will never be displayed and be tossed, or just idling the GPU, I'd lean towards idling the GPU, to reduce heat/power consumption, and prepare it to start at a moment's notice.

Note: My comments describe a theoretical design, however you guys know GL rendering much better than me. Is this stated accurately, and does it make sense?

david_a · April 18, 2019

Do modern games sync input to frame rendering like this? Would making input and rendering asynchronous cause other weird issues?

Graf Zahl · April 18, 2019

Everything that was never considered may cause weird issues. That's the problem with these things.

kb1 · April 19, 2019

11 hours ago, david_a said:

Do modern games sync input to frame rendering like this? Would making input and rendering asynchronous cause other weird issues?

You could poll inputs asynchronously if you poll them faster than the renderer (which would not gain you much of anything). But when input polling is slower, you get input lag, which could be loosely defined as how fast a mouse movement affects the display.

It doesn't take much input lag, before it starts to have a very negative effect on gameplay feel. Imagine turning with the mouse, then abruptly stopping mouse movement, and yet the displayed view keeps turning for a frame or two. It very quickly starts to feel "rubber-bandy", like a car in need of a front-end alignment.

The OS may actually cause some input lag as well. Some controls actually take a significant amount of time to be properly read, like analog joysticks. These are read by charging, then discharging a capacitor thru a resistor, and measuring when the voltage drops below a threshold. This takes some time to happen. Older DOS games might actually halt, until the value could be read.

A modern OS can poll the device in a thread, and cache the last value. This eliminates the need for the game to wait for the read, but because the value it does read is slightly stale, you get some OS-imposed input lag.

Here's how I understand it:

If you read a device faster than the driver can read the device, you get back the same cached value as before.
If you read a device faster than your screen refresh rate, you gain no benefit, as you can only use 1 value per frame.
If you read a device slower than the screen refresh rate, you get input lag.

I think 1 frame of input lag is probably acceptable, and may even be undetectable to most people. At any rate, it's probably unavoidable. On the other hand, if your GPU can render hundreds of fps, and you are rendering and caching multiple frames to be displayed later, this is the visual equivalent to an echo, and your players will experience wicked input lag.

glFinish

The GPU wants to run in parallel with the CPU, and pump out frames as quick as possible. I get that glFinish causes the CPU to wait until the GPU has finished all its work on the current frame. Depending on the current GPU workload, using glFinish can reduce the effective frame rate. The problem is that the GPU may have just started on a new frame when the glFinish call happens, meaning that the CPU has to wait for almost an entire GPU step to complete. That's why spitting out hundreds of frames per second can cause input lag.

It's better to let the GPU get caught up, and then stop feeding it work to do! Do some CPU work, or simply go idle for a few milliseconds. The best time to make a call like glFinish is when the GPU is already done, and idle. It is important to reduce the GPU workload to match the screen refresh rate, so the GPU can be idle and ready to render a new frame, with new input data immediately after VSYNC.

Minimizing input lag, and keeping up with the screen refresh rate

It all boils down to order of operations. Assume a few discrete steps:

Read input
Run AI
CPU renderer prep
GPU frame render
Wait for VSYNC

The order in which these occur is critical to reduce input lag, get the CPU and GPU to work as soon as possible, and maximize CPU and GPU idle time so they are idle and ready for the longest duration between frames. Depending on implementation, those steps can be, and should be rearranged, as needed to minimize input lag and maximize CPU and GPU idleness, and it will run smooth as silk.

I downloaded the source to a Pac-Man emulator. It rendered hundreds of fps, and it was a dog: 100% CPU usage, input lag, the works. I added a 60 Hz timer (Pac-Man's original refresh rate), and a Sleep() call to get the program to yield as many time slices to the OS as needed to conform to 60 Hz. The new program consumes less than 1% CPU, and I can run 2 dozen instances of the program like it was nothing. Still runs at 60 Hz. Extra fps does nothing but kill your CPU/GPU, cause input lag, and consume resources. VSYNC gets a bad rap, though, because that order of operations becomes critical to success.

Please note: I haven't analyzed the code - your mileage may vary. I do not claim to be an expert. What I do know is that, to successfully tune such a beast requires deep profiling, and a deeper understanding of how the above steps interact simultaneously. It's not easy work, but it's easy to get it wrong, and yet appear to be working. You have to gain an intimate understanding of the inter-frame and intra-frame interactions of each step, and the process as a whole. I'd be using the processor high-res timing counters, and building time charts, to try to profile any experimental changes I made.

I am not proposing any solutions here. Instead, my goal was to provide some food for thought, and maybe spark an idea that points you in the right direction. Good luck!

beloko · April 23, 2019

Thanks a lot for all the comments - very interesting and gave me a lot to think about.

If anyone is interested I have updated the repo:

* Buffers mSkyData, mVertexData, mViewpoints, mLights

* Use "-hwbuffers x" to set number of buffers in pipeline - up to 16 E.g "-hwbuffers 4". It defaults to 1 (as normal)

* Uses glSync instead of glFinish (Even for no buffering)

* Fixes the vboheight issue ( I think, only done a bit of testing)

Probably not done correctly, but seems to work.

https://github.com/coelckers/gzdoom/compare/maint_4.0...emileb:g4.0.0_gl_buffered

It gives a good speed increase on mobile and pretty good speed increase on my old laptop, depenending on map.

Graf Zahl · April 23, 2019

Looking at your code there's one thing that doesn't look right.

The FakeFlat routine should only operate on the vboheight for the currently active buffer, not all - this can cause glitches if you have a pending frame where you are underwater but in the current frame are above water. This is relatively rate so you probably missed it. No code that gets called from inside the renderer should ever try to access any data for other than the currently active buffer.

It may also be a good idea to pass the buffer index as parameter instead of reading it out of the 'screen' variable in case the whole setup gets refactored in the future

beloko · April 23, 2019

48 minutes ago, Graf Zahl said:

Looking at your code there's one thing that doesn't look right.

The FakeFlat routine should only operate on the vboheight for the currently active buffer, not all - this can cause glitches if you have a pending frame where you are underwater but in the current frame are above water. This is relatively rate so you probably missed it. No code that gets called from inside the renderer should ever try to access any data for other than the currently active buffer.

It may also be a good idea to pass the buffer index as parameter instead of reading it out of the 'screen' variable in case the whole setup gets refactored in the future

Ah yes thanks a lot, I thought the FakeFlat stuff didn't seem right, I wasn't sure when it was actually used. Cheers!

Sign In

GZDoom - Trying to improve OpenGL pipelining.

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in