SSIMD compiler flags - is it needed?

Graf Zahl · April 7, 2016

kb1 said:
Then, nothing is "holding anything back", and you get the best of all worlds.

... and twice the work. Just an example:

I would really like to rewrite GZDoom to use more GL 4.4 and 4.5 features but that'd make the codebase incompatible with older hardware in a way that'd require some major code duplication.

At some point the added work just isn't worth it anymore because the cost-to-benefit ratio is not good enough.

kb1 · April 7, 2016

Graf Zahl said:
... and twice the work. Just an example:

I would really like to rewrite GZDoom to use more GL 4.4 and 4.5 features but that'd make the codebase incompatible with older hardware in a way that'd require some major code duplication.

At some point the added work just isn't worth it anymore because the cost-to-benefit ratio is not good enough.

If it's major code duplication, that's the potential for major optimization, right? Yeah, it requires work - most things good do. Never said it was free. Of course the lazy route is to simply switch the CPU target upon compile, which would probably provide a small benefit.

But, yeah, it might require some effort - didn't think I needed to mention that. I'm not saying to release a DOS version. XP and above would capture most Doom audience's desires, and, for GZDoom's GL stuff, that's probably a reasonable minimal requirement.

So, maybe a 32-bit XP with basic FPU compile, and an optimized 64-bit SSE/AVX bleeding-edge version. That should cover most of your user base, I would think. Could tack on an inbetween version for the most flexibility.

It's really not that difficult. You restructure your data to be optimal for SIMD support, but be compatible with all versions. Then you grow a small number of functions into pairs. It's probably only a small handful of functions that could really benefit for SIMD, anyway, right?

Graf Zahl · April 7, 2016

Optimizing the floating point stuff is the least of the problems, i.e. it doesn't make any difference. For that I wouldn't even bother doing more than compiling the same code for 32 and 64 bit. And all those enhanced instruction sets don't help here.

I'd rather optimize the render flow but for that to have any point I'd need a better graphics card myself first. Not worth the investment.

That's the main issue here: The cost for these optimizations far outweighs the benefits. It was worth it for GZDoom 2.x which indeed did provide a nice performance boost on current hardware.

Maes · April 7, 2016

COMING SOON: Gentoo-style "ricers" hacking your favourite source ports, and recompiling EVERYTHING with "special optimization flags" for every possible CPU model, subtype, revision, cache size etc.

Why settle for the inefficient generic stuff when you could have e.g. prBoom+ or ZDoom compiled with flags that match EXACTLY your exclusive 1st gen 1.5 GHz Pentium 4, or your pre-release Athlon XP with a beta version of 3DNow!, or super-rare Pentium III-S with 512KB of cache and hardware prefetch?!

kb1 · April 9, 2016

Graf Zahl said:
I'd rather optimize the render flow but for that to have any point I'd need a better graphics card myself first. Not worth the investment.

That's the main issue here: The cost for these optimizations far outweighs the benefits. It was worth it for GZDoom 2.x which indeed did provide a nice performance boost on current hardware.

Yes, it's a cost, and I have and will continue to pay it. And, that code is a few hours of my time, vs. noticing a lag *each and every* time I play certain wads, knowing damn well that I could do better. It's the cost of being ripped out of the immersion, and dropped on my head, back into reality, with some mundane response like "this room has a lot of visplanes", or "there's too many monsters in this level.". My previous post confers with you, in that the areas in the code where heavy optimization make sense are few. Renderer - check. Pre-renderer clipping array/visplane calc (R_StoreWallRange, etc)... - maybe check. BLITs - check!

In the renderer, a long-term goal of mine is to attempt to do texture mapping (scaling, light-calc) on multiple pixels simultaneously, using SIMD. Wow, that would be incredible!

One final area of optimization would be in the sound-mixing code. Sound hardware can do a lot of the work, but sound effect mixing done with software should be able to take advantage of SIMD instructions.

But, stated differently, there are about 4 levels of commitment required with SIMD usage, based on how it is implemented:
1. No SIMD - your code will run on a Pentium, (or 486DX!).
2. Compiler option on - Here, the compiler tries to rearrange things to allow some usage of SIMD tech, without the programmers's involvement.
3. Compiler option on, with wrapped CPU-specific SIMD instructions, inserted into the program by the programmer.
4. Roll your own (write sections in assembly).

I can't expect any compiler to produce anything decent below level 3. And, if level 4 is available, why bother with level 3 in most cases. This takes commitment, because you're going to have to rearrange structures, worry about alignment, branch to less powerful functions when the CPU does not support the tech, etc. I can fully understand not wanting to go down that path. But the idea doesn't deserve to be consistently poo-pooed. You talk as if these new, massively-powerful instruction set extensions don't deserve some extra time and effort (including providing a fallback path for users with older CPUs).

Again, it's ok for it to be more work than you personally want to invest - I totally respect that viewpoint, absolutely.

But throwing around the idea that it's simply a waste of time in general, and that you have never noticed improvements after setting a lousy compiler option? That's kinda careless, man! Because some people will believe you, without realizing that you haven't done the work to actually know if the work outweighs the benefits. You can't really say that, unless you can state what the benefits are...which requires doing the work.

Maes said:
COMING SOON: Gentoo-style "ricers" hacking your favourite source ports, and recompiling EVERYTHING with "special optimization flags" for every possible CPU model, subtype, revision, cache size etc.

Why settle for the inefficient generic stuff when you could have e.g. prBoom+ or ZDoom compiled with flags that match EXACTLY your exclusive 1st gen 1.5 GHz Pentium 4, or your pre-release Athlon XP with a beta version of 3DNow!, or super-rare Pentium III-S with 512KB of cache and hardware prefetch?!

I know this was a joke, but, seriously, how many versions would you need, if you tested processor capabilities at startup, and set function pointers accordingly? Easy-peasy.

Linguica · April 9, 2016

kb1 said:
But throwing around the idea that it's simply a waste of time in general, and that you have never noticed improvements after setting a lousy compiler option? That's kinda careless, man! Because some people will believe you, without realizing that you haven't done the work to actually know if the work outweighs the benefits. You can't really say that, unless you can state what the benefits are...which requires doing the work.

You're calling the author of maybe the most popular and advanced Doom source port "careless" because he doesn't want to spend his time on micro-optimizations with spurious benefit, and subtly suggesting he might just be lazy? Where's your source port again?

Edward850 · April 9, 2016

Furthermore, wouldn't that be the exact opposite of laziness and carelessness? It's not exactly straightforward to make implementation-specific features all conform to the exact same standards at a binary level, and making damn sure it really is the same result across multiple platforms. If you're going as far as to implement an alternative math library to maintain consistency, you're putting in effort for sure.

Graf Zahl · April 9, 2016

Ok, let's just take apart this pile of hogwash:

kb1 said:
Yes, it's a cost, and I have and will continue to pay it. And, that code is a few hours of my time, vs. noticing a lag *each and every* time I play certain wads, knowing damn well that I could do better. It's the cost of being ripped out of the immersion, and dropped on my head, back into reality, with some mundane response like "this room has a lot of visplanes", or "there's too many monsters in this level.". My previous post confers with you, in that the areas in the code where heavy optimization make sense are few. Renderer - check. Pre-renderer clipping array/visplane calc (R_StoreWallRange, etc)... - maybe check. BLITs - check!

In the renderer, a long-term goal of mine is to attempt to do texture mapping (scaling, light-calc) on multiple pixels simultaneously, using SIMD. Wow, that would be incredible!

Let's be clear about one thing first:
I do not work on ZDoom's software renderer. It also already contains highly optimized assembly drawing routines - and if you actually profile the code, that's the one single spot where the vast majority of time is being spent. No need to optimize the rest to death.

The first rule of optimization always has to be to find the actual bottlenecks and then focus on them.

One final area of optimization would be in the sound-mixing code. Sound hardware can do a lot of the work, but sound effect mixing done with software should be able to take advantage of SIMD instructions.

Wow! As if we haven't thought of that! But it's a complete waste of time these days! Any recent computer (say, built in the last 8-10 years) has a multi-core processor, and since sound mixing is always done asynchronously, it will most certainly be relegated to the second CPU core, so you lose nothing.
You also have to consider that the market penetration of mixing-capable sound hardware probably hovers in the low single digit percents, because the simple on-board soundchips are indeed good enough for most people.
Add to that that hardware sound mixing comes at a steep price (different hardware has different limits) and that the actual benefits are close to nil (easily verified by forcing the entire engine onto a single core) and the entire feature evaporates into a cloud of smoke. If you look at professional sound libraries you will find that they do not care about hardware sound mixing at all. OpenAL these days means OpenAL Soft and FMod has ditched all hardware mixing support? The simple reason is that the software mixers are far, far more powerful.

But, stated differently, there are about 4 levels of commitment required with SIMD usage, based on how it is implemented:
1. No SIMD - your code will run on a Pentium, (or 486DX!).
2. Compiler option on - Here, the compiler tries to rearrange things to allow some usage of SIMD tech, without the programmers's involvement.
3. Compiler option on, with wrapped CPU-specific SIMD instructions, inserted into the program by the programmer.
4. Roll your own (write sections in assembly).

... which might make sense if you actually get some benefit out of this. But I clearly see that people like you get hung up on the advertisement of a feature and actually never run a profiler on it to see if investing time here makes sense.
As things stand, ZDoom's software renderer won't benefit much from it because it'd affect too small a time slice of where time is spent, and in GZDoom's hardware renderer the math stuff is never the limiting factor. Optimizing code that runs 10% of all time to be 10% faster means a 1% speedup, and that's the territory we are in here. For the record: Your options 1, 2 and 3 do not make a bit of performance difference, so why even bother? I repeatedly said that on modern hardware x87 instructions and SSE instructions make no performance difference on 32 bit. And since I have no system where it MIGHT make a difference, plus these are old and slow anyway (that'd be 10 year old systems by now), why optimize for them? These days they all squarely fall into the 'legacy hardware' slot, so one universal solution to support them across the board is all you may expect. And that - no surprise - is x87.

I can't expect any compiler to produce anything decent below level 3. And, if level 4 is available, why bother with level 3 in most cases.
This takes commitment, because you're going to have to rearrange structures, worry about alignment, branch to less powerful functions when the CPU does not support the tech, etc. I can fully understand not wanting to go down that path. But the idea doesn't deserve to be consistently poo-pooed. You talk as if these new, massively-powerful instruction set extensions don't deserve some extra time and effort (including providing a fallback path for users with older CPUs).

Why? Because it's a lot of work. Worse, it's a lot of work that's highly platform specific, very tedious and coming with an utterly shitty cost-to-benefit ratio.

And the speedup it might bring is minimal if you can't arrange large quantities of calculations to use these features. And with something as simple as Doom that's very hard.

What you are proposing is quite similar to the attitude put on display here: https://www.doomworld.com/vb/post/1534288

In sort: It's sacrificing long-term viability for some short-term benefit. Let's not forget one thing: GZDoom's code base is 15 years old by now, it has survived multiple processor generations, all of which have different requirements to get the optimal speed out of. Had I gone down your route, each time when something new ghad come up writing new low level support code, the engine would have stalled because the more specialized the code gets, the harder is to maintain and the more difficult it is to add new features or improve it in general.
I am doing multi-platform programming in my job and am fully aware of the problems such approaches always cause. I am a staunch proponent of avoiding this route whenever possible, because all this specialized code always causes more work to support - that's fine if you release your game in 6 months and never have to think about it ever again - but that's deadly if you write code that's supposed to still be alive in 10 years.

Again, it's ok for it to be more work than you personally want to invest - I totally respect that viewpoint, absolutely.

If it could provide an actual speedup, I might, but that's not the case. It never was.
GZDoom's limit always has been elsewhere. First it was the graphics hardware performance that was 90% of the bottleneck - making CPU-side optimizations a waste of time - and although these days the CPU is the limiting factor, it's not the little bit of math that is required. No, right now what blocks improvement is primarily the fact that the most costly parts of the renderer are not multi-threadable (the BSP traverser and visibility clipping, which need to process data in order, and the actual drawing code which is hampered by OpenGL's single-thread operation model.)
None of these would benefit from math optimizations at all, and the rest of the code that actually does some math takes 20% of all time, with large parts of it not doing any math at all but setting up the data for the renderer.

But throwing around the idea that it's simply a waste of time in general, and that you have never noticed improvements after setting a lousy compiler option? That's kinda careless, man! Because some people will believe you, without realizing that you haven't done the work to actually know if the work outweighs the benefits. You can't really say that, unless you can state what the benefits are...which requires doing the work.

I only repeat myself: If you want to optimize, first you need to profile. And then optimize where most of the time is spent. And guess what: I precisely did that years ago. Why, for example, do you think, is GZDoom's clipper based on pseudo angles, not real ones? Yes, right: Because this was a spot where lots of time was spent and where I did some focussed research to improve it. And behold: That improvement is not built on micro-optimizing the code but on changing the algorithm. Which ultimately is the only kind of optimization that has long term benefits.
Your proposal is so highly CPU-revision specific that in a few years it's all obsolete and a waste of time. And I do not feel like wasting my time on that stuff.

I know this was a joke, but, seriously, how many versions would you need, if you tested processor capabilities at startup, and set function pointers accordingly? Easy-peasy.

You clearly do not know about the art of software engineering. As I said in a response to the link I posted above, performance is not everything. I need code that remains viable 10 years down the line and is not littered with specialized stuff for this and that, because every time something on the mainline code changes, all those specialized versions have to be changed as well. And the more of them you get the more work you get.

Again, that's fine if all you have to do is think 6 months ahead, but it will destroy your engine if it's supposed to live on for many, many years.

So much for easy-peasy.

kb1 · April 12, 2016

Linguica said:
You're calling the author of maybe the most popular and advanced Doom source port "careless" because he doesn't want to spend his time on micro-optimizations with spurious benefit, and subtly suggesting he might just be lazy? Where's your source port again?

Those are your words, not mine. Stated again, I said that telling people that using SIMD optimizations is useless, is careless, because it might stop people from trying, especially people that respect Graf's opinion. That's it. Where did you get "lazy" from?

Graf Zahl said:
Ok, let's just take apart this pile of hogwash:

That's not very nice. I didn't disrespect you once.

Graf Zahl said:
Let's be clear about one thing first:
I do not work on ZDoom's software renderer. It also already contains highly optimized assembly drawing routines - and if you actually profile the code, that's the one single spot where the vast majority of time is being spent. No need to optimize the rest to death.

The first rule of optimization always has to be to find the actual bottlenecks and then focus on them.

I'm with you there.

Graf Zahl said:
Wow! As if we haven't thought of that! But it's a complete waste of time these days! Any recent computer (say, built in the last 8-10 years) has a multi-core processor, and since sound mixing is always done asynchronously, it will most certainly be relegated to the second CPU core, so you lose nothing.

Not unless it is set up to do so, and it is allowed to do so. I imagine SDL does do that - can't say for sure. And, hopefully it's using SIMD, since that's a prime candidate.

Graf Zahl said:
You also have to consider that the market penetration of mixing-capable sound hardware probably hovers in the low single digit percents, because the simple on-board soundchips are indeed good enough for most people.
Add to that that hardware sound mixing comes at a steep price (different hardware has different limits) and that the actual benefits are close to nil (easily verified by forcing the entire engine onto a single core) and the entire feature evaporates into a cloud of smoke. If you look at professional sound libraries you will find that they do not care about hardware sound mixing at all. OpenAL these days means OpenAL Soft and FMod has ditched all hardware mixing support? The simple reason is that the software mixers are far, far more powerful.

Interesting. Personally, I have always chosen software mixing, simply for control. I didn't realise that there were other reasons to make that choice.

Graf Zahl said:
... which might make sense if you actually get some benefit out of this. But I clearly see that people like you get hung up on the advertisement of a feature and actually never run a profiler on it to see if investing time here makes sense.
As things stand, ZDoom's software renderer won't benefit much from it because it'd affect too small a time slice of where time is spent, and in GZDoom's hardware renderer the math stuff is never the limiting factor. Optimizing code that runs 10% of all time to be 10% faster means a 1% speedup, and that's the territory we are in here. For the record: Your options 1, 2 and 3 do not make a bit of performance difference, so why even bother? I repeatedly said that on modern hardware x87 instructions and SSE instructions make no performance difference on 32 bit. And since I have no system where it MIGHT make a difference, plus these are old and slow anyway (that'd be 10 year old systems by now), why optimize for them? These days they all squarely fall into the 'legacy hardware' slot, so one universal solution to support them across the board is all you may expect. And that - no surprise - is x87.

Now you're making assumptions: That my knowledge comes from "advertisements", that I do not profile. The whole reason I am trying to support SIMD is from the huge performance advantage I have witnessed in my own programs. And, yes, I had to really juggle around the data, and those were simple routines (not Doom), but I easily got 3 to 4 times speed increase over some pretty optimal x86 code that had been extensively tweaked and timed.

Graf Zahl said:
Why? Because it's a lot of work. Worse, it's a lot of work that's highly platform specific, very tedious and coming with an utterly shitty cost-to-benefit ratio.

And the speedup it might bring is minimal if you can't arrange large quantities of calculations to use these features. And with something as simple as Doom that's very hard.

What you are proposing is quite similar to the attitude put on display here: https://www.doomworld.com/vb/post/1534288

Yes, the setup framework is a lot of work. But, once it's there, it becomes a linear process. And, the speedup might be great.

Graf Zahl said:
In sort: It's sacrificing long-term viability for some short-term benefit. Let's not forget one thing: GZDoom's code base is 15 years old by now, it has survived multiple processor generations, all of which have different requirements to get the optimal speed out of. Had I gone down your route, each time when something new ghad come up writing new low level support code, the engine would have stalled because the more specialized the code gets, the harder is to maintain and the more difficult it is to add new features or improve it in general.
I am doing multi-platform programming in my job and am fully aware of the problems such approaches always cause. I am a staunch proponent of avoiding this route whenever possible, because all this specialized code always causes more work to support - that's fine if you release your game in 6 months and never have to think about it ever again - but that's deadly if you write code that's supposed to still be alive in 10 years.

Yeah, I get that you're a proponent - I knew that before I replied. And, yeah, GZDoom is a mature product, and, yes, specialized code does require another level of responsibility. No argument there.

If it could provide an actual speedup, I might, but that's not the case. It never was.
GZDoom's limit always has been elsewhere. First it was the graphics hardware performance that was 90% of the bottleneck - making CPU-side optimizations a waste of time - and although these days the CPU is the limiting factor, it's not the little bit of math that is required. No, right now what blocks improvement is primarily the fact that the most costly parts of the renderer are not multi-threadable (the BSP traverser and visibility clipping, which need to process data in order, and the actual drawing code which is hampered by OpenGL's single-thread operation model.)
None of these would benefit from math optimizations at all, and the rest of the code that actually does some math takes 20% of all time, with large parts of it not doing any math at all but setting up the data for the renderer.

I only repeat myself: If you want to optimize, first you need to profile. And then optimize where most of the time is spent. And guess what: I precisely did that years ago. Why, for example, do you think, is GZDoom's clipper based on pseudo angles, not real ones? Yes, right: Because this was a spot where lots of time was spent and where I did some focussed research to improve it. And behold: That improvement is not built on micro-optimizing the code but on changing the algorithm. Which ultimately is the only kind of optimization that has long term benefits.
Your proposal is so highly CPU-revision specific that in a few years it's all obsolete and a waste of time. And I do not feel like wasting my time on that stuff.
[/quote]You may be right, GZDoom may have less opportunities for benefit from processor-specific code.

Graf Zahl said:
You clearly do not know about the art of software engineering.

35 years of writing code, 30 of them professional says your wrong. You clearly get defensive and start to insult any time someone disagrees with you, and you basically throw a fit. I never did that to you.[/quote]

Graf Zahl said:
As I said in a response to the link I posted above, performance is not everything. I need code that remains viable 10 years down the line and is not littered with specialized stuff for this and that, because every time something on the mainline code changes, all those specialized versions have to be changed as well. And the more of them you get the more work you get.

Performance is not nothing. Do you really think Intel and AMD are going to drop instructions? Their processors would never sell. Think about it.

Graf Zahl said:
Again, that's fine if all you have to do is think 6 months ahead, but it will destroy your engine if it's supposed to live on for many, many years.

So much for easy-peasy.

For you, maybe. It's not difficult for me. Then again, I don't release updates everyday. In fact, I haven't released any. That's because I am taking my time trying to perfect everything, so that I can more easily support demo-playback, and so I don't experience that heavy responsibility of releasing continuous bug fixes and new features. In your scenario, I guess I'd be more reluctant to change things simply for performance.

Before you beat me down and judged me, all I asked is that you change your wording a bit. Instead of simply damning SIMD optimization, maybe say "I don't think it would benefit GZDoom much." Not "it's a waste of time", because that's simply too harsh - thus careless. And, by the way, it's a waste of everyone's time to try to play a game, that putts along at 10 fps, due to a bottleneck that could be made more efficient. In that case, performance is everything.

Graf Zahl · April 12, 2016

kb1 said:
Not unless it is set up to do so, and it is allowed to do so. I imagine SDL does do that - can't say for sure. And, hopefully it's using SIMD, since that's a prime candidate.

SDL? No idea. But when talking about real sound libraries, I haven't seen any that works synchronously. Even OpenAL which completely botched the streaming stuff works asynchronously internally, too bad that its makers were big morons who failed to realize the value of asynchronous callbacks.

kb1 said:
Now you're making assumptions: That my knowledge comes from "advertisements", that I do not profile. The whole reason I am trying to support SIMD is from the huge performance advantage I have witnessed in my own programs. And, yes, I had to really juggle around the data, and those were simple routines (not Doom), but I easily got 3 to 4 times speed increase over some pretty optimal x86 code that had been extensively tweaked and timed.

I wonder how this '3-4 times speed increase' fares when it becomes part of a larger program.
In my book the isolated performance of single routines doesn't mean much. What matters is how this affects performance as a whole.

kb1 said:
Yes, the setup framework is a lot of work. But, once it's there, it becomes a linear process. And, the speedup might be great.

First, the value must be proven. Second, the linear factor must not be too high. Sorry, but it's a waste of time. Doom simply doesn't do enough math to make this worthwile.

kb1 said:
35 years of writing code, 30 of them professional says your wrong. You clearly get defensive and start to insult any time someone disagrees with you, and you basically throw a fit. I never did that to you.

Then why is all I can find here boilerplate advice, i.e. stuff right out of an optimization textbook without any scrutinizing whether it may even apply to the task at hand?

kb1 said:
Performance is not nothing. Do you really think Intel and AMD are going to drop instructions? Their processors would never sell. Think about it.

Again the same nonsense. You make a blanket assumption and answer it with boilerplate.
And now take one goddamn guess how much I have profiled and analyzed the code to find any actual bottlenecks to optimize?
This gets increasingly difficult if all the time is evenly spread out across a large amount of code. No isolated part gets enough coverage to get something out of it aside from doing this kind of peephole optimization.

kb1 said:
For you, maybe. It's not difficult for me. Then again, I don't release updates everyday. In fact, I haven't released any. That's because I am taking my time trying to perfect everything, so that I can more easily support demo-playback, and so I don't experience that heavy responsibility of releasing continuous bug fixes and new features. In your scenario, I guess I'd be more reluctant to change things simply for performance.

Ok, so you're a typical tinkerer, who thinks that optimizing the code to hell is a glorious achievement? Of course it makes sense to do this stuff with such a mindset. I left that behind 20 years ago when I started programming professionally.

You know, the first 3 years of GZDoom's development (from 2002-2005) went like this - but all the good stuff came afterward, and that includes many performance optimizations. Why? Because I got the one important thing you won't get: Feedback, feedback and more feedback from people actually USING the code. That feedback alone is more worth than taking all your time perfecting your code.

But one thing is clear: 90% of all optimization I did over the last 11 years was not micro-optimization like trying to see where SSE may be beneficial but algorithmic changes that had a much wider and much more profound impact. And that's clearly the kind of optimization I look for in the future, not this hogwash about perfectly tailoring the code to the strengths of a certain instruction set. That's never going to be the solution because first SSE2 was the magic ingredient, then AVX, and who knows what is next (or what is needed on ARM CPU's for example) That's way too much specialization that takes off the focus from the stuff that's really important to make the code faster.

kb1 said:
Before you beat me down and judged me, all I asked is that you change your wording a bit. Instead of simply damning SIMD optimization,

In the grand scheme of things it's useless unless you have some very specific tasks that happen to be time consuming and may benefit from vectorization. But wven writing a sine approximation that runs twice as fast only matters if you have thousands of sines to calculate every millisecond.
The biggest weakness of this stuff is that it's utterly hostile toward a memory organization that makes sense in the program's context, it's the perfect tool for programmers who think speed first and everything else a distant second but the way it works will not help it getting more mainstream support because it's close to impossible to optimize normal code to make use of it.

If it's more work to vectorize the data than to simply do the goddamn calculations right away it's just a losing proposition. And that's precisely the issue at hand here.

And since optimization is so important for you, here's one number:

The very first GZDoom release 0.9.1 from 2005, runs Frozen Time, when starting at the exit to the bridge with 20 fps on my current system.
The latest 2.1.1 release and the devbuilds run it at 54 fps on the same system. (That'd be the same for 64 bit SSE or 32 bit x87 builds.)

And I managed to do that without adding one single CPU specific optimization, or even assembly code - ever! It was all achieved by endless profiling, hunting for bottlenecks and eliminating them.

HavoX · April 12, 2016

kb1 said:
*wall of text*

You've just been comprehensively, unanswerably told. Accept it.

Coraline · April 13, 2016

Graf Zahl said:
And since optimization is so important for you, here's one number:

The very first GZDoom release 0.9.1 from 2005, runs Frozen Time, when starting at the exit to the bridge with 20 fps on my current system.
The latest 2.1.1 release and the devbuilds run it at 54 fps on the same system. (That'd be the same for 64 bit SSE or 32 bit x87 builds.)

And I managed to do that without adding one single CPU specific optimization, or even assembly code - ever! It was all achieved by endless profiling, hunting for bottlenecks and eliminating them.

This is very interesting to me, because Frozen Time's bridge makes 3DGE bog down to very low single digit framerates. For curiosity's sake, where was most of the time spent trying to render the bridge area and what kind of things did you do to speed it up? As soon as I settle on a decent profiler for my setup that was one of the first things I wanted to optimize. Even though our ports are different, it might give me a good idea of what needs to be tackled first. Probably line of sight, maybe bsp culling?..

kb1 · April 13, 2016

Graf Zahl said:
SDL? No idea. But when talking about real sound libraries, I haven't seen any that works synchronously. Even OpenAL which completely botched the streaming stuff works asynchronously internally, too bad that its makers were big morons who failed to realize the value of asynchronous callbacks.

I wonder how this '3-4 times speed increase' fares when it becomes part of a larger program.

In my book the isolated performance of single routines doesn't mean much. What matters is how this affects performance as a whole.

First, the value must be proven. Second, the linear factor must not be too high. Sorry, but it's a waste of time. Doom simply doesn't do enough math to make this worthwile.

Then why is all I can find here boilerplate advice, i.e. stuff right out of an optimization textbook without any scrutinizing whether it may even apply to the task at hand?

Again the same nonsense. You make a blanket assumption and answer it with boilerplate.

The boilerplate response is in response to your over-bearing "It's a waste of time" declaration. Of course it makes sense to focus your optimization efforts where they will make a big difference. And, it's not all about math. The possible parallelization is worth investigating. As one example, I use SIMD for block transfers in Doom, which beats the hell out of rep movsd and the like. In all fairness, that's for the software renderer.

Graf Zahl said:
And now take one goddamn guess how much I have profiled and analyzed the code to find any actual bottlenecks to optimize?
This gets increasingly difficult if all the time is evenly spread out across a large amount of code. No isolated part gets enough coverage to get something out of it aside from doing this kind of peephole optimization.

I know. My guess is that you've done extensive profiling. One thing I do to make certain areas of the profile stick out is to make contrived examples: 15,000 monster mega maps, maps with tons of sectors, etc. Basically, making any slow area become more prominent. That led to a few optimizations. But, yes, time is typically evenly spread across rendering, and AI.

Graf Zahl said:
Ok, so you're a typical tinkerer, who thinks that optimizing the code to hell is a glorious achievement? Of course it makes sense to do this stuff with such a mindset. I left that behind 20 years ago when I started programming professionally.

Did I mention my 30 professional years? Yeah, it's pretty awesome to eliminate a lag, absolutely.

Graf Zahl said:
You know, the first 3 years of GZDoom's development (from 2002-2005) went like this - but all the good stuff came afterward, and that includes many performance optimizations. Why? Because I got the one important thing you won't get: Feedback, feedback and more feedback from people actually USING the code. That feedback alone is more worth than taking all your time perfecting your code.

I'll probably get lots of useful feedback if and when I release my port. But I intend on getting a lot less bug reports, cause I'm taking the time to do the obvious stuff. I have no desire to force everyone to be my code testers - I am hoping to provide a good experience upfront. Goal #1 is that it is fun to play, which implies that it works right, and that it plays smoothly, which implies optimization.

Graf Zahl said:
But one thing is clear: 90% of all optimization I did over the last 11 years was not micro-optimization like trying to see where SSE may be beneficial but algorithmic changes that had a much wider and much more profound impact. And that's clearly the kind of optimization I look for in the future, not this hogwash about perfectly tailoring the code to the strengths of a certain instruction set. That's never going to be the solution because first SSE2 was the magic ingredient, then AVX, and who knows what is next (or what is needed on ARM CPU's for example) That's way too much specialization that takes off the focus from the stuff that's really important to make the code faster.

SSEx and AVX are here to stay. And, realistically, most everyone is running on a x86/x64 processor that contains these instructions. It's not hogwash (whatever the hell that is), and it's not a "certain instruction set", it's the instruction set used by 99% of your users, isn't it?

Graf Zahl said:
In the grand scheme of things it's useless unless you have some very specific tasks that happen to be time consuming and may benefit from vectorization. But wven writing a sine approximation that runs twice as fast only matters if you have thousands of sines to calculate every millisecond.
The biggest weakness of this stuff is that it's utterly hostile toward a memory organization that makes sense in the program's context, it's the perfect tool for programmers who think speed first and everything else a distant second but the way it works will not help it getting more mainstream support because it's close to impossible to optimize normal code to make use of it.

If it's more work to vectorize the data than to simply do the goddamn calculations right away it's just a losing proposition. And that's precisely the issue at hand here.

And since optimization is so important for you, here's one number:

The very first GZDoom release 0.9.1 from 2005, runs Frozen Time, when starting at the exit to the bridge with 20 fps on my current system.
The latest 2.1.1 release and the devbuilds run it at 54 fps on the same system. (That'd be the same for 64 bit SSE or 32 bit x87 builds.)

And I managed to do that without adding one single CPU specific optimization, or even assembly code - ever! It was all achieved by endless profiling, hunting for bottlenecks and eliminating them.

Good job! And, you could go further, if you wanted to. But, there's nothing wrong with drawing the line. You are taking a logical approach, by avoiding having multiple code paths conditional upon processor features.

And, there's nothing wrong with crossing that line. And, yes, I am unreasonable with my computer demands at times. I put the user experience above the beauty of the code any day. If I have to write 5 times as much code to speed up something 25%, I'll do it if it gets called often enough.

I am going further, and it's not hogwash, nor is it a waste of my time. I thoroughly enjoy a good game of coop with smooth frame rates, regardless of which map I throw at it.

Why should I feel like I wasted my time? (with my code. This post? Hmmm.)

Graf, I'm not attacking you. You've brought up your vast programming experience (while trying to belittle mine, I might mention), and none of that was ever in question. You've done a fine job, and continue to. I just ask that you tone down the absolute statements. They can be harmful, and, in this case, incongruent to everyone's reality. I'm at the top of the mountain screaming to everyone that I've had good experience replacing a few inner loops with some hand-written SIMD code, but I can;t get my message across, cause you do your best to squash my message. We're on the same team, man, why all the bitterness?

HavoX said:
You've just been comprehensively, unanswerably told. Accept it.

Wow, you're right - thanks, I wasn't sure. I will.

Jon · April 13, 2016

By the time you release, we might all be running on ARM cores ;)

Graf Zahl · April 13, 2016

Chu said:
This is very interesting to me, because Frozen Time's bridge makes 3DGE bog down to very low single digit framerates. For curiosity's sake, where was most of the time spent trying to render the bridge area and what kind of things did you do to speed it up? As soon as I settle on a decent profiler for my setup that was one of the first things I wanted to optimize. Even though our ports are different, it might give me a good idea of what needs to be tackled first. Probably line of sight, maybe bsp culling?..

Most optimizations predate that particular level, but the biggest changes I made during that time were:

- instead of using atan2 to calculate a real angle for the clipper, use a pseudo angle which does not give proper direction but only order of points.
- cache these values because vertices tend to get visited more than once per frame.
- simplify some checks for two sided walls because they do not need to do everything a one-sided wall needs to do.
- (GL 4.4 only): Use a persistently mapped vertex buffer instead of immediate mode, when possible.
- lots of general streamlining.

If you get into single digits - which doesn't surprise me when I think about the last time I compared port speeds - my guess would be that EDGE suffered from the same problem as all other GL ports: Converting the Doom level data into some intermediate structures which cost more time to maintain than simply replicating the software renderer's approach of traversing the nodes and rendering off the actual level data directly. Please don't ask me why no port doing that is fast, I never profiled one.

If you want to optimize I only can give the same advice as what I said before: The only thing that will give you meaningful answers is some profiling. Sometimes the bottlenecks are in places where one might never expect them.

Graf Zahl · April 13, 2016

kb1 said:
The boilerplate response is in response to your over-bearing "It's a waste of time" declaration. Of course it makes sense to focus your optimization efforts where they will make a big difference. And, it's not all about math. The possible parallelization is worth investigating. As one example, I use SIMD for block transfers in Doom, which beats the hell out of rep movsd and the like. In all fairness, that's for the software renderer.

The 64 bit version actually uses the xmm registers for that - but since the overall speed gain is negligible it's not something I'd invest further time in. There's just not enough block copying going around to justify rolling out some special code for it

kb1 said:
I know. My guess is that you've done extensive profiling. One thing I do to make certain areas of the profile stick out is to make contrived examples: 15,000 monster mega maps, maps with tons of sectors, etc. Basically, making any slow area become more prominent. That led to a few optimizations. But, yes, time is typically evenly spread across rendering, and AI.

For rendering profiling I use such maps, of course. Frozen Time is my current favorite for pure rendering performance because the sector count is extremely high and the special effect count is zero. Which of course let me find a few first grade time wasters (e.g. splitting sidedefs for precise rendering is not necessary most of the time.)
For the AI things look different. Most of the time is spent in P_TryMove and whatever it calls. And that code has grown quite complex in ZDoom, which is actually causing some problems with maps like Nuts.

kb1 said:
Did I mention my 30 professional years? Yeah, it's pretty awesome to eliminate a lag, absolutely.

So? I count 24 professional years and 10 more doing it as a hobby. I started on a C64 so I think I know how to optimize. On the other hand I haven't become stuck to that mindset. I only optimize when a payoff is likely.

kb1 said:
I'll probably get lots of useful feedback if and when I release my port. But I intend on getting a lot less bug reports, cause I'm taking the time to do the obvious stuff. I have no desire to force everyone to be my code testers - I am hoping to provide a good experience upfront. Goal #1 is that it is fun to play, which implies that it works right,

I did that, too. Taking care of the obvious stuff before releasing GZDoom. It still didn't prepare for how things went afterward.

kb1 said:
and that it plays smoothly, which implies optimization.

I can only repeat myself: I am not interested in making certain code segments twice as fast. I only follow through with an optimization approach when a quick mock-up shows it's worth it for the program as a whole. Making a code segment twice as fast that ultimately runs about 2% of all time is a 1% optimization, provided that the CPU cache doesn't bite you in the ass afterward (which is the #1 showstopper nearly each time.) That has kept me from going lots of unproductive rounds that may have looked nice on paper. Guess why GZDoom still processes all sidedefs each frame completely? Not because I am lazy but primarily because trying to cache it comes with its own share of problems that render the whole thing mostly ineffective.

kb1 said:
SSEx and AVX are here to stay. And, realistically, most everyone is running on a x86/x64 processor that contains these instructions. It's not hogwash (whatever the hell that is), and it's not a "certain instruction set", it's the instruction set used by 99% of your users, isn't it?

I only repeat myself: The instruction sets themselves are not hogwash, but the belief that they constiture a magic ingredient to make a program faster as a whole definitely is.

I only can cite the sine calculation I mentioned above. It hasn't been too long ago reading an article about this stuff, and of course it ended with going the SSE assembly route, bragging that this hand-optimized code can calculate 4 sines in the time it takes to calculate one. Yeah right, but what it didn't answer is how to orchestrate the source data that you actually have the possibility of doing that. Which brings us back to the plain and simple fact that vectorization is a very hostile concept when it comes to organizing data. In GZDoom there isn't anything that may profit from it.

kb1 said:
Good job! And, you could go further, if you wanted to.

What makes you think I stopped there? But there isn't much left to squeeze out. Anything I tried that sounded logical didn't really help. The only thing that by now may actually speed things up would be some radical multithreading, but that's a futile approach if it all gets bottlenecked by OpenGL's insistence to run in a single thread. The need for constant synchronization is killing this outright.

kb1 said:
And, there's nothing wrong with crossing that line. And, yes, I am unreasonable with my computer demands at times. I put the user experience above the beauty of the code any day. If I have to write 5 times as much code to speed up something 25%, I'll do it if it gets called often enough.

Well, my belief is that throwing more code at a problem is never the solution. If I know I can get 25% more out of it, I'd try my best to achieve the same without such extreme measures - and most of the time a simpler, more streamlined solution presents itself. See the pseudo angles stuff I mentioned before. That was the ultimate result of the attempt to speed up the clipper by throwing a wall of code at it. Well, the wall of code worked, but the far simpler solution worked even better - to the point that the wall of code could be removed entirely.

kb1 · April 13, 2016

I'm not going to pick apart your responses - I grow weary of it. You have this nack of taking my responses and restating them as if you had originally said them, or taking my defensive statement and respinning it as an attack. Again, why the hostility?

For example, you stated: "You clearly do not know about the art of software engineering", to which I replied that I've been coding for 35 years. Then you respond with "So?" as if I had simply been bragging, when, in fact, it was in defense of your unfounded attack.

So, to wrap this up, I will summarize the points you've been glossing over, in bullet form:

1. I consider myself a pretty damn good programmer.
2. I believe that you are as well, and I have never stated otherwise.
3. I have never called you lazy.
4. Typically, the only problem I ever have with you are your absolute statements:
"Throwing more code at the problem is never a solution."
"...vectorization is a very hostile concept when it comes to organizing data."
"In GZDoom there isn't anything that may profit from it."
"...At some point the added work just isn't worth it anymore because the cost-to-benefit ratio is not good enough."

These are just a few of the many examples across many threads.

The amazing thing about this whole interaction is that I've been paying you a compliment this whole time. I've been stating that many people on this site value your wisdom and your words. Which is why I am asking you to accept this fact, and to avoid making absolute statements such as those above.
Like the first one:
"Throwing more code at the problem is never a solution."

Can you change that to this?:
"Throwing more code at the problem is rarely a solution."

I can fully agree with you on the second statement. But when you use the first statement, you effectively nullify my statements. It's like a slap in the face. It's also unnecessary.

I'm asking you to consider that such definitive, absolute, black-and-white statements are ~~never~~ :) rarely true in all cases, and that they can be disrespectful and hurtful. Not looking for a response here, I'm just asking you to consider it. Thanks.

Maes · April 13, 2016

kb1 said:
I know this was a joke, but, seriously, how many versions would you need, if you tested processor capabilities at startup, and set function pointers accordingly? Easy-peasy.

HA! Tell that to the Gentoo ricers. If EVERYTHING isn't compiled with all the "right" flags set, start to end, then it's simply not optimized enough ;-) Generic precompiled A/V and framework libraries?! You've got to be kidding!!!

Sign In

SSIMD compiler flags - is it needed?

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in