Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
Coraline

SSIMD compiler flags - is it needed?

Recommended Posts

Hi all, I have been compiling the past few versions of 3DGE with Streaming SIMD Extentions enabled in CFLAGS (msse2, etc), but a few users were not able to run the engine at all because of them. The new version has those ommitted so it can run on a wider range of systems.

My question is - does real-world engine performance increase any with those optimizations? Maybe on newer technology I could see the potential for such things... but is it needed for all of our Doom source ports?

I cannot tell if it brings any performance advantages, even with CPUs that support SSIMD instructions. So far, its only hampered users with older CPUs, which was my rational for disabling those in the new build..

Share this post


Link to post

I did some extensive tests with different instruction sets before porting ZDoom to floating point. I could find no relevant difference in performance between x87 and SSE2 instructions. I also never could detect a difference when comparing an SSE2 and an x87 build of GZDoom.

The only thing you need to be aware of is that due to its bad architectural design, x87 code tends to be far more dependent on the compiler, regarding precision of the calculations, so if you have any concerns about netgame or demo compatibility it's going to be a problem, unless you stick to doubles and make sure up front to set the CPU to 64 bit precision.

But seriously, there's really still people out there using such dinosaur computers? I would have thought that these days non-SSE CPUs have disappeared.

Share this post


Link to post

The x64 specification ditches the old school FPU entirely and decodes all instructions in to the SSE pipeline. Performance wise and results wise, yes, you're not going to find a difference when testing on an x64 processor. Unless your machine has really bad cache, in which case x87 instruction streams tend to be far bigger than SSE instruction streams so you may very well get cache penalties and subsequent pipeline stalls just for decoding instructions.

Running SSE on a 32 bit processor? That's where the "streaming" part of the title comes in. The SSE pipeline on x86 has some rather large penalties if you don't order your operations correctly. So the question there you need to be asking yourself is if you intend on supporting 32 bit processors ever again/at all/etc.

Once you've switched over to an SSE instruction set, then there's a ton of possibilities for optimisations open for you. GCC has a bunch of intrinsic optimisations that can replace common code with SSE optimised versions. I'd expect LLVM to have the same. I roll my own hand-written optimised library in Remedy's codebase and inline it everywhere, which has the advantage of explicitly stating what kind of optimisations I expect my code to do instead of leaving it to the mercy of the compiler (and even then I've yet to see a version of Visual C++ not fuck up standard floating point operations and require disassembling and rewriting simple C++ code, had a really bad one recently where it casted floats up to doubles for calling standard library functions but forgot to cast it back down so I had to make it explicit).

SSE 4.1 and AVX start getting in to the "Why did Intel take this long to make an almost-complete instruction set" territory (no, seriously, Altivec and NEON are fairly complete instruction sets so how can Intel just fail to get it right continually). But limiting your code to that effectively limits you to Nehalem or later processors.

Important hardware distinction too. While x87 had internal precision that you could control for calculations, SSE does not have this. 32 bit channels aren't just less accurate than 64 bit channels, they also calculate less accurately. That double/float cast I mentioned was likely the compiler's attempt to make my code more accurate. But on average I've found some double operations to be up 40% slower than float operations on both Intel and AMD hardware, so it also depends on how much of a speed/accuracy trade-off you're willing to make. Most SSE instructions have _ps and _pd variants (most, there's some really annoying omissions).

Also not to be overlooked by using SSE optimisations is the fact that it has an integer pipeline. Even with SSE 4.2, which lacks AVX's gather/scatter functions, I made string compares 4 times faster. I expect gather/scatter would get that faster still, and I should really write a few more specialised cases for the compare function such as string compares starting from 16 byte boundaries, the one I wrote assumes neither string is on the right boundary thanks to substring comparisons etc. The render team also makes use of the integer pipeline quite a bit. I even rejigged a hashing function to take in template parameters solely so I could stick SIMD integers in there.

Share this post


Link to post
GooberMan said:

Running SSE on a 32 bit processor?


Which begs the question: What is considered more important?

Making it work on ALL old CPU or making it work better on a limited subset of old CPUs?
As things stand, the vast majority of users these days got a 64 bit CPU where there is no noticable performance difference between x87 and SSE instructions.

The CPUs we are talking about here are old anyway and the systems they are in are far more limited by other factors, like old and slow graphics cards.

As for double vs. float, I've yet to see the 40% difference. All the tests I made last week to decide whether to use floats or doubles in ZDoom showed that it doesn't matter. I could not see any difference on my 4 year old CPU, even with larger arrays of data.

My standpoint on the entire matter clearly is to make it work well on modern hardware and for these old dinosaur systems just ensuring that it runs. You won't get much mileage out of them anyway. I consider optimizing for them a waste of time. I don't have one so I can't test on one.

Share this post


Link to post
Graf Zahl said:

As for double vs. float, I've yet to see the 40% difference.

Divides are a big one. Although, strangely, for inverse operations an approximation method provided by Intel (which uses the _mm_rcp_ss instruction and does some iterations to restore almost all of the lost accuracy) runs slower on desktop Intel processors but runs faster on the AMD Jaguar processors. So, tl;dr, there's still a lot of processor-dependant stuff. But that divide is still 40% on every processor I've tried.

One thing we wanted to do (but didn't have time to try) was provide different libs for different architectures. As I inferred, I can do more with AVX than with SSE 4.2 but it's such a small share of the desktop market that it didn't justify the time/effort to do so.

Share this post


Link to post
Graf Zahl said:

But seriously, there's really still people out there using such dinosaur computers? I would have thought that these days non-SSE CPUs have disappeared.

That would be me. Still running a 32-bit Athlon XP, which has support for SSE but not SSE2, the last of its kind. I'm hoping I will get something newer soon, as it is indeed a dinosaur, but these days it's mostly a Doom and internet machine so it's hard to justify spending too much on something new.

Share this post


Link to post
GooberMan said:

... SSE stuff

Some very good points, especially about the compiler trying it's best to use SIMD, but not lose the precision of, for example 64 vs. 80 bit calculations. Done right, you can achieve things like: 2 or 4 multiply/adds PER CYCLE, simultaneous SIMD and non-SIMD operations, super-fast block moves, etc.

You can get performance out of even a 32-bit processor with SIMD instructions, but there's some strict rules to follow if you actually want that advantage, and, the older you go, the more strict the rules become.

Gooberman lays down the foundation for some general rules for using SIMD tech:

. Roll your own assembly for SIMD. It's difficult enough to willingly convert your data to be SIMD-friendly. The compiler would have to be *really* smart to do the right thing without massive help from the programmer, IMHO.

. Plan your data ahead of time, to get simultaneous operations to work.
. Read the processor specs carefully to avoid alignment issues, pipeline issues, etc.
. Interleave commands in a way that instruction #2 does not depend on instruction #1's results, to support out-of-order processing. This applies to all programming, not just SIMD.
. Provide non-SIMD versions of the code, and test for processor capabilities at the beginning of the program. This allows your code to run on systems that do not support your SIMD instructions.
. Profile. Re-write. Profile again. And again. Test it on as many processor makes and models you can find. Possibly provide different routines for each type.

Unfortunately, it's just not an easy thing to accomplish. The Intel compiler does a lot of this work, where it tests processor type, and links in optimized routines for each. But I read that it purposely chooses the worst route if you're running on an AMD - at least it used to.

It really is the kind of thing that modern processors should be doing - they should provide different code paths based on capability, and they should be knowledgable about the capabilities and quirks of all recent processors. But they don't really do the job, so the task falls to the programmer.

@Graf: Not seeing a difference between single and float operations is a clue that the compiler is maybe 'thunking' values around, to guarantee some stated accuracy level, while still trying to use the lower SSE precision. It may indeed promote single to double, to avoid the precision loss. You get into a situation where you want to tell the compiler "I don't mind some precision loss, I want speed", which is ironic - that why you want the SSE in the first place!

The number of operations that can really benefit from SIMD are very limited. You have to model your data to fit the instructions, which is kinda backwards from more traditional approaches. Telling the compiler to "just use SIMD extensions" leaves a lot to be desired.

Share this post


Link to post
kb1 said:

@Graf: Not seeing a difference between single and float operations is a clue that the compiler is maybe 'thunking' values around, to guarantee some stated accuracy level, while still trying to use the lower SSE precision. It may indeed promote single to double, to avoid the precision loss. You get into a situation where you want to tell the compiler "I don't mind some precision loss, I want speed", which is ironic - that why you want the SSE in the first place!



Certainly not. Yes, GZDoom's rendering code is compiled with 'optimize for speed' option, but when I don't see any performance difference between doubles, floats and fixed point (to be precise, when running a geometry heavy map like Frozen Time the difference is around 0.1ms per frame) then it's a clear indicator that these calculations never are the actual bottleneck.

It doesn't even make a difference if I store all the relevant data as floats or doubles, there's no real difference in performance, even that should theoretically show as increased cache misses.

And Visual Studio never tries to counter the precision loss of single precision math.

Share this post


Link to post

I said 'never in (G)ZDoom', meaning that all these considerations are more or less irrelevant because they never become the limiting factor for performance.

In ZDoom that's clearly the speed of software rendering and in GZDoom all the floating point math is still intermixed with lots of integer stuff and, of course, ultimately calling the OpenGL API.

I guess the same will be true for other Doom ports as well.

Share this post


Link to post
Graf Zahl said:

Certainly not. Yes, GZDoom's rendering code is compiled with 'optimize for speed' option, but when I don't see any performance difference between doubles, floats and fixed point (to be precise, when running a geometry heavy map like Frozen Time the difference is around 0.1ms per frame) then it's a clear indicator that these calculations never are the actual bottleneck.

It doesn't even make a difference if I store all the relevant data as floats or doubles, there's no real difference in performance, even that should theoretically show as increased cache misses.

And Visual Studio never tries to counter the precision loss of single precision math.

There *should* be a difference. But, it may be minimal - depends on what FP ops your doing, and how many, of course. Doesn't GZDoom still use a lot of fixed-point, you did you convert all of that over? (Been a while since I looked).

Didn't know about Visual Studio - I seem to remember, in some case, it seemed to be going out of it's way to preserve precision. Maybe that's a compiler option?

Share this post


Link to post
kb1 said:

There *should* be a difference. But, it may be minimal - depends on what FP ops your doing, and how many, of course. Doesn't GZDoom still use a lot of fixed-point, you did you convert all of that over? (Been a while since I looked).


Half of the coordinate processing stuff in the renderer is still fixed point.´, especially a lot of height comparisons.

But why *should* there be a difference? If there is it's so small it gets drowned in the normal timing variations. At least I can't measure it and even going at it with a profiler does not help because none of the relevant code gets any significant percentage.

kb1 said:

Didn't know about Visual Studio - I seem to remember, in some case, it seemed to be going out of it's way to preserve precision. Maybe that's a compiler option?


Yes, it is an option. In fact ZDoom and GZDoom use different settings for different code: Anything that may affect netgame sync is compiled with 'precise' and avoids the CRT's math functions (which are slower than the CRT but ensure the same results on all tested platforms), the rest optimizes for speed and tries to use the CRT for math.

Share this post


Link to post

3DGE doesnt have any fixed point, so maybe I will compile a version with those flags and just profile. It is worth a shot at least. ^_^

Using GCC/MinGW though, I want to convert the Windows makefile over to use VS instead for that platform, it just seems like a lot of work. Also just got so used to the command line since that is what EDGE originally used.

Learned a lot in this thread, I really appreciate everyone chiming in!

Share this post


Link to post

If there is noticeable difference, you can include two binaries in the distributed zip. I remember some commercial games doing that to retain Athlon XP compatibility back in the day.

Share this post


Link to post
Chu said:

Using GCC/MinGW though, I want to convert the Windows makefile over to use VS instead for that platform, it just seems like a lot of work.



If you are doing that, best consider using something like CMake.

Share this post


Link to post
Graf Zahl said:

...But why *should* there be a difference? If there is it's so small it gets drowned in the normal timing variations. At least I can't measure it and even going at it with a profiler does not help because none of the relevant code gets any significant percentage.

Double math is slower that single math. But not by much. It's real obvious with division, but you've got to be doing a whole lot of double math to notice a difference these days.

Share this post


Link to post

Would it be worth releasing an executable for CPUs that support those extensions? I guess it couldn't hurt, but it would be basically packing in two versions. In other words...over-complicated or worth it? ;)

Share this post


Link to post

We are talking a small number of 10 year old systems where it may make a noticable difference. On anything recent it really doesn't matter and you can just use x87 instructions.

Ultimately this ranks for me on the same level as Visual C's XP and non-XP toolsets. Unless you know you can't support the older system there's no point excluding it because you won't gain much of an advantage.

Before converting ZDoom to floating point I made extensive benchmarks of math code (meaning stuff like software-implemented sin, cos, tan, etc. functions) to decide what to support. And while I could see some performance difference between 32 and 64 bit I couldn't see anything that significantly distinguished 32-bit x87 and 32-bit SSE2 - to the point where I was unsure that the correct code was generated - I had to check the disassembly in the debugger to confirm.

And if you decide to do it, first make sure that your code uses floating point so extensively that it actually should show in the measurements.

All I can tell you is that the only piece of real-life code which I could test over the years that exists as both (ZDoom's node builder got the one function eating the vast majority of time as both x87 and SSE2) never made a difference for me on my last 2 systems, the older one being bought in 2007.

I got one older computer from 2004 lying around but that's too old to even have SSE2.

Share this post


Link to post

So we get it. You test code that uses minimal floating point usage, and you test x87 on a processor that decodes it to SSE instructions. Great.

Here's a simple checklist that one should follow instead.

  • Are you compiling for x64? SSE2 is the minimum native floating point model as defined by the hardware specifications, and compilers will target that by default.
  • Are you compiling for x86, but your user base is mostly x64 CPUs? Worth switching exclusively to SSE2 anyway, but you'll want to be careful that the smaller instruction stream size doesn't start doing unexpected things like invoking the branch predictor's wrath etc.
  • Are you compiling for x86, and (let's say) 40% minimum of your userbase runs on x86? Don't switch. But don't be afraid to have it enabled and usable with hand-written optimised code, and definitely don't try hand optimisation unless you have x86 hardware to develop on.
In reality, if you're going to release two versions of your executable, releasing 32 bit and 64 bit versions together makes that entire checklist moot thanks to the first point. And I'll reiterate hand writing optimised code, I know you're doing some more modern 3D things in 3DGE these days. SSE4.1 has a dot product instruction for example, earlier SSE versions needed a multiply and an add and some shuffles but is still quite a bit cheaper than doing several scalar multiplies and adds.

Share this post


Link to post
GooberMan said:

to have it enabled and usable with hand-written optimised code,



Yes, I do get it. The only way to make SSE useful is with hand-written optimized code.

It might be fine and valid for your specific use case but the issue here is clearly, whether to provide an alternative binary compiled off the same C/C++ source, not whether to go through the code with a fine comb and replace stuff with specially tailored assembly to squeeze a bit more juice out of it.

As I said, the real question should be: Is it worth doing this for some CPUs that are 10 years old by now? Disregarding the 32 bit OS I had on it in the beginning, I've been using 64 bit CPUs for 9 years now, so this should give a clue how old the systems have to be where you might see an advantage of using SSE2 for 32 bit.

Because, I still have to see a single piece of evidence that on MODERN hardware it makes even a hint of a difference - and that's where I'd normally put my focus.

Share this post


Link to post
Graf Zahl said:

Yes, I do get it. The only way to make SSE useful is with hand-written optimized code.

Incorrect. That is the only way to make SSE useful on x86 hardware. On x64, it's just the native floating point instruction set that combines single stream and packed float/double/integer functionality Which results in smaller instruction streams, less work for the instruction decoder, more registers available for the compiler to use for more complicated operations, and depending on which compiler options you use it can automatically replace common operations with optimised versions that take advantage of specialised SSE instructions.

This is all stuff I've stated in this thread, pay attention.

And sure, my specific use case is "actually using floating point functionalty on a wide range of hardware" here, so feel free to continue brushing aside my advice and experience if you aren't using floats in your code or if you don't see the point in compiling for x64.

Share this post


Link to post
GooberMan said:

Incorrect. That is the only way to make SSE useful on x86 hardware. On x64, it's just the native floating point instruction set that combines single stream and packed float/double/integer functionality Which results in smaller instruction streams, less work for the instruction decoder, more registers available for the compiler to use for more complicated operations, and depending on which compiler options you use it can automatically replace common operations with optimised versions that take advantage of specialised SSE instructions.


... and yet, it doesn't make one bit of a performance difference in GZDoom overall. Whether I use a 32 bit x87 build or a 32 bit SSE2 build or a 64 bit build with all its added registers - the overall performance stays the same, with minimal deviations in single functions that mutually even each other out.

And that despite the fact that my own benchmarks show that 64 bit math code can be up to twice as fast as the same code built in 32 bit, SSE or not.

64 bit makes sense for having more memory available, but unless you do some extensive number crunching it looks like other factors in the engine are vastly more relevant for performance than which FP instruction set is being used.

With regards to 3DGE it means not to blindly follow some generic advice but to test up front if the added work is beneficial or not. And here my point still stands: Providing a 32-bit SSE2 build has no benefit, and that was the ONLY point I tried to make, I said nothing about 64 bit.

For GZDoom I provide a 64 bit build - not because it performs better but because this is the most future-proof option. 32 bit will die out sooner or later so it's pointless to focus development there.

Share this post


Link to post
Graf Zahl said:

Providing a 32-bit SSE2 build has no benefit, and that was the ONLY point I tried to make, I said nothing about 64 bit.

...so that checklist I wrote covered your point? Why did you even reply to it in an argumentative manner then?

Share this post


Link to post

Look who was starting.

That's what I was saying all along and yet you were responding in a manner that somehow came across as I was telling bullshit.

Share this post


Link to post
GooberMan said:

...so that checklist I wrote covered your point? Why did you even reply to it in an argumentative manner then?

You need an attitude adjustment.

Share this post


Link to post

The *proper* viewpoint is not "Aw, shucks, I have to build an old crappy 32-bit version for my customers with ancient systems." Rather, it should be: "Oh, goody, I can build a super high-performance version that uses the latest processor technology."

Building 2 versions provides you the justification to have a highly optimized modern version, with a fallback for those that need it. Yikes. Way to take a rose and make it look like a booger...

Share this post


Link to post
kb1 said:

The *proper* viewpoint is not "Aw, shucks, I have to build an old crappy 32-bit version for my customers with ancient systems."


But mostly it IS the old and crappy version that holds things back... :(

Share this post


Link to post
Graf Zahl said:

But mostly it IS the old and crappy version that holds things back... :(

Did you read what I typed? Try again. Here: I'll make it easy:

Make 2 versions (at least) - A compatibility version, and a bleeding edge full-featured version. Then, nothing is "holding anything back", and you get the best of all worlds. Unless your view of "being held back" is that you have to compile twice. Can't help you there.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×