wesleyjohnson Posted March 30, 2017 (edited) This started with something I saw (and I do not know anymore where) that stated that by default a literal such as 1.25 is a double, and will be converted at run-time to be stored or used as a float. So I went about finding any float literals that were not marked with the f. So I feel like I am shooting myself in the foot asking this here, but the answers are so interesting. I wrote a test program with a mixture of default literals and explicit float literals, compiled it on gcc 4.5, and dumped it as assembly output. The assembly is is showing exactly the same code for both float literal and double literal. I used -O0 to avoid optimization, but still gcc seems to be losing the run-time conversion of double to float. But in DoomLegacy code, changing a 1.25 to 1.25f makes the code get smaller, so something is happening. But ... changing a (f1 < 0.01) to (f1 < 0.01f) makes the code get larger, consistently. I am comparing the sizes of stripped binaries, so it should not be the debugging information. I have yet to get a look at the assembly of this cause it is a bit more difficult to dump DoomLegacy into assembly output and find the interesting spot. I seek enlightenment, or at least a rationale (in the general case). float f1, f2, f3; f1 = 1.25; // defaults to double ?? f2 = 1.25f; // nice, code gets smaller f3 = 0.0; // does it care when it is zero ? f2 = f1 * 2; // does it care when multiply by int if( f1 < 0.01 ) ... // should change this to 0.01f ? if( f2 < 0.01f ) ... // what the heck, code got 30 bytes larger for everyone of these I fix if( f3 < 0.0 ) ... // does it care about the f when the test is against zero Edited March 30, 2017 by wesleyjohnson : wording 0 Share this post Link to post
kb1 Posted March 30, 2017 You can't really get any meaningful info from the size of the binary, with such a small test case, if at all. In general, the compiler does lots of things, some of which are not directly related to a specific line of code, which will produce a counter-intuitive result. Here's just some off-the-top-of-my-head examples, which may or may not apply to your example: Code and data alignment Floating point setup code (like MMX floating-point reset, or setting rounding mode. Variable promotion (some languages will, for example, promote an int to a float, or even both ints to floats, before doing, say, a multiplication) By including a decimal, the compiler knows that the constant is not an integer, so it is free to make the R-Value itself either single or double, or even 80-bit. But it must then convert to be able to store the value. But, back to the subject of .EXE size: The EXE header structure works in blocks of data. If you compile a small program, and then add a small amount of code, it is possible that the binary size will not change at all, because it still fits in the block. Furthermore, newer compilers that do "whole program optimization" are free to rearrange things in complex ways. It's just not a good way to measure. And, finally, frequently, larger machine language code is faster than tightly packed code! Modern X86 processor instructions can be up to 15 bytes long! The processor is able to pull in that much code and decode the instruction quickly. Now, code cache size does present an argument for smaller compiled code size. But these rules are so complex that trying to analyze an entire process is a massive project. However: If you really want to know the answer, it can be done, but it's going to take a little bit of work. You can write a program that generates A LOT of similar code with the "f" and without the "f", and check binary size of the result, and that should provide a more meaningful result. If the code is large enough, you should be able to counteract any alignment/block size/optimization issues. But, whatever code you generate, you must make sure the compiler does not optimize it out. All the constants you create must be used in this test program, and must be used in a way that the clever compiler cannot factor out, and ruin your tests. And, finally I must state the obvious: A look at the assembly should provide the answer, much more easily than my suggestion above. And, I must ask: Are you trying to figure out if the compiler is storing doubles vs. floats, or are you asking because you are actually concerned about final compiled binary size? I ask, because bigger code != slower code on modern processors. If you do get some results, please post them: I'd be interested to know what those results are. Good luck. 0 Share this post Link to post
Ladna Posted April 6, 2017 It might be that before it could store it in the binary as a literal ".01", but when you make it a float literal it has to store the full 32-bit floating point representation? That doesn't mean 30 bytes I guess, but there could be cascading effects. Dunno. You can have GCC dump the assembly for a single file. 0 Share this post Link to post
kb1 Posted April 7, 2017 Did you get any more info on this? To really pin it down, I'd try something like this: Write a program that generates a .C file, called "Listing1.C" that looks kinda like this: Spoiler // Listing 1 - No 'f' #include <time.h> int main(void) { float f1; time_t clock1; time_t clock2; time(&clock1); f1 = 1 / (float)(&clock1); /* Make A LOT of these lines using a program that generates them. The literal used should be chosen randomly, We are trying to determine what happens when a float constant has an explicit 'f' or not, and the effect that that has on the compiled output. So, the random literals we choose should not require more than float accuracy to express literally. We do not want an implicit conversion to anything except float. Perhaps, the literals should be generated as strings of a specific length: "0." & xxxxx, where xxxxx = 5 decimal digits. */ #define ENTRY (1) #define VALUE (0.12246) if (f1 == VALUE) { time(&clock2); return ((int)(&clock2)) / ENTRY; } #define ENTRY (2) #define VALUE (0.58909) if (f1 == VALUE) { time(&clock2); return ((int)(&clock2)) / ENTRY; } ... #define ENTRY (1000) #define VALUE (0.08716) if (f1 == VALUE) { time(&clock2); return ((int)(&clock2)) / ENTRY; } } Some notes on my theory of operation: Most likely, if you use the same pseudo-random constants, and turn off optimizations, these 2 listings will produce identical output. I think maybe, with your listing above, the mismatched use of the float identifier 'f' thwarted an optimization that allowed the compiler to do all the calculations at compile time, reducing the output to a single unconditional path. In other words, I think the compiler could have reduced your code to this: float f1 = 1.25f; float f2 = 2.5f; float f3 = 0; In fact, it could then remove all the code, cause the vars are not used (with what you listed). None of the IF statements are taken. But, the mixed usage of 'f' may have prevented it from discovering this possibility. Sometimes, the simplest setup confuses an otherwise brilliant compiler. But, it's just a guess :) By specifying 1,000 entries, the code should "spill into" multiple "blocks", regardless of whatever definition of "block" you may use, thereby preventing a block allocation type of obfuscation of code growth. Also, any code length differences are multiplied by 1,000, making them very obvious. The use of the first time function guarantees that compile-time calculation/code reduction cannot occur in the comparison. The use of the time function inside the blocks prevents the creation of a "virtual return table", which is highly unlikely anyway. But, if I had not included the "/ ENTRY", the compiler could generate a single "Call time(&clock2); return (&clock2)" code block, and jumped, or "fell into" it, which I wanted to prevent. The point I'm trying to make is that the listings above should be next to impossible for the compiler to optimize, even if you were not turning off optimizations. What remains is the processing of your literals, 'f' or not, with a fixed amount of overhead that can be subtracted out. The other, more subtle point is that, trying to manually optimize for code size just isn't very effective anymore. Modern compiler writers are typically very smart, and they have knowledge on a per-processor, per-instruction type level, and they typically cater their compilers to use that information to do the right thing, in general (Not that I can't sometimes do better with hand-written assembly, but that's a different discussion :) It actually can be relevant in this discussion: For hand-written assembler to be justified, and to actually make a significant difference these days, you need to use different, per-processor approaches, which suggests different code paths per architecture. For the past 15 years, or so, processors handle certain code sequences so differently that code optimized for a Pentium II runs sub-optimal on, say, an i3, or i5, or, move importantly, on different manufacturer's offerings, and that's just for x86/x64. So, naturally, if you support multiple code paths, your code has become bigger - a lot bigger, yet performs better. Compiled size is no longer inversely proportional to code performance, and, often, the opposite is true. Another note about modern compilers: The optimization techniques offered lean towards performance, not code size. I know they present that as a simple choice of this or that. But, usually, optimizing for size just prevents some aggressive code performance techniques from being used: techniques like inlining, loop unrolling, etc. Now, noticing a huge spike in binary size may be a useful indicator that you've just pulled in an unnecessary library unintentionally, due to a bug. But, other than that, IMHO binary size by itself is too coarse a measurement to derive much useful (actionable) info from. Though, I have to admit that I do want to see the results of those 2 code listings I posted! I hope you see where I'm coming from, but mainly, I hope you find them (or a similar idea) helpful. 0 Share this post Link to post
wesleyjohnson Posted April 7, 2017 Sorry, but I am really busy lately. I think the clock test program approach is similar to the little test program that I already wrote. My test program has some float values coming from function calls, with obfuscating statements to discourage inlining. Used -O0 keep the optimizer from confusing things. I looked at the assembly of that test program and it showed that there was no difference at all in the assembly code. The compiler was generating the same instructions when the literal was 1.25 as when the literal was 1.25f. This right off disagrees with the report that plain float literals are actually double, and will be converted to float at run-time. I tested with the GCC 4.5 compiler, with no other switches. DoomLegacy uses some other switches that could be affecting this, and while I doubt it, there is not much else to suspect. Just using the editor to duplicate some of the code blocks would be faster than writing a program to duplicate stuff. But I think the result will be the same, no difference. In the DoomLegacy base code, I changed most of the float literals that were being used on float variables to have the f. These included assigns, and expressions where all the operands were float literals. There was a significant reduction in the DoomLegacy code size. That is usually a sign that something was improved that the optimizer was not finding on its own. The significant weirdness was the inequality tests. Change any float inequality to use an explicit float literal and the code size would bump up by about 30 bytes. And it happened for every float inequality I changed, in any file. It is consistent and also accumulative, which suggests some kind of actual effect in the code outside of the block allocation effect. Usually the code size will not change (the block allocation effect) but I think most of that is in handling variables. For many edits the code size will not change, then another edit will change it in a large chunk. Been seeing that for years. But then there are some edits like fixing 5 identical references (like plane.lighttable[lightval] ), where assigning the reference once to another local var gives an immediate code size reduction. This tells me that the compiler, even with optimization, is not finding these common subexpressions so it was duplicating the code. That is why I always check the code size as nothing else I could look at (short of looking at assembly output) could tell me what effect manually handling the common subexpression would have. The GCC info suggests using array indexing is better because otherwise the optimizer is inhibited by ptr references. My results seem to indicate that the explicit ptr gives smaller code (which must be because it is more direct with less duplicated effort). The optimizer was not doing as well as my hand optimization. Not much immediate benefit to get an answer right now, and I have my hands full right now. The code will be "correct" either way, and may in fact be identical, as far as the assembly I have seen so far suggests. It makes it more difficult not knowing what the code size bump actually means. Using the code size is much faster and of finer resolution than doing some kind of regression frame rate test. It is convienent, as long as you have some idea of the other influences on the code size, and how to ignore them as noise. I will report back if find anything else significant, maybe in 2 to 3 weeks. 0 Share this post Link to post
Graf Zahl Posted April 7, 2017 You really should let the coompiler produce assembly output to check these things. Floating point logic can easily be very different from what you may expect when writing code. One example: In Visual Studio, with SSE2 code generation it doesn't matter one bit if you give a constant an 'f' postfix. All the compiler will do is take the value and then internally convert it to the precision the destination variable requires. Otherwise it'd have to emit a conversion instruction that costs additional processing time. When creating code for x87 things will get even more tricky when using single precision floats. In order to ensure that the entire calculation will remain in single precision, when using the 'precise' floating point model, the value will frequently be stored and re-read to and from memory. So depending on various settings even seemingly trivial differences can create very different code. And worse, what may be efficient for one hardware can easily be the total opposite for other hardware. 0 Share this post Link to post
kb1 Posted April 7, 2017 (edited) 1 hour ago, Graf Zahl said: You really should let the coompiler produce assembly output to check these things. Floating point logic can easily be very different from what you may expect when writing code. One example: In Visual Studio, with SSE2 code generation it doesn't matter one bit if you give a constant an 'f' postfix. All the compiler will do is take the value and then internally convert it to the precision the destination variable requires. Otherwise it'd have to emit a conversion instruction that costs additional processing time. When creating code for x87 things will get even more tricky when using single precision floats. In order to ensure that the entire calculation will remain in single precision, when using the 'precise' floating point model, the value will frequently be stored and re-read to and from memory. So depending on various settings even seemingly trivial differences can create very different code. And worse, what may be efficient for one hardware can easily be the total opposite for other hardware. Very true - the compiler's attempts to preserve precision have non-intuitive effects on the generated code. Things get more complicated with SSE2 vs. MMX/x87, at least with doubles, cause the older technology uses a full 80-bits of precision, to try to maintain accuracy on 64-bits. If you ask for "precise" calculations, the compiler can try to force all calculations through 80-bit considerations. "Precise" also prevents some optimizations that look algebraically correct. For example, a*(b+c) should be able to replace (a*b+a*c), but the compiler won't always do it, since floating-point calculations are not precise enough to ensure the exact same result. The thinking there is that an optimization should have absolutely no effect on result, even if it's a single bit in a huge floating point number. In that case, "precise" doesn't mean 'more precise', it means exactly match the original, unoptimized formula's result. 3 hours ago, wesleyjohnson said: Sorry, but I am really busy lately. I think the clock test program approach is similar to the little test program that I already wrote. My test program has some float values coming from function calls, with obfuscating statements to discourage inlining. Used -O0 keep the optimizer from confusing things... The "clock test" program was written that way to ensure that the compiler could not optimize the code. Since you use this as a diagnostic, here's what I suggest: When you get some time, write the smallest program that demonstrates this 30-byte difference when you add 'f'. Then check out the assembly, and find the discrepancy. Maybe toggle some of the compiler options and see if it gets worse, or goes away. Because this is a vital diagnostic for you, knowing what it's telling you is also vital, if that makes sense. And, yes, if you do find out what's going on, please post it here. It's bound to be interesting. 0 Share this post Link to post
wesleyjohnson Posted April 9, 2017 Couldn't ignore the thing for very long, ... The compiler uses the FCOMPL for the 64 bit compares, but is avoiding using the FCOMPS instruction for 32 bit compares. Is there some buggy i386 CPU that has a bad FCOMPS instruction? That would explain why other test compiles do not show it, the compiler ARCH target is for my native Athlon64, or i686. Compiler: GCC 4.5.2 on Linux 2.6 Source: Doomlegacy 1.46.3 Assembly: objdump -d Assembly comparison: diff -U4 diff -U4 markings space : is same for both first and second file - : as appears in first file + : as appears in second file ============= Change a 1.0 to 1.0f . Orig: skysow03 = 1.0 + ((float) angle) / (ANG90 - 1); Tst1: skysow03 = 1.0f + ((float) angle) / (ANG90 - 1); Orig size: 1335076 Tst1 size: 1335076 diff in objdump listing The only difference was a data block that is buried in the code of D_DoomMain ... This same area changes for every compilation, with similar differences, even two compiles of the orig. Looks to be some text about date/time. ("pr 6 201716:13:50") ("pr 8 201716:51:13") * Not every change of literal has an effect. * There are special instructions for loading +1.0 (FLT1) and +0.0 (FLTZ), as 66 bit values. Both 1.0 and 1.0f will use the same instruction. ============== A small function with float literals. Orig: static void R_Generate_gamma_black_table( void ) { int i; float b0 = ((float) cv_black.value ) / 2.0; // black float pow_max = 255.0 - b0; float gam = gamma_lookup( cv_usegamma.value ); // gamma gammatable[0] = 0; // absolute black for( i=1; i<=255; i++ ) { float fi = ((float) i) / 255.0; put_gammatable( i, b0 + (powf( fi, gam ) * pow_max) ); } } Tst2: static void R_Generate_gamma_black_table( void ) { int i; float b0 = ((float) cv_black.value ) / 2.0f; // black float pow_max = 255.0f - b0; float gam = gamma_lookup( cv_usegamma.value ); // gamma gammatable[0] = 0; // absolute black for( i=1; i<=255; i++ ) { float fi = ((float) i) / 255.0f; put_gammatable( i, b0 + (powf( fi, gam ) * pow_max) ); } } The put_gammatable() call is inlined. static void put_gammatable( i, float fv ) { int gv = (int) roundf( fv ); if( gv < 0 ) gv = 0; if( gv > 255 ) gv = 255; gammatable[i] = gv; } With R_Generate_gamma_black_tabled auto-inlined. Orig size: 1335076 Tst2 size: 1335044 R_Generate_gamma_black_table() was originally inlined, so this test was done using __attribute__((noinline)) on it. With R_Generate_gamma_black_tabled marked noinline. Orig size: 1334993 Tst2 size: 1334961 Many code addresses changed, so ignore all those. Where an instruction differed only in address, only the orig was kept. Source code was added by hand. @@ -32141,13134 +32141,13148 @@ 08064d10 <R_Generate_gamma_black_table>: 8064d10: 56 push %esi 8064d11: 53 push %ebx 8064d12: 83 ec 14 sub $0x14,%esp - float b0 = ((float) cv_black.value ) / 2.0; // black + float b0 = ((float) cv_black.value ) / 2.0f; // black - 8064d15: d9 05 fc 56 12 08 flds 0x81256fc - 8064d1b: da 0d d4 ff 14 08 fimull 0x814ffd4 - 8064d21: d9 54 24 04 fsts 0x4(%esp) + 8064d15: d9 05 dc 56 12 08 flds 0x81256dc + 8064d1b: da 0d b4 ff 14 08 fimull 0x814ffb4 + 8064d21: d9 54 24 08 fsts 0x8(%esp) - float pow_max = 255.0 - b0; + float pow_max = 255.0f - b0; - 8064d25: d8 2d 40 58 12 08 fsubrs 0x8125840 + 8064d25: d8 2d 20 58 12 08 fsubrs 0x8125820 8064d2b: d9 5c 24 08 fstps 0x8(%esp) float gam = gamma_lookup( cv_usegamma.value ); // gamma gamma_lookup() is inlined 8064d2f: a1 54 00 15 08 mov 0x8150054,%eax 8064d34: 8b 34 85 b0 00 15 08 mov 0x81500b0(,%eax,4),%esi for( i=1; i<=255; i++ ) for loop init 8064d3b: c6 05 e0 49 1d 08 00 movb $0x0,0x81d49e0 8064d42: bb 01 00 00 00 mov $0x1,%ebx 8064d47: eb 1f jmp 8064d68 <R_Generate_gamma_black_table+0x58> 8064d49: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi <R_Generate_gamma_black_table+0x40>: part of inlined put_gammatable( i, ... ); if( gv > 255 ) gv = 255; 8064d50: 3d ff 00 00 00 cmp $0xff,%eax 8064d55: 7e 02 jle 8064d59 <R_Generate_gamma_black_table+0x49> 8064d57: b0 ff mov $0xff,%al <R_Generate_gamma_black_table+0x49>: part of inlined put_gammatable( i, ... ); gammatable[i] = gv; 8064d59: 88 83 e0 49 1d 08 mov %al,0x81d49e0(%ebx) for loop increment (just for non-zero case) 8064d5f: 43 inc %ebx 8064d60: 81 fb 00 01 00 00 cmp $0x100,%ebx 8064d66: 74 47 je 8064daf <R_Generate_gamma_black_table+0x9f> <R_Generate_gamma_black_table+0x58>: for loop body 8064d68: 83 ec 08 sub $0x8,%esp gam, pushed for powf call 8064d6b: 56 push %esi - float fi = ((float) i) / 255.0; mult by const double (1.0 / 255.0) - 8064d6c: 89 5c 24 18 mov %ebx,0x18(%esp) - 8064d70: db 44 24 18 fildl 0x18(%esp) - 8064d74: dd 05 10 73 12 08 fldl 0x8127310 - 8064d7a: de c9 fmulp %st,%st(1) - 8064d7c: 83 ec 04 sub $0x4,%esp + float fi = ((float) i) / 255.0f; mult by const float (1.0 / 255.0) + 8064d6c: d9 05 f4 5c 12 08 flds 0x8125cf4 + 8064d72: 53 push %ebx + 8064d73: da 0c 24 fimull (%esp) 8064d76: d9 1c 24 fstps (%esp) powf( fi, gam ) 8064d79: e8 6e 5b fe ff call 804a8ec <powf@plt> * pow_max - 8064d87: d8 4c 24 18 fmuls 0x18(%esp) + 8064d7e: d8 4c 24 1c fmuls 0x1c(%esp) + b0 - 8064d8b: d8 44 24 14 fadds 0x14(%esp) + 8064d82: d8 44 24 18 fadds 0x18(%esp) 8064d86: d9 1c 24 fstps (%esp) inlined put_gammatable( i, fv ); 8064d89: e8 7e 61 fe ff call 804af0c <lroundf@plt> 8064d8e: 83 c4 10 add $0x10,%esp if( gv < 0 ) gv = 0; 8064d91: 85 c0 test %eax,%eax 8064d93: 79 bb jns 8064d50 <R_Generate_gamma_black_table+0x40> gammatable[i] = 0; -- compiler optimization 8064d95: 31 c0 xor %eax,%eax 8064d97: 88 83 c0 49 1d 08 mov %al,0x81d49c0(%ebx) for loop increment -- compiler optimization (just for 0 case) 8064d9d: 43 inc %ebx 8064d9e: 81 fb 00 01 00 00 cmp $0x100,%ebx 8064da4: 75 c2 jne 8064d68 <R_Generate_gamma_black_table+0x58> <R_Generate_gamma_black_table+0x9f>: 8064daf: 83 c4 14 add $0x14,%esp 8064db2: 5b pop %ebx 8064db3: 5e pop %esi 8064db4: c3 ret - 8064db5: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi - 8064db9: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi + 8064dac: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi ========== Repeat the same test as above, but with DEBUG so objdump can dump source lines. Debug code is significantly different than normal code. In the ( / 255.0) the fldl (64 bit) changed to flds (32 bit). Orig size: 4657775 Tst3 size: 4657775 --- tst3_orig.ods 2017-04-08 20:10:12.000000000 -0500 +++ tst3_ch.ods 2017-04-08 20:08:28.000000000 -0500 @@ -48826,18 +48826,21 @@ { 8062870: 55 push %ebp 8062871: 89 e5 mov %esp,%ebp 8062873: 83 ec 48 sub $0x48,%esp - float b0 = ((float) cv_black.value ) / 2.0; // black + float b0 = ((float) cv_black.value ) / 2.0f; // black 8062876: a1 94 2b 13 08 mov 0x8132b94,%eax 806287b: 89 45 d4 mov %eax,-0x2c(%ebp) 806287e: db 45 d4 fildl -0x2c(%ebp) 8062881: d9 05 00 a1 10 08 flds 0x810a100 8062887: de c9 fmulp %st,%st(1) 8062889: d9 5d f0 fstps -0x10(%ebp) - float pow_max = 255.0 - b0; + float pow_max = 255.0f - b0; 806288c: d9 05 04 a1 10 08 flds 0x810a104 8062892: d8 65 f0 fsubs -0x10(%ebp) 8062895: d9 5d ec fstps -0x14(%ebp) float gam = gamma_lookup( cv_usegamma.value ); // gamma @@ -48853,11 +48856,11 @@ for( i=1; i<=255; i++ ) 80628b0: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%ebp) 80628b7: eb 49 jmp 8062902 <R_Generate_gamma_black_table+0x92> { - float fi = ((float) i) / 255.0; + float fi = ((float) i) / 255.0f; 80628b9: db 45 f4 fildl -0xc(%ebp) - 80628bc: dd 05 08 a1 10 08 fldl 0x810a108 + 80628bc: d9 05 08 a1 10 08 flds 0x810a108 80628c2: de c9 fmulp %st,%st(1) 80628c4: d9 5d e4 fstps -0x1c(%ebp) put_gammatable( i, b0 + (powf( fi, gam ) * pow_max) ); 80628c7: 83 ec 08 sub $0x8,%esp @@ -48876,9 +48879,9 @@ 80628f1: d9 1c 24 fstps (%esp) 80628f4: ff 75 f4 pushl -0xc(%ebp) 80628f7: e8 89 fe ff ff call 8062785 <put_gammatable> 80628fc: 83 c4 10 add $0x10,%esp - float pow_max = 255.0 - b0; + float pow_max = 255.0f - b0; float gam = gamma_lookup( cv_usegamma.value ); // gamma gammatable[0] = 0; // absolute black * Seems that a double literal added to float is compile-time converted to float literal. * Seems that a double literal multiplied by a float is kept as const double. * Seems that a division by double literal is converted to multiply by a double literal. * Double literals are stored as const double (64 bit). * Float literals are stored as const float (32 bit). * There are special instructions for loading +1.0 (FLT1) and +0.0 (FLTZ), as 66 bit values. ============ float comparison to 0.0 or 1.0 in fracdivline() Orig: if (frac<0.0 || frac>1.0) return DVL_none; // not within the polygon side Tst4: if (frac<0.0f || frac>1.0f) return DVL_none; // not within the polygon side Orig size: 1335012 Tst4 size: 1335012 No differences, exact identical code. * Putting f on a 0.0 or 1.0 literal makes no difference. ============ float comparison in fracdivline() orig: if( frac < 0.05 && SameVertex(...) ) tst5: if( frac < 0.05f && SameVertex(...) ) To simplify the assembly: __attribute__((noinline)) static boolean SameVertex( ... ) Orig size: 1335047 Tst5 size: 1335079 0804ee20 <fracdivline>: 804ee20: 56 push %esi 804ee21: 53 push %ebx @@ -5520,9 +5520,9 @@ 804ee52: d8 cb fmul %st(3),%st 804ee54: de e9 fsubrp %st,%st(1) 804ee56: d9 c0 fld %st(0) if (fabs(den) < 1.0E-36) // avoid check of float for exact 0 The literal got moved by 40 bytes. Variable den is double, and fabs() returns a double. 804ee58: d9 e1 fabs - 804ee5a: dc 1d 50 57 12 08 fcompl 0x8125750 + 804ee5a: dc 1d 70 57 12 08 fcompl 0x8125770 804ee60: df e0 fnstsw %ax 804ee62: f6 c4 01 test $0x1,%ah 804ee65: 75 49 jne 804eeb0 <fracdivline+0x90> 804ee67: d9 ce fxch %st(6) @@ -5595,39732 +5595,39720 @@ 804eefb: d9 1e fstps (%esi) 804eefd: d8 ca fmul %st(2),%st 804eeff: de c1 faddp %st,%st(1) 804ef01: d9 5e 04 fstps 0x4(%esi) - if( frac < 0.05 - 804ef04: dc 15 58 57 12 08 fcoml 0x8125758 - 804ef0a: df e0 fnstsw %ax - 804ef0c: f6 c4 01 test $0x1,%ah - 804ef0f: 74 1b je 804ef2c <fracdivline+0x10c> + if( frac < 0.05f + 804ef04: d9 05 40 57 12 08 flds 0x8125740 + 804ef0a: d9 c9 fxch %st(1) + 804ef0c: d8 d1 fcom %st(1) + 804ef0e: df e0 fnstsw %ax + 804ef10: dd d9 fstp %st(1) + 804ef12: f6 c4 01 test $0x1,%ah + 804ef15: 74 1b je 804ef32 <fracdivline+0x112> SameVertex( ) 804ef11: 89 f0 mov %esi,%eax 804ef13: 89 4c 24 04 mov %ecx,0x4(%esp) 804ef17: dd 5c 24 08 fstpl 0x8(%esp) 804ef1b: e8 b0 fe ff ff call 804edd0 <SameVertex.clone.5> 804ef20: 85 c0 test %eax,%eax 804ef22: 8b 4c 24 04 mov 0x4(%esp),%ecx 804ef26: dd 44 24 08 fldl 0x8(%esp) 804ef2a: 75 74 jne 804efa0 <fracdivline+0x180> <fracdivline+0x10c> : if( frac > 0.95 804ef2c: dc 1d 60 57 12 08 fcompl 0x8125760 804ef32: df e0 fnstsw %ax 804ef34: f6 c4 45 test $0x45,%ah 804ef37: 74 27 je 804ef60 <fracdivline+0x140> 804ef39: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) 804ef40: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804ef47: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804ef4e: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) 804ef55: b8 02 00 00 00 mov $0x2,%eax 804ef5a: 83 c4 2c add $0x2c,%esp 804ef5d: 5b pop %ebx 804ef5e: 5e pop %esi 804ef5f: c3 ret <fracdivline+0x140> : * Comparison with float literal instead of double, seems to cost 3 extra instructions (6 bytes of code size). ============== Change two comparisions, to see how differences accumulate. orig: if( frac < 0.05 && SameVertex(...) ) ... if( frac > 0.95 && SameVertex(...) ) ... tst5: if( frac < 0.05f && SameVertex(...) ) if( frac > 0.95f && SameVertex(...) ) ... To simplify the assembly: __attribute__((noinline)) static boolean SameVertex( ... ) Orig size: 1335047 Tst6 size: 1335079 Same file size as with one changed literal. But clearly the assembly has extra instructions per comparision. 0804ee20 <fracdivline>: 804ee20: 56 push %esi 804ee21: 53 push %ebx @@ -5595,39732 +5595,39721 @@ 804eefb: d9 1e fstps (%esi) 804eefd: d8 ca fmul %st(2),%st 804eeff: de c1 faddp %st,%st(1) 804ef01: d9 5e 04 fstps 0x4(%esi) - if( frac < 0.05 - 804ef04: dc 15 58 57 12 08 fcoml 0x8125758 - 804ef0a: df e0 fnstsw %ax - 804ef0c: f6 c4 01 test $0x1,%ah - 804ef0f: 74 1b je 804ef2c <fracdivline+0x10c> + if( frac < 0.05f + 804ef04: d9 05 40 57 12 08 flds 0x8125740 + 804ef0a: d9 c9 fxch %st(1) + 804ef0c: d8 d1 fcom %st(1) + 804ef0e: df e0 fnstsw %ax + 804ef10: dd d9 fstp %st(1) + 804ef12: f6 c4 01 test $0x1,%ah + 804ef15: 74 1b je 804ef32 <fracdivline+0x112> 804ef11: 89 f0 mov %esi,%eax 804ef13: 89 4c 24 04 mov %ecx,0x4(%esp) 804ef17: dd 5c 24 08 fstpl 0x8(%esp) 804ef1b: e8 b0 fe ff ff call 804edd0 <SameVertex.clone.5> 804ef20: 85 c0 test %eax,%eax 804ef22: 8b 4c 24 04 mov 0x4(%esp),%ecx 804ef26: dd 44 24 08 fldl 0x8(%esp) 804ef2a: 75 74 jne 804efa0 <fracdivline+0x180> - if( frac > 0.95 - 804ef2c: dc 1d 60 57 12 08 fcompl 0x8125760 - 804ef32: df e0 fnstsw %ax - 804ef34: f6 c4 45 test $0x45,%ah - 804ef37: 74 27 je 804ef60 <fracdivline+0x140> + if( frac > 0.95f + 804ef32: d9 05 44 57 12 08 flds 0x8125744 + 804ef38: d9 c9 fxch %st(1) + 804ef3a: de d9 fcompp + 804ef3c: df e0 fnstsw %ax + 804ef3e: f6 c4 45 test $0x45,%ah + 804ef41: 74 2d je 804ef70 <fracdivline+0x150> <fracdivline+0x119>: 804ef39: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) 804ef40: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804ef47: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804ef4e: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) 804ef55: b8 02 00 00 00 mov $0x2,%eax 804ef5a: 83 c4 2c add $0x2c,%esp 804ef5d: 5b pop %ebx 804ef5e: 5e pop %esi 804ef5f: c3 ret Alignment ? It did not align other jmp targets like this ! + 804ef6a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi <fracdivline+0x140> : 804ef60: 89 ca mov %ecx,%edx 804ef62: 89 f0 mov %esi,%eax 804ef64: 89 4c 24 04 mov %ecx,0x4(%esp) 804ef68: e8 63 fe ff ff call 804edd0 <SameVertex.clone.5> 804ef6d: 85 c0 test %eax,%eax 804ef6f: 8b 4c 24 04 mov 0x4(%esp),%ecx 804ef73: 74 c4 je 804ef39 <fracdivline+0x119> 804ef75: 89 4e 08 mov %ecx,0x8(%esi) 804ef78: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804ef7f: c7 46 14 02 00 00 00 movl $0x2,0x14(%esi) 804ef86: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) 804ef8d: b8 03 00 00 00 mov $0x3,%eax 804ef92: e9 3b ff ff ff jmp 804eed2 <fracdivline+0xb2> 804ef97: 89 f6 mov %esi,%esi 804ef99: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi 804efa0: dd d8 fstp %st(0) 804efa2: 89 5e 08 mov %ebx,0x8(%esi) 804efa5: c7 46 10 ff ff ff ff movl $0xffffffff,0x10(%esi) 804efac: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804efb3: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) 804efba: b8 01 00 00 00 mov $0x1,%eax 804efbf: e9 0e ff ff ff jmp 804eed2 <fracdivline+0xb2> More alignment ? 804efc4: 8d b6 00 00 00 00 lea 0x0(%esi),%esi 804efca: 8d bf 00 00 00 00 lea 0x0(%edi),%edi - 0804efd0 <wpoly_insert_vert>: + 0804efe0 <wpoly_insert_vert>: - 08124d7c <_fini>: + 08124d8c <_fini>: Still only a difference of 16 bytes at end of the code. * The first float literal added 3 instructions (6 bytes). * The second float literal added 2 instructions (4 bytes). * Compiler seems to include alignment to 0x10 at end of function. This complicates using the code size as a guide. A code size bump of 32 bytes means that executable grew by 17 to 32 bytes. 0 Share this post Link to post
wesleyjohnson Posted April 10, 2017 This may explain the code increase for using float literals: https://gcc.gnu.org/ml/gcc/1998-11/msg00003.html This is a discussion in the GCC mailing list with a Cygnus engineer about a i386 FP comparison bug, and fixing it in the egcs compiler. Cygnus engineer thought they ought to turn it into a feature. Quote > Yes, it is bug. But why do not turn it to feature? > No. Not separating the cc0 from the cc0 user is a fundamental design concept, > we are not changing it anytime soon. > We need to stop regstack from inserting insns between the cc0 setter and > the cc0 user. Anything else is unacceptable. What I found is incomplete so I don't know any better details at this time. It seems that their final solution, that appears in the assembly, got much more complicated. 0 Share this post Link to post
wesleyjohnson Posted April 10, 2017 (edited) An interesting article on floating point. Not directly relevant to floating point bugs in hardware, but may be more practical. I have not read it all yet. https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ This is from a Chrome software programmer who does much the same thing. Puts in a const, watches the code shrink, looks at the assembly to see what bad optimization the compiler produces, and figures how to get around it. Worth a quick read at least. https://randomascii.wordpress.com/2017/01/08/add-a-const-here-delete-a-const-there/ Edited April 10, 2017 by wesleyjohnson 1 Share this post Link to post
Graf Zahl Posted April 10, 2017 Last year we did some precision calculations for ZDoom. The relevant thread is in the developers' forum so I cannot link it here. The conclusions we made: - if you need reliable results, never use floats, always use doubles. Single precision floats depend on variables that are outside the programmer's control on some platforms. - for the same reason, do not use the CRT's math functions like sin or sqrt. They all differ between compilers. If you need reproducable results you have to use software implemented replacements. - do not use any floating point optimizations the compiler may offer -they also may affect the results. Following these rules it is safe to assume that all current compilers create code that gives the same result on all currently relevant platforms (i.e. x86/x87, x86/SSE2, x64, ARM, ARM64 and PowerPC.) No idea how an old compiler like GCC 4.5 would fare here, though - it might present some issues. 0 Share this post Link to post
kb1 Posted April 10, 2017 3 hours ago, wesleyjohnson said: An interesting article on floating point. Not directly relevant to floating point bugs in hardware, but may be more practical. I have not read it all yet. https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ This is from a Chrome software programmer who does much the same thing. Puts in a const, watches the code shrink, looks at the assembly to see what bad optimization the compiler produces, and figures how to get around it. Worth a quick read at least. https://randomascii.wordpress.com/2017/01/08/add-a-const-here-delete-a-const-there/ The randomASCII articles are amazing. Thanks for the detailed writeup. Doesn't look like a "feature" to me. I have not heard of a 386 float compare bug, but, yeah, maybe. But that doesn't explain why you have problems when you choose 686 as your target, or why it needs a "secret" method to enable it. So, the big question: Will you continue to add 'f', or not? :) 0 Share this post Link to post
wesleyjohnson Posted April 14, 2017 (edited) Sorry for the delay. My ability to reply on DoomWorld was totally broken for the last week. It would not respond to any button pushes and I could not even complain to anyone. I have added the ability to choose ARCH when compiing DoomLegacy. The GCC docs are not entirely clear about what it uses for default, but I suspect that it is generic32, or i386. I have not found a way to detect what it used from looking at the objdumps. Tools like "file" or "objdump" only will tell me that the object format is "i386". It says that even for i686 compiles. The default (which was used for all the tests so far) must be i386. 1. GCC is including assembly for 386 cpu problems. 2. When I compile with -march=i686, the code gets 30K smaller. Those articles seem to be definitive reference material. They are referred to by most other work on floating point. I found almost identical sentences in the GCC info pages too. 0 Share this post Link to post
wesleyjohnson Posted April 15, 2017 Repeat the last test for different target arch. It appears that the default compile is close to the i486. Interesting that the smallest code is for the 386, with the 586 a close second. The 486 has the largest code. There are other strange things in the code, which make for code size bumps in the 686 and k8 cases. The strange things are not alignment, they are executed, but appear to be useless compiler artifacts. Extra code clones do not help. They could really reduce the code size by controlling their clones better. This is just for GCC, who knows what CLANG or MS does. Compiler: GCC 4.5.2 Source: Doomlegacy 1.46.3 Assembly: objdump -d ============== Try different -march settings for fracdivline() with f literals. Tst7_386: -march=i386 Tst7_486: -march=i486 Tst8_586: -march=i586 Tst9_686: -march=i686 Tst10_k8: -march=k8 Orig size: 1335047 Tst6 size: 1335079 Tst7_386 size: 1237118 Tst7_486 size: 1335079 Tst8_586 size: 1280269 Tst9_686 size: 1302098 Tst10_k8 size: 1306919 ============== Short Notes: ** These exit clones are present in all versions. They appear to clean the floating point stack, but do not appear for every exit of the function. They are not padding, they are executed. ret_DVL_none.clone.1: -> return DVL_none 804e600: dd d8 fstp %st(0) 804e602: dd d8 fstp %st(0) 804e604: dd d8 fstp %st(0) 804e606: dd d8 fstp %st(0) 804e608: dd d8 fstp %st(0) 804e60a: dd d8 fstp %st(0) 804e60c: dd d8 fstp %st(0) 804e60e: eb 10 jmp 804e620 ret_DVL_none ** if (fabs(den) < 1.0E-36) For i386, i486, i586: The compare double. if (fabs(den) < 1.0E-36) d9 c0 fld %st(0) d9 e1 fabs dc 1d d8 d8 10 08 fcompl 0x810d8d8 df e0 fnstsw %ax f6 c4 01 test $0x1,%ah 75 49 jne 804e600 ret_DVL_none.clone.1 For i686: Compare Double in registers. if (fabs(den) < 1.0E-36) d9 c0 fld %st(0) d9 e1 fabs dd 05 d8 d8 11 08 fldl 0x811d8d8 df f1 fcomip %st(1),%st dd d8 fstp %st(0) 77 3c ja 804ea18 ret_DVL_none.clone.1 For k8: Compare Double in registers. if (fabs(den) < 1.0E-36) d9 c0 fld %st(0) d9 e1 fabs dd 05 d8 e4 11 08 fldl 0x811e4d8 df f1 fcomip %st(1),%st df c0 ffreep %st(0) 77 3c ja 804ea08 ret_DVL_none.clone.1 ** if (frac<0.0 ... ) For i386, i486, i586: ftst if (frac<0.0 ... ) d9 e4 ftst df e0 fnstsw %ax f6 c4 01 test $0x1,%ah 75 30 jne 804e610 ret_DVL_none.clone.2 For i686, k8: load 0 and compare if (frac<0.0 ... ) d9 ee fldz df f1 fcomip %st(1),%st 77 2e ja 804ea28 ret_DVL_none.clone.2 ** if ( ... frac>1.0) For i386, i486, i586: Load 1, compare float (avoiding buggy fcomps?). if ( ... frac>1.0) d9 e8 fld1 d9 c9 fxch %st(1) d8 d1 fcom %st(1) df e0 fnstsw %ax dd d9 fstp %st(1) f6 c4 45 test $0x45,%ah 75 39 jne 804e628 body.num1 For i686, k8: Load 1, compare integer if ( ... frac>1.0) d9 e8 fld1 d9 c9 fxch %st(1) db f1 fcomi %st(1),%st dd d9 fstp %st(1) 76 3c jbe 804ea40 body.num1 ** if( frac < 0.05f For i386, i486, i586: Load literal, compare float (avoiding buggy fcomps?). if( frac < 0.05f d9 05 a0 d8 10 08 flds 0x810d8a0 d9 c9 fxch %st(1) d8 d1 fcom %st(1) df e0 fnstsw %ax dd d9 fstp %st(1) f6 c4 01 test $0x1,%ah 74 1b je 804e67a body.if.95 For i686, k8: Load literal, compare integer if( frac < 0.05f d9 05 a0 d8 11 08 flds 0x811d8a0 df f1 fcomip %st(1),%st 76 1b jbe 804ea8b body.if.95 ** if( frac > 0.95f For i386, i486, i586: Load literal, compare float (avoiding buggy fcomps?). if( frac > 0.95f d9 05 a4 d8 10 08 flds 0x810d8a4 d9 c9 fxch %st(1) de d9 fcompp df e0 fnstsw %ax f6 c4 45 test $0x45,%ah 74 29 je 804e6b4 if.95.OR.samevertex For i686: Load literal, compare integer. if( frac > 0.95f d9 05 a4 d8 11 08 flds 0x811d8a4 d9 c9 fxch %st(1) df f1 fcomip %st(1),%st dd d8 fstp %st(0) 77 27 ja 804eac0 if.95.OR.samevertex For k8: Load literal, compare integer. if( frac > 0.95f d9 05 a4 e4 11 08 flds 0x811e4a4 d9 c9 fxch %st(1) df f1 fcomip %st(1),%st df c0 ffreep %st(0) 77 27 ja 804eab0 if.95.OR.samevertex ============== Tst7_386: 0804e570 <fracdivline>: 804e570: 56 push %esi 804e571: 53 push %ebx 804e572: 83 ec 2c sub $0x2c,%esp 804e575: 89 d3 mov %edx,%ebx 804e577: 8b 74 24 38 mov 0x38(%esp),%esi 804e57b: d9 02 flds (%edx) 804e57d: d9 42 04 flds 0x4(%edx) 804e580: d9 01 flds (%ecx) 804e582: d8 e2 fsub %st(2),%st 804e584: d9 41 04 flds 0x4(%ecx) 804e587: d8 e2 fsub %st(2),%st 804e589: d9 00 flds (%eax) 804e58b: d9 5c 24 18 fstps 0x18(%esp) 804e58f: d9 40 04 flds 0x4(%eax) 804e592: d9 5c 24 1c fstps 0x1c(%esp) den = v3dy*v1dx - v3dx*v1dy; 804e596: d9 40 08 flds 0x8(%eax) 804e599: d9 40 0c flds 0xc(%eax) 804e59c: d9 c3 fld %st(3) 804e59e: d8 c9 fmul %st(1),%st 804e5a0: d9 c3 fld %st(3) 804e5a2: d8 cb fmul %st(3),%st 804e5a4: de e9 fsubrp %st,%st(1) if (fabs(den) < 1.0E-36) 804e5a6: d9 c0 fld %st(0) 804e5a8: d9 e1 fabs 804e5aa: dc 1d d8 d8 10 08 fcompl 0x810d8d8 804e5b0: df e0 fnstsw %ax 804e5b2: f6 c4 01 test $0x1,%ah 804e5b5: 75 49 jne 804e600 ret_DVL_none.clone.1 num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx; 804e5b7: d9 ce fxch %st(6) 804e5b9: dd 5c 24 20 fstpl 0x20(%esp) 804e5bd: dd 44 24 20 fldl 0x20(%esp) 804e5c1: d8 6c 24 18 fsubrs 0x18(%esp) 804e5c5: d9 c5 fld %st(5) 804e5c7: d8 64 24 1c fsubs 0x1c(%esp) 804e5cb: dc cb fmul %st,%st(3) 804e5cd: d9 ca fxch %st(2) 804e5cf: d8 c9 fmul %st(1),%st 804e5d1: de c3 faddp %st,%st(3) frac = num / den; 804e5d3: d9 ca fxch %st(2) 804e5d5: d8 f6 fdiv %st(6),%st if (frac<0.0 ... ) 804e5d7: d9 e4 ftst 804e5d9: df e0 fnstsw %ax 804e5db: f6 c4 01 test $0x1,%ah 804e5de: 75 30 jne 804e610 ret_DVL_none.clone.2 if ( ... frac>1.0) 804e5e0: d9 e8 fld1 804e5e2: d9 c9 fxch %st(1) 804e5e4: d8 d1 fcom %st(1) 804e5e6: df e0 fnstsw %ax 804e5e8: dd d9 fstp %st(1) 804e5ea: f6 c4 45 test $0x45,%ah 804e5ed: 75 39 jne 804e628 body.num1 # ret_DVL_none.clone.3: -> return DVL_none 804e5ef: dd d8 fstp %st(0) 804e5f1: dd d8 fstp %st(0) 804e5f3: dd d8 fstp %st(0) 804e5f5: dd d8 fstp %st(0) 804e5f7: dd d8 fstp %st(0) 804e5f9: dd d8 fstp %st(0) 804e5fb: dd d8 fstp %st(0) 804e5fd: eb 21 jmp 804e620 ret_DVL_none 804e5ff: 90 nop ret_DVL_none.clone.1: -> return DVL_none 804e600: dd d8 fstp %st(0) 804e602: dd d8 fstp %st(0) 804e604: dd d8 fstp %st(0) 804e606: dd d8 fstp %st(0) 804e608: dd d8 fstp %st(0) 804e60a: dd d8 fstp %st(0) 804e60c: dd d8 fstp %st(0) 804e60e: eb 10 jmp 804e620 ret_DVL_none ret_DVL_none.clone.2: -> return DVL_none 804e610: dd d8 fstp %st(0) 804e612: dd d8 fstp %st(0) 804e614: dd d8 fstp %st(0) 804e616: dd d8 fstp %st(0) 804e618: dd d8 fstp %st(0) 804e61a: dd d8 fstp %st(0) 804e61c: dd d8 fstp %st(0) 804e61e: 66 90 xchg %ax,%ax ret_DVL_none: return DVL_none; 804e620: 31 c0 xor %eax,%eax ret_common.1: 804e622: 83 c4 2c add $0x2c,%esp 804e625: 5b pop %ebx 804e626: 5e pop %esi 804e627: c3 ret body.num1: num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx; 804e628: d9 c9 fxch %st(1) 804e62a: d8 cc fmul %st(4),%st 804e62c: d9 ca fxch %st(2) 804e62e: d8 cb fmul %st(3),%st 804e630: de c2 faddp %st,%st(2) 804e632: d9 c9 fxch %st(1) result->divfrac = num / den; 804e634: de f5 fdivp %st,%st(5) 804e636: d9 cc fxch %st(4) 804e638: d9 5e 0c fstps 0xc(%esi) result->divpt.x = v1x + v1dx*frac; 804e63b: d9 c9 fxch %st(1) 804e63d: d8 cb fmul %st(3),%st 804e63f: dc 44 24 20 faddl 0x20(%esp) 804e643: d9 1e fstps (%esi) result->divpt.y = v1y + v1dy*frac; 804e645: d8 ca fmul %st(2),%st 804e647: de c1 faddp %st,%st(1) 804e649: d9 5e 04 fstps 0x4(%esi) + if( frac < 0.05f + 804e64c: d9 05 a0 d8 10 08 flds 0x810d8a0 + 804e652: d9 c9 fxch %st(1) + 804e654: d8 d1 fcom %st(1) + 804e656: df e0 fnstsw %ax + 804e658: dd d9 fstp %st(1) + 804e65a: f6 c4 01 test $0x1,%ah + 804e65d: 74 1b je 804e67a body.if.95 if( ... && SameVertex() 804e65f: 89 f0 mov %esi,%eax 804e661: 89 4c 24 04 mov %ecx,0x4(%esp) 804e665: dd 5c 24 08 fstpl 0x8(%esp) 804e669: e8 ce fe ff ff call 804e53c <SameVertex.clone.5> 804e66e: 85 c0 test %eax,%eax 804e670: 8b 4c 24 04 mov 0x4(%esp),%ecx 804e674: dd 44 24 08 fldl 0x8(%esp) 804e678: 75 72 jne 804e6ec case_DVL_v1 body.if.95: + if( frac > 0.95f + 804e67a: d9 05 a4 d8 10 08 flds 0x810d8a4 + 804e680: d9 c9 fxch %st(1) + 804e682: de d9 fcompp + 804e684: df e0 fnstsw %ax + 804e686: f6 c4 45 test $0x45,%ah + 804e689: 74 29 je 804e6b4 if.95.OR.samevertex case_DVL_mid: result->vertex = NULL; 804e68b: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) result->before = 0; 804e692: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) result->after = 1; 804e699: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) result->before = 0; 804e6a0: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) return DVL_mid; 804e6a7: b8 02 00 00 00 mov $0x2,%eax 804e6ac: 83 c4 2c add $0x2c,%esp 804e6af: 5b pop %ebx 804e6b0: 5e pop %esi 804e6b1: c3 ret 804e6b2: 66 90 xchg %ax,%ax if.95.OR.samevertex: if( ... && SameVertex() 804e6b4: 89 ca mov %ecx,%edx 804e6b6: 89 f0 mov %esi,%eax 804e6b8: 89 4c 24 04 mov %ecx,0x4(%esp) 804e6bc: e8 7b fe ff ff call 804e53c <SameVertex.clone.5> 804e6c1: 85 c0 test %eax,%eax 804e6c3: 8b 4c 24 04 mov 0x4(%esp),%ecx 804e6c7: 74 c2 je 804e68b case_DVL_mid # case_DVL_v2: result->vertex = v2; 804e6c9: 89 4e 08 mov %ecx,0x8(%esi) result->before = 0; 804e6cc: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) result->after = 2; 804e6d3: c7 46 14 02 00 00 00 movl $0x2,0x14(%esi) result->at_vert = true; 804e6da: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) return DVL_v2; 804e6e1: b8 03 00 00 00 mov $0x3,%eax 804e6e6: e9 37 ff ff ff jmp 804e622 ret_common.1 804e6eb: 90 nop case_DVL_v1: result->vertex = v1; 804e6ec: dd d8 fstp %st(0) 804e6ee: 89 5e 08 mov %ebx,0x8(%esi) result->before = -1; 804e6f1: c7 46 10 ff ff ff ff movl $0xffffffff,0x10(%esi) result->after = 1; 804e6f8: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) result->at_vert = true; 804e6ff: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) return DVL_v1; 804e706: b8 01 00 00 00 mov $0x1,%eax 804e70b: e9 12 ff ff ff jmp 804e622 ret_common.1 ============== Tst7_486: 0804ee20 <fracdivline>: 804ee20: 56 push %esi 804ee21: 53 push %ebx 804ee22: 83 ec 2c sub $0x2c,%esp 804ee25: 89 d3 mov %edx,%ebx 804ee27: 8b 74 24 38 mov 0x38(%esp),%esi 804ee2b: d9 02 flds (%edx) 804ee2d: d9 42 04 flds 0x4(%edx) 804ee30: d9 01 flds (%ecx) 804ee32: d8 e2 fsub %st(2),%st 804ee34: d9 41 04 flds 0x4(%ecx) 804ee37: d8 e2 fsub %st(2),%st 804ee39: d9 00 flds (%eax) 804ee3b: d9 5c 24 18 fstps 0x18(%esp) 804ee3f: d9 40 04 flds 0x4(%eax) 804ee42: d9 5c 24 1c fstps 0x1c(%esp) den = v3dy*v1dx - v3dx*v1dy; 804ee46: d9 40 08 flds 0x8(%eax) 804ee49: d9 40 0c flds 0xc(%eax) 804ee4c: d9 c3 fld %st(3) 804ee4e: d8 c9 fmul %st(1),%st 804ee50: d9 c3 fld %st(3) 804ee52: d8 cb fmul %st(3),%st 804ee54: de e9 fsubrp %st,%st(1) if (fabs(den) < 1.0E-36) 804ee56: d9 c0 fld %st(0) 804ee58: d9 e1 fabs 804ee5a: dc 1d 78 57 12 08 fcompl 0x8125778 804ee60: df e0 fnstsw %ax 804ee62: f6 c4 01 test $0x1,%ah 804ee65: 75 49 jne 804eeb0 ret_DVL_none.clone.1 num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx; 804ee67: d9 ce fxch %st(6) 804ee69: dd 5c 24 20 fstpl 0x20(%esp) 804ee6d: dd 44 24 20 fldl 0x20(%esp) 804ee71: d8 6c 24 18 fsubrs 0x18(%esp) 804ee75: d9 c5 fld %st(5) 804ee77: d8 64 24 1c fsubs 0x1c(%esp) 804ee7b: dc cb fmul %st,%st(3) 804ee7d: d9 ca fxch %st(2) 804ee7f: d8 c9 fmul %st(1),%st 804ee81: de c3 faddp %st,%st(3) frac = num / den; 804ee83: d9 ca fxch %st(2) 804ee85: d8 f6 fdiv %st(6),%st if (frac<0.0 ... ) 804ee87: d9 e4 ftst 804ee89: df e0 fnstsw %ax 804ee8b: f6 c4 01 test $0x1,%ah 804ee8e: 75 30 jne 804eec0 ret_DVL_none.clone.2 if ( ... frac>1.0) 804ee90: d9 e8 fld1 804ee92: d9 c9 fxch %st(1) 804ee94: d8 d1 fcom %st(1) 804ee96: df e0 fnstsw %ax 804ee98: dd d9 fstp %st(1) 804ee9a: f6 c4 45 test $0x45,%ah 804ee9d: 75 41 jne 804eee0 body.num1 # ret_DVL_none.clone.3: -> return DVL_none 804ee9f: dd d8 fstp %st(0) 804eea1: dd d8 fstp %st(0) 804eea3: dd d8 fstp %st(0) 804eea5: dd d8 fstp %st(0) 804eea7: dd d8 fstp %st(0) 804eea9: dd d8 fstp %st(0) 804eeab: dd d8 fstp %st(0) 804eead: eb 21 jmp 804eed0 ret_DVL_none 804eeaf: 90 nop ret_DVL_none.clone.1: -> return DVL_none 804eeb0: dd d8 fstp %st(0) 804eeb2: dd d8 fstp %st(0) 804eeb4: dd d8 fstp %st(0) 804eeb6: dd d8 fstp %st(0) 804eeb8: dd d8 fstp %st(0) 804eeba: dd d8 fstp %st(0) 804eebc: dd d8 fstp %st(0) 804eebe: eb 10 jmp 804eed0 ret_DVL_none ret_DVL_none.clone.2: 804eec0: dd d8 fstp %st(0) 804eec2: dd d8 fstp %st(0) 804eec4: dd d8 fstp %st(0) 804eec6: dd d8 fstp %st(0) 804eec8: dd d8 fstp %st(0) 804eeca: dd d8 fstp %st(0) 804eecc: dd d8 fstp %st(0) 804eece: 66 90 xchg %ax,%ax ret_DVL_none: return DVL_none; 804eed0: 31 c0 xor %eax,%eax ret_common.1: 804eed2: 83 c4 2c add $0x2c,%esp 804eed5: 5b pop %ebx 804eed6: 5e pop %esi 804eed7: c3 ret 804eed8: 90 nop 804eed9: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi body.num1: num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx; 804eee0: d9 c9 fxch %st(1) 804eee2: d8 cc fmul %st(4),%st 804eee4: d9 ca fxch %st(2) 804eee6: d8 cb fmul %st(3),%st 804eee8: de c2 faddp %st,%st(2) 804eeea: d9 c9 fxch %st(1) result->divfrac = num / den; 804eeec: de f5 fdivp %st,%st(5) 804eeee: d9 cc fxch %st(4) 804eef0: d9 5e 0c fstps 0xc(%esi) result->divpt.x = v1x + v1dx*frac; 804eef3: d9 c9 fxch %st(1) 804eef5: d8 cb fmul %st(3),%st 804eef7: dc 44 24 20 faddl 0x20(%esp) 804eefb: d9 1e fstps (%esi) result->divpt.y = v1y + v1dy*frac; 804eefd: d8 ca fmul %st(2),%st 804eeff: de c1 faddp %st,%st(1) 804ef01: d9 5e 04 fstps 0x4(%esi) + if( frac < 0.05f + 804ef04: d9 05 40 57 12 08 flds 0x8125740 + 804ef0a: d9 c9 fxch %st(1) + 804ef0c: d8 d1 fcom %st(1) + 804ef0e: df e0 fnstsw %ax + 804ef10: dd d9 fstp %st(1) + 804ef12: f6 c4 01 test $0x1,%ah + 804ef15: 74 1b je 804ef32 body.if.95 if( ... && SameVertex() 804ef17: 89 f0 mov %esi,%eax 804ef19: 89 4c 24 04 mov %ecx,0x4(%esp) 804ef1d: dd 5c 24 08 fstpl 0x8(%esp) 804ef21: e8 aa fe ff ff call 804edd0 <SameVertex.clone.5> 804ef26: 85 c0 test %eax,%eax 804ef28: 8b 4c 24 04 mov 0x4(%esp),%ecx 804ef2c: dd 44 24 08 fldl 0x8(%esp) 804ef30: 75 7e jne 804efb0 case_DVL_v1 body.if.95: + if( frac > 0.95f + 804ef32: d9 05 44 57 12 08 flds 0x8125744 + 804ef38: d9 c9 fxch %st(1) + 804ef3a: de d9 fcompp + 804ef3c: df e0 fnstsw %ax + 804ef3e: f6 c4 45 test $0x45,%ah + 804ef41: 74 2d je 804ef70 if.95.OR.samevertex case_DVL_mid: 804ef43: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) 804ef4a: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804ef51: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804ef58: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) return DVL_mid; 804ef5f: b8 02 00 00 00 mov $0x2,%eax 804ef64: 83 c4 2c add $0x2c,%esp 804ef67: 5b pop %ebx 804ef68: 5e pop %esi 804ef69: c3 ret 804ef6a: 8d b6 00 00 00 00 lea 0x0(%esi),%esi if.95.OR.samevertex: if( ... && SameVertex() 804ef70: 89 ca mov %ecx,%edx 804ef72: 89 f0 mov %esi,%eax 804ef74: 89 4c 24 04 mov %ecx,0x4(%esp) 804ef78: e8 53 fe ff ff call 804edd0 <SameVertex.clone.5> 804ef7d: 85 c0 test %eax,%eax 804ef7f: 8b 4c 24 04 mov 0x4(%esp),%ecx 804ef83: 74 be je 804ef43 case_DVL_mid # case_DVL_v2: 804ef85: 89 4e 08 mov %ecx,0x8(%esi) 804ef88: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804ef8f: c7 46 14 02 00 00 00 movl $0x2,0x14(%esi) 804ef96: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) return DVL_v2; 804ef9d: b8 03 00 00 00 mov $0x3,%eax 804efa2: e9 2b ff ff ff jmp 804eed2 ret_common.1 804efa7: 89 f6 mov %esi,%esi 804efa9: 8d bc 27 00 00 00 00 lea 0x0(%edi,%eiz,1),%edi case_DVL_v1: 804efb0: dd d8 fstp %st(0) 804efb2: 89 5e 08 mov %ebx,0x8(%esi) 804efb5: c7 46 10 ff ff ff ff movl $0xffffffff,0x10(%esi) 804efbc: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804efc3: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) return DVL_v1; 804efca: b8 01 00 00 00 mov $0x1,%eax 804efcf: e9 fe fe ff ff jmp 804eed2 ret_common.1 804efd4: 8d b6 00 00 00 00 lea 0x0(%esi),%esi 804efda: 8d bf 00 00 00 00 lea 0x0(%edi),%edi ============== Tst8_586: 0804e9e0 <fracdivline>: 804e9e0: 56 push %esi 804e9e1: 53 push %ebx 804e9e2: 83 ec 2c sub $0x2c,%esp 804e9e5: 89 d3 mov %edx,%ebx 804e9e7: d9 02 flds (%edx) 804e9e9: d9 42 04 flds 0x4(%edx) 804e9ec: d9 01 flds (%ecx) 804e9ee: d8 e2 fsub %st(2),%st 804e9f0: 31 d2 xor %edx,%edx 804e9f2: 8b 74 24 38 mov 0x38(%esp),%esi 804e9f6: d9 41 04 flds 0x4(%ecx) 804e9f9: d8 e2 fsub %st(2),%st 804e9fb: d9 00 flds (%eax) 804e9fd: d9 5c 24 18 fstps 0x18(%esp) 804ea01: d9 40 04 flds 0x4(%eax) 804ea04: d9 5c 24 1c fstps 0x1c(%esp) den = v3dy*v1dx - v3dx*v1dy; 804ea08: d9 40 08 flds 0x8(%eax) 804ea0b: d9 40 0c flds 0xc(%eax) 804ea0e: d9 c3 fld %st(3) 804ea10: d8 c9 fmul %st(1),%st 804ea12: d9 c3 fld %st(3) 804ea14: d8 cb fmul %st(3),%st 804ea16: de e9 fsubrp %st,%st(1) if (fabs(den) < 1.0E-36) 804ea18: d9 c0 fld %st(0) 804ea1a: d9 e1 fabs 804ea1c: dc 1d 38 81 11 08 fcompl 0x8118138 804ea22: df e0 fnstsw %ax 804ea24: f6 c4 01 test $0x1,%ah 804ea27: 75 4f jne 804ea78 ret_DVL_none.clone.1 num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx; 804ea29: d9 ce fxch %st(6) 804ea2b: dd 5c 24 20 fstpl 0x20(%esp) 804ea2f: dd 44 24 20 fldl 0x20(%esp) 804ea33: d8 6c 24 18 fsubrs 0x18(%esp) 804ea37: d9 c5 fld %st(5) 804ea39: d8 64 24 1c fsubs 0x1c(%esp) 804ea3d: dc cb fmul %st,%st(3) 804ea3f: d9 ca fxch %st(2) 804ea41: d8 c9 fmul %st(1),%st 804ea43: de c3 faddp %st,%st(3) frac = num / den; 804ea45: d9 ca fxch %st(2) 804ea47: d8 f6 fdiv %st(6),%st if (frac<0.0 ... ) 804ea49: d9 e4 ftst 804ea4b: df e0 fnstsw %ax 804ea4d: f6 c4 01 test $0x1,%ah 804ea50: 75 36 jne 804ea88 ret_DVL_none.clone.2 if ( ... frac>1.0) 804ea52: d9 e8 fld1 804ea54: d9 c9 fxch %st(1) 804ea56: d8 d1 fcom %st(1) 804ea58: df e0 fnstsw %ax 804ea5a: dd d9 fstp %st(1) 804ea5c: f6 c4 45 test $0x45,%ah 804ea5f: 75 3f jne 804eaa0 body.num1 # ret_DVL_none.clone.3: -> return DVL_none 804ea61: dd d8 fstp %st(0) 804ea63: dd d8 fstp %st(0) 804ea65: dd d8 fstp %st(0) 804ea67: dd d8 fstp %st(0) 804ea69: dd d8 fstp %st(0) 804ea6b: dd d8 fstp %st(0) 804ea6d: dd d8 fstp %st(0) 804ea6f: eb 27 jmp 804ea98 ret_common.1 804ea71: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi ret_DVL_none.clone.1: -> return DVL_none 804ea78: dd d8 fstp %st(0) 804ea7a: dd d8 fstp %st(0) 804ea7c: dd d8 fstp %st(0) 804ea7e: dd d8 fstp %st(0) 804ea80: dd d8 fstp %st(0) 804ea82: dd d8 fstp %st(0) 804ea84: dd d8 fstp %st(0) # ? %eax==0 804ea86: eb 10 jmp 804ea98 ret_common.1 ret_DVL_none.clone.2: -> return DVL_none 804ea88: dd d8 fstp %st(0) 804ea8a: dd d8 fstp %st(0) 804ea8c: dd d8 fstp %st(0) 804ea8e: dd d8 fstp %st(0) 804ea90: dd d8 fstp %st(0) 804ea92: dd d8 fstp %st(0) 804ea94: dd d8 fstp %st(0) return DVL_none; 804ea96: 66 90 xchg %ax,%ax ret_common.1: 804ea98: 83 c4 2c add $0x2c,%esp 804ea9b: 89 d0 mov %edx,%eax 804ea9d: 5b pop %ebx 804ea9e: 5e pop %esi 804ea9f: c3 ret body.num1: num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx; 804eaa0: d9 c9 fxch %st(1) 804eaa2: d8 cc fmul %st(4),%st 804eaa4: d9 ca fxch %st(2) 804eaa6: d8 cb fmul %st(3),%st 804eaa8: de c2 faddp %st,%st(2) 804eaaa: d9 c9 fxch %st(1) result->divfrac = num / den; 804eaac: de f5 fdivp %st,%st(5) 804eaae: d9 cc fxch %st(4) 804eab0: d9 5e 0c fstps 0xc(%esi) result->divpt.x = v1x + v1dx*frac; 804eab3: d9 c9 fxch %st(1) 804eab5: d8 cb fmul %st(3),%st 804eab7: dc 44 24 20 faddl 0x20(%esp) 804eabb: d9 1e fstps (%esi) result->divpt.y = v1y + v1dy*frac; 804eabd: d8 ca fmul %st(2),%st 804eabf: de c1 faddp %st,%st(1) 804eac1: d9 5e 04 fstps 0x4(%esi) + if( frac < 0.05f + 804eac4: d9 05 00 81 11 08 flds 0x8118100 + 804eaca: d9 c9 fxch %st(1) + 804eacc: d8 d1 fcom %st(1) + 804eace: df e0 fnstsw %ax + 804ead0: dd d9 fstp %st(1) + 804ead2: f6 c4 01 test $0x1,%ah + 804ead5: 74 1d je 804eaf4 body.if.95 if.95.OR.samevertex: if( ... && SameVertex() 804ead7: 89 da mov %ebx,%edx 804ead9: 89 f0 mov %esi,%eax 804eadb: 89 4c 24 04 mov %ecx,0x4(%esp) 804eadf: dd 5c 24 08 fstpl 0x8(%esp) 804eae3: e8 b8 fe ff ff call 804e9a0 <SameVertex.clone.5> 804eae8: 8b 4c 24 04 mov 0x4(%esp),%ecx 804eaec: 85 c0 test %eax,%eax 804eaee: dd 44 24 08 fldl 0x8(%esp) 804eaf2: 75 74 jne 804eb68 case_DVL_v1 body.if.95: + if( frac > 0.95f + 804eaf4: d9 05 04 81 11 08 flds 0x8118104 + 804eafa: d9 c9 fxch %st(1) + 804eafc: de d9 fcompp + 804eafe: df e0 fnstsw %ax + 804eb00: f6 c4 45 test $0x45,%ah + 804eb03: 74 2b je 804eb30 if.95.OR.samevertex case_DVL_mid: 804eb05: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) 804eb0c: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804eb13: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804eb1a: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) return DVL_mid; 804eb21: ba 02 00 00 00 mov $0x2,%edx 804eb26: 83 c4 2c add $0x2c,%esp 804eb29: 89 d0 mov %edx,%eax 804eb2b: 5b pop %ebx 804eb2c: 5e pop %esi 804eb2d: c3 ret 804eb2e: 66 90 xchg %ax,%ax if.95.OR.samevertex: if( ... && SameVertex() 804eb30: 89 ca mov %ecx,%edx 804eb32: 89 f0 mov %esi,%eax 804eb34: 89 4c 24 04 mov %ecx,0x4(%esp) 804eb38: e8 63 fe ff ff call 804e9a0 <SameVertex.clone.5> 804eb3d: 8b 4c 24 04 mov 0x4(%esp),%ecx 804eb41: 85 c0 test %eax,%eax 804eb43: 74 c0 je 804eb05 case_DVL_mid # case_DVL_v2: 804eb45: 89 4e 08 mov %ecx,0x8(%esi) 804eb48: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804eb4f: c7 46 14 02 00 00 00 movl $0x2,0x14(%esi) 804eb56: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) return DVL_v2; 804eb5d: ba 03 00 00 00 mov $0x3,%edx 804eb62: e9 31 ff ff ff jmp 804ea98 ret_common.1 804eb67: 90 nop case_DVL_v1: 804eb68: dd d8 fstp %st(0) 804eb6a: 89 5e 08 mov %ebx,0x8(%esi) 804eb6d: c7 46 10 ff ff ff ff movl $0xffffffff,0x10(%esi) 804eb74: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804eb7b: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) return DVL_v1; 804eb82: ba 01 00 00 00 mov $0x1,%edx 804eb87: e9 0c ff ff ff jmp 804ea98 ret_common.1 804eb8c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi ============== Tst9_686: 0804e990 <fracdivline>: 804e990: 56 push %esi 804e991: 53 push %ebx 804e992: 89 d3 mov %edx,%ebx 804e994: 83 ec 34 sub $0x34,%esp 804e997: d9 02 flds (%edx) 804e999: d9 42 04 flds 0x4(%edx) 804e99c: d9 01 flds (%ecx) 804e99e: d8 e2 fsub %st(2),%st 804e9a0: 8b 74 24 40 mov 0x40(%esp),%esi 804e9a4: d9 41 04 flds 0x4(%ecx) 804e9a7: d8 e2 fsub %st(2),%st 804e9a9: dd 1c 24 fstpl (%esp) 804e9ac: d9 00 flds (%eax) 804e9ae: d9 5c 24 28 fstps 0x28(%esp) 804e9b2: d9 40 04 flds 0x4(%eax) 804e9b5: d9 5c 24 2c fstps 0x2c(%esp) den = v3dy*v1dx - v3dx*v1dy; 804e9b9: d9 40 08 flds 0x8(%eax) 804e9bc: d9 40 0c flds 0xc(%eax) 804e9bf: 31 c0 xor %eax,%eax 804e9c1: d9 c2 fld %st(2) 804e9c3: d8 c9 fmul %st(1),%st 804e9c5: dd 04 24 fldl (%esp) 804e9c8: d8 cb fmul %st(3),%st 804e9ca: de e9 fsubrp %st,%st(1) if (fabs(den) < 1.0E-36) 804e9cc: d9 c0 fld %st(0) 804e9ce: d9 e1 fabs 804e9d0: dd 05 d8 d8 11 08 fldl 0x811d8d8 804e9d6: df f1 fcomip %st(1),%st 804e9d8: dd d8 fstp %st(0) 804e9da: 77 3c ja 804ea18 ret_DVL_none.clone.1 num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx; 804e9dc: d9 c5 fld %st(5) 804e9de: d8 6c 24 28 fsubrs 0x28(%esp) 804e9e2: d9 c5 fld %st(5) 804e9e4: d8 64 24 2c fsubs 0x2c(%esp) 804e9e8: dc cc fmul %st,%st(4) 804e9ea: d9 cb fxch %st(3) 804e9ec: d8 c9 fmul %st(1),%st 804e9ee: de c4 faddp %st,%st(4) frac = num / den; 804e9f0: d9 cb fxch %st(3) 804e9f2: d8 f1 fdiv %st(1),%st if (frac<0.0 ... ) 804e9f4: d9 ee fldz 804e9f6: df f1 fcomip %st(1),%st 804e9f8: 77 2e ja 804ea28 ret_DVL_none.clone.2 if ( ... frac>1.0) 804e9fa: d9 e8 fld1 804e9fc: d9 c9 fxch %st(1) 804e9fe: db f1 fcomi %st(1),%st 804ea00: dd d9 fstp %st(1) 804ea02: 76 3c jbe 804ea40 body.num1 # ret_DVL_none.clone.3: -> return DVL_none 804ea04: dd d8 fstp %st(0) 804ea06: dd d8 fstp %st(0) 804ea08: dd d8 fstp %st(0) 804ea0a: dd d8 fstp %st(0) 804ea0c: dd d8 fstp %st(0) 804ea0e: dd d8 fstp %st(0) 804ea10: dd d8 fstp %st(0) 804ea12: eb 24 jmp 804ea38 ret_common.1 804ea14: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi ret_DVL_none.clone.1: -> return DVL_none 804ea18: dd d8 fstp %st(0) 804ea1a: dd d8 fstp %st(0) 804ea1c: dd d8 fstp %st(0) 804ea1e: dd d8 fstp %st(0) 804ea20: dd d8 fstp %st(0) 804ea22: dd d8 fstp %st(0) 804ea24: eb 12 jmp 804ea38 ret_common.1 804ea26: 66 90 xchg %ax,%ax ret_DVL_none.clone.2: -> return DVL_none 804ea28: dd d8 fstp %st(0) 804ea2a: dd d8 fstp %st(0) 804ea2c: dd d8 fstp %st(0) 804ea2e: dd d8 fstp %st(0) 804ea30: dd d8 fstp %st(0) 804ea32: dd d8 fstp %st(0) 804ea34: dd d8 fstp %st(0) return DVL_none; 804ea36: 66 90 xchg %ax,%ax ret_common.1: 804ea38: 83 c4 34 add $0x34,%esp 804ea3b: 5b pop %ebx 804ea3c: 5e pop %esi 804ea3d: c3 ret body.num1: num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx; 804ea3e: 66 90 xchg %ax,%ax 804ea40: d9 ca fxch %st(2) 804ea42: d8 cc fmul %st(4),%st 804ea44: d9 cb fxch %st(3) 804ea46: dc 0c 24 fmull (%esp) 804ea49: de c3 faddp %st,%st(3) result->divfrac = num / den; 804ea4b: de fa fdivrp %st,%st(2) 804ea4d: d9 c9 fxch %st(1) 804ea4f: d9 5e 0c fstps 0xc(%esi) result->divpt.x = v1x + v1dx*frac; 804ea52: dc c9 fmul %st,%st(1) 804ea54: d9 c9 fxch %st(1) 804ea56: de c3 faddp %st,%st(3) 804ea58: d9 ca fxch %st(2) 804ea5a: d9 1e fstps (%esi) result->divpt.y = v1y + v1dy*frac; 804ea5c: dd 04 24 fldl (%esp) 804ea5f: d8 ca fmul %st(2),%st 804ea61: de c1 faddp %st,%st(1) 804ea63: d9 5e 04 fstps 0x4(%esi) + if( frac < 0.05f + 804ea66: d9 05 a0 d8 11 08 flds 0x811d8a0 + 804ea6c: df f1 fcomip %st(1),%st + 804ea6e: 76 1b jbe 804ea8b body.if.95 if( ... && SameVertex() 804ea70: 89 f0 mov %esi,%eax 804ea72: 89 4c 24 0c mov %ecx,0xc(%esp) 804ea76: dd 5c 24 10 fstpl 0x10(%esp) 804ea7a: e8 d1 fe ff ff call 804e950 <SameVertex.clone.5> 804ea7f: 8b 4c 24 0c mov 0xc(%esp),%ecx 804ea83: 85 c0 test %eax,%eax 804ea85: dd 44 24 10 fldl 0x10(%esp) 804ea89: 75 6d jne 804eaf8 case_DVL_v1 body.if.95: + if( frac > 0.95f + 804ea8b: d9 05 a4 d8 11 08 flds 0x811d8a4 + 804ea91: d9 c9 fxch %st(1) + 804ea93: df f1 fcomip %st(1),%st + 804ea95: dd d8 fstp %st(0) + 804ea97: 77 27 ja 804eac0 if.95.OR.samevertex case_DVL_mid: 804ea99: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) return DVL_mid; 804eaa0: b8 02 00 00 00 mov $0x2,%eax 804eaa5: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804eaac: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804eab3: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) 804eaba: 83 c4 34 add $0x34,%esp 804eabd: 5b pop %ebx 804eabe: 5e pop %esi 804eabf: c3 ret if.95.OR.samevertex: if( ... && SameVertex() 804eac0: 89 ca mov %ecx,%edx 804eac2: 89 f0 mov %esi,%eax 804eac4: 89 4c 24 0c mov %ecx,0xc(%esp) 804eac8: e8 83 fe ff ff call 804e950 <SameVertex.clone.5> 804eacd: 8b 4c 24 0c mov 0xc(%esp),%ecx 804ead1: 85 c0 test %eax,%eax 804ead3: 74 c4 je 804ea99 case_DVL_mid # case_DVL_v2: 804ead5: 89 4e 08 mov %ecx,0x8(%esi) return DVL_v2; 804ead8: b8 03 00 00 00 mov $0x3,%eax 804eadd: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) 804eae4: c7 46 14 02 00 00 00 movl $0x2,0x14(%esi) 804eaeb: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) 804eaf2: e9 41 ff ff ff jmp 804ea38 ret_common.1 804eaf7: 90 nop case_DVL_v1: 804eaf8: dd d8 fstp %st(0) 804eafa: 89 5e 08 mov %ebx,0x8(%esi) return DVL_v1; 804eafd: b8 01 00 00 00 mov $0x1,%eax 804eb02: c7 46 10 ff ff ff ff movl $0xffffffff,0x10(%esi) 804eb09: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804eb10: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) 804eb17: e9 1c ff ff ff jmp 804ea38 ret_common.1 804eb1c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi ============== Tst10_k8: 0804e980 <fracdivline>: 804e980: 56 push %esi 804e981: 53 push %ebx 804e982: 89 d3 mov %edx,%ebx 804e984: 83 ec 34 sub $0x34,%esp 804e987: d9 02 flds (%edx) 804e989: 8b 74 24 40 mov 0x40(%esp),%esi 804e98d: d9 42 04 flds 0x4(%edx) 804e990: d9 01 flds (%ecx) 804e992: d8 e2 fsub %st(2),%st 804e994: d9 41 04 flds 0x4(%ecx) 804e997: d8 e2 fsub %st(2),%st 804e999: dd 1c 24 fstpl (%esp) 804e99c: d9 00 flds (%eax) 804e99e: d9 5c 24 28 fstps 0x28(%esp) 804e9a2: d9 40 04 flds 0x4(%eax) 804e9a5: d9 5c 24 2c fstps 0x2c(%esp) den = v3dy*v1dx - v3dx*v1dy; 804e9a9: d9 40 08 flds 0x8(%eax) 804e9ac: d9 40 0c flds 0xc(%eax) 804e9af: 31 c0 xor %eax,%eax 804e9b1: d9 c2 fld %st(2) 804e9b3: d8 c9 fmul %st(1),%st 804e9b5: dd 04 24 fldl (%esp) 804e9b8: d8 cb fmul %st(3),%st 804e9ba: de e9 fsubrp %st,%st(1) if (fabs(den) < 1.0E-36) 804e9bc: d9 c0 fld %st(0) 804e9be: d9 e1 fabs 804e9c0: dd 05 d8 e4 11 08 fldl 0x811e4d8 804e9c6: df f1 fcomip %st(1),%st 804e9c8: df c0 ffreep %st(0) 804e9ca: 77 3c ja 804ea08 ret_DVL_none.clone.1 num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx; 804e9cc: d9 c5 fld %st(5) 804e9ce: d8 6c 24 28 fsubrs 0x28(%esp) 804e9d2: d9 c5 fld %st(5) 804e9d4: d8 64 24 2c fsubs 0x2c(%esp) 804e9d8: dc cc fmul %st,%st(4) 804e9da: d9 cb fxch %st(3) 804e9dc: d8 c9 fmul %st(1),%st 804e9de: de c4 faddp %st,%st(4) frac = num / den; 804e9e0: d9 cb fxch %st(3) 804e9e2: d8 f1 fdiv %st(1),%st if (frac<0.0 ... ) 804e9e4: d9 ee fldz 804e9e6: df f1 fcomip %st(1),%st 804e9e8: 77 2e ja 804ea18 ret_DVL_none.clone.2 if ( ... frac>1.0) 804e9ea: d9 e8 fld1 804e9ec: d9 c9 fxch %st(1) 804e9ee: db f1 fcomi %st(1),%st 804e9f0: dd d9 fstp %st(1) 804e9f2: 76 3c jbe 804ea30 body.num1 # ret_DVL_none.clone.3: -> return DVL_none 804e9f4: df c0 ffreep %st(0) 804e9f6: df c0 ffreep %st(0) 804e9f8: df c0 ffreep %st(0) 804e9fa: df c0 ffreep %st(0) 804e9fc: df c0 ffreep %st(0) 804e9fe: df c0 ffreep %st(0) 804ea00: df c0 ffreep %st(0) 804ea02: eb 24 jmp 804ea28 ret_common.1 804ea04: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi ret_DVL_none.clone.1: -> return DVL_none 804ea08: df c0 ffreep %st(0) 804ea0a: df c0 ffreep %st(0) 804ea0c: df c0 ffreep %st(0) 804ea0e: df c0 ffreep %st(0) 804ea10: df c0 ffreep %st(0) 804ea12: df c0 ffreep %st(0) 804ea14: eb 12 jmp 804ea28 ret_common.1 804ea16: 66 90 xchg %ax,%ax ret_DVL_none.clone.2: -> return DVL_none 804ea18: df c0 ffreep %st(0) 804ea1a: df c0 ffreep %st(0) 804ea1c: df c0 ffreep %st(0) 804ea1e: df c0 ffreep %st(0) 804ea20: df c0 ffreep %st(0) 804ea22: df c0 ffreep %st(0) 804ea24: df c0 ffreep %st(0) ret_DVL_none: 804ea26: 66 90 xchg %ax,%ax ret_common.1: 804ea28: 83 c4 34 add $0x34,%esp 804ea2b: 5b pop %ebx 804ea2c: 5e pop %esi 804ea2d: c3 ret 804ea2e: 66 90 xchg %ax,%ax body.num1: num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx; 804ea30: d9 ca fxch %st(2) 804ea32: d8 cc fmul %st(4),%st 804ea34: d9 cb fxch %st(3) 804ea36: dc 0c 24 fmull (%esp) 804ea39: de c3 faddp %st,%st(3) result->divfrac = num / den; 804ea3b: de fa fdivrp %st,%st(2) 804ea3d: d9 c9 fxch %st(1) 804ea3f: d9 5e 0c fstps 0xc(%esi) result->divpt.x = v1x + v1dx*frac; 804ea42: dc c9 fmul %st,%st(1) 804ea44: d9 c9 fxch %st(1) 804ea46: de c3 faddp %st,%st(3) 804ea48: d9 ca fxch %st(2) 804ea4a: d9 1e fstps (%esi) result->divpt.y = v1y + v1dy*frac; 804ea4c: dd 04 24 fldl (%esp) 804ea4f: d8 ca fmul %st(2),%st 804ea51: de c1 faddp %st,%st(1) 804ea53: d9 5e 04 fstps 0x4(%esi) + if( frac < 0.05f + 804ea56: d9 05 a0 e4 11 08 flds 0x811e4a0 + 804ea5c: df f1 fcomip %st(1),%st + 804ea5e: 76 1b jbe 804ea7b body.if.95 if( ... && SameVertex() 804ea60: dd 5c 24 10 fstpl 0x10(%esp) 804ea64: 89 f0 mov %esi,%eax 804ea66: 89 4c 24 0c mov %ecx,0xc(%esp) 804ea6a: e8 d1 fe ff ff call 804e940 <SameVertex.clone.5> 804ea6f: 85 c0 test %eax,%eax 804ea71: 8b 4c 24 0c mov 0xc(%esp),%ecx 804ea75: dd 44 24 10 fldl 0x10(%esp) 804ea79: 75 6d jne 804eae8 case_DVL_v1 body.if.95: + if( frac > 0.95f + 804ea7b: d9 05 a4 e4 11 08 flds 0x811e4a4 + 804ea81: d9 c9 fxch %st(1) + 804ea83: df f1 fcomip %st(1),%st + 804ea85: df c0 ffreep %st(0) + 804ea87: 77 27 ja 804eab0 if.95.OR.samevertex case_DVL_mid: 804ea89: c7 46 08 00 00 00 00 movl $0x0,0x8(%esi) 804ea90: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) return DVL_mid; 804ea97: b8 02 00 00 00 mov $0x2,%eax 804ea9c: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804eaa3: c7 46 18 00 00 00 00 movl $0x0,0x18(%esi) 804eaaa: 83 c4 34 add $0x34,%esp 804eaad: 5b pop %ebx 804eaae: 5e pop %esi 804eaaf: c3 ret if.95.OR.samevertex: if( ... && SameVertex() 804eab0: 89 ca mov %ecx,%edx 804eab2: 89 f0 mov %esi,%eax 804eab4: 89 4c 24 0c mov %ecx,0xc(%esp) 804eab8: e8 83 fe ff ff call 804e940 <SameVertex.clone.5> 804eabd: 85 c0 test %eax,%eax 804eabf: 8b 4c 24 0c mov 0xc(%esp),%ecx 804eac3: 74 c4 je 804ea89 case_DVL_mid # case_DVL_v2: 804eac5: 89 4e 08 mov %ecx,0x8(%esi) 804eac8: c7 46 10 00 00 00 00 movl $0x0,0x10(%esi) return DVL_v2; 804eacf: b8 03 00 00 00 mov $0x3,%eax 804ead4: c7 46 14 02 00 00 00 movl $0x2,0x14(%esi) 804eadb: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) 804eae2: e9 41 ff ff ff jmp 804ea28 ret_common.1 804eae7: 90 nop case_DVL_v1: 804eae8: df c0 ffreep %st(0) 804eaea: 89 5e 08 mov %ebx,0x8(%esi) 804eaed: c7 46 10 ff ff ff ff movl $0xffffffff,0x10(%esi) return DVL_v1; 804eaf4: b8 01 00 00 00 mov $0x1,%eax 804eaf9: c7 46 14 01 00 00 00 movl $0x1,0x14(%esi) 804eb00: c7 46 18 01 00 00 00 movl $0x1,0x18(%esi) 804eb07: e9 1c ff ff ff jmp 804ea28 ret_common.1 804eb0c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 0 Share this post Link to post
kb1 Posted April 17, 2017 Yes, it's a real shame when you can't trust your CPU instructions, cause you end up with these legacy fixes forever, that handle problems with ancient CPUs. Then again, modern CPUs are so damn complicated, it's almost forgivable when a bug creeps in. Anyway, that's too much code for me to analyze. I do see some of the "clone" stuff you're describing. Sometimes, compilers will align with "variable-sized NOPs", which can be any instruction of the desired size, that doesn't mess up any calculations. Intel, and I suppose AMD, have stated that you're supposed to use specific instructions for this purpose, vs. a MOV EAX, EAX, or whatever. With the recent drive for power-efficient CPUs, these preferred NOP instructions could be hard-wired to actually do nothing, vs. do a harmless instruction, which can also help by eliminating dependencies and the like. I can't explain the code clones, unless one is used inline, and the original is kept, which might help during single-step debugging? (I'm reaching here :) I fear the possibility of sending you off on a wild goose chase, but I must suggest: Maybe it's time to try another compiler, at least as a test. It could provide more info that could validate what your compiler is doing. For instance, maybe you'd see the strange fcomps workaround stuff being generated conditionally in another compiler. It could also provide a code size comparison. But, honestly, at some point, you have to either trust your compiler, and choose to live with the occasional benign size bumps/code clones, or be forever unhappy, and rewrite the whole damn thing in assembler. Short of that, you could maybe become proficient at inline compiler directives that modify local behavior, load in pre-built, pre-vetted libraries to replace certain code blocks, etc. But for a code base the size of Doom, it becomes a ridiculous proposition. In this specific case, "Close you eyes and act like nothing happened" may actually be the best policy, as much as I hate to say it. Good luck. 0 Share this post Link to post
axdoomer Posted April 23, 2017 I had a question that came to my mind when reading this thread: Why does Doom uses fixed point instead of using bare ints? Doom had to shift or cast numbers to make multiplication and divisions work correctly (this uses more CPU time), but using bare ints, they wouldn't have to do any "conversions" right? It's only a question of representation as far as I know. FixedDiv and FixedMul from Chocolate-Doom : // Fixme. __USE_C_FIXED__ or something. fixed_t FixedMul ( fixed_t a, fixed_t b ) { return ((int64_t) a * (int64_t) b) >> FRACBITS; } // // FixedDiv, C version. // fixed_t FixedDiv(fixed_t a, fixed_t b) { if ((abs(a) >> 14) >= abs(b)) { return (a^b) < 0 ? INT_MIN : INT_MAX; } else { int64_t result; result = ((int64_t) a << FRACBITS) / b; return (fixed_t) result; } } 0 Share this post Link to post
dpJudas Posted April 24, 2017 The shift is virtually free, because it is among the fastest instructions available. The casts to 64 bit are only needed in C because the language cannot properly express the actual assembly instructions, if I remember correctly. Either way, you cannot use "bare ints" because the precision of whole integers is not good enough for the math Doom is using. 0 Share this post Link to post
axdoomer Posted April 24, 2017 15 minutes ago, dpJudas said: Either way, you cannot use "bare ints" because the precision of whole integers is not good enough for the math Doom is using. The fixed points are 16.16, so you can multiply everything by 65536. That means that what is currently 1 unit would be represented as 65536 units. You still have 32-bits to represent Doom's world. I'm just looking to know if the developers complicated their own lives or if it simplified their work (and if it made the game faster or slower). 0 Share this post Link to post
dpJudas Posted April 24, 2017 If you multiply everything by 65536 you just invented fixed point. You have to divide by 65536 (aka shift 16 bits to the right) after a multiplication because otherwise you multiplied things twice by 65536. 0 Share this post Link to post
kb1 Posted April 24, 2017 8 hours ago, axdoomer said: The fixed points are 16.16, so you can multiply everything by 65536. That means that what is currently 1 unit would be represented as 65536 units. You still have 32-bits to represent Doom's world. I'm just looking to know if the developers complicated their own lives or if it simplified their work (and if it made the game faster or slower). 16.16 gives you numbers in the range of -32768 to +32767 with a granularity of 1/65536. In machine language, you don't even have to shift, which was helpful in vanilla. Especially back then, divides and multiplies on ints were much faster than with floats, and Doom needs various amounts of fractional precision throughout the engine. And, it's not just 16.16, it used 12.20 and others. This is the essense of the Wiggle Fix code - it dynamically adjusts the ratio of whole to fractional units in specific wall accumulators to prevent a nasty renderer artifact that causes walls to shift around unnaturally. Essentially a home-grown floating point using fixed point variables. 1 Share this post Link to post
wesleyjohnson Posted April 24, 2017 I am done with this topic. I do not have those other compilers. This is where someone else steps up and shows what clang or MS does. It depends upon what target your distribution is for. As we still compile distribution for i486, I am going to leave the float literal markers off of most comparisons to avoid the code bump. Any future compilation for a i686 will rely upon the compiler optimization. I already committed the code so this is already done. Knowing what the code size bumps are allows me to continue using code size to judge code fix quality, by knowing what some of the extraneous noise is about. Another strange one. I took out an IF stmt that tested for Boom compatibility, and the code size jumped by 2K. Make some other changes, and the code size is stuck at one value. Put the one IF stmt back, and the 2K code size bump disappears. I think I don't want to even look at it. 0 Share this post Link to post
kb1 Posted April 27, 2017 On 4/24/2017 at 4:13 PM, wesleyjohnson said: I am done with this topic. I do not have those other compilers. This is where someone else steps up and shows what clang or MS does. It depends upon what target your distribution is for. As we still compile distribution for i486, I am going to leave the float literal markers off of most comparisons to avoid the code bump. Any future compilation for a i686 will rely upon the compiler optimization. I already committed the code so this is already done. Knowing what the code size bumps are allows me to continue using code size to judge code fix quality, by knowing what some of the extraneous noise is about. Another strange one. I took out an IF stmt that tested for Boom compatibility, and the code size jumped by 2K. Make some other changes, and the code size is stuck at one value. Put the one IF stmt back, and the 2K code size bump disappears. I think I don't want to even look at it. I wouldn't look :) You know, I've been studying the x86/x64 processor docs recently, and, here's something that may ease your mind somewhat: With all the caching, deep pipelining, multiple execution port stuff in there, you get a lot of statements executed, essentially, for free! It's truly amazing how much work Intel and AMD have put into optimizing their products. Long-winded, sloppily-compiled code often runs faster than tight code, even. It's all converted to internal "micro-op" code anyway. Cache is king. That's really the thing that matters these days. Re-calculating a value is often faster than reading a lookup entry in uncached memory, which is a big change in methodology. I would not worry about a 2K bump here or there. I would assume that there was a good reason, and trust the compiler, for all but the most-important loops. 0 Share this post Link to post
Graf Zahl Posted April 27, 2017 6 hours ago, kb1 said: With all the caching, deep pipelining, multiple execution port stuff in there, you get a lot of statements executed, essentially, for free! It's truly amazing how much work Intel and AMD have put into optimizing their products. Long-winded, sloppily-compiled code often runs faster than tight code, even. I can confirm this. Modern compilers know better how a CPU can optimize. We were quite surprised when we found out that all the highly optimized assembly for the draw loops in ZDoom's renderer had lost all advantage over its C counterpart in the last 2 or3 CPU generations, I had to pull out my 10 year old laptop to see the assembly stuff have a minor advantage. Compiled code may look sloppy but this is often done to get some instructions in between that can be executed for free. In one case, the assembly code only looked better - after finding out what slowed down the C version and fixing it to work properly it was just as fast. In this particular situation it is that loading global variables in 64 bit code is REALLY slow. Doing it inside a loop is murderous. Just loading them into local variables, even onto the stack, made all the difference. Of course that optimized version of the function was quiite a bit larger because it had to save all registers onto the stack, then load them with the global variables and afterward pop the registers again. It still was twice as fast as the version directly reading the global variables. It is also quite pointless to look at the binary size. A 2k bump can simply be one new page of content being added, even if that amounts to only a few bytes of code. It depends on the linker how large a page is, it can be 512 bytes but some linkers choose larger values. 0 Share this post Link to post
kb1 Posted April 27, 2017 (edited) 11 hours ago, Graf Zahl said: I can confirm this. Modern compilers know better how a CPU can optimize. We were quite surprised when we found out that all the highly optimized assembly for the draw loops in ZDoom's renderer had lost all advantage over its C counterpart in the last 2 or3 CPU generations, I had to pull out my 10 year old laptop to see the assembly stuff have a minor advantage. Compiled code may look sloppy but this is often done to get some instructions in between that can be executed for free. In one case, the assembly code only looked better - after finding out what slowed down the C version and fixing it to work properly it was just as fast. In this particular situation it is that loading global variables in 64 bit code is REALLY slow. Doing it inside a loop is murderous. Just loading them into local variables, even onto the stack, made all the difference. Of course that optimized version of the function was quiite a bit larger because it had to save all registers onto the stack, then load them with the global variables and afterward pop the registers again. It still was twice as fast as the version directly reading the global variables. It is also quite pointless to look at the binary size. A 2k bump can simply be one new page of content being added, even if that amounts to only a few bytes of code. It depends on the linker how large a page is, it can be 512 bytes but some linkers choose larger values. In Wesley's case, it grew 2k after removing an IF statement, so, that's interesting. But, yeah making your vars local gets them into cache, and, hopefully, registers. Good stuff. Now, as you know, I have to play devil's advocate in this area, and claim that, possibly, some better assembly would probably still be better for the render loops, but, the gap is closing. Because the effect is so processor-specific, it is a ton of work to get the absolute best performance on a range of processors. And, the code will only be optimal on some processors. It can be done, but it's a project in itself. The compiler guys have intimate knowledge of these issues, so, in most all cases, trust your compiler! 0 Share this post Link to post
Graf Zahl Posted April 27, 2017 31 minutes ago, kb1 said: In Wesley's case, it grew 2k after removing an IF statement, so, that's interesting. But, yeah making your vars local gets them into cache, and, hopefully, registers. Good stuff. Now, as you know, I have to play devil's advocate in this area, and claim that, possibly, some better assembly would probably still be better for the render loops, but, the gap is closing. If you can orchestrate it perfectly you might have been able to shave off maybe 5% more. But then the next CPU generation comes along and won't like your optimization, making the C version faster again. With today's CPU's it is simply a battle that cannot be won. Let's not forget that the original assembly I was talking about already used all registers and even used a bit of self-modifying code to get the remaining two values off the stack, too. But from what I have seen it looks like the CPUs already get heavily optimized for reading local stack data because it is so frequent in compiled code. In any case, the main reason the assembly was ditched was not that it had lost all performance advantage but that with the transition to a multithreaded renderer it just became unmaintainable. And the performance boost from the multithreading was magnitudes more than a few measly percent a well written assembly routine might have yielded. 0 Share this post Link to post
kb1 Posted April 28, 2017 20 hours ago, Graf Zahl said: If you can orchestrate it perfectly you might have been able to shave off maybe 5% more. But then the next CPU generation comes along and won't like your optimization, making the C version faster again. With today's CPU's it is simply a battle that cannot be won. Let's not forget that the original assembly I was talking about already used all registers and even used a bit of self-modifying code to get the remaining two values off the stack, too. But from what I have seen it looks like the CPUs already get heavily optimized for reading local stack data because it is so frequent in compiled code. In any case, the main reason the assembly was ditched was not that it had lost all performance advantage but that with the transition to a multithreaded renderer it just became unmaintainable. And the performance boost from the multithreading was magnitudes more than a few measly percent a well written assembly routine might have yielded. I understand what happened in ZDoom's case. I guess I'll have to put my money where my mouth is with a demonstration. 0 Share this post Link to post