Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
wesleyjohnson

float literals

Recommended Posts

This started with something I saw (and I do not know anymore where) that stated that by default a literal such as 1.25 is a double, and will be converted at run-time to be stored or used as a float.

So I went about finding any float literals that were not marked with the f.

 

So I feel like I am shooting myself in the foot asking this here, but the answers are so interesting.

 

I wrote a test program with a mixture of default literals and explicit float literals, compiled it on gcc 4.5, and dumped it as assembly output.

The assembly is is showing exactly the same code for both float literal and double literal.  I used -O0 to avoid optimization, but still gcc seems to be losing the run-time conversion of double to float.

 

But in DoomLegacy code, changing a 1.25 to 1.25f makes the code get smaller, so something is happening.

But ... changing a (f1 < 0.01)  to (f1 < 0.01f)  makes the code get larger, consistently. 

I am comparing the sizes of stripped binaries, so it should not be the debugging information.

I have yet to get a look at the assembly of this cause it is a bit more difficult to dump DoomLegacy into assembly output and find the interesting spot.

 

I seek enlightenment, or at least a rationale (in the general case).

float f1, f2, f3;
 
f1 = 1.25;
  // defaults to double ??
f2 = 1.25f;
  // nice, code gets smaller
f3 = 0.0;
  // does it care when it is zero ?

f2 = f1 * 2;
  // does it care when multiply by int

if( f1 < 0.01 )  ...
  // should change this to 0.01f ?
            
if( f2 < 0.01f ) ...
  // what the heck, code got 30 bytes larger for everyone of these I fix

if( f3 < 0.0 )  ...
  // does it care about the f when the test is against zero

 

Edited by wesleyjohnson : wording

Share this post


Link to post

You can't really get any meaningful info from the size of the binary, with such a small test case, if at all. In general, the compiler does lots of things, some of which are not directly related to a specific line of code, which will produce a counter-intuitive result. Here's just some off-the-top-of-my-head examples, which may or may not apply to your example:

 

  • Code and data alignment
  • Floating point setup code (like MMX floating-point reset, or setting rounding mode.
  • Variable promotion (some languages will, for example, promote an int to a float, or even both ints to floats, before doing, say, a multiplication)

By including a decimal, the compiler knows that the constant is not an integer, so it is free to make the R-Value itself either single or double, or even 80-bit. But it must then convert to be able to store the value.

 

But, back to the subject of .EXE size: The EXE header structure works in blocks of data. If you compile a small program, and then add a small amount of code, it is possible that the binary size will not change at all, because it still fits in the block. Furthermore, newer compilers that do "whole program optimization" are free to rearrange things in complex ways. It's just not a good way to measure. And, finally, frequently, larger machine language code is faster than tightly packed code! Modern X86 processor instructions can be up to 15 bytes long! The processor is able to pull in that much code and decode the instruction quickly. Now, code cache size does present an argument for smaller compiled code size. But these rules are so complex that trying to analyze an entire process is a massive project.

 

However: If you really want to know the answer, it can be done, but it's going to take a little bit of work. You can write a program that generates A LOT of similar code with the "f" and without the "f", and check binary size of the result, and that should provide a more meaningful result. If the code is large enough, you should be able to counteract any alignment/block size/optimization issues. But, whatever code you generate, you must make sure the compiler does not optimize it out. All the constants you create must be used in this test program, and must be used in a way that the clever compiler cannot factor out, and ruin your tests.

 

And, finally I must state the obvious: A look at the assembly should provide the answer, much more easily than my suggestion above. And, I must ask: Are you trying to figure out if the compiler is storing doubles vs. floats, or are you asking because you are actually concerned about final compiled binary size? I ask, because bigger code != slower code on modern processors.

 

If you do get some results, please post them: I'd be interested to know what those results are. Good luck.

Share this post


Link to post

It might be that before it could store it in the binary as a literal ".01", but when you make it a float literal it has to store the full 32-bit floating point representation?  That doesn't mean 30 bytes I guess, but there could be cascading effects.  Dunno.

 

You can have GCC dump the assembly for a single file.

Share this post


Link to post

Did you get any more info on this? To really pin it down, I'd try something like this:

  • Write a program that generates a .C file, called "Listing1.C" that looks kinda like this:
Spoiler

// Listing 1 - No 'f'
#include <time.h>
int main(void)
{
  float  f1;
  time_t clock1;
  time_t clock2;

  time(&clock1);
  f1 = 1 / (float)(&clock1);

  /* Make A LOT of these lines using a program that generates them.
      The literal used should be chosen randomly, We are trying to
      determine what happens when a float constant has an explicit 'f'
      or not, and the effect that that has on the compiled output. So,
      the random literals we choose should not require more than float
      accuracy to express literally. We do not want an implicit conversion
      to anything except float. Perhaps, the literals should be generated
      as strings of a specific length: "0." & xxxxx, where xxxxx = 5 decimal
      digits.
  */

  #define ENTRY (1)
  #define VALUE (0.12246)
  if (f1 == VALUE)
  {
    time(&clock2);
    return ((int)(&clock2)) / ENTRY;
  }

  #define ENTRY (2)
  #define VALUE (0.58909)
  if (f1 == VALUE)
  {
    time(&clock2);
    return ((int)(&clock2)) / ENTRY;
  }

  ...

  #define ENTRY (1000)
  #define VALUE (0.08716)
  if (f1 == VALUE)
  {
    time(&clock2);
    return ((int)(&clock2)) / ENTRY;
  }
}

 

 

Some notes on my theory of operation:

  • Most likely, if you use the same pseudo-random constants, and turn off optimizations, these 2 listings will produce identical output. I think maybe, with your listing above, the mismatched use of the float identifier 'f' thwarted an optimization that allowed the compiler to do all the calculations at compile time, reducing the output to a single unconditional path. In other words, I think the compiler could have reduced your code to this:
float f1 = 1.25f;
float f2 = 2.5f;
float f3 = 0;

In fact, it could then remove all the code, cause the vars are not used (with what you listed). None of the IF statements are taken. But, the mixed usage of 'f' may have prevented it from discovering this possibility. Sometimes, the simplest setup confuses an otherwise brilliant compiler. But, it's just a guess :)

  • By specifying 1,000 entries, the code should "spill into" multiple "blocks", regardless of whatever definition of "block" you may use, thereby preventing a block allocation type of obfuscation of code growth. Also, any code length differences are multiplied by 1,000, making them very obvious.
  • The use of the first time function guarantees that compile-time calculation/code reduction cannot occur in the comparison.
  • The use of the time function inside the blocks prevents the creation of a "virtual return table", which is highly unlikely anyway. But, if I had not included the "/ ENTRY", the compiler could generate a single "Call time(&clock2); return (&clock2)" code block, and jumped, or "fell into" it, which I wanted to prevent.
  • The point I'm trying to make is that the listings above should be next to impossible for the compiler to optimize, even if you were not turning off optimizations. What remains is the processing of your literals, 'f' or not, with a fixed amount of overhead that can be subtracted out.

The other, more subtle point is that, trying to manually optimize for code size just isn't very effective anymore. Modern compiler writers are typically very smart, and they have knowledge on a per-processor, per-instruction type level, and they typically cater their compilers to use that information to do the right thing, in general (Not that I can't sometimes do better with hand-written assembly, but that's a different discussion :)

 

It actually can be relevant in this discussion: For hand-written assembler to be justified, and to actually make a significant difference these days, you need to use different, per-processor approaches, which suggests different code paths per architecture. For the past 15 years, or so, processors handle certain code sequences so differently that code optimized for a Pentium II runs sub-optimal on, say, an i3, or i5, or, move importantly, on different manufacturer's offerings, and that's just for x86/x64. So, naturally, if you support multiple code paths, your code has become bigger - a lot bigger, yet performs better. Compiled size is no longer inversely proportional to code performance, and, often, the opposite is true.

 

Another note about modern compilers: The optimization techniques offered lean towards performance, not code size. I know they present that as a simple choice of this or that. But, usually, optimizing for size just prevents some aggressive code performance techniques from being used: techniques like inlining, loop unrolling, etc.

 

Now, noticing a huge spike in binary size may be a useful indicator that you've just pulled in an unnecessary library unintentionally, due to a bug. But, other than that, IMHO binary size by itself is too coarse a measurement to derive much useful (actionable) info from.

 

Though, I have to admit that I do want to see the results of those 2 code listings I posted! I hope you see where I'm coming from, but mainly, I hope you find them (or a similar idea) helpful.

Share this post


Link to post

Sorry, but I am really busy lately.  I think the clock test program approach is similar to the little test program that I already wrote.

My test program has some float values coming from function calls, with obfuscating statements to discourage inlining.

Used -O0 keep the optimizer from confusing things.

I looked at the assembly of that test program and it showed that there was no difference at all in the assembly code.

The compiler was generating the same instructions when the literal was 1.25 as when the literal was 1.25f.

This right off disagrees with the report that plain float literals are actually double, and will be converted to float at run-time.

 

I tested with the GCC 4.5 compiler, with no other switches.  DoomLegacy uses some other switches that could be affecting this, and while I doubt it, there is not much else to suspect.

 

Just using the editor to duplicate some of the code blocks would be faster than writing a program to duplicate stuff.

But I think the result will be the same, no difference.

 

In the DoomLegacy base code, I changed most of the float literals that were being used on float variables to have the f.  These included assigns, and expressions where all the operands were float literals.  There was a significant reduction in the DoomLegacy code size.  That is usually a sign that something was improved that the optimizer was not finding on its own.

 

The significant weirdness was the inequality tests.  Change any float inequality to use an explicit float literal and the code size would bump up by about 30 bytes.  And it happened for every float inequality I changed, in any file.   It is consistent and also accumulative, which suggests some kind of actual effect in the code outside of the block allocation effect.

 

Usually the code size will not change (the block allocation effect) but I think most of that is in handling variables.

For many edits the code size will not change, then another edit will change it in a large chunk.  Been seeing that for years.

But then there are some edits like fixing 5 identical references (like  plane.lighttable[lightval] ), where assigning the reference once to another local var gives an immediate code size reduction.

This tells me that the compiler, even with optimization, is not finding these common subexpressions so it was duplicating the code.  That is why I always check the code size as nothing else I could look at (short of looking at assembly output) could tell me what effect manually handling the common subexpression would have.

 

The GCC info suggests using array indexing is better because otherwise the optimizer is inhibited by ptr references.  My results seem to indicate that the explicit ptr gives smaller code (which must be because it is more direct with less duplicated effort).  The optimizer was not doing as well as my hand optimization.

 

Not much immediate benefit to get an answer right now, and I have my hands full right now.  The code will be "correct" either way, and may in fact be identical, as far as the assembly I have seen so far suggests.   It makes it more difficult not knowing what the code size bump actually means.  Using the code size is much faster and of finer resolution than doing some kind of regression frame rate test.  It is convienent, as long as you have some idea of the other influences on the code size, and how to ignore them as noise.   I will report back if find anything else significant, maybe in 2 to 3 weeks.

 

 

 

 

 

Share this post


Link to post

You really should let the coompiler produce assembly output to check these things. Floating point logic can easily be very different from what you may expect when writing code.

 

One example:

 

In Visual Studio, with SSE2 code generation it doesn't matter one bit if you give a constant an 'f' postfix. All the compiler will do is take the value and then internally convert it to the precision the destination variable requires. Otherwise it'd have to emit a conversion instruction that costs additional processing time.

When creating code for x87 things will get even more tricky when using single precision floats. In order to ensure that the entire calculation will remain in single precision, when using the 'precise' floating point model, the value will frequently be stored and re-read to and from memory.

So depending on various settings even seemingly trivial differences can create very different code. And worse, what may be efficient for one hardware can easily be the total opposite for other hardware.

Share this post


Link to post
1 hour ago, Graf Zahl said:

You really should let the coompiler produce assembly output to check these things. Floating point logic can easily be very different from what you may expect when writing code.

 

One example:

 

In Visual Studio, with SSE2 code generation it doesn't matter one bit if you give a constant an 'f' postfix. All the compiler will do is take the value and then internally convert it to the precision the destination variable requires. Otherwise it'd have to emit a conversion instruction that costs additional processing time.

When creating code for x87 things will get even more tricky when using single precision floats. In order to ensure that the entire calculation will remain in single precision, when using the 'precise' floating point model, the value will frequently be stored and re-read to and from memory.

So depending on various settings even seemingly trivial differences can create very different code. And worse, what may be efficient for one hardware can easily be the total opposite for other hardware.

Very true - the compiler's attempts to preserve precision have non-intuitive effects on the generated code. Things get more complicated with SSE2 vs. MMX/x87, at least with doubles, cause the older technology uses a full 80-bits of precision, to try to maintain accuracy on 64-bits. If you ask for "precise" calculations, the compiler can try to force all calculations through 80-bit considerations. "Precise" also prevents some optimizations that look algebraically correct. For example, a*(b+c) should be able to replace (a*b+a*c), but the compiler won't always do it, since floating-point calculations are not precise enough to ensure the exact same result. The thinking there is that an optimization should have absolutely no effect on result, even if it's a single bit in a huge floating point number. In that case, "precise" doesn't mean 'more precise', it means exactly match the original, unoptimized formula's result.

3 hours ago, wesleyjohnson said:

Sorry, but I am really busy lately.  I think the clock test program approach is similar to the little test program that I already wrote.

My test program has some float values coming from function calls, with obfuscating statements to discourage inlining.

Used -O0 keep the optimizer from confusing things...

The "clock test" program was written that way to ensure that the compiler could not optimize the code.

Since you use this as a diagnostic, here's what I suggest:

When you get some time, write the smallest program that demonstrates this 30-byte difference when you add 'f'. Then check out the assembly, and find the discrepancy. Maybe toggle some of the compiler options and see if it gets worse, or goes away. Because this is a vital diagnostic for you, knowing what it's telling you is also vital, if that makes sense.

 

And, yes, if you do find out what's going on, please post it here. It's bound to be interesting.

Share this post


Link to post

Couldn't ignore the thing for very long,   ...

The compiler uses the FCOMPL for the 64 bit compares, but is avoiding using the FCOMPS instruction for 32 bit compares.  Is there some buggy i386 CPU that has a bad FCOMPS instruction?

That would explain why other test compiles do not show it, the compiler ARCH target is for my native Athlon64, or i686. 

Compiler: GCC 4.5.2   on  Linux 2.6
Source: Doomlegacy 1.46.3
Assembly: objdump -d
Assembly comparison: diff -U4
  diff -U4 markings
      space : is same for both first and second file
      -     : as appears in first file
      +     : as appears in second file

=============
Change a 1.0 to 1.0f .

Orig:   skysow03 = 1.0 + ((float) angle) / (ANG90 - 1);
Tst1:   skysow03 = 1.0f + ((float) angle) / (ANG90 - 1);

Orig size: 1335076
Tst1 size: 1335076

diff in objdump listing

The only difference was a data block that is buried in the code of D_DoomMain ...
This same area changes for every compilation, with similar differences,
even two compiles of the orig.
Looks to be some text about date/time.
  ("pr  6 201716:13:50")
  ("pr  8 201716:51:13")

* Not every change of literal has an effect.
* There are special instructions for loading +1.0 (FLT1) and +0.0 (FLTZ),
  as 66 bit values.   Both 1.0 and 1.0f will use the same instruction.

==============
A small function with float literals.

Orig:
static void
  R_Generate_gamma_black_table( void )
{
    int i;
    float b0 = ((float) cv_black.value ) / 2.0; // black
    float pow_max = 255.0 - b0;
    float gam = gamma_lookup( cv_usegamma.value );  // gamma

    gammatable[0] = 0;	// absolute black

    for( i=1; i<=255; i++ )
    {
        float fi = ((float) i) / 255.0;
        put_gammatable( i, b0 + (powf( fi, gam ) * pow_max) );
    }
}

Tst2:
static void
  R_Generate_gamma_black_table( void )
{
    int i;
    float b0 = ((float) cv_black.value ) / 2.0f; // black
    float pow_max = 255.0f - b0;
    float gam = gamma_lookup( cv_usegamma.value );  // gamma

    gammatable[0] = 0;	// absolute black

    for( i=1; i<=255; i++ )
    {
        float fi = ((float) i) / 255.0f;
        put_gammatable( i, b0 + (powf( fi, gam ) * pow_max) );
    }
}

The put_gammatable() call is inlined.
static void
  put_gammatable( i, float fv )
{
  int gv = (int) roundf( fv );
  if( gv < 0 )  gv = 0;
  if( gv > 255 )  gv = 255;
  gammatable[i] = gv;
}

With R_Generate_gamma_black_tabled auto-inlined.
Orig size: 1335076
Tst2 size: 1335044

R_Generate_gamma_black_table() was originally inlined, so this test was
done using __attribute__((noinline)) on it.

With R_Generate_gamma_black_tabled marked noinline.
Orig size: 1334993
Tst2 size: 1334961

Many code addresses changed, so ignore all those.
Where an instruction differed only in address, only the orig was kept.
Source code was added by hand.

@@ -32141,13134 +32141,13148 @@
 08064d10 <R_Generate_gamma_black_table>:
  8064d10:	56                   	push   %esi
  8064d11:	53                   	push   %ebx
  8064d12:	83 ec 14             	sub    $0x14,%esp

- float b0 = ((float) cv_black.value ) / 2.0; // black
+ float b0 = ((float) cv_black.value ) / 2.0f; // black
- 8064d15:	d9 05 fc 56 12 08    	flds   0x81256fc
- 8064d1b:	da 0d d4 ff 14 08    	fimull 0x814ffd4
- 8064d21:	d9 54 24 04          	fsts   0x4(%esp)
+ 8064d15:	d9 05 dc 56 12 08    	flds   0x81256dc
+ 8064d1b:	da 0d b4 ff 14 08    	fimull 0x814ffb4
+ 8064d21:	d9 54 24 08          	fsts   0x8(%esp)

- float pow_max = 255.0 - b0;
+ float pow_max = 255.0f - b0;
- 8064d25:	d8 2d 40 58 12 08    	fsubrs 0x8125840
+ 8064d25:	d8 2d 20 58 12 08    	fsubrs 0x8125820
  8064d2b:	d9 5c 24 08          	fstps  0x8(%esp)
  
float gam = gamma_lookup( cv_usegamma.value );  // gamma
gamma_lookup() is inlined
  8064d2f:	a1 54 00 15 08       	mov    0x8150054,%eax
  8064d34:	8b 34 85 b0 00 15 08 	mov    0x81500b0(,%eax,4),%esi

for( i=1; i<=255; i++ )
for loop init
  8064d3b:	c6 05 e0 49 1d 08 00 	movb   $0x0,0x81d49e0
  8064d42:	bb 01 00 00 00       	mov    $0x1,%ebx
  8064d47:	eb 1f                	jmp    8064d68 <R_Generate_gamma_black_table+0x58>
  
  8064d49:	8d b4 26 00 00 00 00 	lea    0x0(%esi,%eiz,1),%esi

<R_Generate_gamma_black_table+0x40>:
part of inlined put_gammatable( i, ... );
  if( gv > 255 )  gv = 255;
  8064d50:	3d ff 00 00 00       	cmp    $0xff,%eax
  8064d55:	7e 02                	jle    8064d59 <R_Generate_gamma_black_table+0x49>
  8064d57:	b0 ff                	mov    $0xff,%al

<R_Generate_gamma_black_table+0x49>:
part of inlined put_gammatable( i, ... );
  gammatable[i] = gv;
  8064d59:	88 83 e0 49 1d 08    	mov    %al,0x81d49e0(%ebx)

for loop increment (just for non-zero case)
  8064d5f:	43                   	inc    %ebx
  8064d60:	81 fb 00 01 00 00    	cmp    $0x100,%ebx
  8064d66:	74 47                	je     8064daf <R_Generate_gamma_black_table+0x9f>

<R_Generate_gamma_black_table+0x58>: for loop body
  8064d68:	83 ec 08             	sub    $0x8,%esp
  gam, pushed for powf call
  8064d6b:	56                   	push   %esi

- float fi = ((float) i) / 255.0;
    mult by const double (1.0 / 255.0)
- 8064d6c:	89 5c 24 18          	mov    %ebx,0x18(%esp)
- 8064d70:	db 44 24 18          	fildl  0x18(%esp)
- 8064d74:	dd 05 10 73 12 08    	fldl   0x8127310
- 8064d7a:	de c9                	fmulp  %st,%st(1)
- 8064d7c:	83 ec 04             	sub    $0x4,%esp

+ float fi = ((float) i) / 255.0f;
    mult by const float (1.0 / 255.0)
+ 8064d6c:	d9 05 f4 5c 12 08    	flds   0x8125cf4
+ 8064d72:	53                   	push   %ebx
+ 8064d73:	da 0c 24             	fimull (%esp)

  8064d76:	d9 1c 24             	fstps  (%esp)
  powf( fi, gam )
  8064d79:	e8 6e 5b fe ff       	call   804a8ec <powf@plt>

  * pow_max
- 8064d87:	d8 4c 24 18          	fmuls  0x18(%esp)
+ 8064d7e:	d8 4c 24 1c          	fmuls  0x1c(%esp)
  + b0
- 8064d8b:	d8 44 24 14          	fadds  0x14(%esp)
+ 8064d82:	d8 44 24 18          	fadds  0x18(%esp)

  8064d86:	d9 1c 24             	fstps  (%esp)

inlined put_gammatable( i, fv );
  8064d89:	e8 7e 61 fe ff       	call   804af0c <lroundf@plt>
  8064d8e:	83 c4 10             	add    $0x10,%esp

  if( gv < 0 )  gv = 0;
  8064d91:	85 c0                	test   %eax,%eax
  8064d93:	79 bb                	jns    8064d50 <R_Generate_gamma_black_table+0x40>
  gammatable[i] = 0; -- compiler optimization
  8064d95:	31 c0                	xor    %eax,%eax
  8064d97:	88 83 c0 49 1d 08    	mov    %al,0x81d49c0(%ebx)

for loop increment -- compiler optimization (just for 0 case)
  8064d9d:	43                   	inc    %ebx
  8064d9e:	81 fb 00 01 00 00    	cmp    $0x100,%ebx
  8064da4:	75 c2                	jne    8064d68 <R_Generate_gamma_black_table+0x58>

<R_Generate_gamma_black_table+0x9f>:
  8064daf:	83 c4 14             	add    $0x14,%esp
  8064db2:	5b                   	pop    %ebx
  8064db3:	5e                   	pop    %esi
  8064db4:	c3                   	ret    


- 8064db5:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
- 8064db9:	8d bc 27 00 00 00 00 	lea    0x0(%edi,%eiz,1),%edi
+ 8064dac:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi

==========
Repeat the same test as above, but with DEBUG so objdump can dump source lines.
Debug code is significantly different than normal code.
In the ( / 255.0) the fldl (64 bit) changed to flds (32 bit).


Orig size: 4657775
Tst3 size: 4657775

--- tst3_orig.ods	2017-04-08 20:10:12.000000000 -0500
+++ tst3_ch.ods	2017-04-08 20:08:28.000000000 -0500
@@ -48826,18 +48826,21 @@
 {
  8062870:	55                   	push   %ebp
  8062871:	89 e5                	mov    %esp,%ebp
  8062873:	83 ec 48             	sub    $0x48,%esp
-    float b0 = ((float) cv_black.value ) / 2.0; // black
+    float b0 = ((float) cv_black.value ) / 2.0f; // black
  8062876:	a1 94 2b 13 08       	mov    0x8132b94,%eax
  806287b:	89 45 d4             	mov    %eax,-0x2c(%ebp)
  806287e:	db 45 d4             	fildl  -0x2c(%ebp)
  8062881:	d9 05 00 a1 10 08    	flds   0x810a100
  8062887:	de c9                	fmulp  %st,%st(1)
  8062889:	d9 5d f0             	fstps  -0x10(%ebp)
-    float pow_max = 255.0 - b0;
+    float pow_max = 255.0f - b0;
  806288c:	d9 05 04 a1 10 08    	flds   0x810a104
  8062892:	d8 65 f0             	fsubs  -0x10(%ebp)
  8062895:	d9 5d ec             	fstps  -0x14(%ebp)
     float gam = gamma_lookup( cv_usegamma.value );  // gamma
@@ -48853,11 +48856,11 @@
     for( i=1; i<=255; i++ )
  80628b0:	c7 45 f4 01 00 00 00 	movl   $0x1,-0xc(%ebp)
  80628b7:	eb 49                	jmp    8062902 <R_Generate_gamma_black_table+0x92>
     {
-        float fi = ((float) i) / 255.0;
+        float fi = ((float) i) / 255.0f;
  80628b9:	db 45 f4             	fildl  -0xc(%ebp)
- 80628bc:	dd 05 08 a1 10 08    	fldl   0x810a108
+ 80628bc:	d9 05 08 a1 10 08    	flds   0x810a108
  80628c2:	de c9                	fmulp  %st,%st(1)
  80628c4:	d9 5d e4             	fstps  -0x1c(%ebp)
         put_gammatable( i, b0 + (powf( fi, gam ) * pow_max) );
  80628c7:	83 ec 08             	sub    $0x8,%esp
@@ -48876,9 +48879,9 @@
  80628f1:	d9 1c 24             	fstps  (%esp)
  80628f4:	ff 75 f4             	pushl  -0xc(%ebp)
  80628f7:	e8 89 fe ff ff       	call   8062785 <put_gammatable>
  80628fc:	83 c4 10             	add    $0x10,%esp
-    float pow_max = 255.0 - b0;
+    float pow_max = 255.0f - b0;
     float gam = gamma_lookup( cv_usegamma.value );  // gamma
 
     gammatable[0] = 0;	// absolute black


* Seems that a double literal added to float is compile-time converted
  to float literal.
* Seems that a double literal multiplied by a float is kept as const double.
* Seems that a division by double literal is converted to multiply by a
  double literal.
* Double literals are stored as const double (64 bit).
* Float literals are stored as const float (32 bit).

* There are special instructions for loading +1.0 (FLT1) and +0.0 (FLTZ),
  as 66 bit values.
  
============
float comparison to 0.0 or 1.0 in fracdivline()

Orig:
if (frac<0.0 || frac>1.0)
    return DVL_none;  // not within the polygon side

Tst4:
if (frac<0.0f || frac>1.0f)
    return DVL_none;  // not within the polygon side


Orig size: 1335012
Tst4 size: 1335012

No differences, exact identical code.

* Putting f on a 0.0 or 1.0 literal makes no difference.

============
float comparison in fracdivline()

orig:
if( frac < 0.05 && SameVertex(...) )

tst5:
if( frac < 0.05f && SameVertex(...) )

To simplify the assembly:
__attribute__((noinline))
  static boolean SameVertex( ... )


Orig size: 1335047
Tst5 size: 1335079


0804ee20 <fracdivline>:
 804ee20:	56                   	push   %esi
 804ee21:	53                   	push   %ebx


@@ -5520,9 +5520,9 @@
  804ee52:	d8 cb                	fmul   %st(3),%st
  804ee54:	de e9                	fsubrp %st,%st(1)
  804ee56:	d9 c0                	fld    %st(0)
if (fabs(den) < 1.0E-36) // avoid check of float for exact 0
  The literal got moved by 40 bytes.
  Variable den is double, and fabs() returns a double.
  804ee58:	d9 e1                	fabs   
- 804ee5a:	dc 1d 50 57 12 08    	fcompl 0x8125750
+ 804ee5a:	dc 1d 70 57 12 08    	fcompl 0x8125770
  804ee60:	df e0                	fnstsw %ax
  804ee62:	f6 c4 01             	test   $0x1,%ah
  804ee65:	75 49                	jne    804eeb0 <fracdivline+0x90>
  804ee67:	d9 ce                	fxch   %st(6)
@@ -5595,39732 +5595,39720 @@
  804eefb:	d9 1e                	fstps  (%esi)
  804eefd:	d8 ca                	fmul   %st(2),%st
  804eeff:	de c1                	faddp  %st,%st(1)
  804ef01:	d9 5e 04             	fstps  0x4(%esi)

- if( frac < 0.05
- 804ef04:	dc 15 58 57 12 08    	fcoml  0x8125758
- 804ef0a:	df e0                	fnstsw %ax
- 804ef0c:	f6 c4 01             	test   $0x1,%ah
- 804ef0f:	74 1b                	je     804ef2c <fracdivline+0x10c>

+ if( frac < 0.05f
+ 804ef04:	d9 05 40 57 12 08    	flds   0x8125740
+ 804ef0a:	d9 c9                	fxch   %st(1)
+ 804ef0c:	d8 d1                	fcom   %st(1)
+ 804ef0e:	df e0                	fnstsw %ax
+ 804ef10:	dd d9                	fstp   %st(1)
+ 804ef12:	f6 c4 01             	test   $0x1,%ah
+ 804ef15:	74 1b                	je     804ef32 <fracdivline+0x112>

 SameVertex( )
  804ef11:	89 f0                	mov    %esi,%eax
  804ef13:	89 4c 24 04          	mov    %ecx,0x4(%esp)
  804ef17:	dd 5c 24 08          	fstpl  0x8(%esp)
  804ef1b:	e8 b0 fe ff ff       	call   804edd0 <SameVertex.clone.5>
  804ef20:	85 c0                	test   %eax,%eax
  804ef22:	8b 4c 24 04          	mov    0x4(%esp),%ecx
  804ef26:	dd 44 24 08          	fldl   0x8(%esp)
  804ef2a:	75 74                	jne    804efa0 <fracdivline+0x180>

<fracdivline+0x10c> :
  if( frac > 0.95
  804ef2c:	dc 1d 60 57 12 08    	fcompl 0x8125760
  804ef32:	df e0                	fnstsw %ax
  804ef34:	f6 c4 45             	test   $0x45,%ah
  804ef37:	74 27                	je     804ef60 <fracdivline+0x140>

  804ef39:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
  804ef40:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
  804ef47:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
  804ef4e:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
  804ef55:	b8 02 00 00 00       	mov    $0x2,%eax
  804ef5a:	83 c4 2c             	add    $0x2c,%esp
  804ef5d:	5b                   	pop    %ebx
  804ef5e:	5e                   	pop    %esi
  804ef5f:	c3                   	ret    

<fracdivline+0x140> :


* Comparison with float literal instead of double, seems to cost 3 extra
  instructions (6 bytes of code size).

==============
Change two comparisions, to see how differences accumulate.

orig:
if( frac < 0.05 && SameVertex(...) ) ...
if( frac > 0.95 && SameVertex(...) ) ...

tst5:
if( frac < 0.05f && SameVertex(...) )
if( frac > 0.95f && SameVertex(...) ) ...

To simplify the assembly:
__attribute__((noinline))
  static boolean SameVertex( ... )


Orig size: 1335047
Tst6 size: 1335079
  Same file size as with one changed literal.
  But clearly the assembly has extra instructions per comparision.


0804ee20 <fracdivline>:
 804ee20:	56                   	push   %esi
 804ee21:	53                   	push   %ebx


@@ -5595,39732 +5595,39721 @@
  804eefb:	d9 1e                	fstps  (%esi)
  804eefd:	d8 ca                	fmul   %st(2),%st
  804eeff:	de c1                	faddp  %st,%st(1)
  804ef01:	d9 5e 04             	fstps  0x4(%esi)

- if( frac < 0.05
- 804ef04:	dc 15 58 57 12 08    	fcoml  0x8125758
- 804ef0a:	df e0                	fnstsw %ax
- 804ef0c:	f6 c4 01             	test   $0x1,%ah
- 804ef0f:	74 1b                	je     804ef2c <fracdivline+0x10c>

+ if( frac < 0.05f
+ 804ef04:	d9 05 40 57 12 08    	flds   0x8125740
+ 804ef0a:	d9 c9                	fxch   %st(1)
+ 804ef0c:	d8 d1                	fcom   %st(1)
+ 804ef0e:	df e0                	fnstsw %ax
+ 804ef10:	dd d9                	fstp   %st(1)
+ 804ef12:	f6 c4 01             	test   $0x1,%ah
+ 804ef15:	74 1b                	je     804ef32 <fracdivline+0x112>

  804ef11:	89 f0                	mov    %esi,%eax
  804ef13:	89 4c 24 04          	mov    %ecx,0x4(%esp)
  804ef17:	dd 5c 24 08          	fstpl  0x8(%esp)
  804ef1b:	e8 b0 fe ff ff       	call   804edd0 <SameVertex.clone.5>
  804ef20:	85 c0                	test   %eax,%eax
  804ef22:	8b 4c 24 04          	mov    0x4(%esp),%ecx
  804ef26:	dd 44 24 08          	fldl   0x8(%esp)
  804ef2a:	75 74                	jne    804efa0 <fracdivline+0x180>

- if( frac > 0.95
- 804ef2c:	dc 1d 60 57 12 08    	fcompl 0x8125760
- 804ef32:	df e0                	fnstsw %ax
- 804ef34:	f6 c4 45             	test   $0x45,%ah
- 804ef37:	74 27                	je     804ef60 <fracdivline+0x140>

+ if( frac > 0.95f
+ 804ef32:	d9 05 44 57 12 08    	flds   0x8125744
+ 804ef38:	d9 c9                	fxch   %st(1)
+ 804ef3a:	de d9                	fcompp 
+ 804ef3c:	df e0                	fnstsw %ax
+ 804ef3e:	f6 c4 45             	test   $0x45,%ah
+ 804ef41:	74 2d                	je     804ef70 <fracdivline+0x150>

<fracdivline+0x119>:
  804ef39:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
  804ef40:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
  804ef47:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
  804ef4e:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
  804ef55:	b8 02 00 00 00       	mov    $0x2,%eax
  804ef5a:	83 c4 2c             	add    $0x2c,%esp
  804ef5d:	5b                   	pop    %ebx
  804ef5e:	5e                   	pop    %esi
  804ef5f:	c3                   	ret    

  Alignment ?  It did not align other jmp targets like this !
+ 804ef6a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi

<fracdivline+0x140> :
  804ef60:	89 ca                	mov    %ecx,%edx
  804ef62:	89 f0                	mov    %esi,%eax
  804ef64:	89 4c 24 04          	mov    %ecx,0x4(%esp)
  804ef68:	e8 63 fe ff ff       	call   804edd0 <SameVertex.clone.5>
  804ef6d:	85 c0                	test   %eax,%eax
  804ef6f:	8b 4c 24 04          	mov    0x4(%esp),%ecx
  804ef73:	74 c4                	je     804ef39 <fracdivline+0x119>

  804ef75:	89 4e 08             	mov    %ecx,0x8(%esi)
  804ef78:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
  804ef7f:	c7 46 14 02 00 00 00 	movl   $0x2,0x14(%esi)
  804ef86:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
  804ef8d:	b8 03 00 00 00       	mov    $0x3,%eax
  804ef92:	e9 3b ff ff ff       	jmp    804eed2 <fracdivline+0xb2>
  804ef97:	89 f6                	mov    %esi,%esi
  804ef99:	8d bc 27 00 00 00 00 	lea    0x0(%edi,%eiz,1),%edi
  804efa0:	dd d8                	fstp   %st(0)
  804efa2:	89 5e 08             	mov    %ebx,0x8(%esi)
  804efa5:	c7 46 10 ff ff ff ff 	movl   $0xffffffff,0x10(%esi)
  804efac:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
  804efb3:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
  804efba:	b8 01 00 00 00       	mov    $0x1,%eax
  804efbf:	e9 0e ff ff ff       	jmp    804eed2 <fracdivline+0xb2>

  More alignment ?
  804efc4:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
  804efca:	8d bf 00 00 00 00    	lea    0x0(%edi),%edi

- 0804efd0 <wpoly_insert_vert>:
+ 0804efe0 <wpoly_insert_vert>:

- 08124d7c <_fini>:
+ 08124d8c <_fini>:
  Still only a difference of 16 bytes at end of the code.


* The first float literal added 3 instructions (6 bytes).
* The second float literal added 2 instructions (4 bytes).
* Compiler seems to include alignment to 0x10 at end of function.
  This complicates using the code size as a guide.
  A code size bump of 32 bytes means that executable grew by 17 to 32 bytes.


 

 

Share this post


Link to post

This may explain the code increase for using float literals:

https://gcc.gnu.org/ml/gcc/1998-11/msg00003.html

 

This is a discussion in the GCC mailing list with a Cygnus engineer about a i386 FP comparison bug, and fixing it in the egcs compiler.

Cygnus engineer thought they ought to turn it into a feature.

 

Quote

> Yes, it is bug. But why do not turn it to feature?

> No. Not separating the cc0 from the cc0 user is a fundamental design concept,

> we are not changing it anytime soon.

> We need to stop regstack from inserting insns between the cc0 setter and

> the cc0 user. Anything else is unacceptable.

 

What I found is incomplete so I don't know any better details at this time.

It seems that their final solution, that appears in the assembly, got much more complicated.

Share this post


Link to post

An interesting article on floating point.  Not directly relevant to floating point bugs in hardware, but may be more practical.

I have not read it all yet.

https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/

 

This is from a Chrome software programmer who does much the same thing.

Puts in a const, watches the code shrink,  looks at the assembly to see what bad optimization the compiler produces, and figures how to get around it.  Worth a quick read at least.

https://randomascii.wordpress.com/2017/01/08/add-a-const-here-delete-a-const-there/

Edited by wesleyjohnson

Share this post


Link to post

Last year we did some precision calculations for ZDoom. The relevant thread is in the developers' forum so I cannot link it here.

 

The conclusions we made:

 

- if you need reliable results, never use floats, always use doubles. Single precision floats depend on variables that are outside the programmer's control on some platforms.

- for the same reason, do not use the CRT's math functions like sin or sqrt. They all differ between compilers. If you need reproducable results you have to use software implemented replacements.

- do not use any floating point optimizations the compiler may offer -they also may affect the results.

 

Following these rules it is safe to assume that all current compilers create code that gives the same result on all currently relevant platforms (i.e. x86/x87, x86/SSE2, x64, ARM, ARM64 and PowerPC.)

 

No idea how an old compiler like GCC 4.5 would fare here, though - it might present some issues.

 

Share this post


Link to post
3 hours ago, wesleyjohnson said:

An interesting article on floating point.  Not directly relevant to floating point bugs in hardware, but may be more practical.

I have not read it all yet.

https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/

 

This is from a Chrome software programmer who does much the same thing.

Puts in a const, watches the code shrink,  looks at the assembly to see what bad optimization the compiler produces, and figures how to get around it.  Worth a quick read at least.

https://randomascii.wordpress.com/2017/01/08/add-a-const-here-delete-a-const-there/

The randomASCII articles are amazing. Thanks for the detailed writeup. Doesn't look like a "feature" to me. I have not heard of a 386 float compare bug, but, yeah, maybe. But that doesn't explain why you have problems when you choose 686 as your target, or why it needs a "secret" method to enable it. So, the big question: Will you continue to add 'f', or not? :)

Share this post


Link to post

Sorry for the delay.  My ability to reply on DoomWorld was totally broken for the last week.  It would not respond to any button pushes and I could not even complain to anyone.

 

I have added the ability to choose ARCH when compiing DoomLegacy.

The GCC docs are not entirely clear about what it uses for default, but I suspect that it is  generic32, or i386.

I have not found a way to detect what it used from looking at the objdumps.  Tools like  "file" or "objdump" only will tell me that the object format is  "i386".  It says that even for  i686 compiles.

The default (which was used for all the tests so far)  must be i386.

1. GCC is including assembly for 386 cpu problems.

2. When I compile with -march=i686, the code gets 30K smaller.

 

Those articles seem to be definitive reference material.  They are referred to by most other work on floating point.

I found almost identical sentences in the GCC info pages too.

 

 

 

Share this post


Link to post

Repeat the last test for different target arch.

It appears that the default compile is close to the i486.

Interesting that the smallest code is for the  386, with the 586 a close second.  The 486 has the largest code.

There are other strange things in the code, which make for code size bumps in the 686 and k8 cases.

The strange things are not alignment, they are executed, but appear to be useless compiler artifacts.

Extra code clones do not help.  They could really reduce the code size by controlling their clones better.

This is just for GCC, who knows what CLANG or MS does.

 

Compiler: GCC 4.5.2
Source: Doomlegacy 1.46.3
Assembly: objdump -d

==============
Try different -march settings for fracdivline() with f literals.
Tst7_386: -march=i386
Tst7_486: -march=i486
Tst8_586: -march=i586
Tst9_686: -march=i686
Tst10_k8: -march=k8

Orig      size: 1335047
Tst6      size: 1335079
Tst7_386  size: 1237118
Tst7_486  size: 1335079
Tst8_586  size: 1280269
Tst9_686  size: 1302098
Tst10_k8  size: 1306919



==============
Short Notes:

**
These exit clones are present in all versions.
They appear to clean the floating point stack, but do not appear for
every exit of the function.  They are not padding, they are executed.

ret_DVL_none.clone.1:
-> return DVL_none
 804e600:	dd d8                	fstp   %st(0)
 804e602:	dd d8                	fstp   %st(0)
 804e604:	dd d8                	fstp   %st(0)
 804e606:	dd d8                	fstp   %st(0)
 804e608:	dd d8                	fstp   %st(0)
 804e60a:	dd d8                	fstp   %st(0)
 804e60c:	dd d8                	fstp   %st(0)
 804e60e:	eb 10                	jmp    804e620 ret_DVL_none
 

** if (fabs(den) < 1.0E-36)
For i386, i486, i586:  The compare double.
if (fabs(den) < 1.0E-36)
 d9 c0                	fld    %st(0)
 d9 e1                	fabs   
 dc 1d d8 d8 10 08    	fcompl 0x810d8d8
 df e0                	fnstsw %ax
 f6 c4 01             	test   $0x1,%ah
 75 49                	jne    804e600 ret_DVL_none.clone.1
For i686:  Compare Double in registers.
if (fabs(den) < 1.0E-36)
 d9 c0                	fld    %st(0)
 d9 e1                	fabs   
 dd 05 d8 d8 11 08    	fldl   0x811d8d8
 df f1                	fcomip %st(1),%st
 dd d8                	fstp   %st(0)
 77 3c                	ja     804ea18 ret_DVL_none.clone.1
For k8:  Compare Double in registers.
if (fabs(den) < 1.0E-36)
 d9 c0                	fld    %st(0)
 d9 e1                	fabs   
 dd 05 d8 e4 11 08    	fldl   0x811e4d8
 df f1                	fcomip %st(1),%st
 df c0                	ffreep %st(0)
 77 3c                	ja     804ea08 ret_DVL_none.clone.1

** if (frac<0.0 ... )
For i386, i486, i586:  ftst
if (frac<0.0 ... )
 d9 e4                	ftst   
 df e0                	fnstsw %ax
 f6 c4 01             	test   $0x1,%ah
 75 30                	jne    804e610 ret_DVL_none.clone.2
For i686, k8:  load 0 and compare
if (frac<0.0 ... )
 d9 ee                	fldz   
 df f1                	fcomip %st(1),%st
 77 2e                	ja     804ea28 ret_DVL_none.clone.2

** if ( ... frac>1.0)
For i386, i486, i586:  Load 1, compare float (avoiding buggy fcomps?).
if ( ... frac>1.0)
 d9 e8                	fld1   
 d9 c9                	fxch   %st(1)
 d8 d1                	fcom   %st(1)
 df e0                	fnstsw %ax
 dd d9                	fstp   %st(1)
 f6 c4 45             	test   $0x45,%ah
 75 39                	jne    804e628 body.num1
For i686, k8: Load 1, compare integer
if ( ... frac>1.0)
 d9 e8                	fld1   
 d9 c9                	fxch   %st(1)
 db f1                	fcomi  %st(1),%st
 dd d9                	fstp   %st(1)
 76 3c                	jbe    804ea40 body.num1

** if( frac < 0.05f
For i386, i486, i586:  Load literal, compare float (avoiding buggy fcomps?).
if( frac < 0.05f
 d9 05 a0 d8 10 08    	flds   0x810d8a0
 d9 c9                	fxch   %st(1)
 d8 d1                	fcom   %st(1)
 df e0                	fnstsw %ax
 dd d9                	fstp   %st(1)
 f6 c4 01             	test   $0x1,%ah
 74 1b                	je     804e67a body.if.95
For i686, k8: Load literal, compare integer
if( frac < 0.05f
 d9 05 a0 d8 11 08    	flds   0x811d8a0
 df f1                	fcomip %st(1),%st
 76 1b                	jbe    804ea8b body.if.95

** if( frac > 0.95f
For i386, i486, i586:  Load literal, compare float (avoiding buggy fcomps?).
if( frac > 0.95f
 d9 05 a4 d8 10 08    	flds   0x810d8a4
 d9 c9                	fxch   %st(1)
 de d9                	fcompp 
 df e0                	fnstsw %ax
 f6 c4 45             	test   $0x45,%ah
 74 29                	je     804e6b4 if.95.OR.samevertex
For i686:  Load literal, compare integer.
if( frac > 0.95f
 d9 05 a4 d8 11 08    	flds   0x811d8a4
 d9 c9                	fxch   %st(1)
 df f1                	fcomip %st(1),%st
 dd d8                	fstp   %st(0)
 77 27                	ja     804eac0 if.95.OR.samevertex
For k8:  Load literal, compare integer.
if( frac > 0.95f
 d9 05 a4 e4 11 08    	flds   0x811e4a4
 d9 c9                	fxch   %st(1)
 df f1                	fcomip %st(1),%st
 df c0                	ffreep %st(0)
 77 27                	ja     804eab0 if.95.OR.samevertex



==============
Tst7_386:
0804e570 <fracdivline>:
 804e570:	56                   	push   %esi
 804e571:	53                   	push   %ebx
 804e572:	83 ec 2c             	sub    $0x2c,%esp
 804e575:	89 d3                	mov    %edx,%ebx
 804e577:	8b 74 24 38          	mov    0x38(%esp),%esi
 804e57b:	d9 02                	flds   (%edx)
 804e57d:	d9 42 04             	flds   0x4(%edx)
 804e580:	d9 01                	flds   (%ecx)
 804e582:	d8 e2                	fsub   %st(2),%st
 804e584:	d9 41 04             	flds   0x4(%ecx)
 804e587:	d8 e2                	fsub   %st(2),%st
 804e589:	d9 00                	flds   (%eax)
 804e58b:	d9 5c 24 18          	fstps  0x18(%esp)
 804e58f:	d9 40 04             	flds   0x4(%eax)
 804e592:	d9 5c 24 1c          	fstps  0x1c(%esp)

den = v3dy*v1dx - v3dx*v1dy;
 804e596:	d9 40 08             	flds   0x8(%eax)
 804e599:	d9 40 0c             	flds   0xc(%eax)
 804e59c:	d9 c3                	fld    %st(3)
 804e59e:	d8 c9                	fmul   %st(1),%st
 804e5a0:	d9 c3                	fld    %st(3)
 804e5a2:	d8 cb                	fmul   %st(3),%st
 804e5a4:	de e9                	fsubrp %st,%st(1)

if (fabs(den) < 1.0E-36)
 804e5a6:	d9 c0                	fld    %st(0)
 804e5a8:	d9 e1                	fabs   
 804e5aa:	dc 1d d8 d8 10 08    	fcompl 0x810d8d8
 804e5b0:	df e0                	fnstsw %ax
 804e5b2:	f6 c4 01             	test   $0x1,%ah
 804e5b5:	75 49                	jne    804e600 ret_DVL_none.clone.1

num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx;
 804e5b7:	d9 ce                	fxch   %st(6)
 804e5b9:	dd 5c 24 20          	fstpl  0x20(%esp)
 804e5bd:	dd 44 24 20          	fldl   0x20(%esp)
 804e5c1:	d8 6c 24 18          	fsubrs 0x18(%esp)
 804e5c5:	d9 c5                	fld    %st(5)
 804e5c7:	d8 64 24 1c          	fsubs  0x1c(%esp)
 804e5cb:	dc cb                	fmul   %st,%st(3)
 804e5cd:	d9 ca                	fxch   %st(2)
 804e5cf:	d8 c9                	fmul   %st(1),%st
 804e5d1:	de c3                	faddp  %st,%st(3)

frac = num / den;
 804e5d3:	d9 ca                	fxch   %st(2)
 804e5d5:	d8 f6                	fdiv   %st(6),%st

if (frac<0.0 ... )
 804e5d7:	d9 e4                	ftst   
 804e5d9:	df e0                	fnstsw %ax
 804e5db:	f6 c4 01             	test   $0x1,%ah
 804e5de:	75 30                	jne    804e610 ret_DVL_none.clone.2

if ( ... frac>1.0)
 804e5e0:	d9 e8                	fld1   
 804e5e2:	d9 c9                	fxch   %st(1)
 804e5e4:	d8 d1                	fcom   %st(1)
 804e5e6:	df e0                	fnstsw %ax
 804e5e8:	dd d9                	fstp   %st(1)
 804e5ea:	f6 c4 45             	test   $0x45,%ah
 804e5ed:	75 39                	jne    804e628 body.num1

# ret_DVL_none.clone.3:
-> return DVL_none
 804e5ef:	dd d8                	fstp   %st(0)
 804e5f1:	dd d8                	fstp   %st(0)
 804e5f3:	dd d8                	fstp   %st(0)
 804e5f5:	dd d8                	fstp   %st(0)
 804e5f7:	dd d8                	fstp   %st(0)
 804e5f9:	dd d8                	fstp   %st(0)
 804e5fb:	dd d8                	fstp   %st(0)
 804e5fd:	eb 21                	jmp    804e620 ret_DVL_none

 804e5ff:	90                   	nop
 
ret_DVL_none.clone.1:
-> return DVL_none
 804e600:	dd d8                	fstp   %st(0)
 804e602:	dd d8                	fstp   %st(0)
 804e604:	dd d8                	fstp   %st(0)
 804e606:	dd d8                	fstp   %st(0)
 804e608:	dd d8                	fstp   %st(0)
 804e60a:	dd d8                	fstp   %st(0)
 804e60c:	dd d8                	fstp   %st(0)
 804e60e:	eb 10                	jmp    804e620 ret_DVL_none
 
ret_DVL_none.clone.2:
-> return DVL_none
 804e610:	dd d8                	fstp   %st(0)
 804e612:	dd d8                	fstp   %st(0)
 804e614:	dd d8                	fstp   %st(0)
 804e616:	dd d8                	fstp   %st(0)
 804e618:	dd d8                	fstp   %st(0)
 804e61a:	dd d8                	fstp   %st(0)
 804e61c:	dd d8                	fstp   %st(0)
 804e61e:	66 90                	xchg   %ax,%ax

ret_DVL_none: 
return DVL_none;
 804e620:	31 c0                	xor    %eax,%eax

ret_common.1:
 804e622:	83 c4 2c             	add    $0x2c,%esp
 804e625:	5b                   	pop    %ebx
 804e626:	5e                   	pop    %esi
 804e627:	c3                   	ret    

body.num1:
num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx;
 804e628:	d9 c9                	fxch   %st(1)
 804e62a:	d8 cc                	fmul   %st(4),%st
 804e62c:	d9 ca                	fxch   %st(2)
 804e62e:	d8 cb                	fmul   %st(3),%st
 804e630:	de c2                	faddp  %st,%st(2)
 804e632:	d9 c9                	fxch   %st(1)
result->divfrac = num / den;
 804e634:	de f5                	fdivp  %st,%st(5)
 804e636:	d9 cc                	fxch   %st(4)
 804e638:	d9 5e 0c             	fstps  0xc(%esi)
result->divpt.x = v1x + v1dx*frac;
 804e63b:	d9 c9                	fxch   %st(1)
 804e63d:	d8 cb                	fmul   %st(3),%st
 804e63f:	dc 44 24 20          	faddl  0x20(%esp)
 804e643:	d9 1e                	fstps  (%esi)
result->divpt.y = v1y + v1dy*frac;
 804e645:	d8 ca                	fmul   %st(2),%st
 804e647:	de c1                	faddp  %st,%st(1)
 804e649:	d9 5e 04             	fstps  0x4(%esi)

+ if( frac < 0.05f
+ 804e64c:	d9 05 a0 d8 10 08    	flds   0x810d8a0
+ 804e652:	d9 c9                	fxch   %st(1)
+ 804e654:	d8 d1                	fcom   %st(1)
+ 804e656:	df e0                	fnstsw %ax
+ 804e658:	dd d9                	fstp   %st(1)
+ 804e65a:	f6 c4 01             	test   $0x1,%ah
+ 804e65d:	74 1b                	je     804e67a body.if.95

if( ... && SameVertex()
 804e65f:	89 f0                	mov    %esi,%eax
 804e661:	89 4c 24 04          	mov    %ecx,0x4(%esp)
 804e665:	dd 5c 24 08          	fstpl  0x8(%esp)
 804e669:	e8 ce fe ff ff       	call   804e53c <SameVertex.clone.5>
 804e66e:	85 c0                	test   %eax,%eax
 804e670:	8b 4c 24 04          	mov    0x4(%esp),%ecx
 804e674:	dd 44 24 08          	fldl   0x8(%esp)
 804e678:	75 72                	jne    804e6ec case_DVL_v1

body.if.95:
+ if( frac > 0.95f
+ 804e67a:	d9 05 a4 d8 10 08    	flds   0x810d8a4
+ 804e680:	d9 c9                	fxch   %st(1)
+ 804e682:	de d9                	fcompp 
+ 804e684:	df e0                	fnstsw %ax
+ 804e686:	f6 c4 45             	test   $0x45,%ah
+ 804e689:	74 29                	je     804e6b4 if.95.OR.samevertex

case_DVL_mid:
result->vertex = NULL;
 804e68b:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
result->before = 0;
 804e692:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
result->after = 1;
 804e699:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
result->before = 0;
 804e6a0:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
return DVL_mid;
 804e6a7:	b8 02 00 00 00       	mov    $0x2,%eax
 804e6ac:	83 c4 2c             	add    $0x2c,%esp
 804e6af:	5b                   	pop    %ebx
 804e6b0:	5e                   	pop    %esi
 804e6b1:	c3                   	ret    

 804e6b2:	66 90                	xchg   %ax,%ax

if.95.OR.samevertex:
if( ... && SameVertex()
 804e6b4:	89 ca                	mov    %ecx,%edx
 804e6b6:	89 f0                	mov    %esi,%eax
 804e6b8:	89 4c 24 04          	mov    %ecx,0x4(%esp)
 804e6bc:	e8 7b fe ff ff       	call   804e53c <SameVertex.clone.5>
 804e6c1:	85 c0                	test   %eax,%eax
 804e6c3:	8b 4c 24 04          	mov    0x4(%esp),%ecx
 804e6c7:	74 c2                	je     804e68b case_DVL_mid

# case_DVL_v2:
result->vertex = v2;
 804e6c9:	89 4e 08             	mov    %ecx,0x8(%esi)
result->before = 0;
 804e6cc:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
result->after = 2;
 804e6d3:	c7 46 14 02 00 00 00 	movl   $0x2,0x14(%esi)
result->at_vert = true;
 804e6da:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
return DVL_v2;
 804e6e1:	b8 03 00 00 00       	mov    $0x3,%eax
 804e6e6:	e9 37 ff ff ff       	jmp    804e622 ret_common.1

 804e6eb:	90                   	nop

case_DVL_v1:
result->vertex = v1;
 804e6ec:	dd d8                	fstp   %st(0)
 804e6ee:	89 5e 08             	mov    %ebx,0x8(%esi)
result->before = -1;
 804e6f1:	c7 46 10 ff ff ff ff 	movl   $0xffffffff,0x10(%esi)
result->after = 1;
 804e6f8:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
result->at_vert = true;
 804e6ff:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
return DVL_v1;
 804e706:	b8 01 00 00 00       	mov    $0x1,%eax
 804e70b:	e9 12 ff ff ff       	jmp    804e622 ret_common.1


==============
Tst7_486:
0804ee20 <fracdivline>:
 804ee20:	56                   	push   %esi
 804ee21:	53                   	push   %ebx
 804ee22:	83 ec 2c             	sub    $0x2c,%esp
 804ee25:	89 d3                	mov    %edx,%ebx
 804ee27:	8b 74 24 38          	mov    0x38(%esp),%esi
 804ee2b:	d9 02                	flds   (%edx)
 804ee2d:	d9 42 04             	flds   0x4(%edx)
 804ee30:	d9 01                	flds   (%ecx)
 804ee32:	d8 e2                	fsub   %st(2),%st
 804ee34:	d9 41 04             	flds   0x4(%ecx)
 804ee37:	d8 e2                	fsub   %st(2),%st
 804ee39:	d9 00                	flds   (%eax)
 804ee3b:	d9 5c 24 18          	fstps  0x18(%esp)
 804ee3f:	d9 40 04             	flds   0x4(%eax)
 804ee42:	d9 5c 24 1c          	fstps  0x1c(%esp)

den = v3dy*v1dx - v3dx*v1dy;
 804ee46:	d9 40 08             	flds   0x8(%eax)
 804ee49:	d9 40 0c             	flds   0xc(%eax)
 804ee4c:	d9 c3                	fld    %st(3)
 804ee4e:	d8 c9                	fmul   %st(1),%st
 804ee50:	d9 c3                	fld    %st(3)
 804ee52:	d8 cb                	fmul   %st(3),%st
 804ee54:	de e9                	fsubrp %st,%st(1)

if (fabs(den) < 1.0E-36)
 804ee56:	d9 c0                	fld    %st(0)
 804ee58:	d9 e1                	fabs   
 804ee5a:	dc 1d 78 57 12 08    	fcompl 0x8125778
 804ee60:	df e0                	fnstsw %ax
 804ee62:	f6 c4 01             	test   $0x1,%ah
 804ee65:	75 49                	jne    804eeb0 ret_DVL_none.clone.1

num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx;
 804ee67:	d9 ce                	fxch   %st(6)
 804ee69:	dd 5c 24 20          	fstpl  0x20(%esp)
 804ee6d:	dd 44 24 20          	fldl   0x20(%esp)
 804ee71:	d8 6c 24 18          	fsubrs 0x18(%esp)
 804ee75:	d9 c5                	fld    %st(5)
 804ee77:	d8 64 24 1c          	fsubs  0x1c(%esp)
 804ee7b:	dc cb                	fmul   %st,%st(3)
 804ee7d:	d9 ca                	fxch   %st(2)
 804ee7f:	d8 c9                	fmul   %st(1),%st
 804ee81:	de c3                	faddp  %st,%st(3)

frac = num / den;
 804ee83:	d9 ca                	fxch   %st(2)
 804ee85:	d8 f6                	fdiv   %st(6),%st

if (frac<0.0 ... )
 804ee87:	d9 e4                	ftst   
 804ee89:	df e0                	fnstsw %ax
 804ee8b:	f6 c4 01             	test   $0x1,%ah
 804ee8e:	75 30                	jne    804eec0 ret_DVL_none.clone.2

if ( ... frac>1.0)
 804ee90:	d9 e8                	fld1   
 804ee92:	d9 c9                	fxch   %st(1)
 804ee94:	d8 d1                	fcom   %st(1)
 804ee96:	df e0                	fnstsw %ax
 804ee98:	dd d9                	fstp   %st(1)
 804ee9a:	f6 c4 45             	test   $0x45,%ah
 804ee9d:	75 41                	jne    804eee0 body.num1

# ret_DVL_none.clone.3:
-> return DVL_none
 804ee9f:	dd d8                	fstp   %st(0)
 804eea1:	dd d8                	fstp   %st(0)
 804eea3:	dd d8                	fstp   %st(0)
 804eea5:	dd d8                	fstp   %st(0)
 804eea7:	dd d8                	fstp   %st(0)
 804eea9:	dd d8                	fstp   %st(0)
 804eeab:	dd d8                	fstp   %st(0)
 804eead:	eb 21                	jmp    804eed0 ret_DVL_none

 804eeaf:	90                   	nop

ret_DVL_none.clone.1:
-> return DVL_none
 804eeb0:	dd d8                	fstp   %st(0)
 804eeb2:	dd d8                	fstp   %st(0)
 804eeb4:	dd d8                	fstp   %st(0)
 804eeb6:	dd d8                	fstp   %st(0)
 804eeb8:	dd d8                	fstp   %st(0)
 804eeba:	dd d8                	fstp   %st(0)
 804eebc:	dd d8                	fstp   %st(0)
 804eebe:	eb 10                	jmp    804eed0 ret_DVL_none

ret_DVL_none.clone.2:
 804eec0:	dd d8                	fstp   %st(0)
 804eec2:	dd d8                	fstp   %st(0)
 804eec4:	dd d8                	fstp   %st(0)
 804eec6:	dd d8                	fstp   %st(0)
 804eec8:	dd d8                	fstp   %st(0)
 804eeca:	dd d8                	fstp   %st(0)
 804eecc:	dd d8                	fstp   %st(0)
 804eece:	66 90                	xchg   %ax,%ax

ret_DVL_none:
return DVL_none;
 804eed0:	31 c0                	xor    %eax,%eax

ret_common.1:
 804eed2:	83 c4 2c             	add    $0x2c,%esp
 804eed5:	5b                   	pop    %ebx
 804eed6:	5e                   	pop    %esi
 804eed7:	c3                   	ret    
 
 804eed8:	90                   	nop
 804eed9:	8d b4 26 00 00 00 00 	lea    0x0(%esi,%eiz,1),%esi
 
body.num1:
num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx;
 804eee0:	d9 c9                	fxch   %st(1)
 804eee2:	d8 cc                	fmul   %st(4),%st
 804eee4:	d9 ca                	fxch   %st(2)
 804eee6:	d8 cb                	fmul   %st(3),%st
 804eee8:	de c2                	faddp  %st,%st(2)
 804eeea:	d9 c9                	fxch   %st(1)
result->divfrac = num / den;
 804eeec:	de f5                	fdivp  %st,%st(5)
 804eeee:	d9 cc                	fxch   %st(4)
 804eef0:	d9 5e 0c             	fstps  0xc(%esi)
result->divpt.x = v1x + v1dx*frac;
 804eef3:	d9 c9                	fxch   %st(1)
 804eef5:	d8 cb                	fmul   %st(3),%st
 804eef7:	dc 44 24 20          	faddl  0x20(%esp)
 804eefb:	d9 1e                	fstps  (%esi)
result->divpt.y = v1y + v1dy*frac;
 804eefd:	d8 ca                	fmul   %st(2),%st
 804eeff:	de c1                	faddp  %st,%st(1)
 804ef01:	d9 5e 04             	fstps  0x4(%esi)
 
+ if( frac < 0.05f
+ 804ef04:	d9 05 40 57 12 08    	flds   0x8125740
+ 804ef0a:	d9 c9                	fxch   %st(1)
+ 804ef0c:	d8 d1                	fcom   %st(1)
+ 804ef0e:	df e0                	fnstsw %ax
+ 804ef10:	dd d9                	fstp   %st(1)
+ 804ef12:	f6 c4 01             	test   $0x1,%ah
+ 804ef15:	74 1b                	je     804ef32 body.if.95
 
if( ... && SameVertex()
 804ef17:	89 f0                	mov    %esi,%eax
 804ef19:	89 4c 24 04          	mov    %ecx,0x4(%esp)
 804ef1d:	dd 5c 24 08          	fstpl  0x8(%esp)
 804ef21:	e8 aa fe ff ff       	call   804edd0 <SameVertex.clone.5>
 804ef26:	85 c0                	test   %eax,%eax
 804ef28:	8b 4c 24 04          	mov    0x4(%esp),%ecx
 804ef2c:	dd 44 24 08          	fldl   0x8(%esp)
 804ef30:	75 7e                	jne    804efb0 case_DVL_v1

body.if.95:
+ if( frac > 0.95f
+ 804ef32:	d9 05 44 57 12 08    	flds   0x8125744
+ 804ef38:	d9 c9                	fxch   %st(1)
+ 804ef3a:	de d9                	fcompp 
+ 804ef3c:	df e0                	fnstsw %ax
+ 804ef3e:	f6 c4 45             	test   $0x45,%ah
+ 804ef41:	74 2d                	je     804ef70 if.95.OR.samevertex

case_DVL_mid:
 804ef43:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
 804ef4a:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
 804ef51:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804ef58:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
return DVL_mid;
 804ef5f:	b8 02 00 00 00       	mov    $0x2,%eax
 804ef64:	83 c4 2c             	add    $0x2c,%esp
 804ef67:	5b                   	pop    %ebx
 804ef68:	5e                   	pop    %esi
 804ef69:	c3                   	ret    

 804ef6a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
 
if.95.OR.samevertex:
if( ... && SameVertex()
 804ef70:	89 ca                	mov    %ecx,%edx
 804ef72:	89 f0                	mov    %esi,%eax
 804ef74:	89 4c 24 04          	mov    %ecx,0x4(%esp)
 804ef78:	e8 53 fe ff ff       	call   804edd0 <SameVertex.clone.5>
 804ef7d:	85 c0                	test   %eax,%eax
 804ef7f:	8b 4c 24 04          	mov    0x4(%esp),%ecx
 804ef83:	74 be                	je     804ef43 case_DVL_mid

# case_DVL_v2:
 804ef85:	89 4e 08             	mov    %ecx,0x8(%esi)
 804ef88:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
 804ef8f:	c7 46 14 02 00 00 00 	movl   $0x2,0x14(%esi)
 804ef96:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
return DVL_v2;
 804ef9d:	b8 03 00 00 00       	mov    $0x3,%eax
 804efa2:	e9 2b ff ff ff       	jmp    804eed2 ret_common.1

 804efa7:	89 f6                	mov    %esi,%esi
 804efa9:	8d bc 27 00 00 00 00 	lea    0x0(%edi,%eiz,1),%edi
 
case_DVL_v1:
 804efb0:	dd d8                	fstp   %st(0)
 804efb2:	89 5e 08             	mov    %ebx,0x8(%esi)
 804efb5:	c7 46 10 ff ff ff ff 	movl   $0xffffffff,0x10(%esi)
 804efbc:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804efc3:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
return DVL_v1;
 804efca:	b8 01 00 00 00       	mov    $0x1,%eax
 804efcf:	e9 fe fe ff ff       	jmp    804eed2 ret_common.1

 804efd4:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
 804efda:	8d bf 00 00 00 00    	lea    0x0(%edi),%edi



==============
Tst8_586:
0804e9e0 <fracdivline>:
 804e9e0:	56                   	push   %esi
 804e9e1:	53                   	push   %ebx
 804e9e2:	83 ec 2c             	sub    $0x2c,%esp
 804e9e5:	89 d3                	mov    %edx,%ebx
 804e9e7:	d9 02                	flds   (%edx)
 804e9e9:	d9 42 04             	flds   0x4(%edx)
 804e9ec:	d9 01                	flds   (%ecx)
 804e9ee:	d8 e2                	fsub   %st(2),%st
 804e9f0:	31 d2                	xor    %edx,%edx
 804e9f2:	8b 74 24 38          	mov    0x38(%esp),%esi
 804e9f6:	d9 41 04             	flds   0x4(%ecx)
 804e9f9:	d8 e2                	fsub   %st(2),%st
 804e9fb:	d9 00                	flds   (%eax)
 804e9fd:	d9 5c 24 18          	fstps  0x18(%esp)
 804ea01:	d9 40 04             	flds   0x4(%eax)
 804ea04:	d9 5c 24 1c          	fstps  0x1c(%esp)

den = v3dy*v1dx - v3dx*v1dy;
 804ea08:	d9 40 08             	flds   0x8(%eax)
 804ea0b:	d9 40 0c             	flds   0xc(%eax)
 804ea0e:	d9 c3                	fld    %st(3)
 804ea10:	d8 c9                	fmul   %st(1),%st
 804ea12:	d9 c3                	fld    %st(3)
 804ea14:	d8 cb                	fmul   %st(3),%st
 804ea16:	de e9                	fsubrp %st,%st(1)

if (fabs(den) < 1.0E-36)
 804ea18:	d9 c0                	fld    %st(0)
 804ea1a:	d9 e1                	fabs   
 804ea1c:	dc 1d 38 81 11 08    	fcompl 0x8118138
 804ea22:	df e0                	fnstsw %ax
 804ea24:	f6 c4 01             	test   $0x1,%ah
 804ea27:	75 4f                	jne    804ea78 ret_DVL_none.clone.1

num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx;
 804ea29:	d9 ce                	fxch   %st(6)
 804ea2b:	dd 5c 24 20          	fstpl  0x20(%esp)
 804ea2f:	dd 44 24 20          	fldl   0x20(%esp)
 804ea33:	d8 6c 24 18          	fsubrs 0x18(%esp)
 804ea37:	d9 c5                	fld    %st(5)
 804ea39:	d8 64 24 1c          	fsubs  0x1c(%esp)
 804ea3d:	dc cb                	fmul   %st,%st(3)
 804ea3f:	d9 ca                	fxch   %st(2)
 804ea41:	d8 c9                	fmul   %st(1),%st
 804ea43:	de c3                	faddp  %st,%st(3)

frac = num / den;
 804ea45:	d9 ca                	fxch   %st(2)
 804ea47:	d8 f6                	fdiv   %st(6),%st

if (frac<0.0 ... )
 804ea49:	d9 e4                	ftst   
 804ea4b:	df e0                	fnstsw %ax
 804ea4d:	f6 c4 01             	test   $0x1,%ah
 804ea50:	75 36                	jne    804ea88 ret_DVL_none.clone.2

if ( ... frac>1.0)
 804ea52:	d9 e8                	fld1   
 804ea54:	d9 c9                	fxch   %st(1)
 804ea56:	d8 d1                	fcom   %st(1)
 804ea58:	df e0                	fnstsw %ax
 804ea5a:	dd d9                	fstp   %st(1)
 804ea5c:	f6 c4 45             	test   $0x45,%ah
 804ea5f:	75 3f                	jne    804eaa0 body.num1

# ret_DVL_none.clone.3:
-> return DVL_none
 804ea61:	dd d8                	fstp   %st(0)
 804ea63:	dd d8                	fstp   %st(0)
 804ea65:	dd d8                	fstp   %st(0)
 804ea67:	dd d8                	fstp   %st(0)
 804ea69:	dd d8                	fstp   %st(0)
 804ea6b:	dd d8                	fstp   %st(0)
 804ea6d:	dd d8                	fstp   %st(0)
 804ea6f:	eb 27                	jmp    804ea98 ret_common.1

 804ea71:	8d b4 26 00 00 00 00 	lea    0x0(%esi,%eiz,1),%esi

ret_DVL_none.clone.1:
-> return DVL_none
 804ea78:	dd d8                	fstp   %st(0)
 804ea7a:	dd d8                	fstp   %st(0)
 804ea7c:	dd d8                	fstp   %st(0)
 804ea7e:	dd d8                	fstp   %st(0)
 804ea80:	dd d8                	fstp   %st(0)
 804ea82:	dd d8                	fstp   %st(0)
 804ea84:	dd d8                	fstp   %st(0)
# ? %eax==0
 804ea86:	eb 10                	jmp    804ea98 ret_common.1

ret_DVL_none.clone.2:
-> return DVL_none
 804ea88:	dd d8                	fstp   %st(0)
 804ea8a:	dd d8                	fstp   %st(0)
 804ea8c:	dd d8                	fstp   %st(0)
 804ea8e:	dd d8                	fstp   %st(0)
 804ea90:	dd d8                	fstp   %st(0)
 804ea92:	dd d8                	fstp   %st(0)
 804ea94:	dd d8                	fstp   %st(0)

return DVL_none;
 804ea96:	66 90                	xchg   %ax,%ax

ret_common.1:
 804ea98:	83 c4 2c             	add    $0x2c,%esp
 804ea9b:	89 d0                	mov    %edx,%eax
 804ea9d:	5b                   	pop    %ebx
 804ea9e:	5e                   	pop    %esi
 804ea9f:	c3                   	ret    

body.num1:
num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx;
 804eaa0:	d9 c9                	fxch   %st(1)
 804eaa2:	d8 cc                	fmul   %st(4),%st
 804eaa4:	d9 ca                	fxch   %st(2)
 804eaa6:	d8 cb                	fmul   %st(3),%st
 804eaa8:	de c2                	faddp  %st,%st(2)
 804eaaa:	d9 c9                	fxch   %st(1)
result->divfrac = num / den;
 804eaac:	de f5                	fdivp  %st,%st(5)
 804eaae:	d9 cc                	fxch   %st(4)
 804eab0:	d9 5e 0c             	fstps  0xc(%esi)
result->divpt.x = v1x + v1dx*frac;
 804eab3:	d9 c9                	fxch   %st(1)
 804eab5:	d8 cb                	fmul   %st(3),%st
 804eab7:	dc 44 24 20          	faddl  0x20(%esp)
 804eabb:	d9 1e                	fstps  (%esi)
result->divpt.y = v1y + v1dy*frac;
 804eabd:	d8 ca                	fmul   %st(2),%st
 804eabf:	de c1                	faddp  %st,%st(1)
 804eac1:	d9 5e 04             	fstps  0x4(%esi)

+ if( frac < 0.05f
+ 804eac4:	d9 05 00 81 11 08    	flds   0x8118100
+ 804eaca:	d9 c9                	fxch   %st(1)
+ 804eacc:	d8 d1                	fcom   %st(1)
+ 804eace:	df e0                	fnstsw %ax
+ 804ead0:	dd d9                	fstp   %st(1)
+ 804ead2:	f6 c4 01             	test   $0x1,%ah
+ 804ead5:	74 1d                	je     804eaf4 body.if.95

if.95.OR.samevertex:
if( ... && SameVertex()
 804ead7:	89 da                	mov    %ebx,%edx
 804ead9:	89 f0                	mov    %esi,%eax
 804eadb:	89 4c 24 04          	mov    %ecx,0x4(%esp)
 804eadf:	dd 5c 24 08          	fstpl  0x8(%esp)
 804eae3:	e8 b8 fe ff ff       	call   804e9a0 <SameVertex.clone.5>
 804eae8:	8b 4c 24 04          	mov    0x4(%esp),%ecx
 804eaec:	85 c0                	test   %eax,%eax
 804eaee:	dd 44 24 08          	fldl   0x8(%esp)
 804eaf2:	75 74                	jne    804eb68 case_DVL_v1

body.if.95:
+ if( frac > 0.95f
+ 804eaf4:	d9 05 04 81 11 08    	flds   0x8118104
+ 804eafa:	d9 c9                	fxch   %st(1)
+ 804eafc:	de d9                	fcompp 
+ 804eafe:	df e0                	fnstsw %ax
+ 804eb00:	f6 c4 45             	test   $0x45,%ah
+ 804eb03:	74 2b                	je     804eb30 if.95.OR.samevertex
 
case_DVL_mid:
 804eb05:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
 804eb0c:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
 804eb13:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804eb1a:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
return DVL_mid;
 804eb21:	ba 02 00 00 00       	mov    $0x2,%edx
 804eb26:	83 c4 2c             	add    $0x2c,%esp
 804eb29:	89 d0                	mov    %edx,%eax
 804eb2b:	5b                   	pop    %ebx
 804eb2c:	5e                   	pop    %esi
 804eb2d:	c3                   	ret    

 804eb2e:	66 90                	xchg   %ax,%ax
 
if.95.OR.samevertex:
if( ... && SameVertex()
 804eb30:	89 ca                	mov    %ecx,%edx
 804eb32:	89 f0                	mov    %esi,%eax
 804eb34:	89 4c 24 04          	mov    %ecx,0x4(%esp)
 804eb38:	e8 63 fe ff ff       	call   804e9a0 <SameVertex.clone.5>
 804eb3d:	8b 4c 24 04          	mov    0x4(%esp),%ecx
 804eb41:	85 c0                	test   %eax,%eax
 804eb43:	74 c0                	je     804eb05 case_DVL_mid

# case_DVL_v2:
 804eb45:	89 4e 08             	mov    %ecx,0x8(%esi)
 804eb48:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
 804eb4f:	c7 46 14 02 00 00 00 	movl   $0x2,0x14(%esi)
 804eb56:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
return DVL_v2;
 804eb5d:	ba 03 00 00 00       	mov    $0x3,%edx
 804eb62:	e9 31 ff ff ff       	jmp    804ea98 ret_common.1
 
 804eb67:	90                   	nop
 
case_DVL_v1:
 804eb68:	dd d8                	fstp   %st(0)
 804eb6a:	89 5e 08             	mov    %ebx,0x8(%esi)
 804eb6d:	c7 46 10 ff ff ff ff 	movl   $0xffffffff,0x10(%esi)
 804eb74:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804eb7b:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
return DVL_v1;
 804eb82:	ba 01 00 00 00       	mov    $0x1,%edx
 804eb87:	e9 0c ff ff ff       	jmp    804ea98 ret_common.1

 804eb8c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi


==============
Tst9_686:
0804e990 <fracdivline>:
 804e990:	56                   	push   %esi
 804e991:	53                   	push   %ebx
 804e992:	89 d3                	mov    %edx,%ebx
 804e994:	83 ec 34             	sub    $0x34,%esp
 804e997:	d9 02                	flds   (%edx)
 804e999:	d9 42 04             	flds   0x4(%edx)
 804e99c:	d9 01                	flds   (%ecx)
 804e99e:	d8 e2                	fsub   %st(2),%st
 804e9a0:	8b 74 24 40          	mov    0x40(%esp),%esi
 804e9a4:	d9 41 04             	flds   0x4(%ecx)
 804e9a7:	d8 e2                	fsub   %st(2),%st
 804e9a9:	dd 1c 24             	fstpl  (%esp)
 804e9ac:	d9 00                	flds   (%eax)
 804e9ae:	d9 5c 24 28          	fstps  0x28(%esp)
 804e9b2:	d9 40 04             	flds   0x4(%eax)
 804e9b5:	d9 5c 24 2c          	fstps  0x2c(%esp)

den = v3dy*v1dx - v3dx*v1dy;
 804e9b9:	d9 40 08             	flds   0x8(%eax)
 804e9bc:	d9 40 0c             	flds   0xc(%eax)
 804e9bf:	31 c0                	xor    %eax,%eax
 804e9c1:	d9 c2                	fld    %st(2)
 804e9c3:	d8 c9                	fmul   %st(1),%st
 804e9c5:	dd 04 24             	fldl   (%esp)
 804e9c8:	d8 cb                	fmul   %st(3),%st
 804e9ca:	de e9                	fsubrp %st,%st(1)

if (fabs(den) < 1.0E-36)
 804e9cc:	d9 c0                	fld    %st(0)
 804e9ce:	d9 e1                	fabs   
 804e9d0:	dd 05 d8 d8 11 08    	fldl   0x811d8d8
 804e9d6:	df f1                	fcomip %st(1),%st
 804e9d8:	dd d8                	fstp   %st(0)
 804e9da:	77 3c                	ja     804ea18 ret_DVL_none.clone.1

num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx;
 804e9dc:	d9 c5                	fld    %st(5)
 804e9de:	d8 6c 24 28          	fsubrs 0x28(%esp)
 804e9e2:	d9 c5                	fld    %st(5)
 804e9e4:	d8 64 24 2c          	fsubs  0x2c(%esp)
 804e9e8:	dc cc                	fmul   %st,%st(4)
 804e9ea:	d9 cb                	fxch   %st(3)
 804e9ec:	d8 c9                	fmul   %st(1),%st
 804e9ee:	de c4                	faddp  %st,%st(4)

frac = num / den;
 804e9f0:	d9 cb                	fxch   %st(3)
 804e9f2:	d8 f1                	fdiv   %st(1),%st

if (frac<0.0 ... )
 804e9f4:	d9 ee                	fldz   
 804e9f6:	df f1                	fcomip %st(1),%st
 804e9f8:	77 2e                	ja     804ea28 ret_DVL_none.clone.2

if ( ... frac>1.0)
 804e9fa:	d9 e8                	fld1   
 804e9fc:	d9 c9                	fxch   %st(1)
 804e9fe:	db f1                	fcomi  %st(1),%st
 804ea00:	dd d9                	fstp   %st(1)
 804ea02:	76 3c                	jbe    804ea40 body.num1

# ret_DVL_none.clone.3:
-> return DVL_none
 804ea04:	dd d8                	fstp   %st(0)
 804ea06:	dd d8                	fstp   %st(0)
 804ea08:	dd d8                	fstp   %st(0)
 804ea0a:	dd d8                	fstp   %st(0)
 804ea0c:	dd d8                	fstp   %st(0)
 804ea0e:	dd d8                	fstp   %st(0)
 804ea10:	dd d8                	fstp   %st(0)
 804ea12:	eb 24                	jmp    804ea38 ret_common.1
 
 804ea14:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
 
ret_DVL_none.clone.1:
-> return DVL_none
 804ea18:	dd d8                	fstp   %st(0)
 804ea1a:	dd d8                	fstp   %st(0)
 804ea1c:	dd d8                	fstp   %st(0)
 804ea1e:	dd d8                	fstp   %st(0)
 804ea20:	dd d8                	fstp   %st(0)
 804ea22:	dd d8                	fstp   %st(0)
 804ea24:	eb 12                	jmp    804ea38 ret_common.1
 
 804ea26:	66 90                	xchg   %ax,%ax

ret_DVL_none.clone.2:
-> return DVL_none
 804ea28:	dd d8                	fstp   %st(0)
 804ea2a:	dd d8                	fstp   %st(0)
 804ea2c:	dd d8                	fstp   %st(0)
 804ea2e:	dd d8                	fstp   %st(0)
 804ea30:	dd d8                	fstp   %st(0)
 804ea32:	dd d8                	fstp   %st(0)
 804ea34:	dd d8                	fstp   %st(0)
 
return DVL_none;
 804ea36:	66 90                	xchg   %ax,%ax

ret_common.1:
 804ea38:	83 c4 34             	add    $0x34,%esp
 804ea3b:	5b                   	pop    %ebx
 804ea3c:	5e                   	pop    %esi
 804ea3d:	c3                   	ret    

body.num1:
num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx;
 804ea3e:	66 90                	xchg   %ax,%ax
 804ea40:	d9 ca                	fxch   %st(2)
 804ea42:	d8 cc                	fmul   %st(4),%st
 804ea44:	d9 cb                	fxch   %st(3)
 804ea46:	dc 0c 24             	fmull  (%esp)
 804ea49:	de c3                	faddp  %st,%st(3)
result->divfrac = num / den;
 804ea4b:	de fa                	fdivrp %st,%st(2)
 804ea4d:	d9 c9                	fxch   %st(1)
 804ea4f:	d9 5e 0c             	fstps  0xc(%esi)
result->divpt.x = v1x + v1dx*frac;
 804ea52:	dc c9                	fmul   %st,%st(1)
 804ea54:	d9 c9                	fxch   %st(1)
 804ea56:	de c3                	faddp  %st,%st(3)
 804ea58:	d9 ca                	fxch   %st(2)
 804ea5a:	d9 1e                	fstps  (%esi)
result->divpt.y = v1y + v1dy*frac;
 804ea5c:	dd 04 24             	fldl   (%esp)
 804ea5f:	d8 ca                	fmul   %st(2),%st
 804ea61:	de c1                	faddp  %st,%st(1)
 804ea63:	d9 5e 04             	fstps  0x4(%esi)

+ if( frac < 0.05f
+ 804ea66:	d9 05 a0 d8 11 08    	flds   0x811d8a0
+ 804ea6c:	df f1                	fcomip %st(1),%st
+ 804ea6e:	76 1b                	jbe    804ea8b body.if.95

if( ... && SameVertex()
 804ea70:	89 f0                	mov    %esi,%eax
 804ea72:	89 4c 24 0c          	mov    %ecx,0xc(%esp)
 804ea76:	dd 5c 24 10          	fstpl  0x10(%esp)
 804ea7a:	e8 d1 fe ff ff       	call   804e950 <SameVertex.clone.5>
 804ea7f:	8b 4c 24 0c          	mov    0xc(%esp),%ecx
 804ea83:	85 c0                	test   %eax,%eax
 804ea85:	dd 44 24 10          	fldl   0x10(%esp)
 804ea89:	75 6d                	jne    804eaf8 case_DVL_v1

body.if.95:
+ if( frac > 0.95f
+ 804ea8b:	d9 05 a4 d8 11 08    	flds   0x811d8a4
+ 804ea91:	d9 c9                	fxch   %st(1)
+ 804ea93:	df f1                	fcomip %st(1),%st
+ 804ea95:	dd d8                	fstp   %st(0)
+ 804ea97:	77 27                	ja     804eac0 if.95.OR.samevertex

case_DVL_mid:
 804ea99:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
return DVL_mid;
 804eaa0:	b8 02 00 00 00       	mov    $0x2,%eax
 804eaa5:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
 804eaac:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804eab3:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
 804eaba:	83 c4 34             	add    $0x34,%esp
 804eabd:	5b                   	pop    %ebx
 804eabe:	5e                   	pop    %esi
 804eabf:	c3                   	ret    
 
if.95.OR.samevertex:
if( ... && SameVertex()
 804eac0:	89 ca                	mov    %ecx,%edx
 804eac2:	89 f0                	mov    %esi,%eax
 804eac4:	89 4c 24 0c          	mov    %ecx,0xc(%esp)
 804eac8:	e8 83 fe ff ff       	call   804e950 <SameVertex.clone.5>
 804eacd:	8b 4c 24 0c          	mov    0xc(%esp),%ecx
 804ead1:	85 c0                	test   %eax,%eax
 804ead3:	74 c4                	je     804ea99 case_DVL_mid

# case_DVL_v2:
 804ead5:	89 4e 08             	mov    %ecx,0x8(%esi)
return DVL_v2;
 804ead8:	b8 03 00 00 00       	mov    $0x3,%eax
 804eadd:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
 804eae4:	c7 46 14 02 00 00 00 	movl   $0x2,0x14(%esi)
 804eaeb:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
 804eaf2:	e9 41 ff ff ff       	jmp    804ea38 ret_common.1

 804eaf7:	90                   	nop

case_DVL_v1:
 804eaf8:	dd d8                	fstp   %st(0)
 804eafa:	89 5e 08             	mov    %ebx,0x8(%esi)
return DVL_v1;
 804eafd:	b8 01 00 00 00       	mov    $0x1,%eax
 804eb02:	c7 46 10 ff ff ff ff 	movl   $0xffffffff,0x10(%esi)
 804eb09:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804eb10:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
 804eb17:	e9 1c ff ff ff       	jmp    804ea38 ret_common.1

 804eb1c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi


==============
Tst10_k8:
0804e980 <fracdivline>:
 804e980:	56                   	push   %esi
 804e981:	53                   	push   %ebx
 804e982:	89 d3                	mov    %edx,%ebx
 804e984:	83 ec 34             	sub    $0x34,%esp
 804e987:	d9 02                	flds   (%edx)
 804e989:	8b 74 24 40          	mov    0x40(%esp),%esi
 804e98d:	d9 42 04             	flds   0x4(%edx)
 804e990:	d9 01                	flds   (%ecx)
 804e992:	d8 e2                	fsub   %st(2),%st
 804e994:	d9 41 04             	flds   0x4(%ecx)
 804e997:	d8 e2                	fsub   %st(2),%st
 804e999:	dd 1c 24             	fstpl  (%esp)
 804e99c:	d9 00                	flds   (%eax)
 804e99e:	d9 5c 24 28          	fstps  0x28(%esp)
 804e9a2:	d9 40 04             	flds   0x4(%eax)
 804e9a5:	d9 5c 24 2c          	fstps  0x2c(%esp)

den = v3dy*v1dx - v3dx*v1dy;
 804e9a9:	d9 40 08             	flds   0x8(%eax)
 804e9ac:	d9 40 0c             	flds   0xc(%eax)
 804e9af:	31 c0                	xor    %eax,%eax
 804e9b1:	d9 c2                	fld    %st(2)
 804e9b3:	d8 c9                	fmul   %st(1),%st
 804e9b5:	dd 04 24             	fldl   (%esp)
 804e9b8:	d8 cb                	fmul   %st(3),%st
 804e9ba:	de e9                	fsubrp %st,%st(1)

if (fabs(den) < 1.0E-36)
 804e9bc:	d9 c0                	fld    %st(0)
 804e9be:	d9 e1                	fabs   
 804e9c0:	dd 05 d8 e4 11 08    	fldl   0x811e4d8
 804e9c6:	df f1                	fcomip %st(1),%st
 804e9c8:	df c0                	ffreep %st(0)
 804e9ca:	77 3c                	ja     804ea08 ret_DVL_none.clone.1

num = (v3x - v1x)*v3dy + (v1y - v3y)*v3dx;
 804e9cc:	d9 c5                	fld    %st(5)
 804e9ce:	d8 6c 24 28          	fsubrs 0x28(%esp)
 804e9d2:	d9 c5                	fld    %st(5)
 804e9d4:	d8 64 24 2c          	fsubs  0x2c(%esp)
 804e9d8:	dc cc                	fmul   %st,%st(4)
 804e9da:	d9 cb                	fxch   %st(3)
 804e9dc:	d8 c9                	fmul   %st(1),%st
 804e9de:	de c4                	faddp  %st,%st(4)

frac = num / den;
 804e9e0:	d9 cb                	fxch   %st(3)
 804e9e2:	d8 f1                	fdiv   %st(1),%st

if (frac<0.0 ... )
 804e9e4:	d9 ee                	fldz   
 804e9e6:	df f1                	fcomip %st(1),%st
 804e9e8:	77 2e                	ja     804ea18 ret_DVL_none.clone.2

if ( ... frac>1.0)
 804e9ea:	d9 e8                	fld1   
 804e9ec:	d9 c9                	fxch   %st(1)
 804e9ee:	db f1                	fcomi  %st(1),%st
 804e9f0:	dd d9                	fstp   %st(1)
 804e9f2:	76 3c                	jbe    804ea30 body.num1

# ret_DVL_none.clone.3:
-> return DVL_none
 804e9f4:	df c0                	ffreep %st(0)
 804e9f6:	df c0                	ffreep %st(0)
 804e9f8:	df c0                	ffreep %st(0)
 804e9fa:	df c0                	ffreep %st(0)
 804e9fc:	df c0                	ffreep %st(0)
 804e9fe:	df c0                	ffreep %st(0)
 804ea00:	df c0                	ffreep %st(0)
 804ea02:	eb 24                	jmp    804ea28 ret_common.1
 
 804ea04:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
 
ret_DVL_none.clone.1:
 -> return DVL_none
 804ea08:	df c0                	ffreep %st(0)
 804ea0a:	df c0                	ffreep %st(0)
 804ea0c:	df c0                	ffreep %st(0)
 804ea0e:	df c0                	ffreep %st(0)
 804ea10:	df c0                	ffreep %st(0)
 804ea12:	df c0                	ffreep %st(0)
 804ea14:	eb 12                	jmp    804ea28 ret_common.1
 
 804ea16:	66 90                	xchg   %ax,%ax

ret_DVL_none.clone.2:
-> return DVL_none
 804ea18:	df c0                	ffreep %st(0)
 804ea1a:	df c0                	ffreep %st(0)
 804ea1c:	df c0                	ffreep %st(0)
 804ea1e:	df c0                	ffreep %st(0)
 804ea20:	df c0                	ffreep %st(0)
 804ea22:	df c0                	ffreep %st(0)
 804ea24:	df c0                	ffreep %st(0)
 
ret_DVL_none:
 804ea26:	66 90                	xchg   %ax,%ax

ret_common.1:
 804ea28:	83 c4 34             	add    $0x34,%esp
 804ea2b:	5b                   	pop    %ebx
 804ea2c:	5e                   	pop    %esi
 804ea2d:	c3                   	ret    
 
 804ea2e:	66 90                	xchg   %ax,%ax
 
body.num1:
num = (v3x - v1x)*v1dy + (v1y - v3y)*v1dx;
 804ea30:	d9 ca                	fxch   %st(2)
 804ea32:	d8 cc                	fmul   %st(4),%st
 804ea34:	d9 cb                	fxch   %st(3)
 804ea36:	dc 0c 24             	fmull  (%esp)
 804ea39:	de c3                	faddp  %st,%st(3)
result->divfrac = num / den;
 804ea3b:	de fa                	fdivrp %st,%st(2)
 804ea3d:	d9 c9                	fxch   %st(1)
 804ea3f:	d9 5e 0c             	fstps  0xc(%esi)
result->divpt.x = v1x + v1dx*frac;
 804ea42:	dc c9                	fmul   %st,%st(1)
 804ea44:	d9 c9                	fxch   %st(1)
 804ea46:	de c3                	faddp  %st,%st(3)
 804ea48:	d9 ca                	fxch   %st(2)
 804ea4a:	d9 1e                	fstps  (%esi)
result->divpt.y = v1y + v1dy*frac;
 804ea4c:	dd 04 24             	fldl   (%esp)
 804ea4f:	d8 ca                	fmul   %st(2),%st
 804ea51:	de c1                	faddp  %st,%st(1)
 804ea53:	d9 5e 04             	fstps  0x4(%esi)

+ if( frac < 0.05f
+ 804ea56:	d9 05 a0 e4 11 08    	flds   0x811e4a0
+ 804ea5c:	df f1                	fcomip %st(1),%st
+ 804ea5e:	76 1b                	jbe    804ea7b body.if.95

if( ... && SameVertex()
 804ea60:	dd 5c 24 10          	fstpl  0x10(%esp)
 804ea64:	89 f0                	mov    %esi,%eax
 804ea66:	89 4c 24 0c          	mov    %ecx,0xc(%esp)
 804ea6a:	e8 d1 fe ff ff       	call   804e940 <SameVertex.clone.5>
 804ea6f:	85 c0                	test   %eax,%eax
 804ea71:	8b 4c 24 0c          	mov    0xc(%esp),%ecx
 804ea75:	dd 44 24 10          	fldl   0x10(%esp)
 804ea79:	75 6d                	jne    804eae8 case_DVL_v1

body.if.95:
+ if( frac > 0.95f
+ 804ea7b:	d9 05 a4 e4 11 08    	flds   0x811e4a4
+ 804ea81:	d9 c9                	fxch   %st(1)
+ 804ea83:	df f1                	fcomip %st(1),%st
+ 804ea85:	df c0                	ffreep %st(0)
+ 804ea87:	77 27                	ja     804eab0 if.95.OR.samevertex

case_DVL_mid:
 804ea89:	c7 46 08 00 00 00 00 	movl   $0x0,0x8(%esi)
 804ea90:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
return DVL_mid;
 804ea97:	b8 02 00 00 00       	mov    $0x2,%eax
 804ea9c:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804eaa3:	c7 46 18 00 00 00 00 	movl   $0x0,0x18(%esi)
 804eaaa:	83 c4 34             	add    $0x34,%esp
 804eaad:	5b                   	pop    %ebx
 804eaae:	5e                   	pop    %esi
 804eaaf:	c3                   	ret    
 
if.95.OR.samevertex:
if( ... && SameVertex()
 804eab0:	89 ca                	mov    %ecx,%edx
 804eab2:	89 f0                	mov    %esi,%eax
 804eab4:	89 4c 24 0c          	mov    %ecx,0xc(%esp)
 804eab8:	e8 83 fe ff ff       	call   804e940 <SameVertex.clone.5>
 804eabd:	85 c0                	test   %eax,%eax
 804eabf:	8b 4c 24 0c          	mov    0xc(%esp),%ecx
 804eac3:	74 c4                	je     804ea89 case_DVL_mid

# case_DVL_v2:
 804eac5:	89 4e 08             	mov    %ecx,0x8(%esi)
 804eac8:	c7 46 10 00 00 00 00 	movl   $0x0,0x10(%esi)
return DVL_v2;
 804eacf:	b8 03 00 00 00       	mov    $0x3,%eax
 804ead4:	c7 46 14 02 00 00 00 	movl   $0x2,0x14(%esi)
 804eadb:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
 804eae2:	e9 41 ff ff ff       	jmp    804ea28 ret_common.1

 804eae7:	90                   	nop

case_DVL_v1:
 804eae8:	df c0                	ffreep %st(0)
 804eaea:	89 5e 08             	mov    %ebx,0x8(%esi)
 804eaed:	c7 46 10 ff ff ff ff 	movl   $0xffffffff,0x10(%esi)
return DVL_v1;
 804eaf4:	b8 01 00 00 00       	mov    $0x1,%eax
 804eaf9:	c7 46 14 01 00 00 00 	movl   $0x1,0x14(%esi)
 804eb00:	c7 46 18 01 00 00 00 	movl   $0x1,0x18(%esi)
 804eb07:	e9 1c ff ff ff       	jmp    804ea28 ret_common.1

 804eb0c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi

 

Share this post


Link to post

Yes, it's a real shame when you can't trust your CPU instructions, cause you end up with these legacy fixes forever, that handle problems with ancient CPUs. Then again, modern CPUs are so damn complicated, it's almost forgivable when a bug creeps in. Anyway, that's too much code for me to analyze. I do see some of the "clone" stuff you're describing. Sometimes, compilers will align with "variable-sized NOPs", which can be any instruction of the desired size, that doesn't mess up any calculations. Intel, and I suppose AMD, have stated that you're supposed to use specific instructions for this purpose, vs. a MOV EAX, EAX, or whatever. With the recent drive for power-efficient CPUs, these preferred NOP instructions could be hard-wired to actually do nothing, vs. do a harmless instruction, which can also help by eliminating dependencies and the like.

 

I can't explain the code clones, unless one is used inline, and the original is kept, which might help during single-step debugging? (I'm reaching here :)

 

I fear the possibility of sending you off on a wild goose chase, but I must suggest: Maybe it's time to try another compiler, at least as a test. It could provide more info that could validate what your compiler is doing. For instance, maybe you'd see the strange fcomps workaround stuff being generated conditionally in another compiler. It could also provide a code size comparison.

 

But, honestly, at some point, you have to either trust your compiler, and choose to live with the occasional benign size bumps/code clones, or be forever unhappy, and rewrite the whole damn thing in assembler. Short of that, you could maybe become proficient at inline compiler directives that modify local behavior, load in pre-built, pre-vetted libraries to replace certain code blocks, etc. But for a code base the size of Doom, it becomes a ridiculous proposition.

 

In this specific case, "Close you eyes and act like nothing happened" may actually be the best policy, as much as I hate to say it. Good luck.

Share this post


Link to post

I had a question that came to my mind when reading this thread: Why does Doom uses fixed point instead of using bare ints?

 

Doom had to shift or cast numbers to make multiplication and divisions work correctly (this uses more CPU time), but using bare ints, they wouldn't have to do any "conversions" right? It's only a question of representation as far as I know. 

 

FixedDiv and FixedMul from Chocolate-Doom :



// Fixme. __USE_C_FIXED__ or something.

fixed_t
FixedMul
( fixed_t    a,
  fixed_t    b )
{
    return ((int64_t) a * (int64_t) b) >> FRACBITS;
}

//
// FixedDiv, C version.
//

fixed_t FixedDiv(fixed_t a, fixed_t b)
{
    if ((abs(a) >> 14) >= abs(b))
    {
    return (a^b) < 0 ? INT_MIN : INT_MAX;
    }
    else
    {
    int64_t result;

    result = ((int64_t) a << FRACBITS) / b;

    return (fixed_t) result;
    }
}
 

Share this post


Link to post

The shift is virtually free, because it is among the fastest instructions available. The casts to 64 bit are only needed in C because the language cannot properly express the actual assembly instructions, if I remember correctly. Either way, you cannot use "bare ints" because the precision of whole integers is not good enough for the math Doom is using.

Share this post


Link to post
15 minutes ago, dpJudas said:

 Either way, you cannot use "bare ints" because the precision of whole integers is not good enough for the math Doom is using.

The fixed points are 16.16, so you can multiply everything by 65536. That means that what is currently 1 unit would be represented as 65536 units. You still have 32-bits to represent Doom's world. I'm just looking to know if the developers complicated their own lives or if it simplified their work (and if it made the game faster or slower). 

Share this post


Link to post

If you multiply everything by 65536 you just invented fixed point. You have to divide by 65536 (aka shift 16 bits to the right) after a multiplication because otherwise you multiplied things twice by 65536.

Share this post


Link to post
8 hours ago, axdoomer said:

The fixed points are 16.16, so you can multiply everything by 65536. That means that what is currently 1 unit would be represented as 65536 units. You still have 32-bits to represent Doom's world. I'm just looking to know if the developers complicated their own lives or if it simplified their work (and if it made the game faster or slower). 

16.16 gives you numbers in the range of -32768 to +32767 with a granularity of 1/65536. In machine language, you don't even have to shift, which was helpful in vanilla. Especially back then, divides and multiplies on ints were much faster than with floats, and Doom needs various amounts of fractional precision throughout the engine. And, it's not just 16.16, it used 12.20 and others. This is the essense of the Wiggle Fix code - it dynamically adjusts the ratio of whole to fractional units in specific wall accumulators to prevent a nasty renderer artifact that causes walls to shift around unnaturally. Essentially a home-grown floating point using fixed point variables.

Share this post


Link to post

I am done with this topic.  I do not have those other compilers.  This is where someone else steps up and shows what clang or MS does.

 

It depends upon what target your distribution is for.  As we still compile distribution for i486, I am going to leave the float literal markers off of most comparisons to avoid the code bump.  Any future compilation for a i686 will rely upon the compiler optimization.

I already committed the code so this is already done.

Knowing what the code size bumps are allows me to continue using code size to judge code fix quality, by knowing what some of the extraneous noise is about.

 

Another strange one.   I took out an IF stmt that tested for Boom compatibility, and the code size jumped by 2K.

Make some other changes, and the code size is stuck at one value.  Put the one IF stmt back, and the 2K code size bump disappears.  I think I don't want to even look at it.

Share this post


Link to post
On 4/24/2017 at 4:13 PM, wesleyjohnson said:

I am done with this topic.  I do not have those other compilers.  This is where someone else steps up and shows what clang or MS does.

 

It depends upon what target your distribution is for.  As we still compile distribution for i486, I am going to leave the float literal markers off of most comparisons to avoid the code bump.  Any future compilation for a i686 will rely upon the compiler optimization.

I already committed the code so this is already done.

Knowing what the code size bumps are allows me to continue using code size to judge code fix quality, by knowing what some of the extraneous noise is about.

 

Another strange one.   I took out an IF stmt that tested for Boom compatibility, and the code size jumped by 2K.

Make some other changes, and the code size is stuck at one value.  Put the one IF stmt back, and the 2K code size bump disappears.  I think I don't want to even look at it.

I wouldn't look :)

 

You know, I've been studying the x86/x64 processor docs recently, and, here's something that may ease your mind somewhat: With all the caching, deep pipelining, multiple execution port stuff in there, you get a lot of statements executed, essentially, for free! It's truly amazing how much work Intel and AMD have put into optimizing their products. Long-winded, sloppily-compiled code often runs faster than tight code, even. It's all converted to internal "micro-op" code anyway. Cache is king. That's really the thing that matters these days. Re-calculating a value is often faster than reading a lookup entry in uncached memory, which is a big change in methodology.

 

I would not worry about a 2K bump here or there. I would assume that there was a good reason, and trust the compiler, for all but the most-important loops.

Share this post


Link to post
6 hours ago, kb1 said:

With all the caching, deep pipelining, multiple execution port stuff in there, you get a lot of statements executed, essentially, for free! It's truly amazing how much work Intel and AMD have put into optimizing their products. Long-winded, sloppily-compiled code often runs faster than tight code, even.

I can confirm this. Modern compilers know better how a CPU can optimize.

We were quite surprised when we found out that all the highly optimized assembly for the draw loops in ZDoom's renderer had lost all advantage over its C counterpart in the last 2 or3 CPU generations, I had to pull out my 10 year old laptop to see the assembly stuff have a minor advantage. Compiled code may look sloppy but this is often done to get some instructions in between that can be executed for free.

In one case, the assembly code only looked better - after finding out what slowed down the C version and fixing it to work properly it was just as fast. In this particular situation it is that loading global variables in 64 bit code is REALLY slow. Doing it inside a loop is murderous. Just loading them into local variables, even onto the stack, made all the difference. Of course that optimized version of the function was quiite a bit larger because it had to save all registers onto the stack, then load them with the global variables and afterward pop the registers again. It still was twice as fast as the version directly reading the global variables.

 

It is also quite pointless to look at the binary size. A 2k bump can simply be one new page of content being added, even if that amounts to only a few bytes of code. It depends on the linker how large a page is, it can be 512 bytes but some linkers choose larger values.

 

Share this post


Link to post
11 hours ago, Graf Zahl said:

I can confirm this. Modern compilers know better how a CPU can optimize.

We were quite surprised when we found out that all the highly optimized assembly for the draw loops in ZDoom's renderer had lost all advantage over its C counterpart in the last 2 or3 CPU generations, I had to pull out my 10 year old laptop to see the assembly stuff have a minor advantage. Compiled code may look sloppy but this is often done to get some instructions in between that can be executed for free.

In one case, the assembly code only looked better - after finding out what slowed down the C version and fixing it to work properly it was just as fast. In this particular situation it is that loading global variables in 64 bit code is REALLY slow. Doing it inside a loop is murderous. Just loading them into local variables, even onto the stack, made all the difference. Of course that optimized version of the function was quiite a bit larger because it had to save all registers onto the stack, then load them with the global variables and afterward pop the registers again. It still was twice as fast as the version directly reading the global variables.

 

It is also quite pointless to look at the binary size. A 2k bump can simply be one new page of content being added, even if that amounts to only a few bytes of code. It depends on the linker how large a page is, it can be 512 bytes but some linkers choose larger values.

 

In Wesley's case, it grew 2k after removing an IF statement, so, that's interesting. But, yeah making your vars local gets them into cache, and, hopefully, registers. Good stuff.

 

Now, as you know, I have to play devil's advocate in this area, and claim that, possibly, some better assembly would probably still be better for the render loops, but, the gap is closing. Because the effect is so processor-specific, it is a ton of work to get the absolute best performance on a range of processors. And, the code will only be optimal on some processors. It can be done, but it's a project in itself. The compiler guys have intimate knowledge of these issues, so, in most all cases, trust your compiler!

Share this post


Link to post
31 minutes ago, kb1 said:

In Wesley's case, it grew 2k after removing an IF statement, so, that's interesting. But, yeah making your vars local gets them into cache, and, hopefully, registers. Good stuff.

 

Now, as you know, I have to play devil's advocate in this area, and claim that, possibly, some better assembly would probably still be better for the render loops, but, the gap is closing.

If you can orchestrate it perfectly you might have been able to shave off maybe 5% more. But then the next CPU generation comes along and won't like your optimization, making the C version faster again. With today's CPU's it is simply a battle that cannot be won. Let's not forget that the original assembly I was talking about already used all registers and even used a bit of self-modifying code to get the remaining two values off the stack, too. But from what I have seen it looks like the CPUs already get heavily optimized for reading local stack data because it is so frequent in compiled code.

 

In any case, the main reason the assembly was ditched was not that it had lost all performance advantage but that with the transition to a multithreaded renderer it just became unmaintainable. And the performance boost from the multithreading was magnitudes more than a few measly percent a well written assembly routine might have yielded.

 

 

Share this post


Link to post
20 hours ago, Graf Zahl said:

If you can orchestrate it perfectly you might have been able to shave off maybe 5% more. But then the next CPU generation comes along and won't like your optimization, making the C version faster again. With today's CPU's it is simply a battle that cannot be won. Let's not forget that the original assembly I was talking about already used all registers and even used a bit of self-modifying code to get the remaining two values off the stack, too. But from what I have seen it looks like the CPUs already get heavily optimized for reading local stack data because it is so frequent in compiled code.

 

In any case, the main reason the assembly was ditched was not that it had lost all performance advantage but that with the transition to a multithreaded renderer it just became unmaintainable. And the performance boost from the multithreading was magnitudes more than a few measly percent a well written assembly routine might have yielded.

 

 

I understand what happened in ZDoom's case. I guess I'll have to put my money where my mouth is with a demonstration.

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×