Compilers

entryway · September 2, 2010

I added support for textured automap in glboom-plus and noticed that MSVC 6.0 build is very slow on textured automap. I did not believe it can be 2x slower then MSVC 2005 build.

fraps + "glboom-plus dv.wad -warp 5" + "map_textured 1" in cfg + <Tab> IDDT

MSVC 6.0 - 45 fps
MSVC 2005/2008 - 93 fps
Intel Compiler 10 - 103 fps

http://prboom-plus.sf.net/vc6-vc2005-intel10.zip

How such code can be 2x slower when I build it with vc6? Massive cache misses I think.

entryway · September 2, 2010

lol, I replaced

glTexCoord2f(flats_vbo[vertexnum].u + floor_uoffs, flats_vbo[vertexnum].v + floor_voffs);
glVertex3f(flats_vbo[vertexnum].x, flats_vbo[vertexnum].z, 0);

with

glVertex3f(flats_vbo[vertexnum].x, flats_vbo[vertexnum].z, 0);
glTexCoord2f(flats_vbo[vertexnum].u + floor_uoffs, flats_vbo[vertexnum].v + floor_voffs);

and now I have 23 fps! with MSVC 6.0 and 30 with MSVC 2005

It's strange, because

typedef struct vbo_xyz_uv_s
{
  float x, y, z;
  float u, v;
} vbo_xyz_uv_t;

Graf Zahl · September 2, 2010

entryway said:

lol, I replaced

glTexCoord2f(flats_vbo[vertexnum].u + floor_uoffs, flats_vbo[vertexnum].v + floor_voffs);
glVertex3f(flats_vbo[vertexnum].x, flats_vbo[vertexnum].z, 0);

with

glVertex3f(flats_vbo[vertexnum].x, flats_vbo[vertexnum].z, 0);
glTexCoord2f(flats_vbo[vertexnum].u + floor_uoffs, flats_vbo[vertexnum].v + floor_voffs);

and now I have 23 fps! with MSVC 6.0 and 30 with MSVC 2005

Does that even work right? If I understand OpenGL correctly the glVertex command must always be the last one to terminate one vertex's attributes.

entryway · September 2, 2010

Graf Zahl said:
Does that even work right? If I understand OpenGL correctly the glVertex command must always be the last one to terminate one vertex's attributes.

Ah yes, you are right. But big difference between mcvc 6 and 2005 remains for that code.

Quasar · September 2, 2010

I think it's generally agreed that Visual C++ 6.0 was terrible at generating floating point code in general. Just judging from the number of hacks in Quake 2's source code to work around them would be enough to deduce that, probably, even if I hadn't heard similar complaints all over the place :)

I know EE's Cardboard runs better compiled with 2008 than it did with 6.0. Ten+ whole years of development separate the two.

entryway · September 3, 2010

lol

  // sort subsectors by texture
  //qsort(visible_subsectors, visible_subsectors_count,
  //  sizeof(visible_subsectors[0]), dicmp_visible_subsectors_by_pic);

.. and FPS doubled for MSVC 6.0 executable

wtf? why msvc6' implementation of qsort is so fucking slow?

Spleen · September 3, 2010

I wonder how fast mingw-generated code runs compared to the Windows compilers. It might be nicer than MSVC 6.0 for backwards compatibility with older Windows versions.

entryway · September 3, 2010

Spleen said:
I wonder how fast mingw-generated code runs compared to the Windows compilers. It might be nicer than MSVC 6.0 for backwards compatibility with older Windows versions.

gcc ~= msvc 2005

msvc 2005 and intel 10 are compatible with older Windows versions

entryway · September 3, 2010

MSVC 6.0 on my work computer:

standard qsort - 25 fps
no sorting at all - 45 fps
custom implementation of qsort - 56 fps

Graf Zahl · September 3, 2010

Did they implement a Bubblesort or what?

entryway · September 3, 2010

Graf Zahl said:
Did they implement a Bubblesort or what?

heh

I have compared custom qsort with msvc2005 qsort and the first one is only ~1% faster. I think because 'compare' func is inlined.

Quasar · September 3, 2010

If history is any indication then the sort is probably cache-bound, which means it probably won't matter what algorithm you replace it with.

entryway · September 3, 2010

Quasar said:
If history is any indication then the sort is probably cache-bound, which means it probably won't matter what algorithm you replace it with.

Probably I did not understand you, but 'my' implementation of qsort is 2.5x faster than msvc6'.

On Deus Vult map05 + 'IDDT' + 'textured automap' I sort 20k pointers every frame and custom implementation is much faster (25 fps -> 55 fps)

#ifdef USE_CUSTOM_QSORT
void qsort_subsectors_by_pic(subsector_t **arr, unsigned int n)
{
  #define cmp_subsectors_by_pic(a, b) ((*a)->sector->floorpic > (*b)->sector->floorpic)
  QSORT(subsector_t*, arr, n, cmp_subsectors_by_pic);
}
#else
static int C_DECL dicmp_visible_subsectors_by_pic(const void *a, const void *b)
{
  return (*((const subsector_t **)b))->sector->floorpic - 
         (*((const subsector_t **)a))->sector->floorpic;
}
#endif


#ifdef USE_CUSTOM_QSORT
  qsort_subsectors_by_pic(visible_subsectors, visible_subsectors_count);
#else
  qsort(visible_subsectors, visible_subsectors_count,
    sizeof(visible_subsectors[0]), dicmp_visible_subsectors_by_pic);
#endif

entryway · September 3, 2010

Graf Zahl said:
Did they implement a Bubblesort or what?

I have tested qsort speed exclusively.

Code:
http://pastebin.com/WR2ztyNP

Results:

MSVC 6.0 qsort  - 32600 ms
Custom qsort    - 1200 ms
MSVC 2005 qsort - 1003 ms
Intel 10 qsort  - 980 ms

Thus MSVC6 qsort is 32x slower than MSVC2005, heh. With my 'compare' function and my data of course.

Anyway, I have changed the algo a little and now I have 100+ fps with any compiler, because I do not need to sort that array every frame.

Graf Zahl · September 3, 2010

entryway said:
I have tested qsort speed exclusively.

Are you sure that MSVC 6 even implements a proper qsort? The speed you get suggests that there's something terribly wrong with it.

entryway · September 3, 2010

Graf Zahl said:
Are you sure that MSVC 6 even implements a proper qsort?

At least it sorts the data :)

I think modern compilers use some tricks (there are a few tricks which can be applied to classic qsort) and read data less randomly to minimize cache misses.

entryway · September 4, 2010

Graf Zahl said:
Are you sure that MSVC 6 even implements a proper qsort? The speed you get suggests that there's something terribly wrong with it.

Such trick does not help for MSVC 6.0. 21000ms against 1000ms for custom sort.

typedef struct visible_subsectors_s
{
  subsector_t *sub;
  int floorpic;
} visible_subsectors_t;

for (i = 0; i < numsubsectors; i++)
{
  visible_subsectors[visible_subsectors_count].sub = &subsectors[i];
  visible_subsectors[visible_subsectors_count].floorpic = subsectors[i].sector->floorpic;
  visible_subsectors_count++;
}

static int C_DECL dicmp_visible_subsectors_by_pic(const void *a, const void *b)
{
  return ((const visible_subsectors_t *)b)->floorpic - 
         ((const visible_subsectors_t *)a)->floorpic;
}

Maes · September 4, 2010

It's quite possible that Microsoft's implementation degenerates to O(n^2) way too easily for that kind of data or spams a fuckton of recursive calls rather than being iterative.

Not good for something you need to call in between frame renders. Of course I'd need to see the implementation to see that...it would not be the first time that a custom implementation (perhaps for a precise kind of object/fixed comparison key) is faster than a more generic one, or can be provided with extra knowledge that make it more efficient (e.g. more optimal pivoting, in your case).

entryway · September 4, 2010

MSVC 6.0: http://pastebin.com/nmZVzrPG
MSVC 2005: http://pastebin.com/T3LBsR3Y
Custom: http://pastebin.com/WfG9qtbj

The problem exists only with MSVC 6.0 qsort, but I did not try to recompile its crt/src

Maes · September 5, 2010

Heh. The two MSVC implementations of qsort are more or less the same as far as raw code is concerned, except for this interesting algorithmic tidbit here:

from MSVC 6.0 implementation, about choosing a pivot element:

The efficiency of the algorithm demands that we find one that is approximately the median of the values, but also that we select one fast. Using the first one produces bad performace if the array is already sorted, so we use the middle one, which would require a very
wierdly arranged array for worst case performance. Testing shows that a median-of-three algorithm does not, in general, increase performance.

...or does it?

from the MSVC 2005 implementation, on the same code spot:

We choose the median of the first, middle, and last elements, to avoid bad performance in the face of already sorted data, or data that is made up of multiple sorted runs appended together. Testing shows that a median-of-three algorithm provides better performance than simply picking the middle element for the latter case.

So apparently they got scathed in the butt with that design decision in MSVC 6.0, as even the optimized custom implementation uses a median of three:

from the custom qsort implementation:

2. Chose the pivot element using a median-of-three decision tree.
This reduces the probability of selecting a bad pivot value and
eliminates certain extraneous comparisons.

It also has other optimizations such as using insertion sort for sufficiently small partitions, while both MS implementations only do so if the WHOLE array to sort is below this "sufficiently small" threshold, not individual partitions.

But the big player seems to be the pivot choosing method.

This generates interesting questions about the nature of the data to be sorted -if it can fuck up with qsort so badly, maybe another nlogn sorting algorithm should be used. Or maybe it should not be sorted at all/only sorted approximatively.

Sign In

Compilers

Recommended Posts

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Share this post

Link to post

Create an account or sign in to comment

Create an account

Sign in