[linux-elitists] Nobody's favorite language? C++ and free software

Jason Spence jspence@lightconsulting.com
Wed Mar 26 04:22:36 PST 2003

On Wed, Mar 26, 2003 at 10:44:22AM +0100, Eugen Leitl wrote: 
> On Wed, 26 Mar 2003, Alan DuBoff wrote:
> > However strip is not a part of the Intel compiler!<g> It is available for 
> > Linux.
> Recent c't says there's not much point to use Intel's compiler vs. gcc 
> 3.2, but for some special numerics stuff on Pentium 4.

Then they haven't had the chance to read the mixed x86/MMX/SSE code
output from g++-3.2 :) icc outputs remarkably more efficient code for
that case.  Two formulas I use often in 3D math, the vector distance
calculation and the cardinal vector cracker, both benefit from
compiling with icc vs. g++.  A few strange things I noticed: macros
don't seem to be as fast as inlines with -O2 on g++, although the gcc
docs claim that they should be, and g++-3.x outputs slower code than
g++-2.x, which completely mystifies me [1].  I've attached the test
file in case anyone is interested in trying it out (Alan?)

Do not interpret this to mean that the Intel compiler outputs better
code in general, since this kind of heavily FP oriented code gives the
Intel compiler a totally unfair advantage.  Most real-world
applications are integer only, and I'd like to see some benchmarks of

thalakan@thom:~$ g++-3.2 -O2 test5.c -o test5
thalakan@thom:~$ ./test5
count: 10674
count: 11441
count: 11447

thalakan@thom:~$ ./test5-icc7
count: 651257
count: 759370
count: 759782

If I inline the function instead of using a macro, I get this:

thalakan@thom:~$ g++-3.2 -O2 -march=pentium3 -mfpmath=sse test5.c -o test5-g++-3.2
thalakan@thom:~$ ./test5-g++-3.2
count: 7236
count: 7831
count: 7833 

thalakan@thom:~$ icc test5.c -O2 -axiMK -o test5-icc7
test5.c(20) : (col. 16) remark: main has been targeted for automatic
cpu dispatch.
thalakan@thom:~$ ./test5-icc7
count: 361822
count: 759482
count: 759828
count: 759870

Running them on an 850MHz AMD Athlon Tbird (which doesn't have SSE)
instead of an 850MHz Pentium IIIm (which does) gives me this:

thalakan@gawyn:~$ ./test5-icc7
count: 78563
count: 923012 <<< It's going faster without SSE?!
count: 925080
count: 924961
count: 912543

thalakan@gawyn:~$ ./test5-g++-3.2
count: 5724
count: 16976
count: 16637
count: 16864
count: 16088

[1] Examining the disassembly made me go check
gcc/config/i386/i386.md, where I found all sorts of fun anachronisms
like this:

; ??? This is correct only for fdiv and sqrt -- sin/cos take 65-100 cycles.
; They can overlap with integer insns.  Only the last two cycles can overlap
; with other fp insns.  Only fsin/fcos can overlap with multiplies.
; Only last two cycles of fsin/fcos can overlap with other instructions.
(define_function_unit "fpu" 1 0
  (and (eq_attr "cpu" "pentium")
       (eq_attr "type" "fdiv")) [2]
  39 37)

[2] For 100 points, name the x87 instruction that gives you the sin
and cos of st(0) simultaneously [3] [4].

[3] No, the other one.

[4] Now convert it back to degrees [5]...

[5] Now convert it into a 4D matrix so we can feed it to OpenGL [6]...

[6] Oh my, now we're out of floating point registers.  It's a good
thing that Opteron has 16 XMM registers, eh? :)

 - Jason                 Currently at: Home, Downstairs (Fremont, CA) (Cloudy)

"It's Like This"

Even the samurai
have teddy bears,
and even the teddy bears
get drunk.
-------------- next part --------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

typedef struct {
  float x;
  float y;
  float z;
} v_t;

v_t targets[1024];

inline float distance1(const v_t a, const v_t b);
inline float distance2(const v_t a, const v_t b);
#define distance3(a, b) (sqrt(((b.x - a.x) * (b.x - a.x)) + \
			      ((b.y - a.y) * (b.x - a.y)) + \
			      ((b.z - a.z) * (b.z - a.z))))

int main(void) {
  int i;
  double count;
  time_t start;
  v_t source;

  source.x = rand();
  source.y = rand();
  source.z = rand();

  for(i = 0; i < sizeof(targets) / sizeof(v_t); ++i) {
    targets[i].x = rand();
    targets[i].y = rand();
    targets[i].z = rand();

  start = time(NULL);  
  while(1) {
    if(time(NULL) != start) {
      printf("count: %.f\n", count);
      count = 0;
      start = time(NULL);
    else {
      for(i = 0; i < sizeof(targets) / sizeof(v_t); ++i) {
	distance1(source, targets[i]);

  return 0;

inline float distance1(const v_t a, const v_t b) {
  return sqrt(((b.x - a.x) * (b.x - a.x)) + 
	      ((b.y - a.y) * (b.y - a.y)) + 
	      ((b.z - a.z) * (b.z - a.z)));

inline float distance2(const v_t a, const v_t b) {
  return sqrt(pow(b.x - a.x, 2) +
	      pow(b.y - a.y, 2) +
	      pow(b.z - a.z, 2));

More information about the linux-elitists mailing list