C-Ray Simple Raytracing Tests

           http://www.futuretech.blinkenlights.nl/c-ray.html

              By: John Tsiombikas <nuclear@siggraph.org>

       Test suite compiled by: Ian Mapleson <mapesdhs@yahoo.com>

                       Last Change: 10/Apr/2008


1. Introduction
2. The C-Ray Tests (how to compile)
3. Running the Tests
4. Submitting Results
5. Background
6. Appendix A: Invalid Tests

**********************************************************************

1. Introduction

C-Ray is a simple raytracer written by John Tsiombikas; in his
own words, it is:

  "...an extremely small program I did one day to figure out how
  would the simplest raytracer program look like in the least
  ammount of code lines."

The relevant discussion on Nekochan is at:

  http://forums.nekochan.net/viewtopic.php?f=3&t=15719

The default data set is very small, so C-ray really only tests the pure
floating-point (fp) speed of a CPU core (or multiple CPUs/cores using the
threaded version), ie. RAM speed is not a significant factor. John said:

  This thing only measures 'floating point CPU performance' and
  nothing more, and it's good that nothing else affects the results.
  A real rendering program/scene would be still CPU-limited meaning
  that by far the major part of the time spent would be CPU time in
  the fpu, but it would have more overhead for disk I/O, shader
  parsing, more strain for the memory bandwidth, and various other
  things. So it's a good approximation being a renderer itself, but
  it's definitely not representative."

Nevertheless, the results are certainly interesting:

  http://www.futuretech.blinkenlights.nl/c-ray.html

If you wish to submit your own results, follow the instructions given
below. Send the data to me, not to John, and I will add them to the
relevant tables; please include all requested details.

Comments and feedback welcome!

Ian.

mapesdhs@yahoo.com
sgidepot@blueyonder.co.uk
http://www.futuretech.blinkenlights.nl/sgidepot/

**********************************************************************

2. The C-Ray Tests

Two programs are included in this archive for testing:

c-ray-f:

  This is for single-CPU systems, or for testing just a single core
  of a multi-core CPU.

c-ray-mt:

  This is the multithreaded version for testing multi-CPU systems, and/or
  systems with more than one CPU core. Note that on some systems c-ray-mt
  with just one thread may be faster than c-ray-f. Use whichever version
  gives you the best results in each case. Use the -t option to specify
  the number of threads; without the -t option, only 1 thread is used.

Compile the source files for your target platform with gcc or whatever
compiler you have. Just enter 'make', though feel free to add any arch-
specific optimizations for your compiler in CFLAGS first. By default,
the Makefile is designed for use with GCC. If you are using an SGI and
want to use MIPS Pro to compile the programs, then enter:

  /bin/cp Makefile.mips Makefile

and then enter 'make'.

Note that the c-ray binaries as supplied were compiled for an SGI
Octane2 R12K/R14 system (users of other SGI models should recompile
if possible), while the example x86 binary in the x86 directory was
compiled by John for a 3GHz P4.

If you don't want to use make, then typical compile lines for each
program on SGIs using MIPS Pro are as follows (in this case for an
Octane system - use a different IP number for other SGI systems):

  cc -O3 -mips4 -TARG:platform=ip30 -Ofast=ip30 c-ray-f.c -o c-ray-f -lm

while for the threaded version the pthread library must be included:

  cc -O3 -mips4 -TARG:platform=ip30 -Ofast=ip30 c-ray-mt.c -o c-ray-mt -lm -lpthread

See the 'cc' man page for full details of available optimisation options.

The file 'sgi.txt' has further example compile lines for SGI O2 and
Octane machines, some with extra example optimisation options. Try
them out, see which one works best on your system. Those using GCC
should consult the gcc man page for full details of available options.

NOTE: results for tests done with pre-run profiling/virtualisation
compiler optimisations will NOT be accepted! (see Appendix A for details).

Before running the tests, naturally you should shut down any other
applications, processes, etc. which might interfere with the test.
For example, on SGI systems, I shut down the mediad and sgi_apache
daemons:

  /etc/init.d/mediad stop
  /etc/init.d/sgi_apache stop

Better still, turn off timed, nsd, and rlogin remotely to run the tests.
It should be possible to do the same thing on a Linux/BSD system.

On a Windows machine, shut down all unnecessary processes, close any
antivirus/firewall applications/processes, and it may be worthwhile
forcing any pending idle tasks to complete before running the tests,
ie. select Run from the Start menu and enter:

  Rundll32.exe advapi32.dll,ProcessIdleTasks

Assuming you are now ready to use the binary programs for the tests...

**********************************************************************

3. Running the Tests

There are two data files used for the tests:

'scene' is a simple environment, with just three reflective spheres.
Examine scene.jpg to see the final image.

'sphract' is a much more complex scenario, with dozens of spheres in
a fractal pattern (see gen_fract.txt for details of how the scene
description was created). Examine sphract.jpg to see the final image.

There are four tests; the first is the shortest, the data from which
are used for the main table on the results page. The tests are:


  Test   Data File    Image Resolution     Oversampling

   1.    scene        Default 800x600      NONE
   2.    sphract      Default 800x600      NONE
   3.    sphract      1024 x 768           8X
   4.    scene        7500 x 3500          NONE

If you are using a single-CPU system which only has one core, or wish
to test just one CPU/core of a multi-CPU/core system, then run the
tests with c-ray-f, or with c-ray-mt using just 1 thread. For systems
with multiple CPUs/cores, please submit results for just a single
CPU/core aswell as the fastest results for using all CPUs/cores (this
allows one to see how well parallel systems scale).

On UNIX systems, the programs receive the scene description data from
the standard input and send the results (the image created) to the
standard output. To run the first test on a UNIX system, enter:

  cat scene | ./c-ray-f > foo.ppm

On a Windows system, enter the following in a Command window:

  c-ray-f -i scene -o foo.ppm

The output will resemble the following (in this case run on a R14000
550MHz SGI Octane2):

  Rendering took: 1 seconds (1888 milliseconds)

The result to submit is the number of milliseconds. Run each test
several times if possible to observe a typical result. It's up to
you whether you submit a typical result or the fastest overall result.

If you have a multi-CPU/core system, now run the test multithreaded with
c-ray-mt, using the -t option to specify the number of threads, eg.

     UNIX:   cat scene | ./c-ray-mt -t 32 > foo.ppm
  Windows:   c-ray-mt -t 32 -i scene -o foo.ppm

For multi-CPU/core systems, the optimum number of threads varies
greatly from one system to another, though a good estimate is 16
times the number of cores. Try different numbers, eg. 32, 64, 128,
or some inbetween number such as 40, 60, etc. But also try smaller
number too, eg. just 8 threads for a quad-core system.

The maximum number of threads c-ray-mt can use is the vertical
resolution of the output image. As the number of threads increases,
eventually the speedup obtained by the parallel processing will be
outweighed by the overhead cost of managing the threads. Experiment
to find what works best for each test.

Thus, for the other tests, the commands to enter on a UNIX system
would be as follows, using 32 threads just as an example here:

  cat sphract | ./c-ray-f > foo.ppm
  cat sphract | ./c-ray-mt -t 32 > foo.ppm
  cat sphract | ./c-ray-f -s 1024x768 -r 8 > foo.ppm
  cat sphract | ./c-ray-mt -t 32 -s 1024x768 -r 8 > foo.ppm
  cat scene | ./c-ray-f -s 7500x3500 > foo.ppm
  cat scene | ./c-ray-mt -t 32 -s 7500x3500 > foo.ppm

while on a Windows system these would be:

  c-ray-f -i sphract -o foo.ppm
  c-ray-mt -t 32 -i sphract -o foo.ppm
  c-ray-f -s 1024x768 -r 8 -i sphract -o foo.ppm
  c-ray-mt -t 32 -s 1024x768 -r 8 -i sphract -o foo.ppm
  c-ray-f -s 7500x3500 -i scene -o foo.ppm
  c-ray-mt -t 32 -s 7500x3500 -i scene -o foo.ppm

If you don't want to bother experimenting, then the script RUN.full
will execute all the tests, with the c-ray-mt tests done using 32
threads, but remember 32 threads might not be optimal for your system.

**********************************************************************

4. Submitting Results

Do not send any results to John; instead, send all results to me at
both of my email addresses (include "C-Ray" in the subject line):

  mapesdhs@yahoo.com
  sgidepot@blueyonder.co.uk

Apart from the run-times reported by each test, please remember
to state which tests are multithreaded and how many threads were
used. Better still, just copy the on-screen text for each test.

With respect to system information, state the type of system (name
and/or model number), CPU details (name/model, type/speed/cache,
no. of CPUs/cores), OS name/version (eg. for Linux, what name,
kernel version/build), what compiler was used (name/version) and
what extra options if any were employed. The online results page
also shows the host name where available, but this is optional.

Thus, for example, a system description might look like this:

  SUN Fire X2100, Solaris 10, host name 'kobe'
  Sun Studio 10 Compiler
  AMD Opteron 175 2.2 GHz 1MB L2

while for an SGI it might be:

  SGI Octane2, IRIX 6.5.26m
  MIPS Pro 7.4.3m (7.3 EOE)
  Dual-R12000 400MHz (2MB L2)

or a Windows system (in this case my own setup):

  WinXP-32bit PC (SP2), Asrock AM2NF3 motherboard, 4GB DDR2/800 RAM
  Athlon64 X2 6000+ 3.15GHz (overclocked)
  Supplied x86 binary used (1 core only)

You can also mention the system's RAM, disk, or anything else, but
they're not essential.

Note for SGI users: you can find out what eoe/dev compiler versions
your system has installed by entering:

  versions -b | grep compiler_

eg. on my Octane2 this gives:

  I  compiler_dev   05/14/2004  Base Compiler Development Environment, 7.3
  I  compiler_eoe   07/13/2005  IRIX Standard Execution Environment (Base Compiler Headers and Libraries, 7.4.3m)

**********************************************************************

5. Background

When I asked John about why he created C-Ray, he said:

  This is just an extremely small program I did one day to figure
  out how would the simplest raytracer program look like in the
  least amount of code lines :)  It's not useful for anything apart
  from benchmarking.

  As part of my BSc dissertation project I did a really big and
  feature-full ray tracer, which could be useful, supporting:
  programmable shading, network rendering, monte carlo rendering
  algorithms, etc. But it's big and buggy, slow, and incomplete
  because I was rushing to finish it the last minute before the
  deadline (as always) :)

  So I scrapped the damn thing after that, and I'm starting from
  scratch with a new design if I finish it and it proves to be
  sucessful I'll let you know :)"


I also asked John why the best c-ray-mt results seem to be obtained
with a number of threads that is much larger than the number of
CPUs/cores in a system, to which he replied:

  Ian wrote:
  > I also suspect more threads means the balanced load between
  > the two CPUs is less affected by the possible differences in
  > complexity between threads.

  Bingo, each thread takes a bunch of scanlines, if the relative
  complexity of the rendering calculations between the bunches(sic)
  is not equal, then one thread may spend much more time calculating
  than another thread. Of course that doesn't necessarily mean that
  one "CPU" ends up calculating much more than the other since the
  threads are not "bound" to any CPU, each CPU takes one of the
  available ready-to-run threads each timeslice. Anyway having more
  threads evens it out. I would guess about 4 times as many threads
  as the CPUs [multiplied by the no. of cores per CPU] would be enough.


I also asked John if there was any element of overhead processing
to handle the results of the multiple threads. He said:

  There's no such overhead. Each thread gets a pointer to the
  appropriate location of the framebuffer, and stores every pixel
  as it is calculated directly. Also any processing afterwards
  (output of the PPM image) is done *after* timing stops.


See the Nekochan thread for more discussion about C-Ray, including
further comments by John.

**********************************************************************

6. Appendix A: Invalid Tests

Results for tests done with pre-run profiling/virtualisation compiler
optimisations will NOT be accepted! What does this mean? Read on...

Someone emailed me to say they had been able to halve the test run
times by using the following compilation/execution sequence:

  XC="gcc -O3 -ffast-math -fomit-frame-pointer c-ray-f.c
  -finline-limit=10000 -ftree-vectorize -fwhole-program
  -fbranch-probabilities -ffunction-sections -o c-ray-f -lm" && $XC
  -fprofile-generate && ./run > /dev/null 2>&1 && $XC -fprofile-use
  && ./run && ./run && ./run && ./run

with ./run containing:

  #!/bin/sh
  cat scene | ./c-ray-f > foo2.ppm

The explanation was as follows:

  What this does is to create an executable with profiling and
  virtualization instructions, execute it in order to create "real life"
  information and then recompile it using that information to better
  optimize it without the need of "guessing" (which is what usually a
  plain -O3 does). Doing this speeds up the execution a lot. Another
  important thing was to increase the inline limit from 600 (or whatever
  the default value is) to something big (like 10000) so more functions
  will be inlined instead of called. The usage of "-fwhole-program" tells
  to gcc that the .c file is the one and only source code file for the
  program, so it will use all functions as static and will make all
  "inline-able" functions, inline. This is another great speedup :-).

My problem with this is that, by definition, the test has to be run
multiple times in order to provide execution profiling data used to
optimise the test for the final run. Thus, the real total run time spent
to obtain the final result was not that much less or perhaps longer than
just running the test a single time without this sort of execution
profiling.

I asked John about these optimisations; he said:

  Hahaha :)
  Yes I bet the biggest advantage was the profiling run and use of that
  profiling information, and maybe also the whole program optimizations.
  Both of which can't be done in real-life programs :)

  The profiling information helps the compiler issue branch prediction
  hints to the processor, and also do prefetches. The first one makes a
  lot of difference in modern x86 CPUs with their huge execution
  pipelines. If the branch prediction fails, you end up flushing the
  pipeline and backtracking tens of instructions. Explicit prefetching,
  would also make a big difference if the data set was bigger. I don't
  think it helps here.

This method of optimisation is certainly interesting, but I don't think
it's appropriate for the C-Ray tests.