Optimizing Raytracing Algorithm Using CUDA

Now, there are many codes to generate images using raytracing algorithm, which can run on CPU or GPU in single or multi-thread methods. In this paper, an optimized algorithm has been designed to generate image using raytracing algorithm to run on CPU or GPU in multi-thread algorithm. This algorithm employs light with depth of 8 to generate images. It is optimized by changing pixel travel priority and ray of light to thread, dedicating depth function to empty threads, and using optimized functions from MSDN library. Its code has been written in C++ and CUDA. In addition, we do the following to show its performance: comparing implementation in different compiler mode, changing thread number, examining different resolution, and investigating data bandwidth. The results show that one can generate at least 11 frames per second in HD (720p) resolution by GPU processor and GT 840M graphic card, using trace method. If better graphic card employ, this algorithm and program can be used to generate real-time animation.

Raytracing: This method can simulate actual reflection a ray on objects, and generate realistic images.In these images, reflection, refraction, absorption, and shadow are represented well, and after a ray intersect object, also rays of light are computed.In this method, generating images is costly and depend on depth number of a ray of light [3].
Radiosity: In this method, we examine object surfaces intersected by a ray of light.In some cases, it is called global illumination.In fact, it generates images based on accurate analyzing of light reflection on distributed surfaces.In this method, shadows are very natural and have been represented softly [4].

2-Importance of Research
In generating 3D images, which are used in games and virtual space simulations, more realistic images has an important role to persuade audiences.Thus, to achieve this, various methods are created to generate such images.Generally, these methods are classified to two groups: object-based and pixel-based methods.Raytracing is a main pixelbased method [5,6].In this method, it is tried to trace path of a light source to object and from other object to it a few stages.It is better than other method concerning examining shadow, reflection, and brilliance of objects on each other.This technique result in much realistic and natural reflection, which generally have very complicated computationally.Since, running them is possible only in offline mode.In this method, to generate a pixel, various path of ray of lights from light source to an object are examined to determine colour intensity (Figure 1).This result in images appears to be more realistic, and reflection of an object on a shiny object can be observed.Figure 2 shows a graphical environment, which generated by this method.On the other hand, with technological advancement in manufacturing graphic cards such as NVIDIA and availability of libraries like Cuda to parallel processing in GPU cores, these contributions are used in sciences and parallel processing in graphic cards are very faster than CPU processing.In this architecture, most of processor power is dedicated to computation unit, and less power is dedicated to Cache memory (Figure 3).In addition, architecture of software is different with others too (Figure 4).

3-Related Works
Various papers have been published about images generation using ray-trace method by GPU processor [13][14][15][16].In [14], it is asserted that CPU computational power is much lower than GPU.In [14], implementation method, image generation codes, and method by which functions assigned to threads is not mentioned.Important factors influence on running time of programs by CPU.These factors comprise active thread number, utilized processor capacity, the type of data transfer in memory, functions, compiler settings, operating system, and etc.
Because, these factors have not mentioned in many papers on raytracing method, it is difficult to test this methods by implement them again, and verify experiment and results.On the other hand, the type of call functions in graphic card, GPU synchronization with CPU when running is completed, functions which are used to determine time, version and technical specification of graphic card, and number of experiments are important factors, which influence on comparison between running time of CPU and GPU.Unfortunately, many papers haven't mentioned impacting factors on running time of CPU and GPU [17][18][19].

4-Method
Firstly, a procedure is employed to generate 3D images using raytrace method in order to generate an image with 20 objects.Raytrace (int x, int y) function has designed to run on CPU and GPU.Its design is based on Cuda's library functions.In order that results have more validity, source codes, which run program on CPU and GPU, are exactly same.After preparing codes, executable program has multi parallel procession with various threads.
In this stage, computation of any pixel of image has assigned to one thread, and a variable, which is counter of last pixel, has used to optimize procedure.After that, next pixel assign to a thread performs its task sooner, and counter increase.In order to same counter don't assign to two thread simultaneously, we have used two functions: EnterCriticalSection and LeaveCriticalSection [20].In this way, CPU has most efficiency.
A function has created to manage threads.Based on demands, it creates target number of threads in any test, then activate them, and free generated images.In this way, it has designed to generate images with 1, 16, 32, 64, 96, 128, 256, 384, and 512 threads (Figure 6-goRayTraceCPU structure).They have created because we want obtain performance of different CPU with different number of threads, and obtain optimized performance of CPU processor [21].
To ensure that actual time has computed for a thread, image generation repeated 6 times.In the meantime, to obtain actual running time of program (continuously and without interruption) requires whole activities of operating system stop, in turn, leading to difficulty for operating system.To solve this problem, we run sleep function (100 ms) between tests.Running time of sleep function has considered, regardless of CPU computations time (Figure 6-Main structure).
In this test, to ensure that computed time is only processing time, we have ignored transfer time to memories (RAM and graphic card's RAM), and considered computations time.Two functions QueryPerformanceFrequency (computation of frequqncy) and QueryPerformanceCounter (obtaining last value of counter) have used to compute time accurately [22] .They have used in EndTime and StartTime Functions.Firstly, to test processor speed, we have used 4 CPUs with different specifications and different number of cores.We have considered processor score in [23] as computational rank.In addition, we have used Gb-DDR3-1600MHz as RAM memory.Table 1 shows specifications of processors.These processors are most common and appropriate one for computational programs.At the time of testing, we used i7 4790 model, which is best and most powerful CPU on the shelf.Its overall rank in [23] is 64.If, Intel Xeon and AMD FX have been ignored (they are used in servers), it is very powerful processor.It is worth noting that for other available processors, we have used information in "cpubenchmark.net" and "videocardenchmark.net",which own by PassMark Company .PassMark Software is a Microsoft Registered Partner and an Intel Software Partner [24].
Final program (release configuration) has been compiled and running time has been recorded for different number of threads and resolution.To record best running time, all current programs excluded from memory, the test repeated times, and minimum time recorded as final value.Figure 5 shows final image with this program.On the other hand, Cuda libraries (introduced by NVIDIA) are used to run programs on graphic card (GPU).We have altered functions and statements, which have been called from raytrace library, according to environment and Cuda library for parallel processing.None of statements of main functions was not changed, and source code to generate images and source code to generate by CPU remained unchanged.
Since, Cuda's library prepares threads to computation based on power of graphic card, therefore we can't activate some of top and down threads, and they are recorded in complete threads list.Resolution of images was HD and Full HD.
We have used three common graphic cards listed in Table 2 .Table 2 shows their scores in reference [25] .These graphic cards perform less than current cards such as GeForceGTX 780, GeForceGTX 770, GeForceGTX 960, and their processing speed are very lower than latter.In [25], GeForceGTX 650 has rank 131 with score 1833.It has good position in comparison with other graphic cards.However, in comparison with GeForce GTX 780 with score 9022, it has very lower score, thus GeForceGTX 880 Ti outperform it significantly.
After completing the program, we have compiled release version for any graphic card, and computed and recorded processing time with different number of threads and resolution.
Since, the test aims to compute processing time and speed, we ignore transfer time of information from RAM memory of system to RAM memory of graphic card and vice versa.Now, it is worth noting that when running functions in graphic card, we don't know when program is completed, therefore, we have used cudaDeviceSynchronize.
In fact, it creates an interruption in CPU until graphic card completes his task, and function is completed when graphic card ends his operations (Figure 8).Thus, starting time is before the running RayTraceGPU, and after calling cudaDeviceSynchronize, program is completed.Difference between both computed and recorded.The rest of function is concerned with copying data in graphic card's memory to internal storage, and freeing allocated memories (Figure 8-goRayTraceGPU procedure).Now, we transfer data in RAM to graphic card's RAM, and then call the function, which generate image in graphic card with memory addresses of parameters.Time of processor utilization by graphic card is difference between starting time of running and end time of function.Lastly, we transfer data in graphic card's RAM to system's RAM, and free graphic card memories.
We have used values of blockDim, blockldx, and threadldx parameters to compute current position of pointer, and call RayTraceGPU function after computing width and height (Figure 8-RayTraceGPU procedure).

Figure 8. Pseudocode of raytrace alghorithm by GPU
Lastly, we present the following tables and figures (Table 3-Table 4) (Figure 9-Figure 12), which shows information about running the program on different CPU and GPU and for different threads.When running the program, we have closed all unnecessary programs of operating system.We have repeated test 6 time, and recorded average time for any resolution and threads for graphic card and minimum time for CPU.
On the other hand, it can be said that CPU processing on data underperform GPU processing, when data traffic increase.Because, according to the equation ( 1) and ( 2)), in 720p and 1080p, image generation time ratio (CPU/GPU) has increased 49% and 62% respectively.While, according to the equation (3), pixel computations ratio has increased 125% in both resolutions.This is also true for the CPU i5-4460 by 125% (1080p) and 110% (720p In this experiment, only gt 635m has longer processing time than i7-4790.Considering the time of manufacturing, comparing processors speed (675 MHz to 3.6 GHz), and given that the graphic card has very lower level than CPU, the comparison is absurd, and we can ignore it.Nevertheless, as Figure 11 and Figure 12 show, this graphic card overperform other processors.Thus, we can perform parallel process better, since increased power of graphic cards, using Cuda programming, and optimal and appropriate selection of thread number.

6-Conclusion
Based on results, it is showed that despite time cost and high computational works needed to generate images using Ray-trace algorithms, we can optimize codes to run in multi-threads method, and use parallel processing in a graphic card to generate HD and Full HD images in real time.We save time by transfer computations from CPU to GPU.We achieved better performance, when data traffic increased.This means that, in near future, we can employ Ray-trace technique in online and real time.

Figure 2 .Figure 3 .
Figure 2. Compare the architecture of CPU and GPU

Figure 4 .
Figure 4.The grid of threads in GPU -This can be one , two and three dimensions has

Figure 6
Figure6shows whole procedure of running in the form of pseudo-code.

Figure 6 .
Figure 6.Pseudocode of raytrace alghorithm by CPU Since quality and resolution of images of videos and games is HD and Full HD, in this test, we have used images

Figure 5 .
Figure 5. Image created from 19 spheres and a light source with a depth of 8 reflection in 1080p resulation As Figure 6 shows, we compare image generation (1080p) in release and debug version to show time difference.As we can see, processing time in release version at least 14 time faster than debug version (Figure 7).Since, we have used release version to conduct tests.

Figure 6 .Figure 7 .
Figure 6.Compare the behavior of the application in Debug and Release modes to generate an image in 1080p resulation by different threads Main:

Compare the best runtime of different threads for different CPU and GPU to generate an image in 1080p resolution Figure 12. Compare the best runtime of different threads for different CPU and GPU to generate an image in 720p resolution
).