Intel Hd Graphics 4000 Opencl Driver Download Install Update
Using OpenCL, key video effects were sped up by as much as 2. Kernel Pseudo Code for Bilinear Interpolation. Pseudo Code for Lens Flare. A popular software title for professional video editing was updated in to accelerate video processing effects with OpenCL. During development of the initial application release, over 60 video effects were accelerated with OpenCL, for which over OpenCL kernels were implemented.
With so many effects accelerated with OpenCL, it was essential to functional test and to assess performance of every OpenCL kernel. This article outlines lessons learned and some optimization techniques used while testing and assessing the performance of the OpenCL kernels. This article assumes the reader is familiar with the OpenCL programming model . The release kit consisted of seven workloads designed to test different video effects which are accelerated with OpenCL.
These seven workloads of the release kit were used in performance and functional analysis throughout the testing of the HDG implementation of OpenCL. A number of issues were encountered on the OpenCL compiler and runtime which were resolved.
HDG runtime challenges were observed and it took time to understand before steps could be taken to address and optimize the performance bottlenecks observed.
Figure 1 compares the initial performance to the optimized performance observed with HDG today. The application release kit workloads were developed to determine increase in playback performance and decrease in render time. The kit is divided into 7 workloads, each showcasing different video effects. These included the following video effects, all implemented with OpenCL kernels. All of this is happening over a transition from the slow motion background image using the Cross Effect transition to the next clip.
This section outlines lesson learned and general optimizations for creating measureable performance improvements in OpenCL kernels. These optimizations were integrated into the release of workloads shown in Figure 1. The OpenCL kernels used in these studies came from a professional video editing application.
The optimizations outlined in the article were scheduled to release in subsequent application updates. The case study also shows how simple it is to further accelerate video processing using SLM in OpenCL kernels where appropriate.
Processing lookup table data in OpenCL kernels in most cases creates a performance bottleneck. This is due to the large number of data transfers for lookup table data that occur between main memory and the memory available for the OpenCL device.
If an OpenCL kernel is not compute bound, the kernel program should be redesigned or the algorithm should not be programmed using OpenCL at all. In general, lookup tables should be avoided in OpenCL kernels if at all possible because LUTs preclude a kernel from being compute bound. Avoiding lookup tables and whether the kernel is compute bound or not are topics for another paper. Lookup tables will almost always create a performance bottleneck when the lookup table data is large, generally more than bytes.
The data transfer latency and access hits slow the HDG OpenCL compute engine, which in turn prevents the kernel from running optimally. Notice that in addition to other parameters, the kernel also has three parameters global pointers for lookup tables; these are lutR, lutG , and lutB. Data held in this memory has to transfer to the kernel along a slow path.
The highlighted code shows LUT table data being used, where the code is indexing through the tables to retrieve LUT data. The indexes were computed based on the incoming image pixel data — not shown here.
Notice there are six values to retrieve by each kernel thread from system memory. The data transfer latency impact is compounded as there are a large number of OpenCL hardware threads running for the kernel which are trying to hit the same memory address space. So what can be done to avoid bottlenecks when using lookup tables? In most cases the answer is as easy as using the HDG local memory.
If at all possible, lookup table data should be copied to shared local memory. Using SLM prevents excessive shuttling of lookup table data between kernel threads thus greatly minimizing data transfer penalties. With the memory latency removed or minimized, the kernel compute throughput will no longer be bogged down and should show substantially better performance. The kernel code still indexes through the lookup tables, but those tables now reside in local memory.
As data resides in local memory, the data transfer latency is avoided which expedites the compute part of the kernel, and thus achieves much better performance. Removing the performance bottleneck on kernels that use lookup tables is often just as simple as using SLM.
Table 1 shows the metrics of performance measured on the color curves effect which was optimized to use SLM. A stand-alone application was written and a single image x was used to assess the performance of the OpenCL kernel for the color curves effect.
The stand-alone application host code looped the execution of the OpenCL kernel times. The kernel with no SLM took about While the kernel optimized with SLM took only 18 milliseconds.
The performance is 4. For additional system details refer to the system Information found on Appendix A. However, using SLM comes with some restrictions. Figure 4 is a screenshot of the color curve infrared effect, showing the output after the different colors have been computed. Due to these incorrect results, the video editing app programmers ended up writing their own bilinear interpolation functionality in OpenCL kernel code.
The incorrect results issue was promptly fixed in the HDG graphics driver and the performance of both BLI implementations was compared. Highlighted code in Figure 5 shows differences between the two implementations. Where x and y values are derived from the running thread ids which you get when calling get global id 0 and 1, respectively.
BLI is commonly used in several video effects, some of which were used to gauge the performance deltas and output result consistency. Even as OpenCL best practices and optimization guidelines suggest to program kernels with as few instructions as possible, there are exceptions to this advice.
This case study explores performance shortcomings of the lens flare video effect which required six kernels, and compares the performance against a monolithic one-kernel solution. Multiple OpenCL kernels are usually viewed as an optimal design solution for video effects where multiple independent video elements are added to the video output. In practice, a single kernel would minimize texture traffic overhead and it might be a better solution in terms of performance.
This case study highlights the lens flare video effect which uses six kernels. Each kernel was designed to draw a lens effect element: Depending on the lens flare effect setting, a particular kernel would be executed multiple times to draw multiple instances of the same element on the same video frame.
This required taking multiple passes over the same image and thus creating texture overhead. The video frame being processed in multiple passes incurs data traffic overhead. The traffic overhead was determined to slow down the processing of the video effect.
A one-kernel approach was proposed to eliminate texture traffic and improve performance. This case study outlines the performance results with the one-kernel solution. To consolidate six kernels into one, the unique code was taken from each of six kernels and turned into six functions which are called from the main kernel. A specific function would be called within a loop to draw multiple elements as needed.
Surprisingly, not all of the settings of the lens flare effect showed performance improvement with this approach. In fact, two of the settings showed minor performance degradation. Table 3 includes the performance metrics observed with the six kernels and the one-kernel implementations. The one-kernel solution sped up three of the five settings while decreasing performance of the other two settings.
Table 3 shows that as the number of elements to draw increase, the one-kernel implementation achieves better performance. It also shows that if the number of elements to draw is less than 10, then the six kernel implementation yields better performance. It is still possible that the one kernel per element solution might perform better even in lens flare effects with less than 10 elements. The pseudo code below includes both host and OpenCL code for the six kernels as well as for the one-kernel implementations of the lens flare effect.
Figure 8 shows the host code and Figure 9 shows the OpenCL kernel code. Some code is omitted to simplify and to help illustrate key code changes. In summary, video and image processing can be accelerated with OpenCL. Further optimization can be achieved on HDG with additional work.
Should the performance for a given kernel not improve as expected, consider the optimization techniques outlined in this paper. As I run my code on HD and Haswell both, a significant performance boost is observed on Haswell, though the compute unit is only increase from 16 to 20, the performance is improved 2 times.
Share Tweet Share Send. General Optimizations This section outlines lesson learned and general optimizations for creating measureable performance improvements in OpenCL kernels. Use 4x2 Chroma blocks when converting color format e. Use fmin a, b instead of min a, b for float data types. Use native built-in functions cautiously. Most native functions yield better performance but not all. The multiplication generated code is more than 2ms faster on HDG than either built-in function.
Use bit-wise operations for Boolean comparisons whenever possible; e. This optimization improves performance especially if the kernel is big, usually 4K instructions or bigger. Use multiplication and truncate function trunc … to get the fractional value part instead of fmod x, y, 1.
Eliminate arithmetic operations of invariant variables in kernel code. Move computations to host code whenever possible. Performance Optimization If at all possible, lookup table data should be copied to shared local memory. Partial Code with SLM for Color Curves Video Effect get local size [0 1] — each function returns the value of the local work size specified on the kernel execution.
Intel® Graphics Drivers
Using OpenCL, key video effects were sped up by as much as 2. Kernel Pseudo Code for Bilinear Interpolation. Pseudo Code for Lens Flare. A popular software title for professional video editing was updated in to accelerate video processing effects with OpenCL.
Intel HD Graphics 4000 Driver Download