- 25 February 2021
- Benjamin Anuworakarn
Development boards are cool and the BeagleBone® Black (BBB) is one of the more interesting ones around. This widely available tiny board costs around £35 and will boot Linux is only 10 seconds so anyone interested in development can get stuck in quickly. The Introduction to Mobile Graphics course has been recently revamped for 2020 for the Imagination’s University Programme and the widely available, low-cost BBB is an ideal platform for student teaching and exercises based on OpenGL® ES2.0, instead of an expensive standard PC.
To the end, Imagination has been working with the BeagleBoard.org® Foundation to enable users to get even more benefit from BBB. The main processor on the BeagleBone® Black (BBB) board runs an Arm® Cortex® A8 core and legacy PowerVR SGXTM 530 graphics processing unit (GPU) and the great news for education purposes is that applications can be enhanced by taking advantage of standard OpenGL® ES 2.0 and OpenCLTM 1.1 APIs.
The newly published OpenCL library and documentation allows the BBB to be used to explore OpenCL. While it is not a high-performance solution it offers the benefit of enabling students to learn how to program on the BBB on a platform with a relatively low cost. Some applications have relaxed latency requirements and a high amount of signal processing that will benefit from offloading the A8 core. An ALSA sample rate converter is included in the documentation package as an example, and it allows the A8 load to be reduced from 55% down to 20% if latency is relaxed. The hope is that by making this library available the BeagleBone community will find other useful places to take advantage of OpenCL.
OpenCL on PowerVR SGX530 – really?
The PowerVR SGX530 is one of Imagination’s most successful GPU designs and is still in active use today. However, the SGX530 is a ten-year-old design that crucially was designed before Open CLTM became a standard. Nevertheless, Imagination did work to offer support OpenCL 1.1 on the PowerVR SGX530, and in recognition of the wide success of the BeagleBone Black since, we have revived this work and are presenting it here.
The SGX530 OpenCL binaries in this package are still a work in progress, and we must note that these binaries are not yet completely OpenCL 1.1 conformance compliant. We should not that the OpenCL driver comes “as is” and has limitations which will be described in the examples.
Using OpenGL ES2.0
There is a specific BBB image AM3358 Debian 9.12 2020-04-06 4GB SD ImgTec that comes with the TI SGX graphics drivers and the PowerVR SDK pre-installed. This is the basis for exploring both OpenGL and OpenCL.
The PowerVR SDK provides examples that start with drawing a triangle as the most basic graphical element (Figure 1) up to the use of vertex and fragment shaders for more sophisticated images (Figure 2). This can be used for self-study. Alternatively, the BBB can be used as the development platform for the Introduction to Mobile Graphics course.
Understanding OpenCL 1.1
The OpenCL package can be downloaded from the Imagination Technology University Programme website. This package contains a build script that will install the OpenCL libraries and patch the PowerVR SDK to allow only the OpenCL 1.1 matrix examples to be built for the BBB.
The OpenCL Matrix Multiplication example is used for educational purposes to understand how OpenCL operates on an embedded platform like the BBB. The key point to understand is the overhead implicit in passing data buffers between the Arm A8 and PowerVR SGX 530 cores and in managing the cache coherency of the data between them. During this overhead time, no useful processing work can be done and so can be considered “wasted” time. OpenCL on the BBB becomes useful in applications where the processing time shown in Figure 3 is significant compared to the overhead as this is when the A8 core is freed up to carry out other tasks when the SGX is processing.
Several OpenCL kernels are used to illustrate how the use of Single Instruction Multiple Data (SIMD) instructions on float4 types (four packets of 32-bit data values) can be used to increase the processing performance on the SGX.
It must be noted that if the goal is the fastest processing time on the BBB then the optimal solution is to use is Arm® NEONTM intrinsic instructions as there is no overhead involved.
Using OpenCL 1.1
The Matrix Multiplication example is useful to understand OpenCL. The next step is to make use of OpenCL in a real example that has the right combination of high processing and relaxed latency. The example chosen is the ALSA sample rate converter (SRC) used in an audio player application. The high processing requirement comes from the 44.1kHz to 48kHz upsampling and the fact it is a player only means that the latency requirements can be relaxed. The OpenCL implementation fits into the ALSA software as shown in Figure 4 with the libspeexdsp layer modified to call OpenCL.
This example is built by running a script that will download and patch the ALSA packages. The OpenCL libraries are supplied as C++ libraries and so ALSA needs to be compiled with g++ to ensure the OpenCL constructors/destructors are called correctly. The g++ compiler is stricter than GCC and so most of the patches are to correct type casting on pointers and ensure that initialised structures are fully initialised and in the correct order.
To take advantage of OpenCL the algorithm being used must be fully parallelisable with the same operation being carried out for every output sample. The SRC algorithm in libspeexdsp does not actually do this as an inner loop calculates an index value that will be used in the next iteration of the outer loop. This means that the ALSA SRC algorithm needed to be re-factored so that the index calculation step can be taken out of the processing loop and implemented first to allow the processing loop to be the same for every iteration.
As aplay is just a playback application it is possible to relax the latency requirements. This turned out to be necessary as the overhead in processing the standard input audio buffer size of 160 samples prevented the OpenCL call from being done in real-time. Reliable operation was achieved with buffer sizes of 640 or more samples. With 1600 sample buffers the A8 load was reduced to 20% from 55% when the SRC was implemented on the A8. This CPU offload is the major benefit of using OpenCL on the BBB and we hope that the community will be able to find other use cases that improve the performance of BBB based systems.
The documentation within the OpenCL download package goes into much greater detail to aid the reader in understanding OpenCL operation and being able to usefully use it.