Undergraduate Category: Engineering & Technology Degree Level: Bachelor of Science Abstract ID#: 670
Improving FIR Filtering and AES Encryp7on with OpenCL 2.0 Carter McCardwell, Tuan Dao, Saoni Mukherjee, David Kaeli
Abstract
Introduc)on
The growth in demand for heterogeneous accelerators has sJmulated the development of cuong-‐edge features in newer accelerators. The heterogeneous programming frameworks such as OpenCL have matured over the years and introduced new features for developers. We explore one of these programming frameworks, OpenCL 2.0. To drive our study, we consider a number of new features in OpenCL 2.0 using two popular applicaJons from two different compuJng domains -‐ cyber security and signal processing. These applicaJons are: 1) the AES-‐128 encrypJon standard, and 2) Finite Impulse Response filtering. In this work, we introduce the latest runJme features enabled in OpenCL 2.0, and discuss how well our applicaJons can benefit from some of these features.
• EvaluaJon of the new features available in OpenCL 2.0, primarily, Shared Virtual Memory (SVM) and dynamic parallelism • SVM removes the need for explicit data copies between host and device, helps speed up program execuJon • EvaluaJon of two applicaJons, exploiJng OpenCL 2.0 features: 1) FIR filtering and 2) AES encrypJon • Both wriXen and opJmized for both OpenCL 1.2 and 2.0 • Discuss improvements in terms of code readability and simplicity
o Earlier GPUs are designed to handle 3-‐D graphics -‐ later evolved devices to speed up scienJfic compuJng o GPUs have many cores that can run thousand threads simultaneously o Provide great computaJonal power at low cost ² OpenCL is the leading general purpose programming framework for heterogeneous systems, used to program CPUs, GPUs, FPGAs ² Based on C99, designed by Apple, maintained by Khronos Group ² Most widely used version: 1.2 and latest revision: OpenCL 2.0 ² Code is more portable than other HPC programming languages such as CUDA
Finite Impulse Response • • • •
Four different operaJons to encrypt the data Data read in from a file linearly as 16-‐byte 4x4 arrays called “states” States are individually processed through the algorithm-‐ potenJal parallelism A series of 4 different transformaJons to the states applied: o SubBytes replaces some bytes in the state with values from a pre-‐generated array of data. o ShidRows shids the bytes in each row by a certain offset. o AddRoundKey XORs each row with a 4-‐byte word generated from the private key. o MixColumns performs a polynomial operaJon on each column • To decrypt, perform the inverse of the operaJons. • • • •
Implemen7ng with OpenCL 2.0:
Implemen7ng with OpenCL 2.0:
ü The AES private key expanded on the CPU ü A shared space for the input and output data allocated in SVM using clSVMAlloc ü The host parcels the data into states and copies it into SVM ü The kernel started and each work-‐unit based on its local ID reads its delegate state into its register ü Each work-‐unit processes the state through the AES encrypJon algorithm ü The work-‐unit copies the processed data back to SVM ü The host writes the encrypted data to a file
Results
Results OpenCL 1.2 Time (miliseconds)
OpenCL 2.0 Time (miliseconds)
5000
1973
1740
10000
3493
2896
15000
4371
3981
20000
6393
5123
Time (seconds)
Dimension
Conclusion and Future Work • • • •
Using SVM reduces coding effort and increases code readability Decreases both applicaJons since no explicit copies of data In terms of future work, next target is to implement a fine-‐grain SVM on Kaveri APUs Develop and opJmize more applicaJons supporJng OpenCL 2.0 – ongoing work with the HSA foundaJon
Time (seconds)
Input dimensions size (5000x)
• Dynamic Parallelism: Allows kernels start other kernels without interacJon with the host. • Shared Virtual Memory: Allows shared memory space and pointers between host and device. Reduces data copying between the two devices and simplify memory management. • Android Installable Client Driver Extension: Allows OpenCL implementaJons to be discovered and loaded as shared objects on Android systems. • Image Support: Improved image support including sRGB and 3D image writes and the ability for mulJple kernels to read and write to the same image.
Advanced Encryp)on Standard
Used to calculate the weighted sum of the most recent input values Compared to IIR, more fine tuned responses, although consumes more computaJon Jme Each compute unit calculates the output for its coefficient Used in digital audio applicaJons for filtering audio signals
ü The input and coefficients allocated using clSVMAlloc ü The memory space mapped (clEnqueueSVMMap) for the host to write the input data ü The memory space is unmapped (clEnqueueSVMUnmap) and passed to the device using clSetKernelArgSVMPointer ü The device runs the kernel and calculates the results ü The memory space is mapped for the host to read out the results ü The memory space is freed using clSVMFree
New Features in OpenCL 2.0
GPGPU and OpenCL
1M 10M 100M 1000M Input size (MB)
Size
OpenCL 1.2 Time (seconds)
CPU Time (seconds)
OpenCL 2.0 Time (seconds)
1M
0.806
1.271
1.24
10M
12.27
19.947
12.539
100M
116.783
211.023
117.054
1000M
1160.867
2117.082
1146.273
References
[1] B. Gaster, L. Howes, D. R. Kaeli, P. Mistry, and D. Schaa, Heterogeneous CompuJng with OpenCL, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011. [2] K. O. W. Group et al., “OpenCL 2.0 specificaJon,” Khronos Group, Nov, 2013. [3] A. V. Oppenheim, A. S. Willsky, and S. H. Nawab, Signals and systems. PrenJce-‐Hall Englewood Cliffs, NJ, 1983, vol. 2. [4] J. Daemen and V. Rijmen, The design of Rijndael: AES-‐the advanced encrypJon standard. Springer, 2002.