Engineering & Technology Degree Level: Bachelor of Science

Report 3 Downloads 112 Views
Undergraduate   Category:  Engineering  &  Technology   Degree  Level:  Bachelor  of  Science   Abstract  ID#:  670

 

Improving  FIR  Filtering  and  AES  Encryp7on  with  OpenCL  2.0   Carter  McCardwell,  Tuan  Dao,  Saoni  Mukherjee,  David  Kaeli  

Abstract  

Introduc)on  

The   growth   in   demand   for   heterogeneous   accelerators   has   sJmulated   the   development   of   cuong-­‐edge   features   in   newer   accelerators.   The   heterogeneous  programming  frameworks  such  as  OpenCL  have  matured   over  the  years  and  introduced  new  features  for  developers.  We  explore   one  of  these  programming  frameworks,  OpenCL  2.0.  To  drive  our  study,   we  consider  a  number  of  new  features  in  OpenCL  2.0  using  two  popular   applicaJons  from  two  different  compuJng  domains  -­‐  cyber  security  and   signal   processing.   These   applicaJons   are:   1)   the   AES-­‐128   encrypJon   standard,   and   2)   Finite   Impulse   Response   filtering.   In   this   work,   we   introduce  the  latest  runJme  features  enabled  in  OpenCL  2.0,  and  discuss   how  well  our  applicaJons  can  benefit  from  some  of  these  features.  

•  EvaluaJon  of  the  new  features  available  in  OpenCL  2.0,   primarily,  Shared  Virtual  Memory  (SVM)  and  dynamic   parallelism     •  SVM  removes  the  need  for  explicit  data  copies  between  host   and  device,  helps  speed  up  program  execuJon   •  EvaluaJon  of  two  applicaJons,  exploiJng  OpenCL  2.0  features:     1)  FIR  filtering  and  2)  AES  encrypJon   •  Both  wriXen  and  opJmized  for  both  OpenCL  1.2  and  2.0   •  Discuss    improvements  in  terms  of  code  readability  and   simplicity    

o  Earlier  GPUs  are  designed  to  handle  3-­‐D  graphics  -­‐  later  evolved   devices  to  speed  up  scienJfic  compuJng   o  GPUs   have   many   cores   that   can   run   thousand   threads   simultaneously   o  Provide  great  computaJonal  power  at  low  cost   ²  OpenCL  is  the  leading  general  purpose  programming  framework   for  heterogeneous  systems,  used  to  program  CPUs,  GPUs,  FPGAs     ²  Based  on  C99,  designed  by  Apple,  maintained  by  Khronos  Group   ²  Most  widely  used  version:  1.2  and  latest  revision:  OpenCL  2.0   ²  Code   is   more   portable   than   other   HPC   programming   languages   such  as  CUDA  

Finite  Impulse  Response   •  •  •  •   

Four  different  operaJons  to  encrypt  the  data   Data  read  in  from  a  file  linearly  as  16-­‐byte  4x4  arrays  called  “states”     States  are  individually  processed  through  the  algorithm-­‐  potenJal  parallelism   A  series  of  4  different  transformaJons  to  the  states  applied:   o  SubBytes  replaces  some  bytes  in  the  state  with  values  from  a  pre-­‐generated  array  of  data.       o  ShidRows  shids  the  bytes  in  each  row  by  a  certain  offset.       o  AddRoundKey  XORs  each  row  with  a  4-­‐byte  word  generated  from  the  private  key.       o  MixColumns  performs  a  polynomial  operaJon  on  each  column   •  To  decrypt,  perform  the  inverse  of  the  operaJons.   •  •  •  • 

Implemen7ng  with  OpenCL  2.0:  

Implemen7ng  with  OpenCL  2.0:  

ü  The  AES  private  key  expanded  on  the  CPU   ü  A  shared  space  for  the  input  and  output  data  allocated  in  SVM   using  clSVMAlloc   ü  The  host  parcels  the  data  into  states  and  copies  it  into  SVM   ü  The  kernel  started  and  each  work-­‐unit  based  on  its  local  ID   reads  its  delegate  state  into  its  register   ü   Each  work-­‐unit  processes  the  state  through  the  AES   encrypJon  algorithm   ü  The  work-­‐unit  copies  the  processed  data  back  to  SVM   ü  The  host  writes  the  encrypted  data  to  a  file    

Results  

Results   OpenCL  1.2   Time   (miliseconds)  

OpenCL  2.0   Time   (miliseconds)  

5000  

1973  

1740  

10000  

3493  

2896  

15000  

4371  

3981  

20000  

6393  

5123  

Time  (seconds)  

Dimension  

Conclusion  and  Future  Work   •  •  •  • 

Using  SVM  reduces  coding  effort  and  increases  code  readability   Decreases  both  applicaJons  since  no  explicit  copies  of  data     In  terms  of  future  work,  next  target  is  to  implement  a  fine-­‐grain  SVM  on  Kaveri  APUs   Develop  and  opJmize  more  applicaJons  supporJng  OpenCL  2.0  –  ongoing  work  with  the  HSA  foundaJon    

    Time  (seconds)  

   

Input  dimensions  size  (5000x)  

•  Dynamic   Parallelism:   Allows   kernels   start   other   kernels   without  interacJon  with  the  host.   •  Shared   Virtual   Memory:   Allows   shared   memory   space   and   pointers   between   host   and   device.   Reduces   data   copying   between  the  two  devices  and  simplify  memory  management.   •  Android   Installable   Client   Driver   Extension:   Allows   OpenCL   implementaJons   to   be   discovered   and   loaded   as   shared   objects  on  Android  systems.   •  Image  Support:  Improved  image  support  including  sRGB  and   3D   image   writes   and   the   ability   for   mulJple   kernels   to   read   and  write  to  the  same  image.  

Advanced  Encryp)on  Standard  

Used  to  calculate  the  weighted  sum  of  the  most  recent  input  values   Compared  to  IIR,  more  fine  tuned  responses,  although  consumes  more  computaJon  Jme   Each  compute  unit  calculates  the  output  for  its  coefficient   Used  in  digital  audio  applicaJons  for  filtering  audio  signals  

ü  The  input  and  coefficients  allocated  using   clSVMAlloc   ü  The  memory  space  mapped  (clEnqueueSVMMap)   for  the  host  to  write  the  input  data   ü  The  memory  space  is  unmapped   (clEnqueueSVMUnmap)  and  passed  to  the  device   using  clSetKernelArgSVMPointer   ü   The  device  runs  the  kernel  and  calculates  the   results   ü  The  memory  space  is  mapped  for  the  host  to  read   out  the  results   ü  The  memory  space  is  freed  using  clSVMFree      

New  Features  in  OpenCL  2.0  

GPGPU  and  OpenCL  

1M                                    10M                                100M                            1000M   Input  size  (MB)  

Size  

OpenCL  1.2   Time   (seconds)  

CPU   Time   (seconds)  

OpenCL  2.0   Time   (seconds)  

1M  

0.806  

1.271  

1.24  

10M  

12.27  

19.947  

12.539  

100M  

116.783  

211.023  

117.054  

1000M  

1160.867  

2117.082  

1146.273  

References  

[1]  B.  Gaster,  L.  Howes,  D.  R.  Kaeli,  P.  Mistry,  and  D.  Schaa,  Heterogeneous  CompuJng  with  OpenCL,  1st  ed.  San  Francisco,  CA,   USA:  Morgan  Kaufmann  Publishers  Inc.,  2011.   [2]  K.  O.  W.  Group  et  al.,  “OpenCL  2.0  specificaJon,”  Khronos  Group,  Nov,  2013.   [3]  A.  V.  Oppenheim,  A.  S.  Willsky,  and  S.  H.  Nawab,  Signals  and  systems.  PrenJce-­‐Hall  Englewood  Cliffs,  NJ,  1983,  vol.  2.   [4]  J.  Daemen  and  V.  Rijmen,  The  design  of  Rijndael:  AES-­‐the  advanced  encrypJon  standard.  Springer,  2002.