An Asynchronous Array of Simple Processors for DSP Applications

Report 0 Downloads 41 Views
Session_23_Penmor

1/8/06

8:28 AM

Page 428

ISSCC 2006 / SESSION 23 / TECHNOLOGY AND ARCHITECTURE DIRECTIONS / 23.6 23.6

An Asynchronous Array of Simple Processors for DSP Applications

Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas University of California, Davis, CA Modern applications increasingly require the computation of DSP workloads comprised of a variety of numerically-intensive DSP tasks. These workloads are found in communication, multimedia, embedded, and wireless applications, and often require very high levels of computation and high energy efficiency. The Asynchronous Array of Simple Processors (AsAP) uses processor cores with small instruction and data memories to dramatically reduce area and power while increasing performance. Fig. 23.6.1 shows the architecture of an individual AsAP processor and the 6×6 array contained on the chip. Data enters the array through the upper left processor and exits from one of the right-column processors. Each processor has a simple RISC-style design with a nine-stage pipeline, detailed in Fig. 23.6.2, and contains two input ports, one output port, and a local clock oscillator. The 16b fixed-point datapath includes a 40b accumulator and executes 54 32b instructions, among which only the bit-reverse instruction is algorithm specific. Memories in modern programmable processors typically consume large amounts of area and power, and frequently contain the critical path. Therefore, energy efficiency and speed can be dramatically increased with small memories. Two reasons make it possible to use very small memories in DSP processors. First, although it is possible to write code that requires large amounts of data and instruction memory, Fig. 23.6.3 shows that many common DSP tasks do not require large memory sizes. Second, a system with many processors can use individual processors or groups of processors to compute individual tasks and intermediate data can be transmitted between processors with a small memory requirement. Fig. 23.6.3 illustrates this fine-grain streaming processor approach. AsAP processors contain a 128×16b data memory and a 64×32b instruction memory, and easily contain many DSP tasks. Each AsAP processor contains an independent clocking system including its own clock oscillator. The maximum clock tree span of less than 1mm in 0.18µm CMOS greatly simplifies the design process. The programmable ring oscillators are built from standard cells and change their frequency through parallel tri-state inverters in each stage, a 5 or 9-stage configuration, and a divider that ranges from 1 to 128, as shown in Fig. 23.6.4. Processors are clocked asynchronously with respect to each other and clocking circuits permit oscillators to operate at any frequency less than the maximum, as well as allowing arbitrary bursts and stalls. The Globally Asynchronous Locally Synchronous (GALS) architecture and oscillator design enable each processor to completely power down (leakage only) if no work is available for nine clock cycles. When work becomes available, the clock is restored in less than one cycle. Clock stalling typically reduces power by more than 50% for complex applications. This clocking strategy also enables frequency adaptable operation to reduce power and to compensate for process and workload variations. The measured oscillator frequency covers a wide range from 1.66MHz to 702MHz. The maximum gap between useful samples (from 1.66MHz to 500MHz) is only 0.08MHz. To simplify the physical design, clock gating was not used in this prototype, and almost 2/3 of the power dissipation is due to the clocking system—so the future addition of clock gating is expected to greatly reduce power consumption.

428

Processors communicate only with adjacent processors to permit full-rate communication with low energy. Movement of one data word from one processor’s memory to an adjacent processor’s memory requires only 220pJ at 1.8V or 68pJ at 0.9V. Each processor can receive inputs from up to two adjacent processors, and can send output to any dynamically-configurable combination of its four neighbors through software. Communication across clock boundaries is accomplished by 32word dual-clock FIFOs. Fig. 23.6.4 shows the configurable synchronization registers and gray coding of read and write address information circuits that avoid metastability and permit reliable operation with completely arbitrary clock operation in both sending and receiving clock domains. Figure 23.6.7 shows the first generation 36-processor standardcell-based AsAP chip implemented using TSMC 0.18µm CMOS. Each processor occupies 0.66mm2 and processors nearly directly abut each other. The chip is fully functional on first-pass silicon at 475MHz at 1.8V, and 116MHz at 0.9V near room temperature. Average typical power is 32mW at 1.8V and 475MHz while executing applications. The maximum power is 144mW per processor at 475MHz, resulting in a maximum energy dissipation of 0.3mW/MHz. At 0.9V and 116MHz, the maximum energy dissipation is 0.093mW/MHz. As shown in Fig. 23.6.5, AsAP is at least 20 times smaller and at least 14 times more energy efficient than other processors. A comparison of area utilization shows that AsAP uses a favorable 66% of its area for the core processor. Some complex applications have been programmed and measured on the AsAP chip. The programs are lightly optimized and unscheduled. A nine-processor JPEG encoder core is shown in Fig. 23.6.6 and dissipates 224mW and has a throughput of 215k 8×8 pixel blocks per second at a clock rate of 300MHz. This throughput is more than nine times faster than implementations built on traditional architectures [1]. A 22-processor 802.11a/802.11g transmitter is fully-compliant with the IEEE standard over all eight rates [2] and contains additional upsampling, filtering, and synchronization functions not required by the standard. The transmitter dissipates 407mW and has a throughput 30% of the 54Mb/s requirement at a clock rate of 300MHz. This result compares well with a reported throughput of 1.7Mb/s for a 24Mb/s rate on a TMS320C6201 [3]. A configuration test bit allows clock oscillator stalling to be disabled, which shows that oscillator stalling reduces JPEG power by 53% and 802.11a/g power by 65%. Acknowledgments: The authors acknowledge support from Intel, UC Micro, NSF Grant No. 0430090, MOSIS, and a UCD Faculty Research Grant; and thank R. Krishnamurthy, M. Anders, S. Mathew, S. Muroor, W. Li, C. Chen, and D. Truong. References: [1] K. Sakiyama et al., “Finding the Best System Design Flow for a HighSpeed JPEG Encoder,” Proc. DAC, pp. 577-578, Jan.,2003. [2] M. Meeuwsen et al., “A Full-Rate Software Implementation of an IEEE 802.11a Compliant Digital Baseband Transmitter,” Proc. SIPS, pp. 124129, Oct., 2004. [3] M. Tariq et al., “Development of an OFDM based high speed wireless LAN platform using the TI C6x DSP,” Proc. ICC, pp. 522-526, Apr., 2002. [4] A. Bright et al., “Creating the BlueGene/L Supercomputer from LowPower SoC ASICs,” ISSCC Dig. Tech. Papers, pp. 188-189, Feb., 2005. [5] S. Agarwala et al., “A 600MHz VLIW DSP,” ISSCC Dig. Tech. Papers, pp. 56-57, Feb., 2002. [6] B. Flachs et al., “A Streaming Processing Unit for a CELL Processor,” ISSCC Dig. Tech. Papers, pp. 134-135, Feb., 2005. [7] M. Taylor et al., “Evaluation of the RAW Microprocessor: An ExposedWire-Delay Architecture for ILP and Streams,” Proc. ISCA, pp. 2-13, June, 2004. [8] R. Witek et al., “StrongARM: A High-Performance ARM Processor,” Proc. CMPCON, pp. 188-191, Feb., 1996.

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06/$20.00 ©2006 IEEE

Session_23_Penmor

1/8/06

8:28 AM

Page 429

ISSCC 2006 / February 8, 2006 / 11:15 AM

Instruction Instruction Fetch Decode

Memory Read

Source Select

Execute 1

Execute 2

Execute 3

Result Select

Write Back

FIFO 0 Read

Inst. Mem Input FIFO0 Input FIFO1

Program Counter

Clock

FIFO 1 Read ALU

ALU MAC Control

Output

Inst. Memory

Inst. Decoder

Static config

Data Mem

Inst. Mem requirement (words)

Address Gens.

6

2N

8-pt DCT

40

16

8x8 2-D DCT

154

72

Conv. coding (k = 7)

29

14

Huffman DC encode

107

24

Huffman AC encode

96

310

N-pt convolution

29

2N

64-pt complex FFT

97

192

Bubble sort

20

1

Memory

50

N

Square root

62

15

Exponential

108

32

DC Mem Write

B

CPU

stage 1

Write logic Read logic

east

n_reset halt

A

data, valid, clk, request

north stage5

data_out SRAM

west

stage9

C

Wr addr & control

south

Rd addr & control

Binary -gray GrayBinary

From traditional view to cascading task view

CPU

B

CPU

C

sync

sync

east north

Graybinary

IMEM DMEM ALU ACC MULT CFG PC etc.

Binary -Gray

FIFO 0

7 tri-state inverters

A

N merge sort

DC Mem Read

Figure 23.6.2: AsAP 9-stage pipeline.

Data Mem requirement (words)

N-pt FIR

Data Memory Write

Data Memory Read Multiply Accumulator

Dynamic config

Figure 23.6.1: AsAP block diagram.

Task

FIFO Write

clock divider

clk_out

east

clk_ext_in

north

clk_ext_en

west south

FIFO 1

west south

same as FIFO 0 Processor

Figure 23.6.3: Memory requirements for common DSP tasks and cascading task data flow.

Figure 23.6.4: Programmable clock oscillator and inter-processor communication diagram.

23 Data bits

1

Area breakdown

20x

80%

comm

Conv. Code

Puncture

Interleave 1

Scrambler

Unused

Training

Interleave 2

IFFT Mem stages 0-1

IFFT BR

Pilot Insertion

Modulation Mapping

20%

core 0%

ARM [8]

GI/Window (Real)

AsAP

10

GI/Window (Imaginary)

14x

0.1

0.01

TI_C64x [5]

RAW [7]

BlGe/L [4]

Upsample FIR

RAW CELL/SPE ARM [7] [6] * [8]

TI_C64x [5] **

AsAP

Huffman

1000

IFFT Lv-shift

AC in

DC in

1-DCT

Huffm

Huffm

Trans

AC in

DC in

in DCT

Huffm

Huffm

IFFT BF

IFFT Output

Zig-zag

IFFT Mem stages 2-3

IFFT BF

Upsample FIR

IFFT BF

IFFT Mem stages 4-5

Output Sync

AsAP

10000

1

output

mem

40%

0.1 TI_C64x CELL/SPE RAW [5] [6] [7]

input

60%

Peak performance density (MOPS/mm²)

Processor area (mm²)

10

Typical energy per operation (nJ/operation)

Pad

100%

100

7x

Guard Interval

1-DCT

Quant., Zig-zag

100

8x8 DCT 10

TI_C64x [5]

RAW [7]

ARM CELL/SPE AsAP [8] [6]

* assume 2 ops/cycle for CELL/SPE

To D/A converter

** assume 3.3 ops/cycle for TI C64X

Figure 23.6.5: Area, power and performance of several processors scaled to 0.13µm.

Figure 23.6.6: JPEG and 802.11a/802.11g transmitter implementations.

Continued on Page 663

DIGEST OF TECHNICAL PAPERS •

429

ISSCC 2006 PAPER CONTINUATIONS

IMEM

OSC

5.68 5.14 mm mm

single processor

810µm

810µm

DMEM

FIFOs

Technology Transistors Area

TSMC 0.18µm 8.5 Million 5.68mm x 5.65mm = 32.1mm² Area, 1 proc 0.66mm² Max speed 475MHz @ 1.8V 17.1GMAC/s Typical power 32mW @ 1.8V @ 475MHz (1 proc) Max power 144mW @ 1.8V @ 475MHz (1 proc) Max energy/op 300pJ/Op = 0.3mW/MHz @ 1.8V 93pJ/Op = 0.093mW/MHz @0.9V (1 proc) Package PGA121M

5.11mm 5.65mm

×6 AsAP array. Figure 23.6.7: Chip micrograph of the 6×

663

• 2006 IEEE International Solid-State Circuits Conference

1-4244-0079-1/06/$20.00 ©2006 IEEE

Recommend Documents