A High-Speed and Low-Voltage Associative Co-Processor ... - CiteSeerX

Report 0 Downloads 123 Views
A High-Speed and Low-Voltage Associative Co-Processor With Hamming Distance Ordering Using Word-Parallel and Hierarchical Search Architecture Yusuke Oike† , Makoto Ikeda†‡ , and Kunihiro Asada†‡ †



Dept. of Electronic Engineering, University of Tokyo VLSI Design and Education Center (VDEC), University of Tokyo 7–3–1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan Phone: +81-3-5841-6719, Fax: +81-3-5841-8912 E-mail: {y-oike, ikeda, asada}@silicon.u-tokyo.ac.jp

Abstract

b0 b1 b2 b3 b4 b5 b6 b7

Data A

0 1 0 1 0 0 1 1

Data B A + B

1 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0

clock period 0 search signal(SS) b0 b1 b2 b3 b4 b5 b6 b7

We present a new concept and its circuit implementation for a high-speed and low-voltage associative co-processor with Hamming distance ordering. A hierarchical search architecture keeps high speed in large input number. Our circuit implementation allows unlimited data base capacity and achieves low-voltage operation under 1.0V for SoC applications, which are difficult for the conventional analog approaches. The search logic embedded in a memory cell realizes word-parallel Hamming distance ordering for high-speed sorting/routing applications as well as near/nearest-match detection for recognition. Our fabricated 0.18 µm 64-bit 32-word associative coprocessor operates 411.5 MHz and 40.0 MHz at 1.8V and 0.75V respectively.

tion. A 64-bit 32-word associative co-processor has been fabricated in 0.18 µm CMOS process and successfully tested.

Introduction

Architecture of Hamming Distance Ordering

Some applications, such as data compression, pattern recognition, multi-media and intelligent processing system, require a huge amount of memory access and data processing time. To reduce them, a lot of context addressable memories (CAMs) are developed [1]–[3]. These CAMs can quickly detect a completely matched data in a data base. In recent years, advanced applications require to detect not only a completely matched data but also a near-match data. The CAMs using analog circuit technologies have been proposed for quick nearest-match detection [4]–[8]. Their circuit implementations are compact in general, however, difficult to operate in deep sub-micron (DSM) process and low voltage supply. Therefore they are not suitable for a system-on-chip VLSI in DSM process. In this paper, we present a high-speed and low-voltage associative co-processor using a hierarchical search architecture with the capability of word-parallel Hamming distance ordering. It has three advantages: (1) The first advantage is highspeed search in large data base due to a hierarchical search√architecture. The search time of our method is limited by O( N) or O(log M) at N-bit M-word data capacity. In addition, it has no limitation of the number of data patterns M, the bit length N and the search distance theoretically. (2) The second advantage is low-voltage operation in DSM. The circuit implementation has a tolerance for device fluctuation in DSM and allows a low-voltage operation under 1.0V, which is difficult for the conventional analog approaches. (3) The third advantage is additional functions for associative processing. The synchronous search logic embedded in a memory cell realizes word-parallel Hamming distance ordering for high-speed sorting/routing applications as well as near/nearest-match detection for recogni-

A. Basic Search Operation Fig.1 shows a basic operation of Hamming distance (HD) ordering without hierarchical search. The operation has a data comparison and a search signal propagation. First, the input data is compared with each template data in bit-parallel way using XOR/XNOR gates. Then search signals start from LSBs of each word. The search signal passes through the matched bit data. The completely matched data (HD = 0) are detected in the first clock period since the search signal passes through MSB. In the next clock period, the first-encountered mismatched bit is masked in each word and the search signals restart to the next mismatched bit. Thus, the data of HD = 1 are detected. After this manner, the data of HD = n are detected in the n-th clock period as shown in Fig.1. The search operation can detect not only the nearest-match data but also all data in order of Hamming distance.

stop 1 n bit clock period pass through

b0 b1 b2 b3 b4 b5 b6 b7

match bit (SS passes through)

mask clock period 2

stop pass through mask stop

clock period 3

pass through

mis-match bit (SS stops) mask

detected

Fig. 1 Basic Operation without Hierarchical Search.

B. Word-Parallel and Hierarchical Search Structure Search time of the basic operation is limited by the propagation time of the search signal, so it is linearly-related to a data length in a ripple-mode implementation. Fig.2 shows an operation diagram of word-parallel and hierarchical structure for high-speed Hamming distance search in large input number. The template data is divided into some blocks. The search signal (SS) propagates to hierarchical search nodes (HN) through each block simultaneously as shown in Fig.2 (a). Each hierarchical node provides a permission signal (PS) to the next block as shown in Fig.2 (b). The permission signal makes a mismatched bit maskable. At the next clock period, the maskable mismatched bit is masked only when the search signal is ar-

output

(a) search signal (SS) propagation path

B 2i-1

B 2i-1

B 2i-1

W

W

W

D2i-1

B 2i-1 W

D2i-1

D 2i-1

D 2i-1 SRAM Cell

SRAM Cell

n bit cells

n+1 bit cells

n+2 bit cells

n+3 bit cells M2i-1

(b) permission signal (PS) path

M2i-1

XNOR Cell

φ1

SS 2i

SS 2i-1

SS 2i-1

φ1 φ2

PS j-1

SS 2i

φ2 n bit cells

n+1 bit cells

n+2 bit cells

PS j-1

n+3 bit cells

SS 2i-1 Search Cell

Search Cell

(b-1) odd-numbered cell

(a-1) odd-numbered cell

(c) operation diagram (ex. HD=2)

B 2i

D2i

D 2i

XNOR Cell

B 2i

B 2i

D2i

D 2i

B 2i

clock period: 0 M2i

SS starts

SS starts

SS starts

SS starts

M2i SS 2i

XOR Cell

φ2

SS 2i+1

PS j-1 SS 2i

SS stops

SS stops maskable

φ2

clock period: 1

Search Cell

(a-2) even-numbered cell

SS 2i SS 2i+1

φ1

φ1

PS j-1

XOR Cell

Search Cell

(b-2) even-numbered cell

SS restarts

maskable

masked

Fig. 3 Schematic of the Associative Memory Cell: (a) Static Circuit Implementation, (b) Compact Implementation.

detected HD=2

clock period: 2

RST

(a)

PRIORITY_OUT

SS restarts PRIORITY_IN

CAMOUT

masked hierarchical search node

mis-match bit

SS propagating path

match bit

mis-match bit (maskable)

SS propagated path

mis-match bit (masked)

PS path

Fig. 2 Operation Diagram of Hierarchical Search.

(b) PRIORITY_OUT1 PRIORITY_IN1 PRIORITY_OUT2 PRIORITY_IN2 PRIORITY_OUT3 PRIORITY_IN3 ADDRESS1

rived. A number of clock cycles, when the search signal outputs from the last hierarchical node, stands for the Hamming distance of the data. For example, some search signals stop at the mismatched bit in each block and the others pass to the hierarchical node as shown in Fig.2 (c). The first mismatched bit becomes maskable at the clock period 0. The data of HD = 0 is detected at the same time. At the next clock period, the first mismatched bit is masked and the search signal restarts. The search signal updates the permission signals and the next mismatched bit becomes maskable. Thus the data of HD = 2 is detected at the clock period 2. In this architecture, the critical path is the search signal propagation path of one block and the hierarchical bypass line. The search time has similar characteristics of a carry-bypass adder, so that it is applicable to a large data base. Circuit Configuration Fig.3 shows a schematic of our associative memory cell. The memory cell is composed of a SRAM cell, an XOR/XNOR circuit for comparison with the input data, and a search circuit for signal propagation and masking. Even-numbered and oddnumbered search circuits are complementary in order to reduce the critical path and the circuit area. In a matched bit, the search signal (SS) always passes to the next bit since the result (M) of comparison is true. In a mismatched bit, the SS stops and waits for the next clock period. In the next clock period, the false M

ADDRESS0

PRIORITY_OUT4 PRIORITY_IN4 PRIORITY_OUT5 PRIORITY_IN5

ADDRESS2

PRIORITY_OUT6 PRIORITY_IN6 PRIORITY_OUT7 PRIORITY_IN7 PRIORITY_OUT8 PRIORITY_IN8

priority decision circuit

address encoder

Fig. 4 Schematic of: (a) Detected Data Selector, (b) Binary-Tree Priority Encoder.

is masked and the SS restarts at the cell where both the search signal (SS) and the permission signal (PS) are true. Therefore only one mismatched bit is masked in word parallel and all data can be detected in order of Hamming distance. Fig.3 (a) shows static circuit implementation. It realizes a high tolerance for device fluctuation and a low-voltage operation. Fig.3 (b) shows compact circuit implementation using dynamic circuits. It saves a search circuit area for large capacity. Fig.4 (a) shows a detected data selector, which masks one output of the detected data in the same clock period after its address encoding. All detected data in the same HD can be encoded by the next priority encoder stage. Fig.4 (b) shows a binary-tree priority encoder. It realizes a small area and quick address encoding with O(log M) delay time at M-word capacity.

13.6µm

Memory Read/Write Circuit, Data Buffer

COMP

ADDRESS

B1/B1

HN

RAM RAM

RAM

SC SC

SC

HN

RAM RAM

RAM

SC SC

SC

PO1 HN

SS

SO2 HB21

W2

HB22

RAM RAM

RAM

SC SC

SC

HN

HB2j

RAM RAM

RAM

SC SC

SC

HN

RAM RAM

RAM

SC SC

SC

HN

SS

SOi HBi1

Wi

HBi2

RAM RAM

RAM

SC SC

SC

HN

HBij

RAM RAM

RAM

SC SC

SC

HN

RAM RAM

RAM

SC SC

SC

HN

SS RAM

SRAM cell

HBij: hierarchical cell block

embedded search circuit HN hierarchical node SC

SS: search signal path

PI2 PO2

PIi POi

SRAM Cell XOR Cell Search Circuit

Search Circuit

XOR Cell

(a) Static Circuit Implementation

(b) Compact Implementation

Fig. 7 Layout of the Associative Memory Cell: (a) Static Circuit Implementation, (b) Compact Implementation.

SOi: search results per word PIi: priority encoder inputs

PS: permission signal path POi: priority decision outputs

Fig. 5 Block Diagram of the Fabricated Associative Co-Processor. 2.8mm

critical path delay 2.18ns CLK

0.00

(TEG)

2.8mm

SRAM Cell 7.2µm

SC

PI1

Signal level (a.u.)

RAM

SC SC

SO1

9.6µm

RAM RAM

HB1j (n+j-1 bit)

Detected Data Selector

Row Decoder

W1

HB12 (n+1 bit)

Priority Decision Circuit and Encoder

8.8µm

HB11 (n bit)

SOi

2.00 4.00 Time (ns)

6.00

Memory R/W Circuit, Data Buffer

Fig. 8 Measured Waveforms of the Search Signal Propagation. 64 bit, 32 word Cells Row Decoder (TEG)

Priority Encoder

(TEG)

64 bit, 2 word Compact Implementation

Fig. 6 Chip Microphotograph.

Chip Implementation We have designed and fabricated a 64-bit 32-word associative co-processor using the present architecture and the static circuit implementation in 0.18 µm CMOS process1 . Fig.5 illustrates a block diagram of the fabricated memory module and Fig.6 shows its chip microphotograph and components. The associative co-processor has 64 × 32 memory cells with the search circuit, a memory read/write circuit with data buffers, a word decoder, and a 32-input priority encoder with a detected data selector. We have also designed a 64-bit 2-word associative memory using the compact implementation for performance evaluation on the same chip. Measurement Results and Discussions A. Area and Capacity The designed 64-bit 32-word associative co-processor occupies 475 µm × 1160 µm (0.55 mm2 ). The area of a memory macro cell with a static search circuit is 9.6 µm × 13.6 µm (130.56 µm2 ) as shown in Fig.7 (a). In the static circuit implementation using 0.18 µm process, the cell area is ×6 and 1 The chip in this study has been fabricated through VLSI Design and Education Center(VDEC), University of Tokyo in collaboration with Hitachi Ltd. and Dai Nippon Printing Co.

×3 as large as a 6T SRAM cell and a standard complete-match CAM cell respectively. Fig.7 (b) shows a layout of the compact implementation using dynamic circuits. It occupies 7.2 µm × 8.8µm (63.36 µm2 ). In this case, the cell area is ×3 and ×2 as large as a 6T SRAM cell and a standard complete-match CAM cell. The number of transistors in our memory cell is larger than the conventional analog approaches [4]–[8]. The analog approaches are, however, difficult to follow device scaling especially in DSM process with keeping its performance and marginal capacity. Our approach can follow device scaling and operate in low supply voltage because of synchronous digital search logics embedded in memories. Besides, it has no limitation of capacity and search distance. Therefore our associative co-processor has more potential for practical use and large capacity than the conventional designs. B. Operation Speed Fig.8 shows measured waveforms using an electron beam probe at room temperature. It shows a delay time of the critical path from the search circuit clock (CLK) to a search output (SOi). The delay time for Hamming distance search in 64-bit data length is 2.18 ns in the worst case. The operation speed of the fabricated associative co-processor is 411.5 MHz and 40.0 MHz at 1.8V and 0.75V power supply respectively. Fig.9 shows measurement results of the operation speed in 0.75V-to-1.8V power supply. In the Hamming distance ordering, the search time needs clock counts corresponding to its Hamming distance. It takes 65 clock periods to order all data from 0-bit distance to 64-bit distance. Our fabricated associative co-processor completes the Hamming distance ordering for sorting/routing of all data in 158.0 ns. It’s difficult to implement such a function in high speed by the conventional analog techniques. When the target application requires only

1.8

measurement results [email protected]

Power supply (V)

1.6 1.4

PASS 1.2 1

FAIL

0.8

[email protected] 0

50

100

150 200 250 300 Frequency (MHz)

350

400

450

Fig. 9 Operation Frequency vs Power Supply Voltage.

9.0

0

Bit length N (bit) 100 200 300 400 500 600 700 800 900 1000 1100

† designed

Conclusions

7.0 Cycle time (ns)

by the compact implementation

module cycle time (measured result, 64bit 32word)

8.0

6.0

distance search stage : O( N) (simulation results, upper axis)

5.0 4.0 3.0

priority encoder stage : O(logM) (simulation results, bottom axis)

2.0 1.0

TABLE I Specifications of the Fabricated Associative Co-Processor. Process 0.18 µm CMOS 5-Metal 1-Poly-Si Power Voltage Supply 0.7 V – 1.8 V Organization 64 Bit × 32 Word Memory Cells 32-Input Priority Encoder Functions Distance Ordering and Nearest-Match Module Size 475 µm × 1160 µm (0.55 mm2 ) Num. of Transistors 88.5k transistors Memory Cell Size 9.6 µm × 13.6 µm (130.56 µm2 ) † 7.2√µm × 8.8 µm (63.36 µm2 ) Search Time Order O( N) (@ N-Bit capacity) Encoding Time Order O(log M) (@ M-Word capacity) Operation Speed 411.5 MHz (@ 1.8V, Measured) 454.5 MHz (@ 1.8V, Simulated) 40.0 MHz (@ 0.75V, Measured) 41.4 MHz (@ 0.75V, Simulated) Distance Ordering Time 154.0 ns (0-bit to 64-bit distance) Power Dissipation 51.3 mW (@ 1.8V, 400MHz) 1.18 mW (@ 0.75V, 40MHz)

0

100 200 300 400 500 600 700 800 900 1000 1100 Num. of words M (word)

Fig. 10 Cycle Time and Data Capacity.

nearest-match data, the search time depends on its Hamming distance. For example, the nearest-match detection is completed in 41.3 ns at 16-bit Hamming distance. This operation speed is also difficult for the conventional analog approaches. The worst case of nearest-match detection is the same as the ordering operation. Fig.10 shows the relation between a bit or word length and a search cycle time. The search time is limited by the search signal propagation or the priority encoding. The former is decided √ by a bit length and its time is O( N) at N-bit length. On the other hand, the latter is decided by a word length and its time is O(log M) at M-word length. Therefore our method keeps high speed in large data base as shown in Fig.10. The Hamming distance ordering has no limitation of data capacity as mentioned above. C. Power Dissipation The power dissipation of the associative co-processor is < 51.3 mW at 1.8V power supply and 400 MHz operation. In lowpower operation, it is 1.18 mW at 0.75V power supply and 40 MHz operation. The search accuracy of the conventional analog approach is unstable and sometimes senseless in lowpower operation. Our search operations are precise regardless of a power supply voltage. The specifications of the fabricated co-processor are summarized in Table I.

We proposed a new concept and its circuit implementation for a high-speed and low-voltage associative co-processor in DSM process to solve the problems of the conventional analog techniques. It achieves no limitation of data capacity and keeps high speed in large data base due to a hierarchical search architecture and a synchronous search logic embedded in a memory cell. Our extended functions, such as Hamming distance ordering, are effectively applied to high-speed sorting/routing applications as well as near/nearest-matching applications. We have designed and fabricated a 64-bit 32-word associative coprocessor in 0.18 µm CMOS process and shown a high-speed and low-voltage operation. The operation speed achieves 411.5 MHz and 40.0 MHz at 1.8 V and 0.75 V supply voltage respectively. References [1] T. Ogura et al., “A 20-kbit Associative Memory LSI for Artificial Intelligence Machines,” IEEE J. Solid-Stage Circuit, vol. 24, no. 4, pp. 1014-1020, Aug. 1989. [2] T. Ikenaga et al., “A Fully Parallel 1-Mb CAM LSI for Real-Time PixelParallel Image Processing,” IEEE J. Solid-Stage Circuit, vol. 35, no. 4, pp. 536-544, Apr. 2000. [3] H. Miyatake et al., “A Design for High-Speed Low-Power CMOS Fully Parallel Content-Addressable Memory Macros,” IEEE J. Solid-Stage Circuit, vol. 36, no. 6, pp. 956-968, June 2001. [4] T. Yamashita et al., “Neuron MOS Winner-Take-All Circuit and Its Application to Associative Memory,” ISSCC Dig. Tech. Papers, pp. 236237, Feb. 1993. [5] M. Nagata et al., “A Minimum-Distance Search Circuit using Dual-Line PWM Signal Processing and Charge-Packet Counting Techniques,” ISSCC Dig. Tech. Papers, pp. 42-43, Feb. 1997. [6] M. Ikeda et al., “Time-Domain Minimum-Distance Detector and Its Application to Low-Power Coding Scheme on Chip-Interface,” Proc. of Eur. Solid-Stage Circuit Conf. (ESSCIRC), pp. 464-467, 1998. [7] H. J. Mattausch et al., “An Architecture for Compact Associative Memories with Deca-ns Nearest-Match Capability up to Large Distances,” ISSCC Dig. Tech. Papers, pp. 170-171, 2001. [8] H. J. Mattausch et al., “Fully-Parallel Pattern-Matching Engine with Dynamic Adaptability to Hamming or Manhattan Distance,” Symp. on VLSI Circuits Dig. Tech. Papers, pp. 252-255, 2002.