Dynamically reconfigurable logic LSI-PCA-1 - VLSI ... - IEEE Xplore

Comment

Report 3 Downloads 36 Views

10-3 Dynamically Reconfigurable Logic LSI - PCA-1 K. Oguri+ H. Nakada H. Ito R. Konishi T. Shiozawa K. Nagami N. Imlig

A. Nagoya M. Inamori

'Department of Computer and Information Sciences Nagasaki University 1-14 Bunkyo-machi Nagasaki-shi Nagasaki 852-852 1 Japan

NTT Network Innovation Laboratories 1-1 Hikari-no-oka Yokosuka-shi Kanagawa 239-0847 Japan

Phone+81468594233Fax:+81468593014 Email: [email protected]

Abstract

PCA realizes parallel computing with objects and message passing by using this dual structure (Fig.1). An object is a logic circuit or a memory block configured on the PP-plane; it is isolated from the other objects but equipped with communication interfaces. Communication between objects is performed by transferring messages through the BP-plane. A message is a stream of commands and data exchanged between objects. Dynamic configuration of circuits on PPs is performed by the function of generating or deleting objects with appropriate messages. Y

This paper describes the realization of a dynamically reconfigurable logic LSI based on a novel parallel computer architecture. The key point of the architecture is its dual-structured cell array to enable dynamic and autonomous reconfiguration of the logic circuit. The LSI was completed with successfully introducing two specific features: fully asynchronous logic circuits and homogeneous structure using only LUTs.

Architecture of PCA-1 Introduction

A . Features of PCA-I

Dynamically reconfigurable computing has been recognized to be suited for communication networks by the capability of adapting to both revises in specifications and environmental changes such as traffic variations. Architectures that support reconfiguration of the logic circuit level are preferable because bit-wise operations are frequently used in communication processing. Furthermore, scalability of the architecture is an important factor because performance flexibility is also required. The Plastic Cell Architecture (PCA), the general purpose dynamically reconfigurable computer architecture proposed by the authors, fulfills such requirements [ 1][2]. From the structural viewpoint, PCA is a fine grain cell array architecture. A significant feature of PCA is its dual-structured cell (PCA-cell) composed of a Plastic Part (PP) and a Built-in Part (BP). The PP is a programmable logic block which can be configured to any logic circuit, while the BP is a small controller with fixed functions that performs configuration of PPs and data communication for the circuits configured on the PPs. PPs and BPs are independently connected as meshes to form the PP-plane and the BP-plane (Fig. I). PCA

cell 1

PCA cell array

PP 1

=

Fig. 1 Plastic Cell Architecture

103

4-89114-014-3101

PP-plane

+

BP-plane

PCA-I is the first LSI realizing the concept of PCA. To implement the aforesaid functions, we have adopted two specific features to PCA-I. One is the homogeneous Sea-of-LUTs structure of PPs, and the other is fully asynchronous architecture. Sea-of-LUTs means that the PP-plane consists of only LUTs; all logic circuit elements are configured on the LUTs. This structure differs from conventional FPGAs with various specific hnctional blocks (e.g. Flip-Flops, combinational circuits and wire channels). This new concept has advantages as follows. It allows the circuit density to be increased. It enables flexible resource trade-off between functional blocks. Moreover, it eases the design of asynchronous circuits on PPs because both wiring and logic delay can be estimated from just LUT delay. The asynchronous architecture is very effective for the following reasons. From the viewpoint of scalability, it enables a large system to be constructed by simply connecting PCA-I LSIs; the timing constraints such as clock skew or metastability are eliminated. It also facilitates message communication between objects because behavior of each object is fundamentally asynchronous. Moreover, power consumption on the chip is limited to configured and active areas. PCA-1 consists of an array of 6x6 PCA-cells. It does noi use addresses to specify cells, only relative offsets using four directions labeled N, E, W and S. For external 1/0, the communication interface of the BP-plane is used. The celk space can be directly extended simply by connecting PCA-I LSIS. B. Plastic Part

2001 Symposium on VLSl Circuits Digest of Technical Papers

TABLE 1 The command set of the Built-in Part

@I

memory cell

;0000

a) Plastic Part (PPl * 8x8 Basic cell 256 x 4-1-LUT

- 2x2

. .

b) Basic Cell (BC) 4-1-LUT

114 by WS

c) 4-1-LUT * 16bit mem.

Fig.2 Sea-of-LUTs structure of the Plastic Part Fig. 2 shows the PP structure. Each PP is an array of 8x8 basic cells (BCs). One BC consists of four 4-input-I-output LUTs and has 4 inputs and 4 outputs f r o d t o the 4 adjacent BCs. Thus one PP is composed of 256 LUTs and handled as an 1024x4 bit SRAM configuration memory by the BP. The PP has 8 inputs and 8 outputs f r o d t o each adjacent PP at the 4 directions. Some of the 1/Os on the N and W sides can be used to communicate with the BP. This selection is controlled by the BP. The 6x6 PPs on PCA-1 are connected to form a seamless network of 92 16 LUTs. Since there are no FFs, long lines or switching elements, only LUTs on the PP, logic circuits are configured on the LUT network. Not only combinational logic but also wiring, latches, and Muller-C gates are configured by LUTs. On the other hand, the PP can be used as an 1024x4 bit SRAM memory for memory objects.

C. Built-in Part Fig. 3 shows the BP architecture. It has a 5-way symmetric module structure because it works by communicating with the 4 adjacent BPs and the corresponding object configured on the PP. The main parts of the BP are 5 INPUT-PORT (IP) modules, which are asynchronous state machines that communicate with their neighbors. The 5-bit width data path used for communication with the neighbors is controlled by asynchronous handshakes using request and acknowledge. Twelve commands are prepared to control the IP (TABLE I). Five IPS share a PP-CONTROLLER (PPC) and 5 OUTPUT-PORTS(OP) to realize the object configuration or message routing specified by the commands. The OPs are used for routing messages to their destination. The PPC offers the functions of writingheading data to/from the PP as an SRAM or initializing an object on the PP. These shared modules are equipped with an arbiter module (ABT) severally to arbitrate requests from IPS. An

asynchronous arbitration technique is used for this purpose r31.

D. Built-in Functions The BP performs the built-in functions specified by the commands in the messages as follows. Message routing is performed by the wormhole routing technique: the header part of a message composed by the sequence of routing commands (NORTH, EAST, WEST, SOUTH and PP) establishes the route for the message itself. In each BP, the first routing command causes the IP to select a corresponding OP for forwarding the reset of the message. Routing not only on the BP-plane but also to the objects (circuit modules) on the PP-plane is treated in the same manner with these commands. Once a route is configured, it is available until released by a CLEAR command. This dynamic routing mechanism gets rid of the quantitative constraint of wiring among objects on the PP-plane. Object configuration is performed by writing the configuration data to the PPs with the memory-write commands: CIF or CIM. One memory-write command is followed by 1024 nibble (= 4bit) data for each PP. After writing the configuration data, the OPEN command initializes the object and opens the communication interface between the object and the BP-plane. And deletion of the object is performed with the CLOSE command, by simply closing the communication interface. Data written in the PP can be read out with the memory-read commands: CO or COCI, and sent to the destination established by the routing commands. These memory access mechanisms are utilized for both general data storage and object cloning. Since five IPS of the BP can receive messages in parallel, multiple functions can be performed simultaneously as long as resources are utilized exclusively. The ABTs for each resource are used for this purpose. For examples, multiple messages can cross each other on a BP if they are conveyed from different IPS to different OPs. Or message routings and memory access to SRAM of the PP can be performed simultaneously on a BP. Asynchronous Circuit Design A . Asynchronous Circuit on PCA-I

Fig.3 Module structure of the Built-in Part

2001 Symposium on VLSl Circuits Digest of Technical Papers

104

We have adopted a fully asynchronous architecture for PCA-I. Thus the BP is designed as asynchronous controller circuits; and the objects configured on the PP-plane can be also designed as asynchronous circuits. The basics of asynchronous circuit design for both parts are same. The asynchronous logic applied to PCA-I is based on the bundled data protocol with the 4-cycle signaling handshake. But a critical feature of the asynchronous circuit is realized only on the BP: the asynchronous arbiter. This is because it needs the custom cell and cannot be realized on Sea-of-LUTs of the PP. The object can use the arbitration function of the BP if necessary. In the following, we describe the asynchronous circuit design utilized for the BP. An asynchronous circuit design method for the object circuits configured on the Sea-of-LUTs structure is partially described in [2]. B. Handshake Circuit

'

With the bundled data protocol, data transfer on combinational logic circuits between FFs are handled as bundles, and the completion of signal transition is indicated by a request signal that has longer delay than the critical path delay of the bundle. Timing of the transfer is controlled by the 4-cycle signaling handshake using request and acknowledge pairs [3]. The handshake circuit is composed of Muller-C gates and delay elements. Fig. 4 shows basic structures of asynchronous handshake circuits adopted in the BP. Fig. 4-d shows the function of the Muller-C gate in the truth-table. It works as a rendezvous element for two inputs [4]. Fig. 4-a and Fig.4-b show the handshake circuits controlling the pipeline and the state machine respectively. The delay element is inserted on the request line to delay propagation of the request until the bundled data become

according to the input data, and handshakes with the selected one. The delay element is inserted in front of the fanout to delay propagation of the request signal until outputs of the combinational logic circuit (i.e. selecting signal) become stable at AND gates preventing the hazard on the request lines. C. Asynchronous Arbiter The asynchronous arbiter module (AE3T) has two functions: asynchronous arbitration and locking the shared resource. The AJ3T arbitrates among asynchronous requests from IPS and sends an acknowledge back to the first one. And it is locked by the IP and does not send any acknowledge to the other IPS until released by the locking IP. Fig. 5 shows the structure of the AE3T. The essential part in the AE3T is the asynchronous arbiter element (Fig.5-a, the interlock element described in [3]). The asynchronous arbiter element arbitrates between two input transitions (req 1 and req2) by asserting the output (ackl or ack2) corresponding to the input that changed first. And the element suppresses metastable outputs until it gets out of a metastable condition even if two inputs change at the same time. The call element connects one of two input handshakes with one output handshake by forwarding an arbitrated req-input (reqA or reqB) to the req-output (reqX) and send the ack-input (ackX) back to the ack-output (ackA or ackE3) corresponding to the asserted req-input (Fig.5-b) [4]. Consequently, the 2-1 handshake arbiter composed of the asynchronous arbiter element and the call element arbitrates between two input handshakes (REQA-ACKA and REQB-ACKB) and connects one of them to the output handshake (REQX-ACKX, Fig.5-c). The ABT is composed of a tree structure of the 2-1 handshake arbiters to guarantee the fairness of arbitration among more than two requests (Fig.4-d).

stable at the input of the FFs.

Switching of asynchronous handshake circuits is also utilized. Fig. 4-c shows the switching mechanism of the handshake circuit. Handshake switching is done by selecting a branch from the request fanout and gating the others with outputs of the combinational logic circuit. With the switching mechanism, the Sender selects the Receiver (A or B)

D. Design Method Since PCA-1 is the cell array architecture, chip was designed by completing one PCA cell and copying it to compose the PCA cell array. The PP was designed with a custom layout, while the BP was designed using a standard cell library. The PP is designed by completing one LUT with a custom layout and copying it. The LUT is designed to prevent hazard or crosstalk on the output caused by change of unrelated a) Asynchronous arbiter element (Interlock element)

a) pipeline

b) Call element

-

Handshake for Lock

4 1

reqz c) handshake switching

x b) state machine

d) Muller C element

Fig.4 Handshake circuits

105

,

10'

1 1 .PreYlOUI slate -

4-89114-014-3/01

c) 2-1 handshakearbiter

d) Arbiter module (ABT)

Fig.5 Asynchronous arbiter module 2001 Symposium on VLSl Circuits Digest of Technical Papers

(don’t care) address bits. Asynchronous circuits of the BP were designed by describing their net-lists directly with HDL (Verilog HDL). The Muller-C cell and the asynchronous arbiter element cell were newly designed and used with the standard cell library. The amount of delay inserted in the handshake circuits are very important because its insufficiency causes hazards on the handshake circuits and malfunctions. Estimation of the propagation delay time was done for the paths on which request signals or switching logic signals propagate actually, which are extracted by taking account of the switching behavior of the handshake circuits. The critical path of the asynchronous circuits differs from that of the synchronous circuit because the handshake circuit with loops and switches makes it difficult to determine the actual signal paths. The amount of inserted delay is determined that the delay time of the request signal becomes more than twice of the estimated delay time of the data path or the switching logic. This criterion is decided to ensure sound asynchronous operation. Therefore the BP performance is affected by the inserted delays, which include a considerable margin for this reason.

Result A . Chip Specification

Fig. 6 shows the photomicrograph of PCA-1. A PCA-1 chip was fabricated using a 0.35 pm CMOS (4-layer) process. Its operational voltage is 3.3 V; die size is 10.0x10.0 mm2. The package is 240-pin QFP. The number of transistors is 79K for one PP, and 15K for one BP, including test circuits. The propagation delay for one LUT is around 1 ns. Performance of message communication between BPs is around 23 MHz. The time required for writing 1024 word data to one PP as a 4096 bit SRAM is around 60 ps and 72 ps for reading them. B. Confirmed Function

Fig.7 Snapshot of the self-reproducing objects by hot-electron induced photoemission All the functions of PCA-I have been tested and shown to work properly. For example, configuration of objects on Sea-of-LUTs PP-plane, asynchronous behavior of objects and message communication, and runtime configuration of objects by the other object have been confirmed. To confirm the most significant attributes of PCA, a self-reproducing object consisting of 4 PCA-cells was implemented. This object can reproduce itself by copying its configuration data while doing another job when it receives a trigger message (Fig.7). This experiment proves that PCA- 1 realizes dynamic and autonomous reconfigurability. In terms of application, the object is able to duplicate itself to increase throughput to meet the performance requirements. The time for cloning is around 300 ps. Power consumption is proportional to the number of working objects (around 7 “object at 16 MHz) while the stand-by current is around 400 @/chip. This indicates that the dynamic reconfigurability and asynchronous architecture of PCA- I are effective in reducing power consumption with the result that unnecessary circuit switching is suppressed.

Conclusion We have designed, fabricated, and tested the dynamically reconfigurable logic LSI, PCA-1. This LSI differs from conventional reconfigurable logic LSIs in that it realizes the autonomous reconfigurability of logic circuits by an asynchronous architecture and a homogeneous structure. In order to verify these features, a self-reproducing logic circuit was realized on a PCA- 1 unit.

References K. Nagami, K. Oguri, T. Shiozawa, H. Ito, and R. Konishi, “Plastic Cell Architecture: A Scalable Device Architecture for General-Purpose Reconfigurable Computing,” IEICE Trans. Electron., Vol.E81-C, No.9, pp. 1431-1437, September 1998. N. Imlig, T. Shiozawa, R. Konishi, K. Oguri, K. Nagami, H. Ito, M. Inamori, and H. Nakada, “Programmable Dataflow Computing on PCA,” IEICE Trans. Fundamentals, Vol.E83-A, No. 12, pp. 2409-2416, December 2000. C.L. Seitz, “System Timing,” in Introduction to VLSl systems, C. Mead and L. Conway eds.: Addison-Wesley, 1980, pp.218-262. I.E. Sutherland, “Micropipelines,” Comm. ACM, Vo1.32, No.6, pp.720-738, June 1989.

Fig.6 Photomicrograph of PCA-I 2001 Symposium on VLSl Circuits Digest of Technical Papers

106

Recommend Documents