Real Computer Science begins where we almost stop reading ...: FSM Design with More Sophisticated Programmable Logic Devices

Sunday, October 6, 2013

FSM Design with More Sophisticated Programmable Logic Devices

The PAL concept was pioneered in the 1970s by Monolithic Memories (which has since merged with Advanced Micro Devices, also known as AMD). It was based on bipolar fuse technology developed for programmable ROMs (all connections are initially available; "blow" the connections you do not want). The primary goal of PAL-based designs was to reduce parts count by replacing conventional TTL logic with more highly integrated programmable logic. Designers frequently report that four TTL packages (10 to 100 "gate equivalents") can be replaced by a single PAL.

PAL-based designs also have the advantage of reduced parts inventory, since a PAL is a "universal" device. You don't need a large stock of miscellaneous TTL components. In addition, PALs support rapid prototyping, because they reduce the number of component-to-component interconnections. Designers can implement bug fixes and new functions within the PALs, often without making changes at the printed circuit board level.

10.3.1 PLDs: Programmable Logic Devices

A number of companies have extended the PAL concept by changing the underlying technology, as well as the component's array of gates and interconnections. Generically, these components are called programmable logic devices (PLDs), with the more sophisticated devices called field-programmable gate arrays (FPGAs). We will examine three representative PLD architectures in this section: Altera MAX, Actel programmable gate array, and Xilinx logical cell array.

10.3.2 Altera Erasable Programmable Logic Devices

Except for very high speed programmable logic, the general trend has been toward CMOS implementation, with its much higher levels of circuit integration and lower power demands than bipolar technologies. PALs were initially based on the same "program once" technology as bipolar PROMs. Altera pioneered the development of erasable programmable logic devices (EPLDs) based on CMOS erasable ROM technology. The EPLD can be erased simply by exposing it to ultraviolet (UV) light and then reprogrammed at a later time. Altera EPLDs are equivalent to 100 to 1000 conventional two-input gates, depending on the model selected.

EPLD Macrocell Architecture The basic element of the EPLD is the macro-cell, containing an eight-product-term AND-OR array and several programmable multiplexers. Multiplexers are particularly easy to implement in MOS technology, so it is no surprise that they are pervasive in CMOS-based programmable logic.

Figure 10.25 gives a block diagram/schematic view of the macrocell's contents. Its elements include a programmable AND array, a multiple fan-in OR gate to compute the logic function with programmable output polarity (via the XOR gate), a tri-state buffer driving an I/O pin, a programmable sequential logic block, and a programmable feedback section. Depending on the component, an EPLD may contain from 8 (EP300 series) to 48 (EP1800 series) such macrocells, each of which can be independently programmed.

Let's look at each of the programmable elements of the macrocell. As you will see, it offers more flexibility than any of the PAL architectures we have seen so far.

The macrocell's AND array is crossed with the true and complement of the EPLD's dedicated input and clock pin signals and the internal feedbacks from each of the component's outputs. Crosspoints are implemented by EPROM connections that are initially connected. Unwanted connections are broken by "blowing" the appropriate EPROM bits.

The multiplexers allow the feedback, output, and clock sections to be independently programmed. The MUX selection lines are controlled by their own EPROM bits. Under MUX control, the combinational function can bypass the flip-flop on the way to the output. Thus you can program any output to be either combinational or registered.

Similarly, macrocell feedback into the AND arrays can come from the registered output or from the external pin. You can program many of the pins to be either output or input.

Some variations on the macrocell architecture support dual feedback, making it possible to use the register for internal state while the pin is used as an independent input. This is an application of the concept of buried registers mentioned in Section 10.1.3.

The programmable clock section allows you to (1) clock the registers synchronously in groups by a dedicated clock input or (2) clock them by a local signal within the macrocell. Since the latter is a product term, the register's clock signal can be any combination of inputs or external clocks.

The two possible configurations of the clock multiplexer are shown in Figure 10.26.

Depending on the value of the bit programmed within the EPROM cell, the flip-flop is controlled with the global clock while the AND array's product term selectively enables the output. This is called synchronous mode because the output register is clocked by a global clock signal, shared among all macrocells. This signal can cause all outputs to change at the same time.

Alternatively, the clock multiplexer can be configured so a local clock, computed from a distinguished AND array product term, controls the output register. In this mode, the output is always enabled, driving the output pin. Since every macrocell can generate its own local clock, the output can change at any time. This is called asynchronous mode.

The register embedded in the macrocell can be configured as a D or T flip-flop, either positive or negative edge-triggered. Since J-K or R-S flip-flops can be implemented in terms of D or T flip-flops, they are realized by providing the appropriate mapping logic in the AND-OR array.

The final programmable element of the macrocell is the register clear signal. One of the AND array's product terms is dedicated to provide this function.

Altera MAX Architecture The major problem with all AND-OR structures is the difficulty of sharing product terms among macrocells. In a conventional PAL, you cannot share the same product term across -different OR gates. The term must be repeated for each output. This can lower the efficiency of the PAL, reducing the number of equivalent discrete gates it can replace.

As programmable logic devices become even more highly integrated, the architectures must evolve to provide more area for global routing of signals. It must be possible to share terms and outputs between macrocells more easily. Altera has addressed these problems in their multiple array matrix (MAX) family of parts. We describe the structure of MAX components in this subsection.

Macrocells, similar in structure to Figure 10.25, are grouped into Logic Array Blocks (LABs). Associated with each LAB is a group of additional product terms, usable by any of the macrocells within the LAB. These Expander Product Terms make it possible to implement a function with up to 35 product terms inside a single macrocell. This compares to only eight terms per function in most PAL families.

In addition, a Programmable Interconnect Array (PIA) can route the LAB's macrocell outputs globally throughout the device. Some lower-density devices also use the PIA to route the product term expanders.

Figure 10.27 shows the generic architecture of a MAX component, the EPM5128. The device has eight dedicated inputs (including the clock), 64 programmable I/O pins, eight LABs, 16 macrocells per LAB (128 macrocells total-not all macrocells are connected to an output pin), and 32 product term expanders per LAB (256 total). The dedicated input pins come in along the top and are distributed to each of the eight LABs. The PIA routes global signals. All on-chip signals have a connection path to the PIA. Only the signals needed by a particular LAB are connected to it under EPROM programming.

The newest top-of-the-line Altera MAX component is the EPM5192. This device contains 192 macrocells organized into 12 LABs. Altera claims that a single EPM5192 can replace up to 100 TTL SSI and MSI components or 20 P22V10 PALs (for the implementation power of this kind of PAL, see Figure 10.28).

Altera has under development a 7000 series of parts that organizes the LABs into multiple rows and columns around the programmable interconnect. The plan is to develop a chip architecture offering 1500 to 20,000 gate equivalents.

Figure 10.28 shows more details of the LAB's internal organization. It consists of an array of macrocells sharing a product term expander array. All macrocell outputs and I/O block inputs are connected to the PIA. Selected signals from the PIA are input to the macrocells.

Figure 10.30 gives more details of the implementation of the expander product terms. The AND array is crossed by the dedicated inputs and the feedback signals from the macrocells. The expander terms also form some of the columns of the array. In other words, they appear like inputs to the macrocells. Any expander term can be shared by all of the macrocells in the LAB.

If you use expander terms you quickly get into multilevel logic structures. Optimization techniques such as those in misII are absolutely necessary to do a good job of mapping logic onto these structures.

Once you reach devices as complex as the advanced MAX family, you would be unlikely to try to generate the personality map by hand. Altera provides an extensive tool set for mapping logic schematics onto the primitives supported by their EPLD structures.

EEPROM Technology for EPLDs Another class of erasable PLDs is based on the technology of electrically erasable programmable ROM (EEPROM). This has two advantages. First, less expensive packaging can be used because there is no chip window for the erasing UV light. -Second, crosspoints can be reprogrammed individually, which can speed up the process if only a small number of changes are needed. The macrocell and chip architectures are similar to those described in this section.

10.3.3 Actel Programmable Gate Arrays

Actel programmable logic chips provide what amounts to a field-programmable gate array structure. The chip contains rows of personalizable logic building blocks separated by horizontal routing channels. The programming method is a proprietary "antifuse" technology, so called because the connector's resistance changes from high to low when a high voltage is placed across it. This is the opposite of a conventional PAL or PROM based on fuse technology. The antifuses require a very small area, so the chip can have more connections than with other technologies. Unlike the EPLDs, Actel parts can be programmed only once.

The elements of the Actel architecture are I/O buffers, logic modules, and interconnect.

Figure 10.31 shows a chip "floor plan" or block diagram. Programmable I/O buffers, and special programming and test logic are along the chip's edges. The I/O pins can be configured as input, output, tri-state, or bidirectional buffers, with or without internal latches.

Internally, the building blocks are organized into multiple rows of logic modules separated by wiring tracks. Each logic module is an eight-input, one-output configurable combinational logic function (the internal structure is described in the next subsection). You can program the module to implement a large number of two-, three-, and four-input logic gates, as well as two-level AND/OR and OR/AND gates. There are no dedicated flip-flops, although D and J-K storage elements can be constructed from two connected modules.

Horizontal wiring tracks provide the main interconnection. Although the tracks run across the length of the chip, a given wire can be partitioned into segments for several interconnections. In addition, vertical wires pass through the logic modules and span multiple wiring channels. Four inputs come from the track above the logic module and four from the track below.

The ACT 1 component family is organized around 25 horizontal and 13 vertical routing tracks. The arrays contain 1200 to 2000 gates, equivalent to two- or three-input NAND and NOR functions. Because of the flexibility of the routing and the personalization of the logic module, Actel claims these are equivalent to 3000 to 6000 gates in a more typical PLD. A second-generation family, under development, will have more horizontal and vertical routing tracks. The new components will contain up to 8000 gates, which Actel claims are equivalent to 20,000 gates in a conventional PLD.

Actel Logic Module The logic module is a modified four-to-one multiplexer. Its block structure is shown in Figure 10.32.

D0, D1, D2, D3, SOA, SOB, S0, S1 are inputs selected through programmable connections from the wiring tracks either above or below the logic block. Y is the single output. It can be routed to the horizontal tracks through programmable connections.

A remarkable number of logic functions can be implemented with this simple building block. For example, let's see how the module can be used to implement a two-input AND gate with inputs A and B. We simply wire A to D1, 0 to D0, and B to SOA. Then wire S0 and S1 to 0. If B is 1 then Y receives A; otherwise it receives 0. This is essentially an AND function.

The symmetry of the logic module makes it easy to implement functions whether the inputs are available above or below the module. For example, if the inputs to the AND "gate" are not available from the top of the module, the lower two-to-one multiplexer could easily be used to implement the function.

The logic module is not limited to implementing combinational functions. Let's look at how to implement an

latch with a single module. The approach is shown in Figure 10.33.

is 0, Q is set to 0, the output of the upper two-to-one multiplexer. If

is 1, then Q is set to the output from the lower multiplexer. This depends on

. If

is 0, then Q is set to 1. Otherwise, Q is again set to its current value. This is exactly the function of the

latch.

Actel Interconnect The routing of signals through the array is one of the most innovative features of the Actel architecture. Antifuses are placed wherever a horizontal and a vertical wire cross, as well as between adjacent horizontal and vertical wire segments.

The interconnection "fabric" and its relation to the logic modules are shown in Figure 10.34.

The pass transistors and lines controlling their gates are used in programming to isolate a particular antifuse. Placing a high voltage across the antifuse establishes a bidirectional interconnection between the two crossing wires.

The logic modules must be carefully placed and then wired by routing interconnections through the network of antifuses. Because of the resistance and capacitance associated with crossing an antifuse, speed-critical signals pass through two antifuses. Most connections can be performed in two or three hops. The worst case might require four hops.

These concepts are illustrated in Figure 10.35.

Every jog from a horizontal to a vertical wiring track and vice versa crosses an antifuse. To go from an output to the upper input requires one jog from horizontal to vertical, a second from vertical to horizontal, and a final jog to vertical again.

Some wire segments may be blocked by previously allocated segments. This is shown by the lower logic module input. To get to it, the output signal jogs vertically to a new horizontal line, to which the input vertical line can be connected. In general, a horizontal segment is made longer by jogging to another overlapping segment via an available vertical wire.

10.3.4 Xilinx Logic Cell Arrays

Xilinx takes another approach to bringing the PLD concept to higher levels of integration. Their programming method is based on CMOS static RAM technology: RAM cells sprinkled throughout the chip determine the personality of logic blocks and define the connectivity of signal paths. Static RAM circuits are important in integrated circuit technology and will continue to get denser and faster.

The RAM cells are linked into a long shift register, and the programming involves shifting in strings of ones and zeros to personalize the function of the chip. The devices come with an on-board hardwired finite state machine that allows the program to be downloaded from a standard ROM part. The Xilinx approach has the advantage of fast reprogrammability, although the chip loses its program each time it is powered down.

Figure 10.36 shows a portion of the chip architecture of the Xilinx logic cell array. The major components are I/O blocks (IOBs) and configurable logic blocks (CLBs). The programmable I/O blocks are placed around the periphery, while the CLBs are arrayed in the central part of the chip. Horizontal and vertical wiring channels separate the various components. We will examine these components in the following subsections.

Xilinx currently supports three component families, the 2000, 3000, and 4000 series. We will discuss the 3000 series, the one most commonly encountered in practice. The XC3020 component contains 64 IOBs and 64 CLBs arranged in an eight-by-eight matrix. Xilinx claims that the component contains 2000 equivalent logic gates. The largest member of the family, XC3090, contains 320 CLBs and 144 IOBs and is claimed to be equivalent to 9000 two-input gates.

Xilinx I/O Block Figure 10.37 shows the internal architecture of the I/O block. The inputs to the I/O block are a tri-state enable, the bit to be output (OUT) to the package pad, and the input and output clocks. The outputs from the block are the input (Direct In or Registered In) signals. The block contains registers in both the input and output paths. These can be reset by a global reset signal provided to the block.

First, let's consider how the block can be used when it is associated with an output pad. The active high or low sense of the OUT signal and the output enable can be set by internal options, stored in RAM cells within the block. The output signal can be direct (combinational) or from the dedicated output register (registered). This register is an edge-triggered D flip-flop.

The slew rate control on the output buffer is used to slow down the rise time of output signals. It can reduce noise spikes in designs where large numbers of outputs change at the same time. Outputs can be fast (5 ns switching time) or slow (30 ns switching time).

Let's now consider the block with the pad used as an input. The input signal can come from the dedicated input register or directly from the input pad. The input register can be an edge-triggered flip-flop or a transparent latch. The input pull-up is intended for use with unused IOBs, so that internal signals are not permitted to float. It cannot be used by the output buffer.

Xilinx CLB Figure 10.38 gives the internal view of the CLB. It has five general purpose data inputs (A, B, C, D, E), one clock input, one clock enable input, data in (DIN), Reset, and two outputs, X and Y. The outputs can be registered or direct. In addition, the CLB has a combinational function generator, two storage elements, and five programmable multiplexers.

The function generator takes seven inputs, five from the programmable interconnect (A, B, C, D, E), and two from internal flip-flop feedbacks (Q1, Q2). It also produces two internal outputs (F, G).

Personality RAM bits let us configure the function block in one of three different ways. With the first option, the function generator can compute any Boolean function of five variables. The distinct inputs are A, one of B/Q1/Q2, one of C/Q1/Q2, D, and E. Both F and G carry the same output value. For example, in this mode, a single combinational function generator can implement a 5-bit odd parity function:

.

With the second option, the function generator can compute two independent functions of four variables each. The inputs are A, one of B/Q1/Q2, one of C/Q1/Q2, and one of D/E. F and G carry separate outputs. In this mode, a single function generator can compute a simple 2-bit comparator. Suppose the two 2-bit numbers are represented as A, B and C, D, respectively. Then the greater than (GT) and equal (EQ) functions can be computed as follows:

The final option implements certain restricted functions of more than five inputs. The variable E selects between two independent functions, each computed from the inputs A, one of B/Q1/Q2, one of C/Q1/Q2, and D. The three different options are summarized in Figure 10.39.

The internal flip-flops are positive edge triggered and can be -controlled by the clock or its complement, depending on the configuration setting. They share a common clock signal. Their data sources are the internally generated functions F and G or the DIN input from the interconnect. An active high asynchronous reset signal can set both registers to zero. When the enable clock input is unasserted, the flip-flops hold their current state, ignoring the clock signals and the inputs. The flip-flop outputs, Q1 and Q2, are fed back as inputs to the function generator, making it possible to implement sequential functions within a single CLB.

The two outputs from the CLB, X and Y, can be driven from the flip-flops or directly from the F and G outputs of the function generator. The CLB organization is reasonably symmetric, making it possible to interchange the top or bottom inputs/outputs to reduce the complexity of interblock routing.

CLB Application Examples A small number of CLBs can implement a wide range of combinational functions.

Figure 10.40 shows a few examples of majority logic and parity checking. An n-input majority function asserts a 1 whenever

or more inputs are 1. Clearly, a single CLB can implement a five-input majority circuit: this is nothing more than a combinational function of five variables!

Now consider a seven-input majority circuit. This can be implemented with three CLBs as shown in the figure. The first-level CLBs count the number of ones in their three inputs, outputting the patterns 00, 01, 10, or 11. Although the first-level CLBs are functions of only three inputs, they use both of their outputs. The second-level CLB sums its three sets of inputs (two 2-bit inputs and one 1-bit input); if these equal or exceed four, it asserts the majority output.

Now consider a parity-checking circuit. A single CLB can implement a 5-bit-wide parity checker, as we have already seen. Cascading two CLBs, as shown in the figure, yields a nine-input circuit. With two levels of CLBs, we can extend the scheme to 25-bit-wide parity logic.

As another combinational logic example, consider how we might implement a 4-bit binary adder. A full adder can be implemented in a single CLB. The Cout and Si outputs are functions of the three inputs Ai, Bi, and Cin. To get a 4-bit adder, we simply cascade four CLBs. This is shown in Figure 10.41(a).

An alternative approach is to use the 2-bit binary adder as a building block. This circuit has five inputs, A1, A0, B1, B0, and a Cin. The outputs S1, S0, and Cout are each functions of these five variables. Thus, the 2-bit adder can be implemented with three CLBs. To construct a 4-bit adder, we cascade two 2-bit units for a total of six CLBs. This is shown in Figure 10.41(b).

The second implementation may not look attractive, but it has some advantages. It incurs two CLB delays in computing the 4-bit sum, one to compute the carry between the low-order 2 bits and the high-order sums and one to compute the final sums of the high-order bits. This compares with four CLB delays in the implementation based on the standard full adders.

This example illustrates some of the trade-offs between CLB resources and delay. The delay through the logic block is fixed, independent of the function it is implementing. The first approach uses less CLB resources than the second, but actually is slower.

Obviously, the CLB structure is not limited to combinational circuits. Because of the two flip-flops per CLB, it is possible to construct 4-bit counters of various kinds using just two CLBs. The CLB inputs are the current state, Q3, Q2, Q1, and Q0. The outputs of the first CLB are the higher-order 2 bits of the counter. The output of the second is the lower-order 2 bits.

Xilinx Interconnect The Xilinx chip architecture supports three methods of interconnecting the CLBs and IOBs: (1) direct connections, (2) general-purpose interconnect, and (3) long line interconnections. With direct connections, adjacent CLBs are wired together in the horizontal or vertical direction. General-purpose interconnect is used for longer distance connections or for signals with a moderate fan-out. The long lines are saved for time-critical signals that must be distributed to many CLBs with minimum signal skew, such as clock signals.

Direct connections provide the fastest, shortest-distance form of interconnect. Thus, it is important for the software that assigns logic functions to available CLBs to place related logic in adjacent CLBs. The X output of a CLB can be connected to the B input of the CLB to its right or the C input of the CLB to its left. The Y output can be connected to the D input of the CLB above it and the A input of the CLB below it.

These direct connections are shown in Figure 10.42.

We show four CLBs, with inputs A, CE (clock enable), DI (data in), B, C, K (clock), E, D, and R (reset) and outputs X and Y. The relative placement of connection pins on the CLBs is geometrically accurate.

CLB2's X output is connected to the B input of CLB3. The X output of CLB1 connects to the C input of CLB0. All of the X connections are horizontal. Similarly, the Y output of CLB1 is connected to the A input of CLB3. The Y output of CLB2 connects to the D input of CLB0. The Y connections are vertical.

We can also use direct connections to wire CLBs and IOBs. There are two IOBs next to each CLB along the top or bottom row of the logic cell array. Along the top, the CLB A input can be driven from the output of one IOB, and the CLB Y output is connected to the other IOB's input. Along the bottom, the D input plays the same role as the A input along the top.

Along the right edge, the X output can be connected to one IOB while the Y output is connected to either of the adjacent IOBs. The C input is driven from one of the IOBs. Along the left edge, the B input can be driven from an IOB, and the IOB can be driven from the X output. CLBs in the corners can connect to IOBs in two dimensions.

Interleaved among the checkerboard of CLBs are the horizontal and vertical wiring channels of the Xilinx general-purpose interconnect. Each channel contains five wires. At the intersections, a programmable switching matrix connects the wires, as shown in Figure 10.42. The figure shows a connection path between the Y output of CLB0 and the D input of CLB1.

The general-purpose interconnect places many restrictions on what can be connected to what. For example, it is not possible to connect every switch matrix pin to every other pin. The pin we used, the second pin from the left on the top, is only connected to the top pins on the left and the right, the three leftmost pins on the bottom, and the second pin from the top on the right. As another example of wiring restrictions, the D input of a CLB can be connected to the second wire in the horizontal channel but not the top wire. Fortunately, Xilinx's software for placing and routing the LCA is aware of the restrictions in the wiring fabric and can hide most of these considerations from you.

The final forms of interconnect are the long lines. There are two such lines per row and three per column. In addition, a single global long line is driven by a special buffer and distributed to every column. It can be connected to the K input of every CLB.

One more special global signal is not shown in the wiring diagrams. A global reset line, connected to the chip reset pin, can force all flip-flops in the LCA to zero, independent of the individual CLB reset input.

Implementing the BCD-to-Excess-3 Converter with a Xilinx Logic Cell Array In this subsection, we examine the implementation of the BCD-to-Excess-3 converter finite state machine using the LCA structure. The next-state and output equations are

Suppose we adopt a synchronous Mealy implementation style. Then each of the four functions requires its own flip-flop. Since no function is more complex than four variables, we can implement each one in one-half of a CLB.

To give you a feeling for the size of functions that can be implemented in a Xilinx chip, the smallest configuration contains 64 CLBs. The example finite state machine uses only 1/32 of the CLB resources of the array.

One critical issue is how to provide reset to the finite state machine. We will use the global reset signal derived from the dedicated reset pin on the LCA package. When this signal is asserted, all flip-flops in the array are set to zero.

Xilinx provides software to map a logic schematic into a placed and routed collection of CLBs, so designers rarely have to deal with the array at this level of detail. Still, it is instructive to understand some of the routing details, because they have a critical effect on performance. Xilinx permits you to hand route critical nets to tune circuit performance (or to help the automatic router complete a difficult routing), if desired.

Figure 10.43 shows a possible interconnection scheme. It uses global long lines, horizontal long lines, direct connections, and general--purpose connections. The global long line is dedicated to the clock signal, which drives the K input of the two CLBs. The horizontal long line carries the clock enable signal (CE). Xilinx makes it possible to attach the horizontal lines to pull-up resistors, so this wire can always carry a 1. The vertical long lines are not used in this example.

We have placed Q2 and Q0 in CLB1, with Q1 and Z in CLB2. This partitioning helps to minimize the use of general-purpose interconnect. CLB1 is implemented without any external inputs, by exploiting the connections inside the CLB.

Since the flow of signals is horizontal from left to right and only CLB1's X output (Q2) can be routed through horizontal direct interconnect, the Y output (Q0) must go through general-purpose interconnect.

Normally, the Y output of CLB2 (Z) would be wired to the vertical channel, eventually reaching an IOB at the periphery of the array of CLBs. If these CLBs are placed along the top row of the array, direct connections can be used to make the machine's inputs and outputs directly available. An input IOB, carrying the machine's X input, can be wired to the A input of CLB2. Also, CLB2's Y output, carrying the machine's Z output, can be connected to an adjacent IOB configured for output.

Traffic Light Controller

In this section, we will examine several alternative implementations for the traffic light controller finite state machine. We start by decomposing the basic machine into its constituent subsystems. Besides the next-state and output functions, we also need logic for the timing of the lights and for detecting the presence of a car at the intersection.

10.4.1 Problem Decomposition: Traffic Light State Machine

Of course, there are many possible ways to organize the components of the traffic light system. Here is the decomposition we will use:

Controller finite state machine
next state/output combinational functions
state register

Short time/long time interval counter
Car sensor
Output decoders and traffic lights

System Block Diagram A block diagram description for this decomposition is shown in Figure 10.44.

The controller finite state machine takes as input the Reset, Clk, TL, and TS signals, as well as a synchronized C signal, and generates the ST signal and encoded light signals (00 = green, 01 = yellow, 10 = red). The interval counter subsystem takes Clk, Reset, and ST as inputs, generating TL and TS as outputs. The car sensor subsystem has an asynchronous sensor input C, which it outputs as a synchronized signal. The light decoders translate the encoded light control signals into signals to drive the individual lights.

It is reasonable to generate outputs that directly control the lights, rather than the encoded scheme we have chosen. Since the actual traffic lights are probably relatively far from the traffic light controller hardware, the encoded scheme has the advantage of fewer wires that need to be routed that distance. But the approach requires additional logic to do the decoding near the lights.

In the way we have drawn Figure 10.44, the finite state machine is a Mealy machine. Is it synchronous or asynchronous? Because the inputs C (sync), TS, and TL change with the clock, it is a synchronous machine. To be thorough, Reset should also be synchronized.

In the following subsections, we look at the logic for the next state and outputs, the car detector, the light decoders, and the interval timer.

Next-State Logic and Outputs The finite state machine has six inputs: Reset, C, TL, TS, and the current state (Q1, Q0); and seven outputs: the next state (P1, P0), ST, H1,0 (encoded highway lights), and F1,0 (encoded farmroad lights).

An espresso truth table file, such as that in Figure 9.21, can be used to specify the transition table for the state machine. From this, we can generate discrete gate, PAL, or PLA-style logic for the next state and other outputs.

Car Detector The car detector logic is much like the debounced switch of Section 6.6.1. A two-position switch embedded in the road determines whether a car is present. This signal should be stable during the transition from one setting to the other, and this calls for a debouncing circuit.

Since a car can arrive at any time, the car detector is asynchronous with respect to the rest of the traffic light system. To synchronize the car sense signal, we must pass it through a synchronizer flip-flop, clocked by the system clock. The circuit is given in Figure 10.45.

Light Decoders The light decoder circuitry is reasonably straightforward. We can use 2-to-4 decoders, such as the TTL 74139.

Figure 10.46 contains the necessary logic.

Interval Timer The last major component is the interval timer, designed to generate the signals TL and TS after being set by ST. We could -implement this in many ways, perhaps the simplest being to use a counter and external decode logic. The counter is cleared when ST is asserted, and TL and TS are asserted by the external logic when the counter counts up to the appropriate threshold value. For this discussion, we will assume that TS is asserted when the 4-bit counter reaches 01112 and TL is asserted when it reaches 11112. Wider counters can be used for more realistic interval timings.

Figure 10.47 shows how the logic could be implemented using a 74163 synchronous up-counter. In the figure, the OR of ST and RESET is complemented to reset the counter. When either ST or RESET is asserted, the

input is asserted, and the counter is set to zero. This is not strictly necessary: whatever state the counter comes up in when powered on, it will eventually cycle through the states that cause TS and TL to be asserted.

10.4.2 PLA/PAL/ROM-Based Implementation

In this subsection, we will use the best encoding found in Section 9.3.1: HG = 00, HY = 10, FG = 01, and FY = 11. This yields an implementation for the next-state and output functions that requires eight unique product terms:

Any PLA component with five inputs, seven outputs, and eight product terms could implement these functions.

PLA/PAL Implementation Because no function is more complex than four product terms, they can also be implemented by many of the available sequential PALs. For example, in Section 8.5.3 we gave an ABEL description for the traffic light state machine that used a P22V10 PAL (see Figure 10.28). This device has 11 dedicated inputs and 10 programmable input/outputs. When the latter are programmed as outputs, they can be either registered or combinational. The OR array varies from 8 to 14 product term inputs, sufficient to implement any of the functions above. The embedded registers can be reset through a dedicated reset line that is routed to each output register, so it is not necessary to include a Reset input signal in the equations.

ROM Implementation A ROM-based implementation requires a complete tabulation of the state transition table. With five inputs and seven outputs, this is a 32-word by 8-bit ROM. If Reset is to be handled by the next-state logic directly, this should be included as one of the inputs to the ROM, thus doubling its size.

10.4.3 Counter-Based Implementation

Although the two-level implementation just described is appropriate for a PAL- or PLA-based approach, it is not necessarily the best strategy when using packaged components such as TTL. The equations in the preceding subsection require eight 3-input gates (three packages), two 4-input gates (one package), three 2-input gates (one package), four inverters (one package), and many wires. An MSI-based implementation could lead to fewer components and certainly fewer interconnections.

Counters, Multiplexers, Decoders If you examine the traffic light finite state machine carefully, you should see that a counter could be used to implement the state register. After all, the machine either holds in its current state or advances to the next state in a well-defined sequence.

Let's make the state assignment HG = 00, HY = 01, FG = 10, and FY = 11. We will implement the state register with a 74163 synchronous up-counter. An external reset signal can be wired to the counter's synchronous clear input.

The question now becomes how to implement the counter's count input.

Figure 10.48 reproduces the state diagram for the traffic controller. In state HG, the exit condition is TL C. In HY and FY, it is TS. In FG, it is TL +

. We could use logic that takes the relevant condition of the current-state bits and ANDs it with the appropriate exit condition to form the count signal. Unfortunately, this would take a fair amount of discrete logic.

A better way is to use a multiplexer to implement the count signal. We drive a four-to-one multiplexer's selection lines with the current-state bits, Q1 and Q0. The inputs are wired for the appropriate exit condition. This is shown in Figure 10.49.

The count signal and the start timer signal, ST, are identical. As you can see, this approach represents a substantial reduction in package count.

As one final application of MSI components, it is possible to drive the traffic lights from signals that have been directly decoded from the current state. For example, if the machine is in state HG, the highway lights are green and the farmroad lights are red. Similarly, the highway lights are yellow and the farmroad lights are red when the machine is in state HY. Thus the highway green light and yellow light can be decoded directly from state 0 and 1, respectively, while the farmroad red light is driven by the OR of these decoded signals. The logic is shown in Figure 10.50.

10.4.4 LCA-Based Implementation

Let's begin with the same set of equations we used for the PAL/PLA-based implementation. Fortunately, none of the next-state and output functions are more complex than five variables: P1 and ST are five variables; P0, H1, H0, and F0 are three variables; and F1 is one variable. Thus, we can implement the finite state machine in four and one half CLBs: one CLB each for the two 5-variable functions, and the remaining five functions grouped two apiece into the remaining CLBs.

Xilinx provides software that takes as input a schematic description of a circuit, automatically partitions the logic to the CLB components, chooses specific CLBs in the logic array to implement this partitioning, and selects a routing of the interconnections to complete the implementation. In general, the designer need not know in detail how the schematic is mapped into the array structure. Nevertheless, to illustrate more about the internal structure of the LCA, we will perform the partitioning, placement, and routing by hand (Xilinx provides an editor that lets you deal with this level of detail).

The first step is to map functions onto CLBs, especially when two functions are to be placed in the same CLB. To make the best use of the CLB's inputs, we should try to place functions of the same inputs in the same CLB. For example, H1 and H0 are both functions of TS, Q1, and Q0. These two functions can easily be placed in the same CLB.

A second goal is to minimize the amount of inter-CLB routing. For example, it makes sense to place P0 and F1 in the same CLB, because the latter is simply a function of the former. The remaining function, F0, is placed in a CLB by itself.

The second step maps the five CLBs onto the array to make best use of direct interconnect and minimum use of global interconnect. We begin by placing the CLB for Q1, since both its outputs X and Y will carry the same value. By placing other CLBs above, below, and to the right of this CLB, we can make good use of many direct connections.

Figure 10.51 shows one possible CLB placement and the routing of the inputs and state outputs. The Q1 CLB is placed in the center at the left. Through direct connections, it drives the D input of F0's CLB, the A input of ST's CLB, and the B input of the H1,0 CLB. The Q0 CLB is placed in the upper right-hand corner. It uses direct connections to drive the A input of the H1,0 CLB.

We must use general-purpose interconnections to get Q1 to the Q0 CLB and Q0 to the left-hand column of CLBs. The X output of the Q1 CLB can be connected to the first vertical track. This is routed up through the switching matrix. The B input on the Q0 CLB can be connected to this wire.

This interconnection illustrates some of the constraints on the -general-purpose connections. The X output can be connected only to -vertical tracks 1 and 4 (numbered from left to right), while the B input can be connected only to vertical tracks 1, 3, and 5, as well as horizontal tracks 1 and 5 in the channel below the CLB. It is not possible to wire up any input or output to any track.

The distribution of Q0 is somewhat more complicated. It is initially connected to the third wire of the horizontal channel below its CLB. The middle switching matrix splits the signals onto the fourth vertical routing track while simultaneously passing it through on the third horizontal wire. Let's follow the horizontal distribution first. The A input of Q1's CLB can connect to this wire (as well as the first horizontal track and the first and third vertical tracks of the channel to its left). The next switching matrix routes the wire to the third vertical track, where it can be connected to the E input (E can also be connected to the fifth vertical track and the fifth horizontal track below it).

Now returning to the vertical distribution of Q0, it passes through another switching matrix, which routes it onto the third vertical track. Since the ST CLB's A input is already directly connected to Q1, we have to route Q0 to the vertical channel to the left of the CLB. The next switching matrix places the signal on the fourth vertical track. From here it can be connected to the D input (which can also be wired to the third horizontal track below it).

The rest of the routing details are to get the C, TL, and TS signals to the appropriate CLB inputs. TS is routed along the fifth vertical track in both of the vertical channels shown in the figure. To keep the routing regular, we wire the C input to this track in each of the CLBs (C can be wired to vertical tracks 2, 4, and 5 and horizontal tracks 3 and 5 from the channel above the CLB), except Q0's CLB. Recall from Figure 10.39 that if an internal flip-flop's output is to be used in computing a function, then one of the B or C inputs cannot be used.

C and TL are needed only in the computation of Q0 and ST, so we only route them through the leftmost vertical track. C follows the fourth and then the third vertical track, from which it can be wired to Q1's D input and ST's E input. TL starts on the second track, where it splits onto the first and third tracks by the first switching matrix. This allows it to be connected to the E input of Q1 and the B input of ST.

It is worth making a few extra points. First, Q1's B input is left floating. This is because the internal flip-flop's output is used as an input to the internal function generator. Second, the Q0 and Q1 CLBs are connected to the global clock line, and their clock enable inputs are wired to the fifth horizontal track. This track carries the clock enable signal. The global reset signal puts the state machine into its starting state.

In general, the designer need not see this level of detail. So why bother to understand it? Routing decisions can have a serious impact on performance, and no routing software is perfect, especially given the complex constraints imposed on the routing task by the Xilinx architecture. Each traversal of a switching matrix adds 1 to 3 ns to the signal delay. The worst-case routing for Q0 passes it through three switching matrices, and this might represent the critical path in the circuit. By working at the detailed level of the interconnection fabric, you can do hand routing for critical signals or force them onto the smaller-delay long line interconnections.

Counter/Multiplexer It is also possible to think about our MSI implementation in terms of the primitives supported by the Xilinx LCA. The basic elements of our MSI approach are a four-to-one multiplexer and a 2-bit up-counter. Let's see how these map into CLBs.

We begin with the multiplexer. A general-purpose four-to-one multiplexer is a function of six variables: its four data inputs and two control inputs. Fortunately, this is exactly the kind of six-variable function that can be implemented by a single CLB. Think back to Figure 10.39(c). The CLB can be configured as a two-to-one multiplexer, controlled by input E, that selects among two functions of four variables. Of course, each of these functions could be its own two-to-one multiplexer.

This is not the best solution, however. We need a second CLB just to implement the terms TL C and TL +

for input to the multiplexer. On closer examination, the function we want to implement is really five variables: TL, C, TS, Q1, and Q0. We can implement this in a single CLB.

Now consider the 2-bit counter that implements the state register. A full-blown 74163 is not really needed: we never use the load capability, and we can clear the counter using the global reset signal rather than a specific clear input. Thus, each bit of the counter is a function of three inputs: a count signal and the 2-bit current state. This too can be implemented in a single CLB.

Finally, let's look at the logic to decode the six light functions from the current state, as we did in the MSI example. All of these functions are defined over two variables: Q1 and Q0. We can pack two such functions per CLB, for a total of three additional CLBs.

By cleverly using MSI functions rather than discrete gates, we came up with a five-CLB implementation, including the output decoders. Many things you have learned about TTL MSI components remain valid in the new programmable logic technology!

Real Computer Science begins where we almost stop reading ...