# A 40 Gb/s CMOS Serial-Link Receiver With Adaptive Equalization and Clock/Data Recovery

Chih-Fan Liao, Student Member, IEEE, and Shen-Iuan Liu, Senior Member, IEEE

Abstract-This paper presents a 40 Gb/s serial-link receiver including an adaptive equalizer and a CDR circuit. A parallel-path equalizing filter is used to compensate the high-frequency loss in copper cables. The adaptation is performed by only varying the gain in the high-pass path, which allows a single loop for proper control and completely removes the RC filters used for separately extracting the high- and low-frequency contents of the signal. A full-rate bang-bang phase detector with only five latches is proposed in the following CDR circuit. Minimizing the number of latches saves the power consumption and the area occupied by inductors. The performance is also improved by avoiding complicated routing of high-frequency signals. The receiver is able to recover 40 Gb/s data passing through a 4 m cable with 10 dB loss at 20 GHz. For an input PRBS of  $2^7 - 1$ , the recovered clock jitter is 0.3  $ps_{\rm rms}$  and 4.3  $ps_{\rm pp}.$  The retimed data exhibits 500  $mV_{\rm pp}$ output swing and 9.6 ps<sub>pp</sub> jitter with BER  $< 10^{-12}$ . Fabricated in 90 nm CMOS technology, the receiver consumes 115 mW, of which 58 mW is dissipated in the equalizer and 57 mW in the CDR.

*Index Terms*—Clock and data recovery, equalizer, 40-Gb/s receiver, serial-link application.

## I. INTRODUCTION

S VLSI technology continues to advance, the operating frequency of processors and memories increases rapidly. This has made the bandwidth of the I/O interface a primary bottleneck in many systems. For example, high-speed routers, switches, and multiple-processor servers may need to transport data over coaxial links at tens of gigabits per second. At such high data rate, skin effect and dielectric loss in the transmission medium cause significant distortion of high-frequency signals, leading to considerable intersymbol interference (ISI) on the output data and limiting the length of transmission. As illustrated in Fig. 1, with a length of 4 meters, even a high-quality cable [1] exhibits 10 dB loss at 20 GHz, resulting in substantial eye closure at 40 Gb/s. To receive the data with reasonable BER, not only the clock and data recovery (CDR) circuit but also an adaptive equalizer is required in a serial-link receiver. It is highly desirable to integrate the equalizer with the CDR circuit on the same chip so that there is no need of an off-chip equalizer. This also eliminates the 50  $\Omega$  buffer between the two circuits and saves the overall power consumption.

Though stand-alone equalizers have been verified at 40 Gb/s in 160 GHz BiCMOS [2] and 0.18  $\mu$ m CMOS [3] technolo-



The authors are with the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan 1067 (e-mail: lsi@cc.ee.ntu.edu.tw).

Digital Object Identifier 10.1109/JSSC.2008.2005535



Fig. 1. Typical environment for serial-link applications.

gies, both of them lack adaptability and either consume a large amount of power (760 mW in [2]) or produce a high BER ( $10^{-5}$ in [3]). Several adaptive equalizers have been reported [4]–[6] and the highest data rate of 20 Gb/s is demonstrated using 0.13  $\mu$ m CMOS [6]. The speed of equalizer/CDR combination, however, remains below 10 Gb/s [5], [7]. This inspires us to utilize new architecture and circuit techniques to achieve higher data rate in serial-link applications.

This paper presents a 40 Gb/s serial-link receiver implemented in 90 nm CMOS. In the proposed equalizing filter, an adaptation mechanism is capable of adjusting the high-frequency boost by a large amount without changing the low-frequency gain. This not only eases the design of the adaptation loop but also removes the *RC* filters required for separately extracting the high- and low-frequency contents of the signal. Following the equalizer, the full-rate CDR architecture is chosen to relax the loading on the previous stage. A 5-latch bang-bang phase detector (PD) is proposed that enhances the operation speed and reduces the power consumption and hardware complexity.

The paper is organized as follows. Section II addresses general considerations for the design of adaptive equalizers and CDR circuits, arriving at the receiver architecture shown later. Section III describes the circuit detail of each building block and Section IV presents the experimental results. Finally, Section V presents the conclusion.

#### **II. RECEIVER ARCHITECTURE**

# A. General Considerations

The receiver architecture highly depends on the type of the equalizing filter. The *RC*-degenerated filter is widely utilized as

2492



Fig. 2. RC-degenerated equalizing filter and its frequency response.



Fig. 3. Output waveforms of an *RC*-degenerated equalizer during adaptation. Solid and dashed lines represent the waveforms of the equalized signal and that of the reference, respectively.

shown in Fig. 2. The circuit introduces a pair of zero and pole by capacitive degeneration, providing some boost at high frequencies. The amount of boost is proportional to the distance between the zero and the pole, while the DC gain is degenerated by the same factor. It therefore suffers from a serious trade-off between the amount of high-frequency boost and the DC loss. Another issue is that, to make the equalizing filter adaptive, the resistor and capacitor must be simultaneously adjusted to cover a wider tunable range [5], [6]. Though a variable range of 5–6 dB at 20 GHz  $(1/2T_b)$  is available by changing both the resistance and the capacitance, this scenario simultaneously alters the low-frequency gain when the high-frequency boost is varied. To understand how this affects the adaptive equalization, the equalizer output during adaptation is examined in Fig. 3.

Suppose initially the equalizer is settled with maximum boost and the cable loss suddenly decreases. Before the loop performs corrections, the equalizer provides so much boost that the high-frequency content of data shows excess amplitude. The power of the equalized signal,  $P_{\rm eq}$ , is therefore larger than the power of the reference,  $P_{\rm ref}$ , as depicted in Fig. 3(a). The loop senses this difference and adjusts  $V_{\rm C}$  accordingly. As  $V_{\rm C}$  increases, the high-frequency boost is decreased as desire, unfortunately the DC gain is increased and the low-frequency content of data grows in amplitude. With a certain  $V_{\rm C}$ , the equalized data shows no ISI but a swing larger than the reference, as illustrated in Fig. 3(b). The loop continues to increase  $V_{\rm C}$  since the power of the equalized signal is still larger than that of the reference. The output data will contain ISI in the steady state because further increasing  $V_{\rm C}$  attenuates the high-frequency content of data while amplifying the low-frequency content of data. The above discussion concludes that the loop cannot be correctly adaptive with only single loop. To overcome this difficulty, another loop is required to control the swing of the [5], so that the adaptation loop can settle at the state shown in Fig. 3(b) with the two signals possessing equal swings. Note that all these phenomena exist even if the transmitter swing is fixed. In other words, two loops are required if the above equalizing filter is used [5], no matter the transmitter swing varies or not. Such a two-loop scenario raises the design complexity and may cause extra uncertainties if the two loops are not carefully designed.

Even if the complexity can be handled by properly designing the two loops, another issue is that such architecture necessitates low/high-pass filters to separately extract the low- and high-frequency information of data [5], as illustrated in Fig. 4. Since the corner frequency of these filters is proportional to the data rate, the *RC* product becomes very small for a 40 Gb/s signal. To obtain  $1/2\pi$  *RC* on the order of  $1/2T_b = 20$  GHz, resistors and capacitors should be on the order of 400 ohm and 20 fF, respectively.<sup>1</sup> Such a small resistance significantly degrades the boost at  $1/2T_b$ , while capacitors of a few tens of fF suffer from large process variations and cannot be estimated with reasonable accuracy. Therefore, it is highly desirable to eliminate these filters by using an architecture that does not require separate information of low- and high-frequency signals.

The CDR architecture also merits several considerations such as the maximum speed of D-flip- flops (DFFs) and the driving capability of the preceding stage. In a half-rate CDR, the bang-bang PD [8] contributes four latches to the previous stage, while the linear PD [9] presents a loading of only two latches. The linear PD, however, needs to generate short-duration pulses whose widths are proportional to the phase error, which is extremely difficult at high data rate. These observations suggest using a full-rate bang-bang PD based on the Alexander topology [10]. It presents a loading of two latches to the previous stage and does not involve in the creation of short-width pulses. However, conventional Alexander PD [10] uses at least seven latches, which limits the operation speed, raises the loading on the clock buffers, and increases the power consumption. A more efficient full-rate bang-bang PD with lower number of latches is therefore required to alleviate the above problems.

#### B. Receiver Architecture

Based on the above considerations, we arrive at the receiver architecture shown in Fig. 5. A parallel-path equalizing filter is adopted in the design. By only adjusting the gain in the high-

<sup>1</sup>Making the resistor larger will cause less degradation of the boost, which however requires an even smaller capacitor to obtain the same corner frequency.



Fig. 4. Illustrating the effects of low/high-pass filters on the main signal path.



Fig. 5. Receiver architecture.

pass path, the adaptation varies only the high-frequency boost and requires only a single loop for proper control. This also removes the low/high-pass filters because the low/high-frequency information needs not to be separately extracted. Another fixedboost equalization stage is cascaded to provide enough gain at 20 GHz. Two buffers tapered in size and driving capability are inserted between the equalizer and the CDR. They provide a total gain of 5 dB and avoid the large loading of the PD on the equalizer. The retimed data at the CDR output serves as the *reference* for the equalizer. Two power detectors measure the power of the equalized signal and that of the retimed data, respectively. The V/I converter compares them and adjusts the boost accordingly.

Note that the swing at nodes A and B is designed equal by using the same  $I_{\text{tail}}R$  in the buffer (following the equalizer) and the DFF. Such design guarantees that the *low-frequency content of data* is voltage-limited and shows equal amplitude at nodes A and B. The *high-frequency content of data*, however, does not limit to a fixed swing because there are inductors in the buffer and the DFF.<sup>2</sup> In other words, when the loop adjusts the high-frequency boost, the ringing behaviors at node A still varies, making the power at nodes A and B represents whether or not the high-frequency loss is compensated properly. Fig. 6 illustrates the receiver adaptation to different cable loss, suggesting that a single loop is enough for convergence and little ISI can be achieved in the steady state. In summary, successful adaptation requires that all frequency contents of data have equal swings so that the output shows minimum ISI. To achieve this goal based on power comparison, the power (swing) of the high-frequency content of data should be compared before and after the CDR (which generates the reference). The high-frequency components of the equalized data cannot be hard-limiting, but the low-frequency components can. It is because equalization only boosts the amplitude of the high-frequency components, and the low-frequency components only need to keep the amplitude through the entire chain.

The amount of boost should be set to maximum at start-up; otherwise the heavily dispersive bits may drive the CDR loop out of lock. Note that setting the boost to maximum causes 10 dB over-boost if the cable length is very short. The equalized data will have overshoot and ringing behaviors. However, as the gain is large enough, the eye is still quite open, making the PD able to distinguish from 0 and 1 and hence the CDR can perform lock. The extra ringing-induced jitter can usually be considered as high-frequency jitter and easily falls within the CDR tolerance.

<sup>&</sup>lt;sup>2</sup>With inductors in the CML differential pair, the voltage at the outputs can exceed the supply. In other words, the swing no longer limits to  $I_{tail}R$  for the high-frequency content of data. The value to which it limits depends on the ratio of L/R. The swing at node B is made close to  $I_{tail}R$  by using a low L/R ratio in the design.



Fig. 6. Adaptation procedure in the receiver.



Fig. 7. Conceptual diagram of the parallel-path equalizer.

#### **III. BUILDING BLOCKS**

## A. Equalizing Filter

The conceptual diagram of a parallel-path equalizing filter is illustrated in Fig. 7. A bandpass path with zero DC gain is combined with a low-pass path to obtain the desired boost at 20 GHz while maintaining a gain close to unity at low-frequencies. Inductor  $L_P$  and parasitic capacitance  $C_P$  are designed to resonate at 20 GHz and resistor  $R_P$  denotes the equivalent parallel resistance at resonance. The transfer function of the output current versus input voltage is derived as

$$\frac{I_{\text{out}}}{V_{\text{in}}} = g_{m3} + g_{m2} \frac{g_{m1}}{C_P} \frac{s}{s^2 + \frac{1}{C_P R_P} s + \frac{1}{L_P C_P}} 
= g_{m3} \frac{s^2 + \frac{1}{C_P R_P} \left(1 + \frac{g_{m2}}{g_{m3}} g_{m1} R_P\right) s + \frac{1}{L_P C_P}}{s^2 + \frac{1}{C_P R_P} s + \frac{1}{L_P C_P}} \qquad (1)$$

By letting

$$\omega_0^2 = 1/L_P C_P \tag{2}$$

$$Q = \omega_0 C_P R_P = R_P / \omega_0 L_P \tag{3}$$

and substituting them into (1), the transfer function becomes

$$\frac{I_{\text{out}}}{V_{\text{in}}} = g_{m3} \frac{s^2 + \frac{\omega_0}{Q} \left(1 + \frac{g_{m2}}{g_{m3}} g_{m1} R_P\right) s + \omega_0^2}{s^2 + \frac{\omega_0}{Q} s + \omega_0^2}.$$
 (4)

Fig. 8 plots the magnitude of the transfer function in (4), indicating that the transconductance is amplified by a factor of  $(1 + g_{m1}R_{P}g_{m2}/g_{m3})$  at resonance and drops to the low-frequency value of  $g_{m3}$  as the frequency approaches infinity. The amount of boost is proportional to  $R_{P}$  and hence the Q of the LC tank. One can easily increase the boost by raising  $g_{m1}g_{m2}R_{P}$ 



Fig. 8. Frequency response of the parallel-path equalizer.



Fig. 9. Effect of Q on the equalizer frequency response.

without sacrificing the DC gain, which is in sharp contrast to the topology in Fig. 2.

It is instructive to examine the locations of poles and zeros in (4) and their effects on the frequency response with different values of Q. With Q > 1/2, the poles are found as

$$\omega_{p1,p2} = -\frac{\omega_0}{2Q} \left( 1 \pm j\sqrt{4Q^2 - 1} \right) \tag{5}$$

while the zeros are located at

$$\omega_{z1,z2} = -\frac{\omega_0}{2Q} A \left( 1 \pm \sqrt{1 - 4Q^2/A^2} \right) \tag{6}$$

where  $A = (1 + g_{m1}R_Pg_{m2}/g_{m3})$ . Note that the two zeros should be real and different in magnitude so that the desired frequency response shown in Fig. 8 can be obtained. This requires

$$A > 2Q. \tag{7}$$

Since A is also a function of Q due to  $R_{\rm P}$ , (7) can be rearranged as

$$1 + g_{m1}R_P \frac{g_{m2}}{g_{m3}} > 2Q$$
  

$$\Rightarrow 1 + g_{m1}\omega_0 L_P Q \frac{g_{m2}}{g_{m3}} > 2Q$$
  

$$\Rightarrow Q < \frac{1}{2 - g_{m1}\omega_0 L_P g_{m2}/g_{m3}}.$$
(8)

Equation (8) suggests that the Q should be lower than a certain value to get the desired loss-compensation profile. Fig. 9 illustrates the effect of Q on the frequency response. Though a higher Q increases the boost at  $\omega_0$ , the available bandwidth shrinks and may lead to insufficient gain at frequencies between dc and  $\omega_0$ , which in turn worsens the jitter in the time domain. The upper limit on Q relaxes the design of on-chip inductors, allowing the use of stacked inductors with narrow traces to save the area.

Circuit realization of the equalizing filter is depicted in Fig. 10. The adaptation is performed by varying the tail current of the bandpass stage, which has no influence on the low-frequency gain and output common mode. With  $g_{m1}g_{m2}/g_{m3} = 12.3 \text{ mS}$ ,  $L_P = 1.1 \text{ nH}$ , and  $\omega_0 = 2\pi \times 20 \text{ GHz}$  in this particular design, the limit of Q according to (8) is only 3.3. A resistor  $R_X$  is therefore inserted in parallel with the LC



Fig. 10. Circuit diagram of the parallel-path equalizer.



Fig. 11. (a) Output eye after traveling through a cable with 12 dB loss at 20 GHz. (b) Equalized eye with  $R_X = 1.1 \,\mathrm{k}\Omega$ . (c) Equalized eye without  $R_X$ .

tank, lowering the Q and broadening the frequency response. Fig. 11(a) shows the simulated eye diagram when the  $2^7-1$ PRBS data passes through a 5 m cable with 12 dB loss at 20 GHz. Fig. 11(b) and (c) are eye diagrams at the equalizer output with different values of  $R_X$ . The jitter indeed increases as the equivalent Q approaches the upper bound of 3.3, even if the loss at  $\omega_0$  is totally compensated. The final design yields  $R_X = 1.1 \text{ k}\Omega$  and Q = 0.9, providing a fully tunable boost of 8 dB at 20 GHz. An RC-degenerated filter cascaded behind the parallel-path filter compensates another 4 dB of loss. Note that the parasitic capacitance can vary by 20%, shifting the peak of the boost profile. To examine the effect on the equalizer performance, the size of  $g_{m1-3}$  in Fig. 10 and the size of the following stage are varied by  $\pm 10\%$ . The boost at 20 GHz varies by +1.3/-0.5 dB. Such variation has minor influence on the output eye diagram. The mismatch between the lengths of differential cables, which "wastes" some of the equalizer gain



Fig. 12. Power detectors in the adaptation circuit.

and causes additional eye closure, easily overcomes the effect of parasitic variation.

The linearity of the equalizer may be of concern. However, typical transmitter swing for serial-link applications is on the order of several hundreds of mV, making the equalizer undoubtedly nonlinear. To make it more linear, degeneration techniques can be used [5]–[7], which however sacrifice the gain. At very high data rate, people always thirst for more gain because of the limited speed of transistors. Therefore, degeneration is not employed in the first stage and extensive transient simulation has been conducted to optimize the boost and jitter of the equalizer. Even if nonlinearity causes extra jitter on the equalized data, the CDR can still lock and provides retiming if the amount of jitter is within the tolerance of the CDR.

## B. Adaptation Circuit

The adaptation circuit includes two power detectors and a V/I converter. The power detector is based on the source-coupled pair followed by low-pass filtering [5], while the V/I converter is a two-stage opamp with rail-to-rail output swing. The equalizer and the DFF in CDR are designed with normally equal output swing and output common mode. However, in cases when the two circuits exhibit a common-mode difference due to process or supply variations, they present an offset on the outputs of the power detectors. To make the loop unsusceptible to such offset, a current source  $I_2$ , as depicted in Fig. 12, is added to compensate for this error. By doing so, the two power detectors have equal gains and a common-mode difference adjustable from 0 to  $I_2R^{3}$ The output of the V/I converter controls the equalizer through a current-steering circuit rather than directly adjusting the gate voltage of the current source. This obtains a wider control range with more predictable loop transient.

## C. VCO and Clock Buffer

The VCO and the clock buffer are shown in Fig. 13. The VCO is an *LC* oscillator with accumulation-mode varactors for frequency tuning. The clock buffer should provide a large output swing for high-speed sampling in the PD while isolating the VCO from random data transitions. Resistor-loaded buffers consume large amount of power even with inductive peaking. On the other hand, *LC* resonant buffers are able to provide high gain at tens of gigahertz because they only operate in the vicinity of the tuned frequency. However, the gain drops dramatically when the operating frequency deviates from the

<sup>3</sup>In this work,  $I_2$  is tuned manually if necessary. For a more robust design, a replica circuit can be used to automatically adjust  $I_2$ .



Fig. 13. 40 GHz VCO and clock buffer.



Fig. 14. Deriving the minimum number of latches required for generating the three consecutive samples necessary in the early-late method.

resonance of the *LC* tank. Since the center frequency of VCO may drift due to process variations, high-Q LC buffers suffer from more design risk. To cover a wider frequency range, a resistor  $R_{\rm P}$  is added in parallel with the *LC* tank as shown in Fig. 13, decreasing the Q but increasing the operation range. With  $R_{\rm P} = 1.3 \text{ k}\Omega$  and 2.5 mA bias current, simulation indicates that the clock buffer presents a gain of 9 dB and a bandwidth of 7 GHz, which is 5 times larger than the VCO tuning range.

# D. Phase Detector (PD)

As mentioned in Section II, a full-rate bang-bang PD is chosen to relax the loading on the equalizer. The tri-state feature is required to minimize the jitter during long-runs, suggesting the early-late method based on sampling the data by the clock. To find the PD architecture that minimizes the number of latches, we start from two DFFs shown in Fig. 14, where one is sampled by the rising edge of the clock and the other is sampled by the falling edge. The sampled results at node  $Q_1$  are denoted as  $S_A$  and  $S_C$ , while the result at node  $Q_2$  is denoted as  $S_{\rm B}$ . Since three consecutive samples in one bit period are required in an early-late method, it is necessary for  $S_A$  to be *stored* and *delayed* by a proper amount; otherwise its information will be lost after the next rising edge of the clock. By observing that the sampled result will hold during one bit period,  $S_{\rm A}$  can be delayed by an amount between 0 and  $T_{\rm b}/2$ , as indicated by the gray area in Fig. 14. This area also represents the time interval for phase comparison.



Fig. 15. Proposed full-rate bang-bang PD with only five latches.



Fig. 16. Timing diagrams of the PD for clock (a) late and (b) early cases.

Based on the above concept, the PD is realized by adding a latch to delay the result of  $Q_1$  by  $T_{\rm b}/2$ , arriving at the architecture shown in Fig. 15. Fig. 16(a) and (b) illustrate the timing diagrams for cases when the clock is late and early, respectively. Note that for  $S_{\rm A}$  to be delayed by  $\Delta t$ , the average output current in one bit period is given by

$$\bar{I}_{out} = \frac{2I_P \left(\frac{T_b}{2} - \left(\frac{T_b}{2} - \Delta t\right)\right)}{T_b}$$
$$= \frac{2I_P \Delta t}{T_b} \tag{9}$$

where  $I_{\rm P}$  denotes the peak current of the V/I converter. Implementing a broadband delay of  $\Delta t$  is difficult as active delay elements suffer from delay-bandwidth trade-offs and passive elements typically have high loss and occupy large area. Besides, both of them are prone to process variations. Using a latch to obtain a more definitely-controlled delay of  $T_{\rm b}/2$  alleviates the above issues. This also maximizes the average output current as indicated by (9). The price paid is a slight increase of loading, i.e., one more latch, on the clock buffer. Based on these observations, a 4-latch full-rate BBPD is theoretically possible, but a 5-latch version is implemented in this design considering all the trade-offs mentioned above. When the CDR is locked, the falling edge of the clock aligns with the data center and performs retiming. Compared to conventional Alexander PD, this design reduces the number of latches while keeping important features such as the tri-state operation and the inherent data-retiming capability.



Fig. 17. (a) CML latch in the DFFs. (b) XOR gate and V/I converter.

Note that the pulse width of UP/DN currents shown in Fig. 16 is halved compared to a 7-latch Alexander PD, implying a smaller amount of phase correction during each comparison interval. This issue is easily remedied by increasing the bias current of the V/I converter by a factor of 2 because the loop only senses the *average* current of the V/I converter in a bang-bang CDR. It is unlike the case in a linear PD, in which the pulses need to have sharp edges and symmetric rise/fall times to avoid static phase offset. As the V/I converter only consumes  $\sim 1$  mA, which is 4 times smaller than the current consumption of a latch, the power penalty is almost negligible.

The power saving of the proposed PD can be quantified as follows. Intuitively, the power is saved by at least 28.5% because the number of latch is reduced from 7 to 5. However, when considering the driving capability of the clock buffer, the total saved power exceeds 28.5%. To drive seven latches in a conventional PD, the inductor in the clock buffer should have smaller inductance, which decreases the equivalent parallel resistance at resonance. This effect causes swing degradation and can only be compensated by increasing the current in the clock buffer. Besides, since there are inductors in the latch, two more latches means larger area in layout and longer length of signal lines.



Fig. 18. Simulated characteristic of the proposed PD.

The clock skew will increase and the associated parasitic capacitances degrade the speed. All these phenomena may require extra power dissipation to balance undesired effects, making the power saving of our proposed design more significant.

The CML latch used in the DFFs is depicted in Fig. 17(a). To operate at 40 Gb/s, it is based on the class-AB current-switching topology with shunt inductor peaking. The AC-coupling capacitors are made by vertical parallel-plate capacitors [11] using the 8th to second metal layers. Such structure exhibits a high density of 1.3 fF/ $\mu$ m<sup>2</sup> and a low bottom-plate capacitance of only 4%. The inductors are realized as stacked structures formed by the 9th and 6th metal layers in order to minimize the area and the length of high-speed interconnect. Each inductor in the latch has an inductance of 300 pH and occupies only  $20 \times 20 \ \mu m^2$ . The necessity to use inductors also strengthens the idea of minimizing the number of latches as every additional latch causes two more inductors, which increases the area rapidly and makes the routing of high-speed signals unmanageable. The XOR gates and V/I converter are depicted in Fig. 17(b). Modified from [12], resistors instead of current mirrors are used at the XOR outputs, resulting in larger output swings at high speed. Moreover, this allows separate optimization of the bias current of the XOR gate and that of the V/I converter, providing extra degree of freedom in the design. Fig. 18 shows the simulated characteristic of the PD. It produces an average current of 200  $\mu$ A for a phase error larger than 3 ps. The linear region is  $\sim$ 6 ps wide and the gain in the center is  $\sim 83 \ \mu \text{A/ps}$ .

# E. Overall Receiver Simulation

The receiver contains two loops and their behaviors must be carefully examined to ensure convergence of each. Fig. 19 shows the transistor-level simulation of the two loops. The CDR uses a second order loop filter with components of  $R_1 = 200 \ \Omega, C_1 = 1 \text{ nF}$ , and  $C_2 = 25 \text{ pF}$ , resulting in a loop bandwidth of ~40 MHz<sup>4</sup> and a capture range of ±60 MHz

<sup>&</sup>lt;sup>4</sup>Here we assume that the jitter at the equalizer output is small enough for the PD to operate in its linear region (~ 6 ps wide). In cases when the equalized signal exhibits large jitter (> 6  $ps_{pp}$ ), the loop behavior must be predicted using the bang-bang model [15].



Fig. 19. Settling behaviors of the two loops in the receiver (the unit of the vertical axis is volts and of the horizontal axis is seconds).



Fig. 20. Die photo of the receiver.

[13]. The capacitor  $C_2$  is entirely integrated on chip for better filtering of ripples on the control line. Initially the boost of the equalizer is set to maximum and the VCO frequency is brought to within 50 MHz of the data rate. The CDR loop gets locked first, and then provides the retimed data for the adaptation loop. Both loops settle smoothly with minimum interference of each other.

## **IV. EXPERIMENTAL RESULTS**

The receiver is fabricated in digital 90 nm CMOS and tested on a high-speed probe station. Fig. 20 shows the die, which measures  $0.77 \times 0.7 \text{ mm}^2$  including the pads. The 40 Gb/s input is provided by an Anritsu random data generator (MP1803A + MP1758A). Precision Timebase Module (Agilent 86107A) and 70 GHz dual-remote sampling head (Agilent 86118A) are used to minimize the jitter caused by the oscilloscope itself.

Fig. 21(a) and (b) show the 40 Gb/s eye diagrams at the cable output and the receiver output, respectively. The data suffers from 10 dB loss at 20 GHz after passing through a 4 m



Fig. 21. 40 Gb/s  $2^7-1$  eye diagrams (a) after passing through a cable with 10 dB loss (b) at the receiver output.



Fig. 22. Recovered 40 GHz clock corresponding to the cable output shown in Fig. 21(a).

cable [1], resulting in a completely closed eye diagram. The retimed data exhibits 500 mV<sub>pp</sub> output swing and 9.6 ps<sub>pp</sub> jitter with BER <  $10^{-12}$ , for an input PRBS of  $2^7$ –1. The recovered clock is depicted in Fig. 22, showing an rms jitter of 0.3 ps and a peak-to-peak jitter of 4.3 ps. The receiver consumes 115 mW, of which 58 mW is dissipated in the equalizer and 57 mW in the CDR circuit.

With a shorter length of cable, i.e., 2 m, the data suffers from 5 dB loss at 20 GHz. The cable output and the receiver output for



Fig. 23. 40 Gb/s  $2^{31}$  – 1 eye diagrams (a) after passing through a cable with 5 dB loss, and (b) at the receiver output.



Fig. 24. Recovered 40 GHz clock corresponding to the cable output shown in Fig. 23(a).

an input PRBS of  $2^{31}-1$  are depicted in Fig. 23(a) and (b), respectively. The retimed output data shows a completely opened eye diagram with 7.4 ps<sub>pp</sub> jitter and BER <  $10^{-12}$ . The clock jitter is 0.42 ps<sub>rms</sub> and 3.9 ps<sub>pp</sub> as shown in Fig. 24.

The receiver is also tested under the condition of no cable loss. The supply can be decreased to 1.2 V and the CDR power consumption can be reduced to 32 mW under this mode.<sup>5</sup> The



Fig. 25. Recovered (a) data and (b) clock with a 40 Gb/s  $2^{31}-1$  PRBS input passing through a zero-loss cable.

recovered data and clock for an input PRBS of  $2^{31} - 1$  are depicted in Fig. 25(a) and (b), respectively. The retimed data jitter is 7.7 ps<sub>pp</sub> with BER <  $10^{-12}$ , while the clock jitter is 0.4 ps<sub>rms</sub> and 3.4 ps<sub>pp</sub>. It is interesting to compare the power dissipation of the CDR with [14], which uses half-rate architecture in the same technology. The full-rate design consumes 32 mW while the half-rate 42 mW. The key is that the half-rate architecture requires eight latches in the PD and one more clock buffer, compromising the advantage of relaxed speed requirement of the DFFs. Besides, generating accurate I/Q phases and distributing the full-rate design more attractive if the speed limit of DFFs can be overcome by circuit/layout techniques.

The measured jitter performance for different cable loss and PRBS length is summarized in Fig. 26. The clock rms jitter rises from 300 fs to 400 fs when the PRBS length increases to  $2^{31}-1$ . For all the measurements done here, the phase noise is less than -105 dBc/Hz at 1-MHz offset. The retimed data jitter is 6–8 ps<sub>pp</sub> for cable loss of 0–5 dB. It increases to 9.6 ps<sub>pp</sub> when the cable loss approaches the maximum equalization capability of the receiver. As observed in Fig. 21(b), some patterns are not equalized properly, which causes double edges on the eye diagram and mainly contributes to the larger peak-to-peak jitter. The effectiveness of equalization is lower than prediction because the severe phase mismatch between the two cables causes

<sup>&</sup>lt;sup>5</sup>The CDR needs to operate with a 1.5 V supply only when the longest cable is used. Under such condition, the equalized signal may not be good enough, so the CDR needs a higher supply to boost the speed of DFFs in the PD.

|                    | [2]                             | [3]                 | [5]                              | [6]                              | [7]                   | This Work           |
|--------------------|---------------------------------|---------------------|----------------------------------|----------------------------------|-----------------------|---------------------|
| Technology         | 160-GHz                         | 0.18µm              | 0.13µm                           | 0.13µm                           | 90nm                  | 90nm                |
|                    | SiGe                            | CMOS                | CMOS                             | CMOS                             | CMOS                  | CMOS                |
| Data Rate          | 40-Gb/s                         | 40-Gb/s             | 20-Gb/s                          | 10-Gb/s                          | 10-Gb/s               | 40-Gb/s             |
| Equalization       | 17dB                            | 17dB                | 10 dB                            | 18 dB                            | $33 \mathrm{dB}^{\#}$ | 10 dB               |
| Adaptation         | NO                              | NO                  | YES                              | YES                              | NO                    | YES                 |
| CDR                | NO                              | NO                  | NO                               | YES                              | YES                   | YES                 |
| Rec. Data          | 5.1ps <sub>pp</sub>             | 13ps <sub>pp</sub>  | $14 ps_{pp}$                     | $20 \mathrm{ps_{pp}}$            | N/A                   | 9.6ps <sub>pp</sub> |
| Jitter             | (PRBS 7)                        | (PRBS 31)           | (PRBS 31)                        | (PRBS 7)                         |                       | (PRBS 7)            |
| Rec. Data<br>Swing | 700mV <sub>pp</sub>             | 120mV <sub>pp</sub> | $300 \mathrm{mV}_{\mathrm{pp}}$  | $250 \mathrm{mV}_{\mathrm{pp}}$  | N/A                   | 500mV <sub>pp</sub> |
| Sensitivity        | $300 \mathrm{mV}_{\mathrm{pp}}$ | N/A                 | $300 \mathrm{mV}_{\mathrm{ppd}}$ | $640 \mathrm{mV}_{\mathrm{ppd}}$ | $1.2 V_{ppd}$         | $1.3V_{ppd}$        |
| BER                | $10^{-12}$                      | $10^{-5}$           | $10^{-15}$                       | 10 <sup>-13</sup>                | $10^{-12}$            | 10 <sup>-12</sup>   |
| Supply             | 3.3V                            | 1.8V                | 1.5V                             | 1.6V                             | 1.2/1.0V              | 1.8/1.5V            |
| Power              | 760mW                           | 70mW                | 60mW                             | 133mW                            | 130mW <sup>***</sup>  | 115mW               |
| (EQ/CDR)           |                                 |                     |                                  | (41/92)                          |                       | (58/57)             |
| Area               | 1.5mm <sup>2</sup>              | 1 mm <sup>2</sup>   | 0.2mm <sup>2*</sup>              | 0.61mm <sup>2*</sup>             | 0.86mm <sup>2**</sup> | 0.54mm <sup>2</sup> |

 TABLE V.1

 Measured Receiver Performance Compared With Similar Works

<sup>#</sup>: Achieved by equalization in both transmitter and receiver.

\*: Only core area.

\*\*: Area of the transmitter has been excluded.

\*\*\*: This includes only the DFE and CDR logic.



Fig. 26. (a) Clock RMS jitter and (b) data peak-to-peak jitter for different cable loss and PRBS length.

additional eye closure when the input signals are sensed differentially. This effect becomes more pronounced for a PRBS length of  $2^{31}$ -1, which even prohibits the receiver from lock. However, for cable loss smaller than or equal to 5 dB, the jitter of the recovered clock and data only increases slightly when the PRBS length increases from 7 to 31. This confirms the ability of the proposed PD/CDR to tolerate long runs while producing a low-jitter output.

The performance of the overall receiver is summarized in Table V.I and compared with prior works related to serial-link applications. This work is the first 40 Gb/s CMOS serial-link receiver that performs both adaptive equalization and CDR. With cable loss ranging from 0 to 10 dB, the receiver is able to recover the clock and data with BER <  $10^{-12}$ . The input sensitivity for the 4 m cable is  $1.3 V_{\rm PP}$  differential, while it reduces to 650 mV<sub>PP</sub> for 0 m and 2 m cables. The extensive use of stacked inductors greatly reduces the area. Though there are 28 inductors in the design, the total area is even smaller than the core area of similar works.

## V. CONCLUSION

A 40 Gb/s serial-link receiver realized in 90 nm CMOS performs adaptive equalization as well as clock and data recovery. The equalizer is based on a parallel-path filter with an *LC*-loaded stage for compensating the high-frequency channel loss. It allows varying the boost without altering the low-frequency gain, which leads to a single loop for adaptation and greatly reduces the complexity. The full-rate CDR circuit incorporates a 5-latch bang-bang PD which relaxes the loading on the clock buffer and decreases the power consumption. The lower number of latches also saves the area occupied by inductors, which makes the routing of high-speed clock and data lines more manageable and hence improves the performance. The fabricated prototype demonstrates the capability to recover 40 Gb/s data suffering from 10 dB loss at 20 GHz. The receiver reproduces the data with 500 mV<sub>pp</sub> output swing and 9.6 ps<sub>pp</sub> jitter with BER <  $10^{-12}$ , while consuming 115 mW from 1.8/1.5 V supplies.

## ACKNOWLEDGMENT

The authors would like to thank TSMC for chip fabrication and NTU-MediaTek Laboratory and National Science Council (NSC) for support of this work.

## REFERENCES

- United Microwave Products Inc,. Torrence, CA [Online]. Available: http://www.unitedmicrowave.com/55.gif
- [2] A. Garg, A. C. Carusone, and S. P. Voinigescu, "A 1-tap 40 Gb/s lookahead decision feedback equalizer in 0.18 μm SiGe BiCMOS technology," *IEEE J. Solid-State Circuits*, vol. 41, no. 10, pp. 2224–2232, Oct. 2006.
- [3] J. Sewter and A. C. Carusone, "A 3-tap FIR filter with cascaded distributed tap amplifiers for equalization up to 40 Gb/s in 0.18 μm CMOS," *IEEE J. Solid-State Circuits*, vol. 41, no. 8, pp. 1919–1929, Aug. 2006.
- [4] G. E. Zhang and M. M. Green, "A 10 Gb/s BiCMOS adaptive cable equalizer," *IEEE J. Solid-State Circuits*, vol. 40, no. 11, pp. 2132–2140, Nov. 2005.
- [5] S. Gondi and B. Razavi, "Equalization and clock and data recovery techniques for 10 Gb/s CMOS serial-link receivers," *IEEE J. Solid-State Circuits*, vol. 42, no. 9, pp. 1999–2011, Sep. 2007.
- [6] J. Lee, "A 20 Gb/s adaptive equalizer in 0.13 μm CMOS technology," IEEE J. Solid-State Circuits, vol. 41, no. 9, pp. 2058–2066, Sep. 2006.
- [7] J. F. Bulzacchelli et al., "A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 41, no. 12, pp. 2885–2900, Dec. 2006.
- [8] B. Razavi, Design of Integrated Circuits for Optical Communications. New York: McGraw Hill, 2003.
- [9] J. Savoj and B. Razavi, "A 10 Gb/s CMOS clock and data recovery circuit with a half-rate linear phase detector," *IEEE J. Solid-State Circuits*, vol. 36, no. 5, pp. 761–767, May 2001.
- [10] J. D. H. Alexander, "Clock recovery from random binary data," *Electron. Lett.*, vol. 11, pp. 541–542, Oct. 1975.
- [11] R. Aparicio and A. Hajimiri, "Capacity limits and matching properties of integrated capacitors," *IEEE J. Solid-State Circuits*, vol. 37, no. 3, pp. 384–393, Mar. 2002.

- [12] J. Lee and B. Razavi, "A 40 Gb/s clock and data recovery circuit in 0.18-µm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 38, no. 12, pp. 2181–2190, Dec. 2003.
- [13] C. S. Vaucher, Architectures for RF Frequency Synthesizers. Boston, MA: Kluwer, 2002.
- [14] C.-F. Liao and S.-I. Liu, "40 Gb/s transimpedance-AGC amplifier and CDR circuit for broadband data receivers in 90 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 43, no. 2, pp. 642–655, Feb. 2008.
- [15] J. Lee, K. S. Kundert, and B. Razavi, "Analysis and modeling of bangbang clock and data recovery circuits," *IEEE J. Solid-State Circuits*, vol. 39, no. 9, pp. 1571–1580, Sep. 2004.



**Chih-Fan Liao** was born in Taipei, Taiwan, in 1981. He received the B.S. and Ph.D. degrees in electrical engineering from National Taiwan University, Taipei, in 2003 and 2007, respectively. While pursuing the Ph.D. degree, his research focused on broadband CMOS circuit design for UWB and high-speed wireline receivers.

He is currently with MediaTek Inc., HsinChu, Taiwan, where he is a Senior Design Engineer and engaged in the development of low-cost Bluetooth receivers.



**Shen-Iuan Liu** (S'88–M'93–SM'03) was born in Keelung, Taiwan, in 1965. He received the B.S. and Ph.D. degrees in electrical engineering from National Taiwan University (NTU), Taipei, Taiwan, in 1987 and 1991, respectively.

During 1991–1993, he served as a second lieutenant in the Chinese Air Force. During 1991–1994, he was an Associate Professor in the Department of Electronic Engineering, National Taiwan Institute of Technology. He joined the Department of Electrical Engineering, NTU, in 1994, where he has been a

Professor since 1998. His research interests are in analog and digital integrated circuits and systems.

Dr. Liu has served as chair of the IEEE SSCS Taipei Chapter in 2004-2008. He has served as general chair of the 15th VLSI Design/CAD Symposium, Taiwan (2004) and as Program Co-chair of the Fourth IEEE Asia-Pacific Conference on Advanced System Integrated Circuits, Fukuoka, Japan (2004). He was the recipient of the Engineering Paper Award from the Chinese Institute of Engineers in 2003, the Young Professor Teaching Award from MXIC Inc., the Research Achievement Award from NTU, and the Outstanding Research Award from National Science Council in 2004. He has served as a technical program committee member for ISSCC in 2006-2008 and A-SSCC since 2005. He was an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, PART II: EXPRESS BRIEFS in 2006-2007. He has been an Associate Editor for the IEEE JOURNAL OF SOLID-STATE CIRCUITS since 2006 and an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, PART I: REGULAR PAPERS since 2008. He has also been an Associate Editor for IEICE (The Institute of Electronics, Information and Communication Engineers) Transactions on Electronics since 2008. He joined the Editorial Board of Research Letters in Electronics in 2008. He is a senior member of IEEE and a member of IEICE.