# A 32 Gb/s ADC-Based PAM-4 Receiver with 2-bit/Stage SAR ADC and Partially-Unrolled DFE

Shiva Kiran, Shengchang Cai, Ying Luo, Sebastian Hoyos, and Samuel Palermo Analog and Mixed-Signal Center, Texas A&M University, College Station, Texas, USA luminoss1@tamu.edu, spalermo@tamu.edu

Abstract-A PAM-4 ADC-based receiver employs a 32-way time-interleaved 6-bit 2-bit/stage loop-unrolled SAR ADC with a single capacitive reference DAC. Digital equalization complexity is reduced with a new PAM-4 DFE architecture that has a gate count comparable to an NRZ DFE, while simultaneously halving the critical path delay. A 3-tap FFE is embedded in the ADC using an additional non-binary DAC to improve the coverage of the 6-bit FFE coefficient space. This 3-tap embedded FFE and CTLE front-end partial equalization allows placement of the CDR's Mueller-Muller phase detector directly at the ADC output to avoid excessive loop delay. Fabricated in GP 65nm CMOS, the 32Gb/s receiver operates at a BER  $< 10^{-11}$  with a 27 dB loss channel and  $< 10^{-9}$  with a 30 dB loss channel without utilizing any transmit equalization. The complete ADCbased receiver achieves a power efficiency of 8.25pJ/bit, including all the front-end, ADC, and DSP power.

## I. INTRODUCTION

ADC-based receiver front-ends enable powerful digital equalization and error correction that can leverage CMOS process scaling [1]–[5]. These ADC front-ends are a natural fit for PAM-4 standards that allow operation at a reduced Nyquist frequency and lower channel loss. However, PAM-4 modulation increases the equalization complexity, particularly the digital decision feedback equalizer (DFE). This motivates the development of high-speed ADCs and digital equalization architectures that can efficiently support PAM-4 modulation.

SAR ADC conversion speed can be improved with techniques such as 2-bit/stage conversion and loop-unrolling. However, conventional loop-unrolled 2-bit/stage conversion requires multiple DACs and comparators that present increased loads to the T/H circuit and capacitive DAC, respectively. This increased loading leads to higher power consumption in the T/H circuit and a decrease in the range of FFE coefficients that can be realized when FFE is embedded in the ADC [6].

In both mixed-signal and digital implementations, closing a DFE's 1-UI critical path is a significant challenge. When loop-unrolling is applied, this critical path is formed by a multiplexer loop [7]. PAM-4 DFE implementations suffer from a larger multiplexer loop delay due to the 4:1 multiplexers in the loop, as opposed to 2:1 multiplexers in NRZ DFE implementations. A technique known as look-ahead multiplexing [7] significantly reduces the critical path length. However, both the look-ahead multiplexing and loop-unrolling techniques greatly increase the gate count in the DFE. This problem is further exacerbated in a PAM-4 DFE due to its complexity increasing by a factor of 4 for each additional DFE tap. This has led recent implementations to adopt an FFE-only approach [2], which can suffer from thermal noise, quantization noise, and cross-talk amplification.

This work presents a 32 Gb/s ADC-based receiver for PAM-4 modulation. Section II provides an overview of the receiver architecture, which includes the ADC, DSP, and Mueller-Muller CDR loop. The architecture and circuit level design of the ADC that employs a single DAC for 2-bit/stage conversion and a reference-scaled shared comparator input stage [8] to reduce loading on the capacitive DAC is detailed in Section III. Section IV introduces a new digital partially-unrolled DFE (PU-DFE) architecture, which reduces PAM-4 DFE complexity to that of an NRZ-DFE architecture. ADC and full receiver measurement results from a GP 65nm CMOS prototype are presented in Section V. Finally, Section VI concludes this paper.

# II. RECEIVER ARCHITECTURE

Fig. 1 shows the ADC-based PAM-4 receiver architecture. The 4-stage CTLE-VGA front-end consists of 2 CTLE stages, a gain stage, and a source follower stage. Programmable capacitor banks in the CTLE stages provide between 5-11 dB of peaking. The front-end output is set to match the ADC full-scale via the programmable resistor in the second CTLE stage that provides gain control.

Following the analog front-end are 8 parallel track and hold circuits that are clocked by 8 clock phases running at 2 GHz and separated by 45 degrees. These 2 GHz phases, which are generated by a CML divide-by-4 block supplied with an 8 GHz clock signal, constitute the critical phases that require skew calibration. Timing mismatches of these track and hold clocks are calibrated with a digitally-controlled variable delay line. The sampling phase is set by a baud-rate CDR loop that employs a Mueller-Muller phase detector, a proportional and integral loop filter, and a bank of current mode phaseinterpolators. An interpolator resolution of 64 phase steps per UI is used, yielding a step size of just under 1 ps for the 62.5 ps baud period. A 3-tap FFE embedded within the ADC provides further linear equalization before quantization noise is added to the signal. This analog front-end and embedded ADC equalization allows placement of the CDR's Mueller-Muller phase detector directly at the ADC output to avoid excessive loop delay.

The DSP includes a main 12-tap FFE and a 2-tap DFE. An additional parallel 4-tap FFE allows for a significant reduction



Fig. 1: ADC-based PAM-4 receiver with CTLE front-end, 6-bit SAR ADC, DSP, and CDR.

in DFE gate count, and is elaborated in Section IV. All the DSP equalizer coefficients are set through a SS-LMS algorithm.

#### **III. ADC ARCHITECTURE**

The ADC is a 32-way time-interleaved 2-bit/stage 6-bit SAR ADC with a 3-tap embedded FFE. Fig. 2 shows the 2-bit/stage unit ADC that has three stages for the 6-bit conversion. Each stage employs a 2-bit flash ADC as the quantizing block with the reference levels internally generated by intentionally skewed comparator regeneration stages. The reference levels for the per-stage flash ADC scale according to the stage and hence are not dynamically set. This allows the use of a single DAC and removes the overhead of multiple DACs present in other multi-bit/stage implementations. The main cursor signal is top-plate sampled to avoid any signal attenuation caused by the comparator input capacitance and routing parasitics.



Fig. 2: 2b/stage loop-unrolled unit ADC with embedded 3-tap FFE.

In order to embed a 3-tap FFE, samples from T/H blocks sampling the pre and post signal values with respect to the current signal value are sampled onto the bottom-plate of a differential FFE DAC. These pre and post samples are scaled and summed on the FFE DAC before they are effectively subtracted from the main cursor sample due to the differential connection at the flash ADC input. The use of 2-bit flash ADCs as the quantization block presents the challenge of the FFE DAC being loaded with 9 comparator input stages. There is the potential for significant attenuation from the comparator input capacitance due to the pre and post samples being bottom-plate sampled, which could lead to a decreased range for the FFE coefficients. In order to reduce the loading of the comparators on the FFE DAC, the input stage of the 3 comparators of the flash ADC are shared (Fig. 3(a)). This reduces the effective DAC loading by 3X.

Since the pre and post FFE samples are sampled on the same DAC, not all FFE coefficient combinations are possible. As shown in Fig. 3(b), improved coefficient coverage is achieved with the implementation of a non-binary FFE DAC. Bottom-plate sampling on the FFE DAC results in a gain of 0.48, which sets the range for the tap coefficient values.

#### **IV. DSP ARCHITECTURE**

As shown in Fig. 4(a), the main signal path through the DSP consists of a 12-tap FFE and a 2-tap partially-unrolled DFE (PU-DFE). In parallel is a 4-tap FFE that is utilized to reduce the PU-DFE complexity and whose output PDF and Symbol 0.33 CDF for a high-loss channel case is shown in Fig. 4(b). A conventional 2-tap PAM-4 loop-unrolled DFE requires computation of all the possible equalized values for the symbol at time n, which involves 4 possible choices for



Fig. 3: (a) Shared-input stage 2-bit flash ADC schematic. (b) Improved 6-bit resolution embedded FFE coefficient coverage map with non-binary FFE DAC (right plot).

the symbol at time n-1 and 4 possible choices for the symbol at time n-2 and results in 16 possible equalized symbol values. If the previous symbols have been partially equalized, as it is at the output of the parallel FFE, then half the most unlikely values for each of the previous symbols can be discarded during the loop unrolling process without incurring an error at a probability that impacts the overall target BER. Hence, the loop is only partially unrolled and only 4 sums need to be computed. Referring to the parallel FFE output PDF  $f_V(v)$ , if the previous symbol falls in Region 1, then the previously transmitted symbol most likely corresponds to a normalized value of -1 or -0.33 and hence DFE coefficients corresponding to symbols 0.33 and 1 are not used in precomputing the possible equalized symbols for this current symbol. As verified in the Symbol 0.33 CDF plot, the probability of the partially equalized signal at the parallel FFE output crossing the boundary between Region 1 and 2 is significantly lower than the target BER. This property makes the error introduced due to partial loop unrolling several orders lower than the target BER. Computations are eliminated in a similar manner for symbols falling in Regions 2 and 3.

Performing look-ahead transformations is necessary to meet the 1-UI critical path through the loop-unrolled DFE multiplexer loop [7]. While this relaxes the critical path timing requirement, it results in a considerable increase in the number of multiplexers needed. For a conventional PAM-4 N-tap DFE implementation, the number of 2:1 multiplexers needed for a look-ahead factor of L and P parallel paths is  $6 * P * ((L - 1) * 4^N + \sum_{i=0}^{N-1} 4^i)$ . However, the PU-DFE architecture reduces the number of 2:1 multiplexers to



Fig. 4: (a) DSP architecture. (b) Parallel FFE output PDF and Symbol 0.33 CDF.

 $2 * P * ((L-1) * 2^N + \sum_{i=0}^{N-1} 2^i)$ . As shown in Fig. 1, this results in significant gate count savings for a given look-ahead factor. The PU-DFE architecture also reduces the conventional 4:1 multiplexers in the DFE multiplexer feedback loop to 2:1 multiplexers and thereby nearly halves the critical path length. This allows the efficient implementation of the 2-tap PU-DFE with a look-ahead factor of 8. The complete DSP architecture has 64 parallel slices operating at a clock frequency of 250 MHz.

# V. EXPERIMENTAL RESULTS

Fig. 5 shows the chip micrograph of the PAM-4 ADC-based receiver prototype, which was fabricated in a GP 65nm CMOS process. The total chip area is 2.621  $mm^2$ , with the ADC and the DSP occupying 0.41  $mm^2$  and 1.17  $mm^2$ , respectively.

16 GS/s ADC SNDR and SFDR as a function of input frequency are shown in Fig. 6. The achieved ENOB is 4.74 and 4.29 bits at low-frequency and the 8 GHz Nyquist frequency, respectively. 32 Gb/s PAM-4 data without any transmit equalization is utilized for the BER measurement results in



Fig. 5: ADC-based PAM-4 receiver chip micrograph.



Fig. 6: ADC SNDR and SFDR vs input frequency.

Fig. 7(a). The timing bathtub curves are obtained by stepping the phase interpolator codes with the CDR in open-loop. A BER less than  $10^{-11}$  is achieved for a 27 dB loss channel and a BER of less than  $10^{-9}$  is achieved for a 30 dB loss channel. Results with the CDR activated are also shown for the 27 dB loss channel, verifying that the CDR locks near the optimal BER point. Fig. 7(b) shows a recovered clock jitter of 939 fs<sub>rms</sub> for the recovered clock in this testing condition. Table I summarizes the receiver performance and compares it with other ADC-based receivers at data rates above 25 Gb/s. The complete 32 Gb/s ADC-based receiver achieves a power efficiency of 8.25 pJ/bit, including all the front-end,



Fig. 7: (a) BER timing bathtub curves. (b) Recovered clock jitter histogram.

#### TABLE I: PERFORMANCE SUMMARY

| Specification                            |     | Rylov [3]                 |      | Cui [4]    |      | Frans [5]                  |     | Aurangozeb [2]          |     | This Work                    |      |
|------------------------------------------|-----|---------------------------|------|------------|------|----------------------------|-----|-------------------------|-----|------------------------------|------|
| Technology                               |     | 32 nm SOI                 |      | 28 nm CMOS |      | 16 nm FinFET               |     | 65 nm CMOS              |     | 65 nm CMOS                   |      |
| Power Supply (V)                         |     | 1 and 0.7                 |      | N/A        |      | 0.9, 1.2 and 1.8           |     | N/A                     |     | 1.1 and 0.9                  |      |
| Data Rate (Gb/s)                         |     | 25                        |      | 32         |      | 56                         |     | 28                      |     | 32                           |      |
| Modulation Format                        |     | PAM-2                     |      | PAM-4      |      | PAM-4                      |     | PAM-4                   |     | PAM-4                        |      |
| ADC Sample Rate (GS/s)                   |     | 25                        |      | 16         |      | 28                         |     | 14                      |     | 16                           |      |
| ADC Structure                            |     | Flash                     |      | SAR        |      | SAR                        |     | Flash                   |     | SAR                          |      |
| Pre-Equalization                         |     | CTLE                      |      | CTLE       |      | CTLE                       |     | Passive<br>Equalization |     | CTLE + 3-tap<br>Embedded FFE |      |
| Post-Equalization                        |     | 8-tap FFE + 8-<br>tap DFE |      | N/A        |      | 24-tap FFE + 1-<br>tap DFE |     | 3 to 8-tap FFE          |     | 12-tap FFE + 2-tap<br>DFE    |      |
| Resolution (bit)                         |     | 5                         |      | 8          |      | 8                          |     | 2-bit to 5.5-bit        |     | 6                            |      |
| ENOB @ Nyquist                           |     | 4                         |      | 5.85       |      | 4.9                        |     | 4.1                     |     | 4.29                         |      |
| Area (mm²)                               |     | 0.6                       |      | 0.89       |      | N/A                        |     | 0.2025                  |     | 2.62                         |      |
| Maximum Compensated<br>Channel Loss (dB) |     | 40 @ 12GHz                |      | 32 @ 8GHz  |      | 31 @ 14GHz                 |     | 30 @ 7GHz               |     | 30 @ 8GHz                    |      |
| Analog Front End + ADC<br>Power (mW)     |     | 310                       |      | 320        |      | 370                        |     | 130                     |     | 166                          |      |
| DSP Power (mW)                           |     | 143                       |      | N/A        |      | N/A                        |     | N/A                     |     | 98                           |      |
| Power Efficiency (pJ/bit)                |     | 12.4                      | 5.72 | 10         | N/A  | 6 61                       | N/A | 4 64                    | N/A | E 10                         | 2.06 |
| AFE+ADC                                  | DSP | 12.4                      | 5.72 |            | 1.74 | 0.01                       | N/A | 4.04                    | N/A | 0.19                         | 3.06 |

ADC, and DSP power, which is lower than the 25 Gb/s NRZ receiver [3]. Utilizing the CTLE front-end, embedded 3-tap FFE in the ADC, and the DSP with the PU-DFE allows for compensation of comparable channel loss to the other PAM-4 receivers without employing any transmit equalization.

#### VI. CONCLUSION

This paper presented a 32 Gb/s PAM-4 receiver with a timeinterleaved 6-bit SAR ADC with 2-bit/stage unit ADC and embedded 3-tap FFE. A new PU-DFE architecture reduces the PAM-4 DFE complexity to that of NRZ DFE to both provide reduced gate count and higher data rate operation. The receiver achieves a measured BER  $< 10^{-9}$  with a 30 dB loss channel without employing any transmit equalization.

#### ACKNOWLEDGEMENT

This work was supported by Intel (Task 2583.001).

# REFERENCES

- A. Shafik et al., "A 10 Gb/s Hybrid ADC-Based Receiver With Embedded Analog and Per-Symbol Dynamically Enabled Digital Equalization," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 3, pp. 671–685, Mar. 2016.
- [2] Aurangozeb *et al.*, "Channel adaptive ADC and TDC for 28 Gb/s PAM-4 digital receiver," in *Custom Integrated Circuits Conference (CICC)*, 2017 *IEEE*, IEEE, 2017, pp. 1–4.
- [3] S. Rylov et al., "A 25 Gb/s ADC-based serial line receiver in 32nm CMOS SOI," in *Solid-State Circuits Conference (ISSCC)*, 2016 IEEE International. IEEE, 2016, pp. 56–57.
- [4] D. Cui et al., "A 320mw 32 Gb/s 8b ADC-based PAM-4 analog front-end with programmable gain control and analog peaking in 28nm CMOS," in *Solid-State Circuits Conference (ISSCC)*, 2016 IEEE International. IEEE, 2016, pp. 58–59.
- [5] Y. Frans et al., "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm FinFET," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 4, pp. 1101–1110, Apr. 2017.
- [6] E. Z. Tabasy et al., "A 6 bit 10 GS/s TI-SAR ADC With Low-Overhead Embedded FFE/DFE Equalization for Wireline Receiver Applications," *IEEE Journal of Solid-State Circuits*, vol. 49, no. 11, pp. 2560–2574, Nov. 2014.
- [7] S. Kasturia et al., "Techniques for high-speed implementation of nonlinear cancellation," Selected Areas in Communications, IEEE Journal on, vol. 9, no. 5, pp. 711–717, 1991.
- [8] S. Cai et al., "A 25 GS/s 6b TI binary search ADC with soft-decision selection in 65nm CMOS," in VLSI Circuits (VLSI Circuits), 2015 Symposium on. IEEE, 2015, pp. C158–C159.