## 12.2 A 335Mb/s 3.9mm<sup>2</sup> 65nm CMOS Flexible MIMO Detection-Decoding Engine Achieving 4G Wireless Data Rates

Markus Winter<sup>1</sup>, Steffen Kunze<sup>1</sup>, Esther Perez Adeva<sup>1</sup>, Björn Mennenga<sup>1</sup>, Emil Matûs<sup>1</sup>, Gerhard Fettweis<sup>1</sup>, Holger Eisenreich<sup>1</sup>, Georg Ellguth<sup>1</sup>, Sebastian Höppner<sup>1</sup>, Stefan Scholze<sup>1</sup>, René Schüffny<sup>1</sup>, Tomoyoshi Kobori<sup>2</sup>

<sup>1</sup>Technical University Dresden, Dresden, Germany <sup>2</sup>NEC, Tokyo, Japan

In current and future wireless standards, such as WiMAX, 3GPP-LTE or LTE-Advanced, receiver terminals have to support numerous operating modes for each protocol [1], as well as sophisticated transmission techniques, especially enhanced MIMO detection and iterative forward error correction (FEC). MIMO detection and FEC belong to the most computationally complex parts of the receiver-side baseband signal processing chain. Implementations thereof must have low power consumption, but also be able to interact in a flexible and efficient way in the detection-decoding engine, while at the same time not compromising on the challenging throughput and flexibility requirements associated with 4G standards. In this paper, we present a chip implementation of a MIMO sphere detector combined with a flexible FEC engine, realizing a detectiondecoding engine in silicon capable of satisfying 4G requirements with a data rate of 335Mb/s.

Designing flexible, high-throughput and cost-effective VLSI detectors represents a challenge in multi-antenna spatial multiplexing systems with high constellation orders (e.g. 4×4 MIMO, 64-QAM). Conventional low-complexity detectors using, for example Successive Interference Cancelation (SIC), provide poor BER-performance, whereas exhaustive-search algorithms [2] (full max-log-APP detection) cannot meet 4G data rates in reasonable hardware complexity. K-best detection [3] is well suited for hardware implementation due to its easy parallelization, but generally sacrifices BER performance and adaptability to channel conditions in favor of fixed data rates. Non-deterministic sphere decoding (SD) algorithms achieve better performance, but cannot be parallelized in a straightforward way. We solve this drawback by decomposing the algorithm into an arbitrary number of regularized loops, each with a fixed-length critical path, independent of constellation size and number of MIMO layers [5].

The FEC challenge for 4G wireless is the requirement to support more than one coding type – typically a combination of convolutional, Turbo, Reed-Solomon, or LDPC codes. Usually, several independent IP cores are utilized for this purpose [1], resulting in unnecessary overhead which possibly can be mitigated by combining the decoding capabilities for different code types into one decoder. However, the sole ASIC implementation [4] that combines Turbo and LDPC decoding does not fulfill 4G wireless requirements. In this work, we realize an efficient high-throughput decoder suitable for direct interaction with our SD in a 4G communication system.

We implemented our detection-decoding engine within the 'Tommy' MPSoC by connecting an SD and FEC core via a packet-switched NoC similarly to other MPSoCs (e.g. [1]). The NoC's flexibility allows the SD and FEC to be used as stand-alone units, or as an integrated detection-decoding chain. The SD core consists of an application-specific instruction-set processor (ASIP) including a control path and a vector datapath to support SIMD vectorization (e.g. for OFDM systems). The datapath is partitioned into several functional units (FUS) [5], as shown in Fig. 12.2.1. Since pipelining cannot be directly applied to the SD feedback-loop-based datapath, 5-stage pipeline-interleaving of independent MIMO-symbol detections has been used for throughput enhancement. By buffering FU output ports, the data produced by one FU can be directly consumed by connected FUS, avoiding the need for intermediate storage. The memory interface has been designed to allow concurrent access to channel and symbol data, avoiding throughput degradation. Conditional memory access (e.g. triggered by detection termination) is assisted by a flow control unit in the control path.

The flexible FEC block contains a programmable multicore ASIP capable of decoding convolutional, Turbo, and LDPC codes [6]. It consists of three identical, independently-programmable processor cores connected through an interconnection network to banks of local memory, illustrated in Fig. 12.2.2. This enables dynamic core clustering and multimode operation, where i) any number of cores can jointly process a code block; or, ii) different codes can be decoded simultaneously on independent clusters. Each core incorporates a control path and a SIMD data path. The integral parts of the datapath are the four processing elements (PEs) designed to exploit key similarities in the basic operations of the decoding algorithms. The internal PE parallelism allows processing of 16 trellis states in parallel for Viterbi and Turbo decoding, or alternatively, 8 LDPC check node updates in parallel. The interconnection network can be configured to perform the random permutations inherent to turbo decoding, or the barrel shift necessary for permuting submatrices of an LDPC parity-check matrix.

Figure 12.2.3 shows the Tommy block diagram. All-digital PLLs provide an individual clock to each unit, ranging from 83MHz-667MHz. This allows every unit to be adjusted to its optimal operating point, achieving the required throughput at minimal power consumption. The LVDS-based I/O interface to an FPGA runs at 500MHz providing a datarate of 8Gb/s in each direction.

The Tommy MPSoC was fabricated in a TSMC 65nm CMOS process. The 17M transistor chip occupies 1.875×3.750=7.03125mm<sup>2</sup> including all 84 I/O cells (see Fig. 12.2.7). The core supply voltage is 1.2V in the typical case, though is adjustable from 1.1V-1.35V for the entire chip. The chip was tested using our measurement and demonstrator chain, shown in Fig. 12.2.4.

The MIMO-detector unit supports up to 64-QAM, 4×4 MIMO transmission. It occupies 0.31mm<sup>2</sup>, including 2.75kB of SRAM. It supports frequencies of up to 333MHz at 1.2V core voltage, consuming 36mW, on average. SD throughput-SNR tradeoff is adjustable, ranging from 296Mb/s at 14.1dB SNR, up to 807Mb/s at 15.55dB SNR (for an information block size of 9216b, ½ rate PCCC, random interleaver, flat fading Rayleigh channel and a Turbo-decoder with 8 internal iterations). Moreover, the MIMO-detector unit can be configured to perform SIC detection, reaching 2Gb/s.

The FEC consumes 3.6mm<sup>2</sup> on the chip. A total of 69.1kB of memory is used. At 1.2V, the turbo decoding mode operates up to 333MHz, at a power consumption of 283mW and throughput of 99Mb/s using the LTE-standard PCCC (rate 1/3 with a block size of 128b and 6 iterations). The LDPC mode reaches a maximum clock frequency of 267MHz at 1.2V, in a power dissipation of 367mW, and throughput of 235.2Mb/s for a ¾ rate WiMAX code with a block size of 768b and 10 iterations. The corresponding energy efficiency is 0.17nJ/b/iteration. Maximum LDPC throughput is 335.4Mb/s at 1.35V and 381MHz, albeit at a lower efficiency.

A comparison with previous MIMO detector chips in Fig. 12.2.5 shows the areathroughput trade-off and energy efficiency. Our implementation outperforms recent hard-output K-best realizations [2-3]. As shown in Fig. 12.2.6, our FEC achieves 4× the throughput of the previous ASIC implementation at 1/3 the area, and half the power, and satisfies 4G data rate requirements.

## References:

[1] F. Clermidy, et al., "A 477mW NoC-Based Digital Baseband for MIMO 4G SDR", *ISSCC Dig. Tech. Papers*, pp. 278 – 279, Feb. 2010

[2] C. Studer, et al., "Soft-Output Sphere Decoding: Algorithms and VLSI Implementation", *IEEE J. on Selected Areas in Communications*, vol. 26, no. 2, pp. 290-300, 2008

[3] M. Shabany, et al., "A 0.13um CMOS 655 Mb/s 4x4 64-QAM K-Best MIMO Detector", *ISSCC Dig. Tech. Papers*, pp. 256 - 257, Feb. 2009

[4] F. Naessens, et al., "A 10.37 mm2 675 mW reconfigurable LDPC and Turbo encoder and decoder for 802.11n, 802.16e and 3GPP-LTE", *IEEE Symp. on VLSI*, pp. 213-214, 2010

[5] E. P. Adeva, et al., "VLSI Architecture for Soft-Output Tuple Search Sphere Decoding", *IEEE Workshop on Signal Processing Systems*, 2011.

[6] S. Kunze, et al., "A "Multi-User" Approach towards a Channel Decoder for Convolutional, Turbo and LDPC Codes ", *IEEE Workshop on Signal Processing Systems (SiPS)*, pp. 386-391, 2010.

## ISSCC 2012 / February 21, 2012 / 2:00 PM



12

## **ISSCC 2012 PAPER CONTINUATIONS**

