

# Different Aspects of MAP Turbo Decoding: A Comparative Analysis

<sup>[1]</sup>E. Sujatha <sup>[2]</sup> Dr. C. Subhas <sup>[3]</sup> Dr. M. N. Giri Prasad <sup>[1][2][3]</sup> Student Member, IEEE <sup>[1][2][3]</sup> Jawaharlal Nehru Technology University, Anatapur.

*Abstract:* — Long Term Evolution-Advanced (LTE-A) targets the peak data rates in surplus of 3Gbps for present and next generation wireless communication systems. Turbo coding, the stated channel coding scheme in 3rd Generation Partnership Project (3GPP) LTE standard, is an advanced forward error correction coding to achieve higher throughput of advanced wireless communication technologies. To support the peak throughputs, parallel turbo decoding procedure has become a necessity and the corresponding VLSI implementation is extremely challenging task to the design engineers. The higher throughput applications require higher parallelism of turbo decoder design; which results in increased hardware complexity, major source of power consumption and silicon area. This paper addresses the design and implementation aspects of different parallel turbo decoders, which meet peak data rates of 3GPP LTE and LTE-Advanced standards, where throughput, power consumption, silicon area and latency are the most decisive cost factors.

Index Terms: --- Turbo coding, parallel Turbo decoder, 3GPP, 4G Long Term Evaluation-Advanced, VLSI.

#### I. INTRODUCTION

In Wireless communications, Channel coding is a vital part of the physical layer and it performs a crucial role in reliable data transmission and latency. In today's modern channel coding Turbo, low density parity check (LDPC) and polar codes achieve the high diversity and possible large coding gain in fading channels. By the invention of turbo coding scheme, a sudden revolution has been happened in the last two decades, therefore, there was a jump from 2G system to 3G and then 4G(LTE & LTE-A) systems. Now the jump from 4G to 5G is expected with new challenges and contributions in wireless communications, which utilizes the merits of satellite communication systems. The present communication system (LTE/LTE-A) offer highest possible throughput up to 1Gbps (LTE-A) and excess of 1Gbps, by using Turbo codes, which can offer BER performance near to Shannon limit [1]. Especially, in lengthy code words, their performance reaches close to the theoretical limits.

Turbo codes were introduced by Berrou, Glavieux and Thitimajashima [1] in 1993. They are high performance forward error correction codes, also they are the first practical codes closely reach maximum theoretical channel capacity and provides reliable data transmission over noisy channel. Turbo codes have been chosen as channel coding scheme to 3G and 4G standards[Fig.1] due to their favourable properties like low complexity encoder, adoptable code rate and iterative decoding process. Turbo codes are used in IEEE 802.16(Wi-MAX) 3G and 4G mobile standards such as in HSPA, EV-DO, LTE, also in deep space satellite systems such as DVB-RCS,DVB-RCS2.

The 3GPP LTE-Advanced standard specified by the International Telecommunication Union Radio-Communication Sector (ITUR) for International Telecommunications Advanced (IMT-A) is cited as fourth generation(4G). A random like code with efficient iterative decoding process was found in the year 1993 and was named as Turbo code [1]. Turbo codes have been widely adapted in 3G and 4G wireless communication standards due to its reliable transmission and excellent forward error correction capability. Decoding plays a vital role in the receiver while retrieving the same data, sent from the transmitter in wireless communications.





To achieve higher throughputs of 3GPP LTE-A standards, distinct turbo decoding algorithms directed at highly parallel architectures. The research is rapidly going on design of efficient Turbo decoders and their vlsi implementations for higher throughput beyond 3Gbps for next generation communication systems.

As well, the research is rapidly going on forthcoming 5<sup>th</sup> generation communication systems, which is a combined system of cellular networks providing higher order data rates and satellite communications covering large area. The expected achievement with the 5G system is high spectral efficiency, low battery consumption, low latency, large number of services, very high throughput and capacity. 5G speed is more than 10 times greater than 4G Systems with high spectral efficiency. To achieve these expectations, the modulation scheme and error correcting codes able to promise very low error rates are essential. The coding scheme of 5G communication system is a hybrid structure of these modern coding schemes. The important design parameters in 5G systems to be considered are bandwidth, energy consumption, latency and implementation complexity.

The remainder of this paper is commenced as follows. In section II, an overview of Turbo Coding of 3GPP LTE standard and various algorithms proposed for turbo decoding. Section III considers design perspective of distinct methods proposed for performance improvement is presented. Finally, Section IV summarizes the performance of various recent vlsi developments of the turbo decoder. Section V concludes and future work of further decoding implementations.

#### **Turbo** Coding

3GPP LTE standard turbo encoder is a parallel concatenation of convolutional encoders connected via

an interleaver as shown in LHS part of Fig.2 to encode the information bits , to be transmitted [10]. At the output , turbo encoder generates systematic bit sequence, non interleaved and interleaved parity bits.

The bits input to the Turbo code internal interleaver are denoted by  $X_{0}^{s}$ ,  $X_{1}^{s}$ ...  $X_{K-1}^{s}$ .

Where *K* is the number of input bits.

The bits output from the Turbo code internal interleaver are denoted by  $X_{p_0}^{p_2}, X_{k_{-1}}^{p_2}, \dots, X_{k_{-1}}^{p_{2}}$ .

are denoted by  $X_{0}^{p_{2}}, X_{1}^{p_{2}}, \dots, X_{K-1}^{p_{2}}$ . The relationship between the input and output bits is:

 $X^{p^2}i = X^{p^2}\pi(i), i = 0, 1, ..., K-1$  ------ (i)

Where the relationship between the output index *i* and the input  $\pi$  (i), index satisfies the following quadratic form:

 $\pi$  (i), = (f1 i + f2 i<sup>2</sup>) mod K -----(ii)

The parameters f1 and f2 are pre defined values depending on the block size K [10].

On the other side, a turbo decoder consists of two key components to decode the data in iterative manner: Soft Input Soft Output (SISO) decoders and Interleavers /De-interleavers as shown in Fig.3. There is two prime Turbo decoding algorithms for decoding recursive systematic convolutional codes in iteratively, namely Soft Output Viterbi Algorithm (SOVA)[3] and maximum a posteriori(MAP) Algorithm [4]. Here, SOVA can be implemented with a lower hardware resource and lower computational complexity than MAP Algorithm [5].



## Fig.2. Block diagram of turbo encoder and a turbo decoder

#### Turbo Decoder Design Perspective

Various recent VLSI implementations have been proposed by researchers to obtain high throughput with maximum possible low latency by using different architectural implementation methods for LTE and LTE-A Standard wireless communications. The latency of sliding window SISO decoders causes decreasing the throughput gains with expanding degree of parallelism, as the system latency is directly proportional to sliding window size. There are trade-offs among number of



iterations, throughput and size of sliding window for designing a suitable turbo decoder architecture for physical layer.

Throughput  $(\Theta_T)$  of the decoder depends on CMOS technology used ,clock frequency(F) and number of decoding iterations( $\rho$ ).

 $\Theta_{\rm T} \alpha F$  and  $\Theta_{\rm T} \alpha 1/\rho$  ------ (iii)

- Number of iterations of decoding cannot be altered, as it degrades the error rate performance.
- Possibility of improving throughput is by improving operating clock frequency.
- Also, by scaling down the CMOS technology ,large design issue can be resolved to some extent.
- Low power techniques can be incorporated in architecture of turbo decoder while designing for performance improvement.

A comparable analysis of existing architectures distinctly shows that Parallelism in SISO decoders for high throughput increases latency and hardware complexity. So, It's a challenging task to design a efficient turbo decoder design of higher data rates.

Conventional MAP Algorithm involves in complex mathematical operations such as multiplication, division and exponential [4]. Logarithmic transformation of this algorithm overcome such complex operations and log map algorithm simplifies computation od stare metric in each of the trellis stage by using branch metrics and state metrics of previous state [11,12]. But log-Map algorithm approximation has high CPU running time Max-log-MAP algorithm.

Even though the parallel decoders achieve higher data rates, it necessitates huge hardware resources, hence the challenging task of turbo decoder is to design scaled-down the hardware complexity. This type of work carried out by reducing the memory need for storing the forward state metrics and branch metrics in each SISO decoder, low power track back of MAP based duo binary turbo decoders and memory reduction based on metric compression using non uniform quantization and walsh-hadamard transform methods in 2011 by M. Martina [13, 14]. Benkeser et. Al. [2009] had proposed radix-2 non Parallel turbo decoder architecture, using max-log-Map algorithm. It achieved a throughput of 20.2 Mbps, suitable for HSDPA wireless communication standard and it occupies less silicon area [15]. Kim et. all [2009] proposed a radix-4 architecture of turbo decoder with 8 SISO decoders in parallel to achieve 100 Mbps throughput using Max-log-Map algorithm. This realization of SISO parallel structure occupies design area 10.7mm<sup>2</sup> [16].

Studer et al. [2011] designed radix-4 parallel turbo decoder architecture using Max-log-Map algorithm to achieve a throughput of 390 Mbps, implemented in 130 nm CMOS technology , where pipelining concept is applied to master slave batcher network to have parallel and interleaved access to memories at high throughput[17] . In this regard, next section summarizes the various design aspects of recent parallel turbo decoders of 4G standard.

#### VLSI Turbo decoder Architectures

The details of recent research activities on parallel turbo decoder implementation tradeoffs and a comparative analysis of different architectural methods are presented.

**Belfanti et.al [2013]** proposed a turbo decoder ASIC design which supports all 188 block sizes for any specified code rate. The key novelty of this work is to reduce the required window size to obtain flexible and efficient high throughput decoder. This design employs 16 parallel radix-4 SISO decoders with an optimized state metric initialization technique to reduce the latency, without any considerable loss in BER performance. In detail, the effect of latency on throughput gain is reduced by performing the consecutive half iterations in an overlapping fashion instead of strictly sequential manner. Also, this novelty work avoids the redundant turbo iterations by 24-bit CRC checksum defined in the standard to increase the energy efficiency of the decoder [19].

*G. Wang et al. [2014]* proposed a flexible and efficient VLSI architecture with a 45nm CMOS technology [19]. This supports multi standards 3G and 4G like HSPA+/LTE/LTE-A. This design use 16 radix-4 SISO MAP decoders . This work aims, to avoid severe memory conflicts problem raised by concurrent memory reading/writing in parallel turbo decoders, which is one of the major bottleneck in high throughput



decoders. Memory conflicts take place due to randomness of interleaver /de-interleaver. Here, two individual scheduling schemes are proposed to eliminate memory reading/writing conflicts. A balanced scheduling scheme to avoid memory reading contentions and a double buffer contention free (CDBF) buffer architecture to avoid memory writing conflicts.

A balanced scheduling scheme[19] proposed here is to provide continuous read operation in next half iteration by writing the output data in interleaved/de-interleaved order in both iterations as shown in Fig.(b). So, all the memory reading operations are continues unlike unbalanced scheduling, where both read write operations sharing the same data from/to the memory in first half iteration then siso decoder first reads data from memory in an interleaved way then once the computation is done, the siso decoder writes the data to memory in de-interleaved way; which results in memory conflicts.



Fig.3(a) Unbalanced scheduling Fig.3(b) Balanced scheduling

A double double buffer contention free buffer architecture[9] to solve memory writing conflicts .This architecture consists of FIFO and Circular built around the interleaver as shown in the Fig.rs place . Circular buffers built by registers to store the concurrent data write operations . All incoming data, rejected LLR datum are pushed into a FIFO, to avoid over flow and writing conflicts.



Fig.4 DBCF architecture

*Rahul Shreshta and Roy P.Paily* [2014] proposed a high throughput turbo decoder with 8 and 64 parallel radix-2 Map decoder architecture in 90 nm CMOS technology[25].

This work focused on VLSI design for high speed Map probability decoders using Log MAP algorithm. This work proposed two schemes, a new ungrouped backward recursion scheme in sliding window Log BCJR algorithm to compute backward state metrics. Secondly, a new state metric normalization technique applied in Add Compare Select Unit to reduce the critical path delay. These two techniques offer retiming and pipelining in architecture for performance improvement. Thirdly, this work adopted fine grain clock gating technique to solve power issue.

*Jing-shuin Lin et al.[2015]* proposed a highly parallel turbo decoder structure to achieve highest throughput rate of 1.45Gbps implemented in 90 nm CMOS technology. This work aims to reduce warm-up computation quantity per each decoding window such that to improve the decoding efficiency and this improvement is possible by modifying the parallel window MAP decoding algorithm.

This modification is named as a dummy calculation reduced parallel window (DCRPW) MAP decoding algorithm, which reduces the dummy/warm up computation adopted to make sure the convergence of backward/forward metrics. This technique is useful in highly parallel decoders than sliding window based turbo decoding. Also this work proposes a dual mode computing schedule to support various rates and block lengths [22].



Fig.5 Proposed architecture of Turbo decoder

It has Turbo controller, as shown in figure which sends decoding signal to start decoding process in conformity with the assigned data, then a priori information for the next iteration . This process continuous till the maximum iteration is reached. Carlo Condo et al.[2013] particularly proposed the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novelty of this contribution is; Firstly, introducing a formal and systematic treatment to tackle the reconfiguration issue and secondly, flexible reconfigurable NoC-based proposing turbo/LDPC decoder architecture with a small complexity overhead[18].

An Li et. All [2016] proposed a fully parallel turbo decoding algorithm[24] ,which allows parallel processing to offer higher processing throughput unlike conventional MAP decoding algorithm which owning to the serial data dependencies. This novel FPTD algorithm reduces 50% of computational complexity and enhances its suitability for FPGA implementations. The FPTD algorithm adopt the concept of equivalent logic blocks , this logic block corresponds to a pair of 4 input Look Up Tales and a register or 6 input Look Up Tales and a register. Pipelining of the loading, processing and pushing out the outputs is allowed when decoding several successive frames.

| Comparison of differe | nt hard ware | implementa | tion |
|-----------------------|--------------|------------|------|
|-----------------------|--------------|------------|------|

|                             | Comparison for LTE Code standard Turbo Decoder |                   |                                |                               |                               |                     |                                                            |                      |  |  |
|-----------------------------|------------------------------------------------|-------------------|--------------------------------|-------------------------------|-------------------------------|---------------------|------------------------------------------------------------|----------------------|--|--|
| Implementations             | An Li<br>2016[23]                              | An Li<br>2016[24] | Jing-Shi<br>un Lin<br>2015[22] | Rahul<br>Shreshta<br>2014[25] | Rahul<br>Shreshta<br>2014[25] | G. Wang<br>2014[21] | S. Belfanti<br>2013[19]<br>& Roth C.<br>et al.<br>2014[20] | C. Condo<br>2013[18] |  |  |
| Algorithm                   | FPTD                                           | FPTD              | Max-Log                        | Log-<br>Map                   | Log-Map                       | Radix-4<br>XMAP     | Max-Log                                                    | Max-Log              |  |  |
| Radix/Parallelism           | 2/6144                                         | 2/6144            | 2/96                           | 2/8                           | 2/64                          | 4/64                | 4/16                                                       | 2/35 PEs             |  |  |
| Technology[nm]              | 65                                             | 40                | 90                             | 90                            | 90                            | 45                  | 65                                                         | 90                   |  |  |
| Clock<br>Frequency[MHz]     | 100                                            | 65                | 250                            | 625                           | 625                           | 600                 | 410                                                        | 200                  |  |  |
| Core Area[mm <sup>2</sup> ] | 109                                            | 55                | -                              | 6.1                           | 19.75                         | 2.43                | 2.49                                                       | 4.87                 |  |  |
| Power[mW]                   | 9618                                           | -                 | -                              | 272.04                        | 1450.5                        | 870                 | 966                                                        | 183.2                |  |  |
| Throughput[Mb/s]            | 1580                                           | 1530              | 1455                           | 301.69                        | 2274                          | 1670                | 1013                                                       | 292                  |  |  |
| Voltage[V]                  | 1.08                                           |                   | -                              | 1.0                           | 1.0                           | 0.81                | 1.2                                                        | -                    |  |  |
| Iterations                  | 39                                             | 28                | 5.5                            | 8                             | 8                             | 5.5                 | 5.5                                                        | 8                    |  |  |
| Gate count                  | -                                              | -                 | 6677k                          | 694k                          | 5304k                         | 1470k               | 1574k                                                      | -                    |  |  |
| Sliding window size         | -                                              |                   | 32                             | 32                            | 96                            | -                   | 14-30                                                      |                      |  |  |
| Normalized                  | 69                                             |                   |                                |                               | -                             | 1 46                | 2.46                                                       | 36                   |  |  |

#### V. CONCLUSION

In the present wireless communications, high throughput design and implementation have become dominating necessity of advancing for excellence. There has been a high speed flow in data-rate for next generation wireless communications and this will open on to more complex algorithms and VLSI architectures in next few decades. Based on this framework, the present study on turbo code and the various design aspects of high throughput turbo decoder have been furnished. So that, it is indispensable to inspect both algorithmic design and architectural sides to achieve a best design that reaches the demands of succeeding generation technology.

#### REFERENCES

- Berrou C., Glavieux A., Thitimajshima P.: 'Near Shannon Limit Error Correcting Coding And Decoding: Turbo Codes'.Proc. Int. Conf. Communications, Pp. 1064–1070, May 1993.
- [2] C. Berrou and A. Glavieux, "Near Optimum Error Correcting Coding and Decoding: Turbo-codes," IEEE Trans. Commun.,vol. 44, pp. 1261–1271, Oct 1996.
- [3] J. Hagenauer and L. Papke, "Decoding turbo codes with the soft output Viterbi algorithm (SOVA)," in proc. IEEE Int. Symp. Information Theory(ISIT'94), June 1994, pp.164.
- [4] L.R. Bahl, J. Cocke, F.Jelinek and J. Raviv, " Optimal decoding of linear codes for minimizing symbol error rate," IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp. 284-287, Mar. 1974.
- [5] Y. Tong, T. H. Yeap and J. Y. Chouinard "VHDL implementation of a Turbo decoder with log-Map based iterative decoding," IEEE Trans. Instrum. Meas., vol. 53, pp.1268-1278, 2004.
- [6] S.Belfanti, C. Roth, M. Gautschi, C. Benkeser and Q.Huang, "A 1 Gbps LTE-Advanced Turbo decoder ASIC in 65nm CMOS," Proc. Symp. VLSI Circuits(VLSIC), Kyoto, Japan, June 2013, pp. C.284-C287.
- [7] G.Wang Wang HaoShen, YangSun Joseph R. Cavallaro Aida Vosoughi Yuanbin Guo, "Parallel Interleaver Design for a high through put HSPA+/LTE multi standard Turbo decoder" IEEE Transactions On Circuits And SystesS—I: Regular Papers, vol. 61, no. 5, May 2014.
- [8] G. Wang, A. Vosouhi, H.Shen, J.R. Cavallaro and Y.Guo, "Parallel Interleaver architecture with new scheduling scheme for high throughput configurable turbo decoder," in Proc. ISCAS, May 2013, pp. 1340-1343.
- [9] G. Wang, Y.Sun, J.R. Cavallaro, and Y.Guo, " Concurrent interleaver architecture for high



throughput multi standard parallel turbo decoder, " in Proc. IEEE Int. Conf. ASAP, Sep.2011, pp. 113-121.

- [10] Lte; Evolved Universal Terrestrial Radio Access (E-utra); Multiplexing And Channel Coding (3gpp Ts 36.212 Version 13.0.0 Release 12 & Release 11)
- [11] J.P. Woodard and L.Hanzo, "Comparative study of Turbo decoding Techniques: an over view" IEEE Trans. On Vehicular Tech., vol. 49, pp. 2208-2233, 2000.
- [12] S. Talakoub, L.Sabeti, B. Shahrrava and M. Ahmadi, "An improved Max-Log-Map algorithm for turbo decoding and turbo eualization" IEEE Trans. Ins. Meas., vol. 56, pp. 1058-1063, 2007.
- [13] T-H. Tsai and C-H. Lin, "A new memory reduced Architecture design for Log Map algorithm in Turbo Decoding," IEEE 6<sup>th</sup> CAS Symposium on Emerging Technologies: Mobile and Wireless communications, vol. 2, pp. 607-610, 2004.
- [14] M. Martina and G. Masera, "State metric compression Techniques for Turbo decodr architectures," IEEE Trans. Cir. Sys.I: Regular papers, vol. 58, pp.1119-1128, 2011.
- [15] C. Benkeser, A. Burg, T. Cupaiuolo and Q.Huang, "Design and Optimization of a HSDPA Turbo decoder ASIC," IEEE journal of Solid-State Circuits, vol. 44, pp. 98-106, 2009.
- [16] J.Kim and I. Park, "A unified parallel Radix-4 Turbo decoder for Mobile Wi-MAX and 3GPP-LTE," Proceedings of IEEE Custom Integrated Circuits Conference (CICC), pp. 487-490, 2009.
- [17] C. Studer, C. Benkeser, S. Belfanti and Q. Huang, "Design and Implementation of a parallel turbo decoder ASIC for 3GPP-LTE," IEEE journal of Solid state Circuits, vol. 46, pp. 8-17,2011.
- [18] C. Condo, M. Martina, and G. Masera, "VLSI implementation of a multi-mode turbo/LDPC decoder architecture," *IEEE Trans.Circuits Sys. I: Reg. Papers*, vol. 60, no. 6, pp. 1441–1454, Jun. 2013.
- [19] S. Belfanti, C. Roth, M. Gautschi, C. Benkeser, and Q. Huang, "A 1 Gbps LTE-advanced turbodecoder ASIC in 65 nm CMOS," in *Proc.Symp. VLSI Circuits (VLSIC)*, 2013, pp. C284–C285.
- [20] C. Roth, S. Belfanti, C. Benkeser, and Q. Huang, "Efficient parallel turbo-decoding for highthroughput wireless systems," IEEE Trans. Circuits Syst. I, vol. 61, no. 6, pp. 1824–1835, June 2014.

- [21] G. Wang, H. Shen, Y.Sun,J.R. Cavallaro, A.Vousagi And Y.Guo "Parallel Interleaver Design For A High Throughput Hspa /Lte Multi-Standard Turbo Decoder" Ieee Transactions On Circuits And Systems—I: Regular Papers, Vol. 61, NO. 5, May 2014.
- [22] Jing-shuin Lin, Ming-Der shieh, chung-Yen Liu, Der-Wei Yang, "Efficient highly parallel turbo decoder for 3GPP LTE-Advanced" 2015 international symposium on VLSI Design, Automation and Test(VLSI-DAT).
- [23] A. Li, L. Xiang, T. Chen, R. G. Maunder, B. M. Al-Hashimi, and L. Hanzo, "VLSI Implementation of Fully Parallel LTE Turbo Decoders," IEEE Access, vol. 4, pp. 323–346, Jan 2016.
- [24] An Li, Peter Hailes, Robert G. Maunder, Bashir M. Al Hashini and Lazos Hanzo, "1.5 Gbits/s FPGA Implementation of a Fully-Parallel Turbo decoder Designed for Mission-Critical Machine – Type Communication Applications " IEEE Access, 2016.
- [25] Rahul Shrestha and Roy P.Paily, "High-Throughput Turbo Decoder With Parallel Architecture for LTE Wireless Communication Standards" IEEE Transactions On Circuits And Systems—I: Regular Papers, pp. 2699-2710.