

## **International Journal of Research**

Available at <a href="https://edupediapublications.org/journals">https://edupediapublications.org/journals</a>

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017

## **Constructing a Fa for Hardware Hastening For Dsp**

Dasari.Rama



Pg Scholar, Department of ECE, Vaageswari college of engineering, Karimnagar, sramadasari1993@gmail.com P.Shiva Ram



Assistant professor, vaages wari college of engineering, mailid:shivaram420@gmail.com

#### **ABSTRACT:**

CS representation continues to be broadly accustomed to design fast arithmetic circuits because of its natural benefit of getting rid of the big carry-propagation chains. Hardware acceleration continues to be demonstrated a very promising implementation technique for digital signal processing (DSP) domain. However, research activities have proven the arithmetic optimizations at greater abstraction levels compared to structural circuit one considerably effect on the data path performance. Instead of adopting a monolithic application-specific integrated circuit design approach, within this brief, we present a manuscript accelerator architecture composed of flexible computational models that offer the execution of a big group of operation templates present in DSP popcorn kernels. Extensive experimental evaluations reveal that the suggested accelerator architecture provides average gains as high as 61.91% in area-delay product and 54.43% in energy consumption in comparison using the condition-of-art flexible data paths. We differentiate from previous creates flexible accelerators by enabling computations to become strongly carried out with carry-save (CS) formatted data. Advanced arithmetic design concepts, i.e., recoding techniques, are employed enabling CS optimizations to become carried out inside a bigger scope compared to previous approaches.

Keywords: Arithmetic optimizations, carry-save (CS) form, data path synthesis, flexible accelerator, operation chaining.

# R IJR

### **International Journal of Research**

Available at https://edupediapublications.org/journals

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017

#### I. INTRODUCTION

Many scientists have suggested using domain-specific coarse-grained reconfigurable accelerators, to be able to ASICs' versatility increase without considerably compromising their The performance. incorporation of heterogeneity through specialized hardware accelerators improves performance and reduces energy consumption. Modern embedded systems target high-finish application domain names needing efficient implementations of computationally intensive digital signal processing (DSP) functions. Although application-specific integrated circuits (ASICs) make up the ideal acceleration solution when it comes to performance and power, their inflexibility results in elevated plastic complexity, as multiple instantiated ASICs are necessary to accelerate various popcorn kernels [1]. High-performance flexible data happen to be suggested to efficiently map primitive or chained procedures based in the initial data-flow graph (DFG) of the kernel. The templates of complex chained procedures are generally removed from the kernel's DFG or specified by a predefined behavior template library. Design choices around the accelerator's data path highly

impact its efficiency. Existing creates coarse-grained recon-figural data paths exploit architecture-level mainly optimizations, e.g., elevated instructionlevel parallelism (ILP). The domain-specific architecture generation calculations and vary the kind and quantity of computation models achieving a personalized design structure. Flexible architectures were suggested exploiting ILP and operation chaining. Lately, Ansaloni et al. adopted aggressive operation chaining to allow the computation of entire sub expressions using multiple ALUs with heterogeneous arithmetic features. These reconfigurable architectures exclude arithmetic optimizations throughout the architectural synthesis and think about them limited to the interior circuit structure of primitive components, e.g., adders, throughout the logic synthesis. However, research activities have proven the arithmetic optimizations at greater abstraction levels compared to structural circuit one considerably effect on the data performance. Timing-driven path optimizations according to carry-save (CS) arithmetic were carried out in the publish-Register Transfer Level (RTL) design stage. Common sub expression elimination in CS computations can be used to optimize

# R

### **International Journal of Research**

Available at <a href="https://edupediapublications.org/journals">https://edupediapublications.org/journals</a>

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017

straight line DSP circuits. Verma et al. developed transformation techniques around the application's DFG to make best use of CS arithmetic prior the particular data path synthesis [2]. These CS optimization approaches target inflexible data path, i.e., ASIC, implementations. Lately, Xydis et al. suggested an adaptable architecture mixing the ILP and pipelining techniques using the CS-aware operation chaining. However, all of the aforementioned solutions feature a natural limitation, i.e., CS optimization is bounded to merging only additions/subtractions. A CS to binary conversion is placed before each operation that is different from addition/subtraction. e.g., multiplication, thus, allocating multiple CS to binary conversions that heavily degrades performance because of timeconsuming carry propagations[3]. Within this brief, we advise a higher-performance architectural plan for that synthesis of flexible hardware DSP accelerators by mixing optimization techniques from both architecture and arithmetic amounts of abstraction. We introduce an adaptable data path architecture that exploits CS enhanced templates of chained procedures. suggested architecture comprises flexible computational models (FCUs), which let the

execution of a big group of operation templates present in DSP popcorn kernels. The suggested accelerator architecture provides average gains as high as 61.91% in area-delay product and 54.43% in energy consumption in comparison to condition-of-art flexible data paths, sustaining efficiency toward scaly technologies.



Fig.1. The Abstract view of FCU

#### II. PROPOSED ACCELERATOR

CS representation continues to be broadly accustomed to design fast arithmetic circuits because of its natural benefit of getting rid of the big carry-propagation chains. CS arithmetic optimizations, arrange the application's DFG and reveal multiple input additive procedures (i.e., chained additions within the initial DFG), which may be planned onto CS compressors. The aim would be to increase the range that the CS computation is carried out inside the DFG. However, each time a multiplication node is

# R

## **International Journal of Research**

Available at <a href="https://edupediapublications.org/journals">https://edupediapublications.org/journals</a>

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017

interleaved within the DFG, whether CS to binary conversion is invoked or even the DFG is changed while using distributive property [3]. Thus, these CS optimization approaches have limited effect on DFGs centered by multiplications, e.g., filtering DSP programs. Within this brief, we tackle this limitation by exploiting the CS to modified Booth (MB) recoding every time a multiplication must be carried out inside a CS-enhanced data path. Thus. computations through the multiplications are processed using CS arithmetic and also the procedures within the targeted data path are transported out without needing intermediate carry-propagate adder for CS to binary conversion, thus enhancing performance. The suggested flexible accelerator architecture is presented. Each FCU works on CS operands and fosters data within the same form1 for direct reuse of intermediate results. Each FCU works on 16-bit operands. This type of bit-length is sufficient which are more DSP data paths, however the architectural idea of the FCU could be straight modified for smaller sized or bigger bit-measures. The amount of FCUs is decided at design time in line with the ILP and area constraints enforced through the designer. The CStoBin module is really a ripple-carry adder and converts the CS form towards the two's complement one [4]. The register bank includes scratch registers and it is employed for storing intermediate results and discussing operands one of the FCUs. DSP Different popcorn kernels different register allocation and knowledge communication designs per kernel) could be planned to the suggested architecture using publish-RTL data path interconnection discussing techniques. The control unit drives the general architecture (i.e., communication between your data port and also the register bank, configuration words from the FCUs and selection signals for those multiplexers) in every clock cycle. The dwelling from the FCU continues to be made to enable high-performance flexible operation chaining with different library of operation templates The suggested FCU allows intratemplate operation chaining by fusing the additions carried out before/following the multiplication performs any partial operation template from the complex procedures: The multiplier's product includes 17 bits. The multiplier features a compensation way of lowering the error enforced in the product's precision through the truncation technique. However, since all of the FCU inputs

# R

### **International Journal of Research**

Available at <a href="https://edupediapublications.org/journals">https://edupediapublications.org/journals</a>

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017

contain 16 bits and provided there are no overflows, To be able to efficiently map DSP popcorn kernels to the suggested FCUbased accelerator. the semiautomatic synthesis methodology presented, continues be modified. Initially, a CS-aware transformation is carried out to the original DFG, merging nodes of multiple chained additions/subtractions to 4:2 compressors. A design generation around the changed DFG groups the CS nodes using the multiplication procedures to create FCU template procedures. The designer chooses the FCU procedures since the DFG for minimized latency. Since quantity of FCUs is bound, an origin-restricted scheduling is recognized as using the available FCUs and CStoBin modules figuring out the resource constraint set. The clustered DFG is scheduled, to ensure that each FCU operation is designated to some specific control step. A listing-based scheduler continues to be adopted thinking about the mobility2 of FCU procedures [5]. The FCU procedures are scheduled based on climbing down mobility. The scheduled FCU procedures are bound onto FCU instances and proper configuration bits are produced. After finishing register allocation, a FSM is

produced to be able to implement the control unit from the overall architecture.

#### **III.RESULTS:**

#### **FCU Result:**



### **Synthesis Results:**

#### **RTL** schematic:



**Technology Schematic:** 



#### **International Journal of Research**

Available at <a href="https://edupediapublications.org/journals">https://edupediapublications.org/journals</a>

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017



## **Design Summary:**

| ytjutj Project Status (09/02/2016 - 14:44:17) |                                 |                     |             |  |  |
|-----------------------------------------------|---------------------------------|---------------------|-------------|--|--|
| Project File:                                 | ytjutj.ise                      | Current State:      | Synthesized |  |  |
| Module Name:                                  | FCU                             | • Errors:           | No Errors   |  |  |
| Target Device:                                | xc3s500e-4fg320                 | • Warnings:         | 6 Warnings  |  |  |
| Product Version:                              | ISE 10.1 - Foundation Simulator | Routing Results:    |             |  |  |
| Design Goal:                                  | Balanced                        | Timing Constraints: |             |  |  |
| Design Strategy:                              | Xilinx Default (unlocked)       | Final Timing Score: |             |  |  |

| ytjutj Partition Summary            | В |  |  |  |
|-------------------------------------|---|--|--|--|
| No partition information was found. |   |  |  |  |

| Device Utilization Summary (estimated values) |      |           |             |  |
|-----------------------------------------------|------|-----------|-------------|--|
| Logic Utilization                             | Used | Available | Utilization |  |
| Number of Slices                              | 824  | 4656      | 17%         |  |
| Number of Slice Flip Flops                    | 128  | 9312      | 1%          |  |
| Number of 4 input LUTs                        | 1458 | 9312      | 15%         |  |
| Number of bonded IOBs                         | 185  | 232       | 79%         |  |
| Number of GCLKs                               | 1    | 24        | 4%          |  |

## **Timing Report:**

Speed Grade: -4

Minimum period: No path found

conventional two's complement and CSformatted data operands, thus enabling high levels of computational density to become accomplished. Within this brief. introduced an adaptable accelerator architecture that exploits the incorporation of CS arithmetic optimizations to allow fast chaining of additive and multiplicative procedures. Theoretical and experimental analyses have proven the suggested solution forms a competent design compromise point enhanced delivering latency/area implementations. Case study is dependent on the system gate model. Regarding both execution latency and also the area complexity and thinking about all of the DSP popcorn kernels, the suggested FCUbased architecture outperforms those built around the FCC and also the RAU. Not

## **International Journal of Research**



Available at <a href="https://edupediapublications.org/journals">https://edupediapublications.org/journals</a>

e-ISSN: 2348-6848 p-ISSN: 2348-795X Volume 04 Issue 13 October 2017

surprisingly, the timing constraints and also the results of cell sizing implied through the Design Compiler synthesis tool, in some instances lead to incongruences between your experimental and also the theoretical studies

#### REFERENCES

[1] M. D. Galanis, G. Theodoridis, S. Tragoudas, and C. E. Goutis, "A high-performance data path for synthesizing DSP kernels," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 25, no. 6, pp. 1154–1162, Jun. 2006.

[2] A. K. Verma, P. Brisk, and P. Ienne, "Data-flow transformations to maximize the use of carry-save representation in arithmetic circuits," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 27, no. 10, pp. 1761–1774, Oct. 2008.

[3] M. Stojilovic, D. Novo, L. Saranovac, P. Brisk, and P. Ienne, "Selective flexibility: Creating domain-specific reconfigurable arrays," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 32, no. 5, pp. 681–694, May 2013.

[4] N. Moreano, E. Borin, C. de Souza, and G. Araujo, "Efficient data path merging for partially reconfigurable architectures," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 24, no. 7, pp. 969–980,Jul. 2005.

[5] N. H. E. Weste and D. M. Harris, *CMOS VLSI Design: A Circuits and Systems Perspective*, 4th ed. Reading, MA, USA: Addison-Wesley, 2010.