Matrix multiplication is the kernel operation used in many image and signal processing applications. In this paper, we present the design and Field Programmable Gate Array (FPGA) implementation of matrix multiplier architectures for use in image and signal processing applications. The designs are optimized for speed which is the main requirement in these applications. First design involves computation of dense matrix vector multiplication which is used in image processing application. The design has been implemented on Virtex-4 FPGA and the performance is evaluated by computing the execution time on FPGA. Implementation results demonstrate that it can provide a throughput of 16970 frames per second which is quite adequate for most image processing applications. The second design involves multiplication of tri-matrix (three matrices) which is used in signal processing application. The proposed design for the multiplication of three matrices has been implemented on Spartan-3 and Virtex-II Pro platformFPGAsrespectively. Implementation results are presented which demonstrate the suitability of FPGAs for such applications.
1. Introduction Computation intensive algorithms used in image and signal processing, multimedia, telecommunications, cryptography, networking and computation domains in general were first realized using software running on Digital Signal Processors (DSPs) or General Purpose Processors (GPPs). Significant speed-up in computation time can be achieved by assigning complex computation intensive tasks to hardware and by exploiting the parallelism in algorithms [1]. Recently, Field Programmable Gate Arrays (FPGAs) have become a platform of choice for hardware realization of computation-intensive applications [1-10]. Especially, when the design at hand requires very high performance, designers can benefit from high density and high performance FPGAs instead of costly multicore Digital Signal Processing (DSP) systems [1]. FPGAs enable a high degree of parallelism and can achieve orders of magnitude speedup over GPPs [7]. This is as a result of the increasing embedded resources on FPGA. FPGA have the benefits of the hardware speed and the software flexibility; also they have a price/performance ratio much more favorable than Application Specific Integrated Circuits (ASICs). Since the major resources for implementing computation-intensive algorithms are embedded on FPGA, latency associated with device communication has been eliminated. However, these embedded resources are limited hence it is important to use these resources optimally.
2. FPGA Overview : Programmable devices, such as programmable logic arrays (PLAs), have been available since 1970s. However, for a number of years, their use was quite limited, mainly due to technological reasons. In the early 1980s, programmable array logic (PALs) devices started to be used as glue-logic parts but suffered from power consumption problems. The extension of the gate array technique to post manufacturing customization, based on the idea of using arrays of custom logic blocks (LBs) that are surrounded by a perimeter of input/output (I/O) blocks, all of which could be assembled arbitrarily [1-2], gave rise to the FPGA concept, which was introduced by Xilinx’ cofounder Ross Freeman in 1985. FPGAs are digital integrated circuits (ICs) that belong to a family of programmable logic devices (PLDs). An FPGA chip includes I/O blocks and the core programmable fabric. The I/O blocks are located around the periphery of the chip, providing programmable I/O connections and support for various I/O standards. The core programmable fabric consists of programmable logic blocks also called configurable logic blocks (CLBs) and programmable routing architectures. By using the appropriate configuration, FPGAs can, in principle, implement any digital circuit as long as their available resources are adequate. Fig. 1 illustrates a general FPGA fabric [10], which represents a popular architecture that many commercial FPGAs are based on, and is also a widely accepted.

FPGAs can be programmed after it is manufactured rather being limited to a predetermined, unchangeable hardware function. The term “field programmable” refers to the fact that its programming takes place “in the field” as opposed to devices whose internal functionality is hardwired by the manufacturer [7-8]. Many different architecture and programming technologies have evolved to provide better designs that make FPGAs economically viable and an attractive alternative to ASICs. Modern FPGAs have superior logic density, low chip cost and performance specifications comparable to low end microprocessor. With multimillion programmable gates per chip, current FPGAs can be used to implement digital systems capable of operating at frequencies up to 550 MHz. In many cases, it is possible to implement an entire system using a single FPGA. This is very economical for specialized applications that do not require the performance of custom hardware.
3. Comparison of FPGAs with ASICs, GPPs and DSPs : An ASIC is highly optimized for one specific application or product. ASICs can provide the best performance and lowest power consumption. For large volume applications, ASICs can also provide the lowest chip cost and system cost. Despite the advantages of ASICs, they are often Infeasible or uneconomical for many embedded systems because of high nonrecurring engineering (NRE) cost and longer design time fig 2. As compared to ASICs, FPGAs offer many advantages such as reduced NRE costand shorter time to market. However, relatively high size and power consumption shown by FPGA devices has been the most important drawback of that technology.GPPs on the other hand are microprocessors that are designed to perform a wide range of computing tasks. As mentioned earlier, FPGAs are most often contrasted with ASICs

4. Design Methodology : Design methodology for the hardware realization of computation intensive algorithm is a combined effort of Electronic Design Automation (EDA) tools, methods and FPGA technology that enables to produce the optimized circuit for the end applications. A right combination of FPGA hardware, designed IP core and EDA tools will definitely enhance the efficiency of the design methodology. By design methodology, we imply the step-by-step process of FPGA design. The FPGA design methodology is used as a guideline for the hardware realization of algorithms. A number of design flows are used by different FPGA vendors but all are basically similar in sequence of tasks performed. These steps are common in all FPGA EDA tools and are essential in today’s FPGA design process. The EDA tools like Xilinx Integrated Software Environment (ISE), Altera’s Quartus II and Mentor Graphics’ FPGA Advantage plays a very important role in obtaining an optimized digital circuitusing FPGA [13-14]. A typical FPGA design flow followed in this work is shown in fig. 2.
5. Literature Review Matrix multiplication is a computationally intensive problem, especially the design and efficient implementation on an FPGA where resources are very limited, has been more demanding. FPGA based design sare usually evaluated using three performance metrics: speed (latency), area, and power (energy). Fixed point implementations in FPGA are fast and have minimal power consumption. Additionally, a fixed point matrix multiplier unit often requires less silicon real estate in an FPGA or ASIC than its floating-point counterpart. The limitation of fixed point number is that very large and very small numbers cannot be represented and the range is limited to bit-width of the number. There has been extensive previous work in the area of designing an FPGA based system for the computation of fixed point matrix multiplier. In [35], a design methodology for synthesizing a family of very compact systolic arrays on FPGA based essentially upon manual mapping at CLB level coupled with VHDL structural-level is discussed. The authors of [3] used matrix multiplication as the benchmark to compare the performance of FPGAs, DSPs and embedded processors. The results show that the FPGAs can multiply two matrices with both lower latency and lower energy consumption than the other two types of devices. This makes FPGA ideal choice for matrix multiplication in signal processing applications.
6.Conclusions : Most of the algorithms which are used in DSP, image and video processing, computer graphics and vision and high performance supercomputing applications have matrix multiplication as the kernel operation. In this paper, we considered two different examples of matrix multiplier architecture where speed is the main constraint. The first design involving computation of dense matrix-vector multiplication is implemented on Xilinx Virtex-4 FPGA and the performance is evaluated by computing its execution time on FPGA. Hardware implementation results demonstrate that it can provide a throughput of16970 frames per second which is sufficient for many image and video processing applications. The second design for the multiplication of three matrices is based onsystolic array and implemented on Spartan-3 and Virtex-IIPro platform FPGAs respectively. Implementation results demonstrate the suitability of FPGAs in such applications .Finally,
References
- S. Ogrenci, A. K. Katsaggelos, and M. Sarrafzadeh,“Analysis and FPGA Implementation of Image restoration under resource constraint,” IEEE Trans. on Computers, Vol.52, No. 3, pp. 390-399, 2003.
- C. Ebeling, C. Fisher, G. Xing, M. Shen, and H. Liu,“Implementing an OFDM Receiver on the RaPiDReconfigurable Architecture,” IEEE Trans. on Computers,Vol. 53, No. 11, pp. 1436-1448, 2004.
- G. R. Goslin, “A Guide to Using Field Programmable GateArrays for Application-Specific Digital Signal ProcessingPerformance,” Microelectronics Journal, Vol. 28, Issue 4,pp. 24-35, 1997.
- J. Isoaho, J. Pasanen, O. Vainio, and H. Tenhunen, “DSP System Integration and Prototyping with FPGAs,” Journalof VLSI Signal Processing, Vol. 6, pp. 155-172, 1993.
- A. G. Ye and D. M. Lewis, “Procedural Texture Mapping on FPGAs,” in Proc. of ACM/SIGDA 7th Intl. Symp. On Field Programmable Gate Arrays, pp. 112-120, 1999.
- S. Knapp, “Using Programmable Logic to Accelerate DSPFunctions,”
- J. Ma, “Signal and Image processing via ReconfigurableComputing,” in Proc. of the First Workshop on Informationand Systems Technology, 2003.
- F. Otto and Z. Pavel, “Hardware Accelerated ImaginAlgorithms,” in Proc. of AUTOS’2002Automatizace systému, pp. 165-171, 2002.
- L. Batina, S. B. Ors, B. Preneel, and J. Vandewalle,“Hardware architectures for public key cryptography,”Integration, the VLSI Journal, Vol. 34, pp. 1-64, 2003.
- D. Johnson, K. Gribbon, D. Bailey, and S. Demidenko,“Implementing Digital Signal Processing Algorithm’s inFPGA’s: Digital Spectral Warping,” in Proc. of 9th Electronics New Zealand Conf., pp. 72-77, 2002.