Cookies Policy
The website need some cookies and similar means to function. If you permit us, we will use those means to collect data on your visits for aggregated statistics to improve our service. Find out More
Accept Reject
  • Menu
Publications

Publications by João Paiva Cardoso

2004

Self-loop pipelining and reconfigurable dataflow arrays

Authors
Cardoso, JMP;

Publication
COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION

Abstract
This paper presents some interesting concepts of static dataflow machines that can be used by reconfigurable computing architectures. We introduce some data-driven reconfigurable arrays and summarize techniques to map imperative software programs to those architectures, some of them being focus of current research work. In particular, we briefly present a novel technique for pipelining loops. Experiments with the technique confirm important improvements over the use of conventional loop pipelining. Hence, the technique proves to be an efficient approach to map loops to coarse-grained reconfigurable architectures employing a static dataflow computational model.

2004

Modeling loop unrolling: Approaches and open issues

Authors
Cardoso, JMP; Diniz, PC;

Publication
COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION

Abstract
Loop unrolling plays an important role in compilation for Reconfigurable Processing Units (RPUs) as it exposes operator parallelism and enables other transformations (e.g., scalar replacement). Deciding when and where to apply loop unrolling, either fully or partially, leads to large design space exploration problems. In order to cope with these vast spaces, researchers have explored the application of design estimation techniques. Using estimation, tools can conduct early evaluation of the impact and interplay of transformations in both the required resources and expected performance. In this paper we present some of the current approaches and issues related to estimation of the loop unrolling impact when targeting RPUs.

2008

Sorting units for FPGA-based embedded systems

Authors
Marcelino, R; Neto, H; Cardoso, JMP;

Publication
DISTRIBUTED EMBEDDED SYSTEMS: DESIGN, MIDDLEWARE AND RESOURCES

Abstract
Sorting is an important operation for a number of embedded applications. As sorting large datasets may impose undesired performance degradation, acceleration units coupled to the embedded processor can be an interesting solution for speeding-up the computations. This paper presents and evaluates three hardware sorting units, bearing in mind embedded computing systems implemented with FPGAs. The proposed architectures take advantage of specific FPGA hardware resources to increase efficiency. Experimental results show the differences in resources and performances among the three proposed sorting units and also between the sorting units and pure software implementations for sorting. We show that a hybrid between an insertion sorting unit and a merge FIFO sorting unit provides a speed-up between 1.6 and 25 compared to a quicksort software implementation.

2002

XPP-VC: A C Compiler with temporal partitioning for the PACT-XPP architecture

Authors
Cardoso, JMP; Weinhardt, M;

Publication
FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, PROCEEDINGS: RECONFIGURABLE COMPUTING IS GOING MAINSTREAM

Abstract
The eXtreme Processing Platform (XPP) is a unique reconfigurable computing (RC) architecture supported by a complete set of design tools. This paper presents the XPP Vectorizing C Compiler XPP-VC, the first high-level compiler for this architecture. It uses new mapping techniques, combined with efficient vectorization. A temporal partitioning phase guarantees the compilation of programs with unlimited complexity, provided that only the supported C subset is used. A new loop partitioning scheme permits to map large loops of any kind. It is not constrained by loop dependences or nesting levels. To our knowledge, the compilation performance is unmatched by any other compiler for RC. Preliminary evaluations show compilation times of only a few seconds from C code to configuration binaries and performance speedups over standard microprocessor implementations. The overall technology represents a significant step toward RC architectures which are faster and simpler to program.

2005

Compilation and temporal partitioning for a coarse-grain reconfigurable architecture

Authors
Cardoso, JMP; Weinhardt, M;

Publication
New Algorithms, Architectures and Applications for Reconfigurable Computing

Abstract
The eXtreme Processing Platform (XPP) is a coarse-grained dynamically reconfigurable architecture. Its advanced reconfiguration features make feasible the configure-execute paradigm, the natural paradigm of dynamically reconfigurable computing. This chapter presents a compiler aiming to program the XPP using a subset of the C language. The compiler, apart from mapping the computational structures onto the available resources on the device, splits the program in temporal sections when it needs more resources than the physically available. In addition, since the execution of the computational structures in a configuration needs at least two stages (e.g., configuring and computing), a scheme to split the program such that the reconfiguration overheads are minimized, taking advantage of the overlapping of the execution stages on different configurations is presented. © 2005 Springer.

1998

Towards an automatic path from JavaTM bytecodes to hardware through high-level synthesis

Authors
Cardoso Joao, MP; Neto Horacio, C;

Publication
Proceedings of the IEEE International Conference on Electronics, Circuits, and Systems

Abstract
This article describes a new approach to synthesize dedicated hardware from a system specification using the Java language. The new compiler named GALADRIEL starts from Java classfiles produced from the initial Java specification and processes the system information in order to exploit the concurrency implicit in each method, so that it can be efficiently implemented by multiple hardware and/or software components. The paper gives emphasis to the compiler techniques used to exploit the implicit concurrency and to the use of high-level synthesis to generate the hardware models from the Java bytecodes information.

  • 38
  • 43