2016
Authors
Machado, N; Maia, F; Matos, M; Oliveira, R;
Publication
2016 SEVENTH LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE COMPUTING (LADC)
Abstract
A distributed system is often built on top of an overlay network. Overlay networks enable network topology transparency while, at the same time, can be designed to provide efficient data dissemination, load balancing, and even fault tolerance. They are constructed by defining logical links between nodes creating a node graph. In practice, this is materialized by a Peer Sampling Service (PSS) that provides references to other nodes to communicate with. Depending on the configuration of the PSS, the characteristics of the overlay can be adjusted to cope with application requirements and performance concerns. Unfortunately, overlay efficiency comes at the expense of dependability. To overcome this, one often deploys an application overlay focused on efficiency, along with a safety-net overlay to ensure dependability. However, this approach results in significant resource waste since safety-net overlays are seldom used. In this paper, we focus on safety-net overlay networks and propose an adaptable mechanism to minimize resource usage while maintaining dependability guarantees. In detail, we consider a random overlay network, known to be highly dependable, and propose BUZZPSS, a new Peer Sampling Service that is able to autonomously fine-tune its resource consumption usage according to the observed system stability. When the system is stable and connectivity is not at risk, BUZZPSS autonomously changes its behavior to save resources. Alongside, it is also able to detect system instability and act accordingly to guarantee that the overlay remains operational. Through an experimental evaluation, we show that BUZZPSS is able to autonomously adapt to the system stability levels, consuming up to 6x less resources than a static approach.
2013
Authors
Bravo, M; Machado, N; Romano, P; Rodrigues, LET;
Publication
Proceedings of the 9th Workshop on Hot Topics in Dependable Systems, HotDep 2013, Farmington, Pennsylvania, USA, November 3, 2013
Abstract
Deterministic replay tools are a useful asset when it comes to pinpoint hard-to-reproduce bugs. However, no sweet spot has yet been found with respect to the trade-off between recording overhead and bug reproducibility, especially in the context of search-based deterministic replay techniques, which rely on inference mechanisms. In this paper, we argue that tracing the locking order, along with the local control-flow path affected by shared variables, allows to dramatically reduce the inference time to find a fault-inducing trace, while imposing only a slight increase in the overhead during production runs. Preliminary evaluation with a micro-benchmark and third-party benchmarks provides initial evidence that supports our claim. © 2013 ACM.
2016
Authors
Machado, N; Lucia, B; Rodrigues, L;
Publication
ACM SIGPLAN NOTICES
Abstract
Concurrency bugs that stem from schedule-dependent branches are hard to understand and debug, because their root causes imply not only different event orderings, but also changes in the control-flow between failing and non-failing executions. We present Cortex: a system that helps exposing and understanding concurrency bugs that result from schedule-dependent branches, without relying on information from failing executions. Cortex preemptively exposes failing executions by perturbing the order of events and control-flow behavior in non-failing schedules from production runs of a program. By leveraging this information from production runs, Cortex synthesizes executions to guide the search for failing schedules. Production-guided search helps cope with the large execution search space by targeting failing executions that are similar to observed non-failing executions. Evaluation on popular benchmarks shows that Cortex is able to expose failing schedules with only a few perturbations to non-failing executions, and takes a practical amount of time.
2013
Authors
Machado, N; Romano, P; Rodrigues, LET;
Publication
5th USENIX Workshop on Hot Topics in Parallelism, HotPar'13, San Jose, CA, USA, June 24-25, 2013
Abstract
2018
Authors
Machado, N; Romano, P; Rodrigues, L;
Publication
SOFTWARE TESTING VERIFICATION & RELIABILITY
Abstract
This paper presents CoopREP, a system that provides support for fault replication of concurrent programs based on cooperative recording and partial log combination. CoopREP uses partial logging to reduce the amount of information that a given program instance is required to store to support deterministic replay. This allows reducing substantially the overhead imposed by the instrumentation of the code, but raises the problem of finding a combination of logs capable of replaying the fault. CoopREP tackles this issue by introducing several innovative statistical analysis techniques aimed at guiding the search of the partial logs to be combined and needed for the replay phase. CoopREP has been evaluated using both standard benchmarks for multithreaded applications and real-world applications. The results highlight that CoopREP can successfully replay concurrency bugs involving tens of thousands of memory accesses, while reducing recording overhead with respect to state-of-the-art noncooperative logging schemes by up to 13x (and by 2.4x on average).
2015
Authors
Machado, N; Lucia, B; Rodrigues, L;
Publication
ACM SIGPLAN NOTICES
Abstract
We present Symbiosis: a concurrency debugging technique based on novel differential schedule projections (DSPs). A DSP shows the small set of memory operations and data-flows responsible for a failure, as well as a reordering of those elements that avoids the failure. To build a DSP, Symbiosis first generates a full, failing, multithreaded schedule via thread path profiling and symbolic constraint solving. Symbiosis selectively reorders events in the failing schedule to produce a non-failing, alternate schedule. A DSP reports the ordering and data-flow differences between the failing and non-failing schedules. Our evaluation on buggy real-world software and benchmarks shows that, in practical time, Symbiosis generates DSPs that both isolate the small fraction of event orders and data-flows responsible for the failure, and show which event reorderings prevent failing. In our experiments, DSPs contain 81% fewer events and 96% fewer data-flows than the full failure-inducing schedules. Moreover, by allowing developers to focus on only a few events, DSPs reduce the amount of time required to find a valid fix.
The access to the final selection minute is only available to applicants.
Please check the confirmation e-mail of your application to obtain the access code.