PhD Thesis: Jorge Luis Villamayor Leguizamón: November 30, 2018, 12:30. (R&D Product Owner / IT Project Manager en Giesecke+Devrient Mobile Security)
Title: Fault Tolerance Configuration and Management for HPC Applications using RADIC.
Abstract: High Performance Computing (HPC) systems continue growing exponentially in terms of components quantity and density to achieve demanding computational power. At the same time, cloud computing is becoming popular, as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for running parallel HPC applications due to the increasing performance of virtualized, highly-available instances. Although, augmenting the amount of components to create larger systems tends to increment the frequency of failures in both clusters and cloud environments. Nowadays, HPC systems have a failure rate of around 1000 per year, meaning a failure every approximately 8 hours.
Fault Tolerance (FT) techniques need to be applied to MPI parallel executions in both, cluster and cloud environments. With FT techniques, high availability is ensured for parallel applications. In order to apply some FT solutions, administrator privileges are required, to install them in the cluster nodes. Moreover, when failures appear human intervention is required to recover the application. A solution, which minimizes
users and administrators intervention is preferred.
Regarding cloud environments, we propose Resilience as a Service (RaaS), a fault tolerant framework for HPC applications. RaaS provides clouds with a highly available, distributed and scalable fault-tolerant service. It redesigns traditional HPC protection and recovery mechanisms, to natively leverage cloud capabilities and its multiple alternatives for implementing FT tasks. This thesis contributes on providing a Multi-platform Resilience Manager (MRM), suitable for traditional bare-metal clusters and clouds (public and private). The presented solution provides FT in an automatic, distributed and transparent manner in the application and user levels according to the users, applications, and runtime requirements. It gives the users critical FT information, allowing them to trade-off cost and protection keeping the mean time to repair within acceptable ranges.
Several experimental environments such as bare-metal clusters and cloud (public and private), running different parallel applications were used during the experimental validations. The experiments verify the functionality and improvement of the contributions.
PhD Thesis: Laura María Espínola Brítez : November 30, 2018, 9:30. (R&D QA Manager en Giesecke+Devrient Mobile Security)
Title: Efficient Communication Management in Cloud Environments .
Abstract: Scientific applications with High Performance Computing (HPC) requirements are migrating to cloud environments due to the facilities that it offers. Cloud computing plays a major role considering the compute power that it provides, avoiding the cost of a physical cluster maintenance. With features like elasticity and pay-per-use, it helps to reduce the researchers’ procurement risk. Most of HPC applications are implemented using Message Passing Interface (MPI), which is a key component in common and distributed computing tasks.
However, for this kind of applications on cloud environments, the major drawback is the loss of execution performance, due to the virtualized network that affects the communications latency and bandwidth. In this thesis a Dynamic MPI Communication Balance and Management (DMCBM) is presented, to overcome the communication challenge of HPC applications in cloud. DMCBM is implemented as a middle-ware between the users’ application and the execution environment. It improves message communication latency times in cloud-based systems, and helps users to detect mapping and parallel implementation issues.
Our solution dynamically rebalances communication flows at higher levels of the virtualized HPC stack, e.g. over MPI communications layer, to dynamically remove communication hot-spots and congestion in the underlying layers. DMCBM abstracts the communications state between application processes based on latency measurements. DMCBM achieves lower application execution time in case of congestion, obtaining better performance in clouds.
The NAS Parallel Benchmarks and a real application of dynamic particles simulation NBody are used to show the DMCBM performace, obtaining an improvement of up to 10% in the execution time and a communication time reduction of about 16% in congestion scenarios.
PhD Thesis: Pilar Gómez Sánchez: June 22, 2018, 12:00. (Assistant Professor UAB)
Title: Analyzing the Parallel Applications' I/O Behavior Impact on HPC Systems.
The volume of data generated by scientific applications grows and the pressure on the I/O system of HPC systems also increases. For this reason, an I/O behavior model is proposed for scientific MPI (Message Passing Interface) parallel applications. The goal is to analyze the applications' impact on the I/O system. Analyzing the MPI parallel applications at POSIX-IO level allows observing how the application's data are treated at that level.
In this research work, the following is presented: the I/O behavior model definition at POSIX-IO level (PIOM-PX model definition), the methodology applied to extract this model and the PIOM-PX-Trace-Tool. As PIOM-PX is based on the I/O phase concept, it can identify the more significant phases. Phases that have more influence than others in the I/O system and they could provoke a bottleneck or a poor performance. Analysis based on I/O phases allows identifying, delimiting, and trying to reduce each phase's impact on the I/O system.
PIOM-PX is part of proposed model PIOM. PIOM integrates the I/O behavior model at POSIX-IO level (PIOM-PX) and the I/O behavior model at MPI-IO level (PIOM-MP, formerly known as PAS2P-IO). The model provides the information necessary to replicate an application's behavior in different systems using synthetic programmables programs. PIOM-PX-Trace-Tool allows interception of POSIX-IO instructions used during the application execution. The experiments carried out are executed in several standar HPC systems and the Cloud platform, where it is able to test the utility of the proposed model PIOM.
PhD Thesis: Cecilia Elizabeth Jaramillo Jaramillo, : July 21, 2017, 11:00. (Researcher at Computer Science Department. Universidad ISRAEL. Quito, Ecuador)
Title: Modelización y Simulación de la transmisión por contacto de una infección nosocomial en el servicio de urgencias hospitalarias.
TDX Source: http://hdl.handle.net/10803/457348
The nosocomial infection is an infection caused by microorganisms acquired within sanitary environments and is one of the main threats faced by hospitalized patients. Methicillin Resistant Staphylococcus Aureus (MRSA) is one of the most common and dangerous microorganisms in hospital settings and it could causes serious skin, wound, organ and even blood-borne infections (bacteremia).
In a healthcare environment, such as the emergency department, constant interaction between patients, healthcare workers and the environment contributes to MRSA transmission. The most common routes of transmission are the hands of the healthcare workers and contaminated medical instruments or objects of the environment. To counteract the transmission, health services have implemented certain actions called infection control measures.
This research addresses the issue of the transmission of nosocomial infection by contact in a emergency service using the capacity that agent-based simulation possesses to represent social phenomena and human dimension. Agent-based computational models allow us to evaluate potential solutions to specific situations in a virtually created environment.
As a result of this research, a simulation tool of contact transmission of MRSA has been obtained, the MRSA-T-Simulator. The main objective of this tool is to allow the construction of virtual scenarios in order to study the phenomenon of MRSA transmission and to evaluate the potential impact of the implementation of different infection control measures on propagation rates.
PhD Thesis: Eva BruBalla: July 21, 2017, 9:30. (Assistant Professor at Gimbernat Schools, Spain)
Title: Scheduling non critical patients' admission in a hospital emergency department.
TDX Source: http://www.tdx.cat/handle/10803/457347
The increase in life expectancy, the progressive growth of aging and a greater number of chronic diseases are factors that contribute significantly to the growing demand for urgent medical care and, consequently, in many cases, to the saturation of the Emergency Departments (ED). Taking into account also the limitations on available resources, this constant risk of ED saturation is one of the most important current problems in health systems around the world, since it often results in an excessive length of stay of patients in the service and, consequently, generates dissatisfaction.
The results presented in this study aim to contribute to the improvement of the quality of care provided in EDs. We propose a method to try to reduce the total length of stay of the patients in the service, through a model for planning the arrival of non-critical patients to it. The model is based on the detailed characterization of the system in terms of its attention capacity and the number of patients attending each hour dynamically. The use of the simulation allows us to obtain knowledge about the behavior of the system through the prediction of patient waiting times for a specific situation or scenario, determined by the way patients arrive at the service and the available sanitary staff resources. A first contribution of the research is the definition of an analytical model for the calculation of the theoretical throughput of a certain sanitary staff configuration. The objective of this first part of the research is to evaluate the responsiveness of sanitary staff to service demand, depending on the configuration of doctors, nurses, admission and triage personnel, and the model of patient flow throughout the service. The second contribution of the research that we present is the definition of a model for scheduling the admission of non-critical patients into the service, by their redistribution with respect to the input pattern initially foreseen by the hospital's historical data. We have been able to verify the effectiveness of the proposed scheduling model based on the information of the actual data provided by the Hospital de Sabadell, as reference hospital, and using the simulation to assess the results of its application.
The described research contributions offer the ED managers new knowledge about the behavior of the service, which may be relevant in decision making, regarding the improvement of service quality, of a great interest taking into account the expected growing demand of the service in a very near future.
PhD Thesis: Joe Carrion Jumbo: July 20, 2017, 11:00. (Researcher at Computer Science Department. Universidad ISRAEL. Quito, Ecuador)
Title: Mejorando la red de los servicios de motores de búsqueda a través de enrutameinto basado en aplicación.
TDX Source: http://hdl.handle.net/10803/456585
Large-scale computer systems like Search Engines provide services to thousands of users, and their user demand can change suddenly. This unstable demand impacts sensitively to the service components (like network and hosts). The system should be able to address unexpected scenarios; otherwise, users would be forced to leave the service. A search engine has a typical architecture consisting of a Front Service, that processes the requests of users, an Index Service that stores the
information collected from the internet and a Cache Service that manages the efficient access to content frequently used.
The scientific advances that provide these services are in general emergent technology. The network services of a search engine require specialized planning; This research is carried out by studying the traffic pattern of a Search Engine and designing a routing model for messages between network nodes based on the data flow conditions of the Search Engine Service. The expected result is a network service specialized in the traffic of a Search Engine that allocates network resources efficiently according to demand it supports in real time. The evaluation of the traffic pattern allowed us to identify conditions of unbalance of the network and congestion of messages.
Therefore model designed combines different routing models of the literature and a new criteria based on the specific conditions of the traffic of the Search Engine. For the design of this proposal it has been necessary to design a scale model of a Search Engine using simulation techniques and It has has used traffic from a real system that allowed
us to accurately evaluate the proposed model and compare it with currently available routing models in the literature and technology. The results show that the proposed model improves the performance of the Search Engine network in terms of latency and network throughput.
PhD Thesis: Francisco Borges: September 30, 12:00 hrs. 2016. (Assistant Professor at IFBA Instituto Federal de Educação, Ciência e Tecnologia da Bahia, Campus Santo Amaro. Bahia. Brazil)
Title: Care HPS: A High Performance Simulation Methodology for Complex Agent-Based Models.
TDX Source: http://hdl.handle.net/10803/395209
This thesis introduces a methodology to do research on HPC for complex agent based models that demand high performance solutions. This methodology, named Care High Performance Simulation (HPS), enables researchers to: 1) develop techniques and solutions of high performance parallel and distributed simulations for agent-based models; and, 2) study, design and implement complex agent-based models that require high performance computing solutions. This methodology was designed to easily and quickly develop new ABMs, as well as to extend and implement new solutions for the main issues of parallel and distributed simulations such as: synchronization, communication, load and computing balancing, and partitioning algorithms in order to test and analyze. Also, some agent-based models and HPC approaches and techniques are developed which can be used by researchers in HPC for ABMs that required high performance solutions.
A set of experiments are included with the aim of showing the completeness and functionality of this methodology and evaluate how the results can be useful. These experiments focus on: 1) presenting the results of our proposed HPC techniques and approaches which are used in the Care HPS; 2) showing that the features of Care HPS reach the proposed aims; and, 3) presenting the scalability results of the Care HPS. As a result, we show that Care HPS can be used as a scientific instrument for the advance of the agent-based parallel and distributed simulations field.
PhD Thesis: Albert Gutiérrez Millà: July 22, 10:00 hrs. 2016. (Researcher at Barcelona Supercomputing Center. CASE - Fusion Dpt.- Barcelona-Spain)
Title: Crowd Modeling and Simulation on High Performance Architectures.
TDX Source: http://hdl.handle.net/10803/392745
Management of security in major events has become crucial in an increasingly populated world. Disasters have incremented in crowd events over the last hundred years and therefore the safety management of the attendees has become a key issue. To understand and assess the risks involved in these situations, models and simulators that allow understand the situation and make decisions accordingly are necessary.
But crowd simulation has high computational requirements when we consider thousands of people. Moreover, the same initial situation can vary on the results depending on the non deterministic behavior of the population; for this we also need a significant amount of statistical reliable simulations. In this thesis we have proposed crowd models and focused on providing a DSS (Decisions Support System). The proposed models can reproduce the complexity of agents, psychological factors, intelligence to find the exit and avoid obstacles or move through the crowd, and recreate internal events of the crowd in case of high pressures or densities.
In order to model these aspects we use agent-based models and numerical methods. To focus on the applicability of the model we have developed a workflow that allows you to run in the Cloud DSS to simplify the complexity of the systems to the experts and only left to the them the configuration. Finally, to test the operation and to validate the simulator we used real scenarios and synthetic in order to evaluate the performance of the models.
PhD Thesis: Liu Zhengchun: July 22, 12:00 hrs. 2016. (Researcher Argonne National Laboratory. MSC Dpt. USA)
Title: Modeling & Simulation for Healtcare Operations Management Using High Performance Computing & Agent Based Model.
TDX Source: http://hdl.handle.net/10803/392743
Hospital based emergency departments (EDs) are highly integrated service units to primarily handle the needs of the patients arriving without prior appointment, and with uncertain conditions. In this context, analysis and management of patient flows play a key role in developing policies and decision tools for overall performance improvement of the system. However, patient flows in EDs are considered to be very complex because of the different pathways patients may take and the inherent uncertainty and variability of healthcare processes. Due to the complexity and crucial role of an ED in the healthcare system, the ability to accurately represent, simulate and predict performance of ED is invaluable for decision makers to solve operations management problems. One way to realize this requirement is by modeling and simulation.
Armed with the ability to execute a compute-intensive model and analyze huge datasets, the overall goal of this study is to develop tools to better understand the complexity (explain), evaluate policy (predict) and improve efficiencies (optimize) of ED units. The two main contributions of this thesis are: (1) An agent-based model for quantitatively predicting and analyzing the complex
behavior of emergency departments. (2) A simulation and optimization based methodology for calibrating model parameters under data scarcity.
Starting from simulating the emergency departments, our efforts proved the feasibility and ideality of using agent-based model & simulation techniques to study the healthcare system.
PhD Thesis supervised by members of the group:
- Performane Prediction: analysis of the scalability of parallel applications. Javier Panadero Martínez (2015) (Researcher at Internet Interdisciplinary Institute (IN3) - Universitat Oberta de Catalunya. Barcelona-Spain)
- Simulación y Optimización como Metodología para Mejorar la Calidad de la Predicción en un Entorno de Simulación Hidrográfica. Adriana Gaudiani (2015) (Associate Researcher at Science Institute. Universidad Nacional de General Sarmiento, Buenos Aires, Argentina)
- A dynamic link speed mechanism for energy Saving in interconnection networks. Hai Nguyen Hoang (2014) (Lecturer in Information Technology Faculty. Danang University of Education. Danang University. Vietnam)
- ARTFUL Deterministically Assessing the Robustness against Transient Faults of Programs.João Artur Dias Lima Gramacho (2014) (Software Analyst & Developer at Oracle MySQL Replication Team. Lisbon Area, Portugal)
- Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability. Hugo Meyer (2014) (Postdoctoral Research Scientist, Universiteit van Amsterdam) Outstanding dissertation award
- Modelo basado en Autómatas Celulares extendidos para diseñar estrategias de Evacuaciones en Casos de Emergencia. Cristian Tissera (2014) (Assistant professor UNSL, Argentina)
- Optimisation via simulation for healthcare emergency departments. Eduardo Cesar Cabrera Flores (2013) (Full-time Researcher at Universidad Autónoma de Guerrero (UAGro), Mexico)
- Vulnerability Assessment for Complex Middleware Interrelationships in Distributed Systems. Jairo Serrano Latorre (2013) (Security & Quality Assurance Officer in the project ECmanaged at Ack Storm S.L.)
- Predictive and Distributed Routing Balancing for High Speed Interconnection Networks. Carlos H. Núñez Castillo (2013) (Researcher, Polytechnic Faculty, Computer Science Department, National University of Asuncion)
- Tolerancia a fallos en la capa de sistema basada en la arquitectura RADIC. Marcela Castro León (2013) (Staff, Gimbernat Schools, Spain)
- Simulación de los Servicios de Urgencias Hospitalarias: una aproximación computacional desarrollada mediante técnicas de Modelado Orientadas al Individuo (Mol). Manuel Taboada González (2013). (Postgraduate Coordinator, Gimbernat Schools, Spain) Outstanding dissertation award
- Metodología para la evaluación de prestaciones del sistema Entrada/Salida en computadores de altas prestaciones. Sandra A. Méndez (2013) (Researcher at Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities, Germany)
- Planificación de DAGs en entornos oportunísticos. MªMar López (2012) (Temporary lecturer, UAB)
- TDP-Shell: Entorno para acoplar gestores de colas y herramientas de monitorización. Vicente Ivars Camanes (2012) (Temporary lecturer, UAB)
- Particionamiento y Balance de Carga en Simulaciones Distribuidas de Bancos de Peces. Roberto Solar Gallardo (2012) (Assistant Professor at Universidad de Santiago de Chile).
- Multipath Fault-tolerant Routing Policies to deal with Dynamic Link Failures in High Speed Interconnection Networks. Gonzalo A. Zarza (2011) (Researcher on High-Performance Solutions & Big-Data at Globant). Outstanding dissertation award
- Fault Tolerance Configuration for Uncoordinated Checkpoints. Leonardo Fialho de Queiroz (2011) (Lab Manager at Atos/Bull, Petrópolis, Rio de Janeiro, Brazil).
- Metodología para la ejecución eficiente de aplicaciones SMPD en clústeres con procesadores multicores. Ronal Muresano Caceres (2011) (HPC and BigData senior software researcher, ITI – Instituto Tecnológico de Informática, Ciudad Politécnica de la Innovación - UPV - Spain). Outstanding dissertation award
- Performance evaluation of applications for heterogeneus systems by means of performance probes. Alexandre Strube (2011) (Researcher at Institute for Advanced Simulation & Jülich Supercomputing Centre)
- Predicción de perfiles de comportamiento de aplicaciones ciéntíficas en nodos multicore. John Corredor Franco (2011) (Associate Professor, Universidad de Pamplona, Colombia)
- Framework for integrating scheduling policies into workflow engines. Gustavo Martinez (2011) (Tecnocom, Spain)
- Firma de la aplicación paralela para predecir el rendimiento. Álvaro Wong González (2010) (Associate Researcher, Universitat Autònoma de Barcelona, Spain)
- Decentralized Scheduling on Grid Environments. Manuel Brugnoli (2010) (RIP 2014)
- R/parallel - Parallel Computing for R in non-dedicated environments. Gonzalo Vera (2010) (Scientific IT Manager and Bioinformatician. Centre for Research in Agricultural Genomics -CRAG-, Spain)
- Políticas de Encaminamiento Multicamino en redes de interconexión de altas prestaciones. Diego F. Lugones (2009) (Researcher at Rince Institute, Dublin City University, Ireland).
- Performability issues of fault tolerance solutions for message.passing systems: the case of Radic. Guna Santos (2009) (Associate Professor at Universidade Federal da Bahia, Salvador, Brazil)
- Un sistema de vídeo bajo demanda a gran escala tolerante a fallos de red. Javier A. Balladini (2008) (Associate Professor, Universidad Nacional del Comahue, Argentina)
- Scheduling for Interactive and Parallel Applications on Grids. Enol Fernández (2008) (Researcher at EGI-InSpire Project, IFCA, Spain)
- Mapping sobre arquitecturas heterogéneas. Laura De Giusti (2008) Universidad Nacional de La Plata (Associate Professor, Universidad Nacional de La PLata, Researcher at III-LIDI, Argentina)
- ¿Podemos predecir en Algoritmos Paralelos No-Deterministas? Paula Fritzsche (2007) (Researcher at Qustodian Trust SL, Spain). Outstanding dissertation award
- Simulación de altas prestaciones para modelos orientados al individuo. Diego Mostaccio (2007) (R&D Engineer at Hewlett-Packard)
- Radic: A powerful fault-tolerance architecture. Angelo Duarte (2007) (Associate Professor, Universidade Estadual de Feira de Santana, Brazil)
- Aumentando las Prestaciones en la Predicción de Flujo de Instrucciones. Juan Carlos Moure (2006) (Associate Professor, Universitat Autònoma de Barcelona, Spain)
- FTDR: Tolerancia a fallos, en clusters de computadores geográficamente distribuidos, basada en replicación de datos. Josemar Rodrigues de Souza (2006) (Associate Professor, Universidade do Estado da Bahia, Brazil).
- Performance prediction and tuning in a multicluster environment. Eduardo Argollo (2006) (Solution Architect for Cloud Applications at Hewlett-Packard, Spain)
- Admission Control and Media Delivery Subsystems for Video on Demand Proxy Server. Bahjat Qazzaz (2004) (Associate Professor, Faculty of Information Technology, An-Najah National University. Nablus, Palestine) (RIP 2019)
- Cómputo paralelo en redes locales de computadoras. Fernando Tinetti (2004) (Associate Professor, Universidad Nacional de La PLata, Researcher at III-LIDI, Argentina)
- Balanceo Distribuido del Encaminamiento en Redes de Interconexión de Computadores Paralelos. Daniel Franco (2000) (Associate Professor, Universitat Autònoma de Barcelona, Spain)
- Modelado y Simulación de Sistemas Paralelos. Remo Suppi (1996) (Associate Professor, Universitat Autònoma de Barcelona, Spain)
- Políticas de Scheduling Estático para Sistemas Multiprocesador. Porfidio Hernández (1991) (Associate Professor, Universitat Autònoma de Barcelona, Spain)
- Simulación de Arquitecturas Computacionales. M.A. Mayosky (1990) (Associate Professor, Universidad Nacional de La PLata, Researcher at LEICI, Argentina)
- Sistemas Multiprocesador con Buses Múltiples. Dolores Rexachs (1988) (Associate Professor, Universitat Autònoma de Barcelona, Spain)
- Adaptación de la Arquitectura en Tiempo de Ejecución. Joan Sorribes (1987) (Associate Professor, Universitat Autònoma de Barcelona, Spain)
- Algoritmos de Selección en el Proceso de Adaptación de la Arquitectura en un Ordenador. Tomás Díez (1987). (RIP 2004)
- Adaptación de la Arquitectura en Sistemas Microprogramables. Ana Ripoll (1980) (Professor, Universitat Autònoma de Barcelona, Spain)
- Procesador Concurrente para Bases de Datos. José Jaime Ruz Ortiz (1980) Universidad Complutense de Madrid (Professor, Universidad Complutense de Madrid. Spain).
- Concepción y desarrollo de un Procesador para Ejecución Directa de Lenguajes de Alto Nivel. Lorenzo Moreno Ruiz (1977) Universidad Complutense de Madrid (Professor, Universidad de La Laguna - Spain).