Our work in 2012 focused on two main directions:
Goals. We plan to develop a Distributed Data Aggregation service (DDA) relying on BlobSeer. Its primary goal will be to serve as a repository backend for complex analysis and automatic mining of scientific data. Several requirements, derived from these objectives, match BlobSeer's features: versioning used for lock-free access to data and different granularities of read / write operations. However, since DDA has a structured view of data, opposed to unstructured data stored in BlobSeer, a new meta-data management layer is needed to make the correct translation between the two views. We further plan to implement an extended BlobSeer client providing a formal description for the data retrieval queries and a specification for a search API. A benchmark tool relying on a performance model of BlobSeer, currently developed at KerData, will be used to automatically determine the best BlobSeer deployment configuration for a specific data aggregation pattern.
Results. We implemented an architecture for a Distributed Data Aggregation Service (DDAS). We modelled the DDAS into two layers: the metadata management layer and the extended BlobSeer client. The upper layer uses CollectGATE, a system for aggregating the scientific research results (publications) belonging to a specific community (e.g. Computer Science) during a long time period. Data is collected by crawling different digital libraries accessible in the Web. Regardless of the type of request or application, the upper layer updates its scheme-object mappings, by querying CollectGATE and performs the operation requested. DDAS maintains a consistent metadata catalogue and object catalogue and ensures fast lookup of meta-information, especially for an existing aggregation scheme that does not require CollectGATE's function. Taking this into consideration we implemented the DDAS using abstract collections designed specifically for these requirements. The object catalogue is represented by a concurrent key-value map with the object unique identifiers as keys and the meta-information of the objects (as retrieved by the extended client) as values. Additionally, the metadata catalogue is represented by the same structure that maps schemes with lists of meta-information that represent the objects which fit the scheme. The extended description of results is presented in this research report.
Postdoc fellows and students involved: Florin Pop (PUB, postdoc), Alexandru Costan (KerData, postdoc), Vlad Serbanescu (PUB, Master student).
Goals. We aim to provide a cloud-based storage layer for sensitive context data, collected from a vast amount of sources: from smartphones to sensors located in the environment. The resulting datasets are of large volumes, and they need to be further processed to identify high-level derived and inferred situations. Clouds are perfect candidates to handle the storage and aggregation of such data for even larger context-aware applications. However, several requirements must be addressed by these Cloud-based solutions: high concurrent accesses, mobility support, real-time access guarantees and persistency. Such solutions rely on more relaxed storage capabilities than traditional relational databases (eventual consistency suffices for example). This, combined with the high concurrency support and the flexible storage schema make BlobSeer a suitable candidate for the storage layer. We plan to develop a new layer on top on BlobSeer targeting context aware applications. At the logical level, this layer will provide transparency, mobility, real-time guarantees and access based on meta-information. At the physical layer, the most important capability will be persistency in Cloud storage environments.
Results. In this period we analyzed the requirements for large-scale context-aware applications (such as applications designed, for example, to support city-scale traffic optimizations, where thousands or millions of users collaborate to solve congestions and report potential hazardous situations). These requirements were expressed in terms of desirable properties and their relations to already-existing storage solutions. From these, BlobSeer proved a valuable candidate, its characteristics mapping well on the ones required by context-apps. However, we still needed and actually developed in this period a layer on top of BlobSeer to allow two major things: efficient access to data based on meta-information (a catalogue of context data), and the support from mobility in the form of distributed caches able to support the movement of people and give support for fast access to real-time event of interest (dissemination of events of interest). The system as a whole was evaluated in extensive experiments, involving thousands of simulated clients, and the results proved its valuable contribution to advance the current state-of-the-art in the area of interested (middlewares to support context-aware apps). More details in this research report
Postdoc fellows and students involved: Ciprian Dobre (PUB, postdoc), Alexandru Costan (KerData, postdoc), Elena Buruceanu (PUB, Master student).
Goals. The main goal of this task is to develop a distributed platform for image processing and management in Clouds and private data centers based on the Map-Reduce paradigm. For the storage layer we rely on BlobSeer, leveraging its natural support for large unstructured data such as documents, pictures in any format or other multimedia content. We plan to use two Cloud platforms for our experiments: Windows Azure Cloud and a GPU-based platform in a private cloud. One subtask will focus on face recognition algorithms and another one will involve image segmentation using Hadoop with GPU acceleration.
Results. This task main purpose was to define a new framework for processing Computer Vision applications in general, and Image Indexing applications in particular, with two important objectives in mind: achieving highly reliable matching results from large sets of data and obtaining real-time or near real-time performance. The infrastructure for our framework consists in a private OpenNebula Cloud on which it was deployed a Hadoop implementation. A first step towards this purpose was designing and implementing a new model for mapping the tasks in Hadoop framework using GPU processing. Besides modifying the mapper and reducer in order to run in a GPU environment, we also proposed new implementations for the partitioner. The partitioner is the one that assigns a certain (key-value) pair representing the output of a mapper to a reducer. By default, the partitioner in Hadoop assigns tasks according to the hash of the key. Most of the Computer Vision applications generate a large number of intermediate (key-value) pair which represents a problem if we want to store all the intermediate data in the Device memory (without any call to the distributed file system). For this purpose we implemented different data storing algorithms depending on the Map-Reduce application: baseline global memory implementation, shared memory implementation, and aligned memory implementation.
PhD Students involved: Elena Apostol (PhD student at PUB), Diana Moise (PhD student at KerData)
A Romanian student was involved in this task for his BS thesis, at PUB:
Goals. We plan to explore the feasibility of leveraging virtual machine migration in IaaS clouds in order to pro-actively migrate VM instances that are predicted to fail to safer hosts, with the purpose of reducing the failure rate of HPC applications running inside the VM instances. The feasibility of such an approach is subject to a trade-off: a reduction in failure rate implies lower checkpointing frequency and thus lower overhead, but it comes at additional cost which depends on many factors, such as the overhead and accuracy of the failure prediction, the overhead generated by live migration itself, etc. Our goal is to model this trade-off by defining and analyzing the factors that play a critical role when put in the context of real HPC applications that are ported to IaaS clouds. Based on the findings, we plan to explore how to combine and adapt the migration, fault prediction and checkpointing techniques for best performance results.
Results. A first step towards this goal consisted in exploring ways to deploy, boot and terminate VMs very quickly, enabling cloud users to exploit elasticity to find the optimal trade-off between the computational needs (number of resources, usage time) and budget constraints. We built a VM management system based on the FUSE interface leveraging the high throughput under increased concurrency of BlobSeer. We integrated it within the Nimbus cloud to allow fast VM deployment / snapshotting/ live migration. An adaptive prefetching mechanism is used to reduce the time required to simultaneously boot a large number of VM instances on clouds from the same initial VM image (multi-deployment). This proposal does not require any foreknowledge of the exact access pattern. It dynamically adapts to it at run time, enabling the slower instances to learn from the experience of the faster ones. Since all booting instances typically access only a small part of the virtual image along almost the same pattern, the required data can be pre-fetched in the background. In parallel, we investigated ways to ensure the anonimity of the data management layer, a requirement for HPC applications deployed into the clouds.
Postdoc fellows involved: Bogdan Nicolae (JLPC, postdoc), Catalin Leordeanu (PUB, postdoc) during a 1-month internship at KerData, Alexandru Costan (KerData, postdoc) during a 1-week visit at Argonne National Lab..
A Romanian student was involved in this task for his BS thesis, at PUB: