Our work in 2011 addressed all three directions defined in the DataCloud@Work proposal:
For the first direction, we aimed to offer advanced data sharing facilities to both applications running within distinct VMs deployed in Infrastructure-as-a-Service environments and to clients accessing BlobSeer as a Cloud storage service. Tasks 1-3 described below studied two sub-topics within this direction, focusing on security aspects, on enhancing BlobSeer with self-* properties and on exploiting BlobSeer's features for scientific applications. Task 4 was developed in the context of our second research direction, which deals with efficient data accesses for cloud federations. Finally, the third research topic evolved through Task 5.
Goals. The increasing popularity of Cloud computing results in a need for efficient and secure data management. One of the most relevant security topic in cloud data management refers to preventing the users from damaging the stored data or from breaking security policies and data-access protocols. We aim to further improve the security of a large scale data management system such as BlobSeer. The goal is to introduce adequate authentication and authorization mechanisms for BlobSeer users and preserve their privacy through anonymization. Another goal is to extend the security framework that protects the system against malicious usage with adaptive security policies that take into account the past actions of each user. Furthermore, we aim to provide a secure environment to deploy web services over BlobSeer. Each user must have the capability to deploy his own web services that rely on BlobSeer as a data management backend. In addition, a user must be able to securely invoke these services, request access to the services of other users or grant access to the services he has deployed.
Results. We proposed a novel security layer for the BlobSeer data management system as well as a number of security enhancements to ensure a practicable and secure client access management. Our solution offers certificate management, encryption capabilities, as well as credential management and access control lists. Using these mechanisms we enable authentication, authorization and secure data transfer. The proposed solution was integrated into BlobSeer, which offers high performance for data transfer and efficient data management. We tested our solution, proving that it handles all the security tasks very efficiently,without adding any significant overhead to the data management system, thus preserving the overall performance of the system. We also focused on Cloud infrastructures and we developed mechanisms to allow secure access to web services in a Cloud environment for data intensive web services using BlobSeer as a data management backend. We developed an efficient system which provides an adequate level of security for web service-based applications which includes rights management and secure communication.
Several Romanian students were involved in this task for their BS theses, at PUB:
Goals. The autonomic management of a distributed storage system aims to support its adaptive steering towards an optimized performance and resource consumption, without the need for human interference. One approach to enhance BlobSeer with self-* properties is by enabling a dynamic allocation scheme for the data providers, that takes into account information provided by an introspection layer (e.g., number of accesses per provider, location awareness, transfer and storage cost ratio). We further aim to introduce mechanisms for deleting data from BlobSeer, in order to support both the adaptive replication (by allowing the decrease of the replication factors) and the self-protection from malicious clients that could overload the system by writing large amounts of data to affect the total available disk space.
Results. We continued our work on enabling BlobSeer with self-adaptive features by dynamically maintaining the replication factors of the data. We enhanced the data replication module with the ability to automatically decrease the data replication factor by means of real-time monitoring with MonALISA. The monitoring data were used to support the BlobSeer's Replication Manager make an informed decision concerning the needed value for the replication factor. The decision is further enforced by consistently updating the metadata information to reflect the updates. We also addressed the issue of garbage collection in the BlobSeer data management system. Malicious users can add false data or use inefficiently the available storage space. Taking this into consideration and the fact that data is always created and never overwritten in the system, the issue of being able to delete unwanted data becomes crucial. As a result, we proposed a BlobSeer data deletion algorithm which can be used to eliminate false data that would otherwise pollute the system.
Two Bachelor theses at PUB focused on this task:
Goals. This task aims to enable BlobSeer as a storage service for large datasets generated and processed by scientific applications. In this context, the BlobSeer storage system will offer advanced data-sharing facilities to processing tasks running within distinct VMs in IaaS environments. to integrate BlobSeer with some well-known open-source IaaS platforms, such as Nimbus and OpenNebula. Furthermore, the goal is to explore the ways to take advantage of the BlobSeer's scalable architecture, high throughput under heavy concurrency and versioning support to increase the performance of scientific workflows.
Results. We integrated BlobSeer as a backend for Cumulus, the data storage service provided by the Nimbus platform. On the one hand, we used BlobSeer to store VM images and to improve the performance of VM deployments, by taking advantage of the concurrency-optimized data accesses in BlobSeer. On the other hand, we evaluated the performance of using Cloud storage services for application data. We focused on a data-intensive, climate modeling application called Cloud Model1 (CM1). We executed CM1 in a Nimbus Cloud environment and we used the BlobSeer-based Cumulus service to store its output. We evaluated our approach through large-scale experiments performed on Grid'5000. Furthermore, we investigated the cost of executing MapReduce applications in Cloud environments, in order to find a proper trade-off between cost and performance for this class of applications. We compared the runtime performance of several MapReduce applications executed within the Hadoop framework, in two similar environments: clusters belonging to the Grid’5000 platform and virtual machines deployed on a Nimbus Cloud hosted by Grid’5000 nodes. We are planning to submit a paper on this work to an international conference.
PhD students involved: Alexandra-Carpen Amarie (KerData, INRIA), during a 2-month internship at Argonne National Lab.
Goals. This task aims to study the challenges of enabling transparent Cloud federation, so as to easily share resources across multiple clouds. On the one hand, the goal is to study systems to create MapReduce execution platforms on top of federated clouds. Another goal is to optimize the behavior of the storage layer in a federated clouds environment and, more specifically, to explore cost-based optimizations for migrating BlobSeer components.
Results. We implemented Resilin, a service able to federate resources from multiple clouds, which provides similar functionalities with Amazon Elastic MapReduce. Our system offers more flexibility as users can choose between different types of virtual machines, operating systems or Hadoop versions.
A Master student focused on the following task during her internship at Inria:
Goals. This research direction was developed during Bogdan Nicolae's postdoc within the INRIA-UIUC Joint Laboratory for Petascale Computing, started in January 2011. The goals was to enhance the BlobSeer-based virtual machines storage system with various management operations such as VM migration (for preventive fault tolerance), by leveraging the global data availability and the efficient versioning support provided by BlobSeer.
Results. We proposed and implemented a complete virtual machine storage solution based on BlobSeer that relies on a lazy VM deployment scheme to fetch VM image content as needed by the application during its runtime, greatly improving deployment time in scenarios where hundreds of VM machines are simultaneously instantiated. Furthermore, this storage solution leverages cloning and shadowing as exposed by BlobSeer to provide high-performance and completely transparent snapshotting support. Several optimizations such as adaptive prefetching and efficient live storage migration were added later on. We obtained significant improvement over state-of-art both in terms of performance and generated network traffic while supporting a series of additional features at no extra cost. These results materialized in a series of associated publications.
Postdoc fellows involved: Bogdan Nicolae (JLPC), Alexandru Costan (KerData).