Automated Hierarchical Service Level Agreements Generic Management Principles and Application to Multi-Domain Infrastructure-as-a-Service Bei der Fakultät für Informatik der Technische Universität Dortmund zur Begutachtung eingereichte Dissertation zur Erlangung des akademischen Grades Doktor der Naturwissenschaften von Konstantinos (Costas) Kotsokalis Oktober 2010 Hauptreferent: Prof. Dr.-Ing. Ramin Yahyapour Στον piατέρα µου, Αθανάσιο Κωτσοκάλη και τη µητέρα µου, Κωνσταντίνα Σακελλαράκη-Κωτσοκάλη για το ϑεµέλιο και το piαράδειγµα. Preface The research presented in this dissertation concerns the area of Service Computing; more specifically, it contributes to the topic of enabling IT ser- vice stacks with dependability, such that they can be used even further in pragmatic business environments and applications. The instrument used for this purpose is that of a Service Level Agreement (SLA). The main focus is on SLA Hierarchies, which reflect corresponding Ser- vice Hierarchies. SLAs may be established manually, or automatically among software agents; it is mainly the latter case that is considered here. The thesis contributes by means of a formal problem definition for the construction of SLA hierarchies using a translation process, a man- agement architecture, a formal model for defining penalties, and a repre- sentation that facilitates the processing of SLAs. Using these tools it is shown that automated SLA management in hi- erarchical setups is possible, through an application to Multi-Domain In- frastructure as a Service. Within this specific technical area, different SLA-based resource capacity planning approaches are examined via sim- ulation – both for online and offline planning. The former case concerns normal runtime operations, and the thesis examines two greedy algo- rithms with regard to their energy-savings efficiency and their perfor- mance. In the latter case, a resource-scarce environment is simulated with the purpose of minimizing penalties from already established SLAs. This is achieved via formally-defined combinatorial models, which are solved and compared to two greedy algorithms. The research that lead to this thesis was conducted from late 2004 to early 2008 in Athens, Greece; from then on and until late 2010 in Dort- mund, Germany, as part of the SLA@SOI EU/FP7 Integrated Project (con- tract No. 216556) [1]. It is presented at Technische Universität Dortmund (TU-Dortmund), Department of Computer Science, under the supervision of Prof. Dr.-Ing. Ramin Yahyapour. i ii Acknowledgements There are many people onemust usually thank in this section of a doctoral dissertation. No exception to this rule is to be found here. First of all, I would like to thank my advisors and supervisors during these years of research; namely, Prof. Dr.-Ing. Ramin Yahyapour, who supervised my work at the Dortmund University of Technology, Germany, and Prof. Dr. Panayiotis Tsanakas who was my supervisor at the National Technical University of Athens, Greece. I owe them a lot for their help and guidance during this process. I am grateful to my colleagues at ITMC/TU-Dortmund, who contributed towards a stimulating research environment, helpedme enormously when- ever speaking German was necessary, but also stood by me as good friends during the last and most difficult part of my work. They are all nothing short of amazing. Special thanks go to Philipp Wieder; his tireless support during my time in Germany made a huge difference. It would be hard to achieve this without the constant support from the many good friends I have been blessed with, outside my working environ- ment. I hope they will understand that I cannot be too verbose here and mention each one separately, but they must all know that their encour- agement was precious. I cannot be thankful enough to my family. This thesis is dedicated to my parents, Athanasios and Konstantina, for all the reasons that words can hardly express. My sisters, Vassiliki and Dimitra, provided support, motivation and a voice of reason that was much needed. They played a crucial role, and for that I thank them wholeheartedly. The same applies to my brother in law, Makis Lazos, and my beloved nephews George and Thanos. Last, but not at all least, I would like to thank my partner in life, Nadia Nikolopoulou. There is no possible way to describe, even to the tiniest bit, all the reasons why this thesis would have never been written if it was not for her. For all the patience, encouragement, support, inspiration and motivation, Nadia, I thank you. You were right in everything. iii iv Contents I Introduction 1 1 Context 3 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 The SLA@SOI EU Project . . . . . . . . . . . . . . . . . 4 1.2 Scope and Scientific Contribution . . . . . . . . . . . . . . . . 5 2 Service and Cloud Computing 9 2.1 Service-Oriented Computing . . . . . . . . . . . . . . . . . . 9 2.2 An Example Scenario . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Service Dependencies . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Multi-Domain Infrastructure-as-a-Service . . . . . . . . . . . 15 3 Service Level Agreements 19 3.1 Definitions and General Considerations . . . . . . . . . . . . 19 3.2 SLA Management . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Use Cases and Problem Description 27 4.1 Hosted Enterprise Resource Planning . . . . . . . . . . . . . . 27 4.2 Enterprise IT . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 32 II Hierarchical SLA Management 37 5 SLA Hierarchies 39 5.1 SLA Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Service Properties Dependencies . . . . . . . . . . . . 40 5.1.2 SLA Translation . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Management Architecture 49 6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v 6.1.1 Template Advertisements . . . . . . . . . . . . . . . . 51 6.1.2 Negotiation Interface . . . . . . . . . . . . . . . . . . 52 6.1.3 Planning and Optimization . . . . . . . . . . . . . . . 52 6.1.4 Monitoring Infrastructure and Forecasting . . . . . . . 53 6.1.5 Provisioning . . . . . . . . . . . . . . . . . . . . . . . . 53 6.1.6 Monitoring and SLA Adjustment . . . . . . . . . . . . . 54 6.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.3 Architecture Applicability . . . . . . . . . . . . . . . . . . . . 58 6.4 Framework Implementation . . . . . . . . . . . . . . . . . . . 59 6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7 Penalty Model 65 7.1 Penalties and Service Level Agreements . . . . . . . . . . . . 65 7.2 Penalty Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.3 Example Application . . . . . . . . . . . . . . . . . . . . . . . 68 7.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 8 System-Internal SLA Representation 73 8.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.2 Binary Decision Diagrams . . . . . . . . . . . . . . . . . . . . 74 8.3 SLAs as BDDs . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.3.1 A Usage Example . . . . . . . . . . . . . . . . . . . . . 76 8.3.2 SLAs and SLA Hierarchies . . . . . . . . . . . . . . . . 77 8.3.3 BDD Mapping . . . . . . . . . . . . . . . . . . . . . . . 78 8.3.4 Negotiation Time Operations . . . . . . . . . . . . . . 81 8.3.5 Runtime Operations . . . . . . . . . . . . . . . . . . . 83 8.4 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 III Infrastructure SLA Planning 89 9 Resource Model 91 9.1 Rationale and Requirements . . . . . . . . . . . . . . . . . . 91 9.2 Resource Element Design . . . . . . . . . . . . . . . . . . . . 93 9.2.1 The Resource Element Model . . . . . . . . . . . . . . 93 9.2.2 Relevance to CIM Reference Model . . . . . . . . . . . 93 9.2.3 Relevance to GLUE Schema . . . . . . . . . . . . . . . 94 9.3 Resource Element Reservations . . . . . . . . . . . . . . . . . 95 9.4 Application to Real Use Cases . . . . . . . . . . . . . . . . . . 97 9.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 10 IaaS SLA Planning 103 vi 10.1 Scenario Description . . . . . . . . . . . . . . . . . . . . . . . 103 10.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 10.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 111 10.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 11 Offline Planning with Limited Resources 121 11.1 Scenario Description . . . . . . . . . . . . . . . . . . . . . . . 121 11.2 Problem Models . . . . . . . . . . . . . . . . . . . . . . . . . . 125 11.2.1 Non-flexible . . . . . . . . . . . . . . . . . . . . . . . . 125 11.2.2 Flexible-items . . . . . . . . . . . . . . . . . . . . . . . 128 11.2.3 Flexible-time . . . . . . . . . . . . . . . . . . . . . . . 128 11.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 128 11.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 11.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 130 11.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 IV Conclusions 143 12 Conclusions 145 12.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . 145 12.2 Critical View of Presented Work . . . . . . . . . . . . . . . . . 148 12.3 Concluding Statements and Future Directions . . . . . . . . . 149 vii viii List of Figures 1.1 Layered SLA establishment of SLA@SOI . . . . . . . . . . . 5 1.2 High-level scenario . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Actors of the example scenario . . . . . . . . . . . . . . . . 11 2.2 Flowchart of address-based route planning service process 12 2.3 Service dependency graph for example scenario . . . . . . 13 2.4 Multi-domain resource provisioning . . . . . . . . . . . . . . 16 3.1 The SLA life cycle, according to the TMF . . . . . . . . . . . 22 3.2 An SLA and an SLA template, according to WS-Agreement . 22 4.1 Enterprise IT . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Enterprise IT planning and optimization . . . . . . . . . . . 31 5.1 SLA dependencies, reflecting service dependencies . . . . 40 5.2 Example service Properties Dependency Graph . . . . . . . 41 5.3 Generic flowchart for translation/negotiation of SLAs . . . . 44 6.1 SAMI and service instances . . . . . . . . . . . . . . . . . . 50 6.2 The SLA Management Instance (SAMI) . . . . . . . . . . . . 51 6.3 A simple Template and Offer example . . . . . . . . . . . . 56 6.4 High-level overview of the negotiation process . . . . . . . 57 6.5 Meeting organization outsourcing . . . . . . . . . . . . . . . 58 8.1 Simple/Ordered BDD representations of f = x1 · x2 + x1 · x3 75 8.2 The BDDs corresponding to functions from Equations 8.2 and 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 8.3 Experimental result . . . . . . . . . . . . . . . . . . . . . . . 85 9.1 Mapping of RE concepts to CIM . . . . . . . . . . . . . . . . 94 9.2 Resource Reservation Scenario . . . . . . . . . . . . . . . . 96 9.3 Representation of the computing resources . . . . . . . . . 98 9.4 Representation of weather station . . . . . . . . . . . . . . 99 10.1 An SLA request example . . . . . . . . . . . . . . . . . . . . 104 ix 10.2 The IaaS scenario of Part III . . . . . . . . . . . . . . . . . . 106 10.3 Two-dimensional bin (server) packing . . . . . . . . . . . . 108 10.4 Dynamic bin (server) packing . . . . . . . . . . . . . . . . . 109 10.5 First-Fit assignment algorithm for each resource group . . . 110 10.6 Best-Fit assignment algorithm for each resource group . . . 112 10.7 Performance comparison of FF and BF . . . . . . . . . . . . 113 10.8 Energy savings comparison (low arrival rate) . . . . . . . . 113 10.9 Energy savings comparison, zoom in (low arrival rate) . . . 114 10.10 Accepted SLAs comparison (medium arrival rate) . . . . . . 115 10.11 Energy savings comparison (medium arrival rate) . . . . . 115 10.12 Energy savings comparison, zoom in (medium arrival rate) 116 10.13 Accepted SLAs comparison (large arrival rate) . . . . . . . . 116 10.14 Energy savings comparison (large arrival rate) . . . . . . . 117 11.1 Resource failure scenario . . . . . . . . . . . . . . . . . . . 122 11.2 Problem illustration . . . . . . . . . . . . . . . . . . . . . . . 122 11.3 Example for “flexible items” strategy . . . . . . . . . . . . . 124 11.4 Required / available resources ratio . . . . . . . . . . . . . 130 11.5 Average resource utilization (percentage) . . . . . . . . . . 131 11.6 Solving/Approximation speed for (200, 1, 1, 5, 1) . . . . . . 132 11.7 Solving/Approximation speed for (400, 2, 2, 10, 2) . . . . . 132 11.8 Solving/Approximation speed for (600, 3, 3, 15, 3) . . . . . 133 11.9 Solving/Approximation speed for (800, 4, 4, 20, 4) . . . . . 133 11.10 Effect of increasing number of SLAs . . . . . . . . . . . . . 135 11.11 Effect of increasing number of resource types . . . . . . . . 136 11.12 Effect of increasing number of data centers . . . . . . . . . 136 11.13 Effect of increasing planning horizon . . . . . . . . . . . . . 137 11.14 Effect of increasing number of resource requests in each SLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 11.15 Placed SLAs per strategy (percent of total SLAs) . . . . . . 138 11.16 Placed resource requests per strategy (percent of total re- source requests) . . . . . . . . . . . . . . . . . . . . . . . . 138 11.17 Avoided penalties per strategy (percent of total penalties) . 139 12.1 High-level scenario . . . . . . . . . . . . . . . . . . . . . . . 146 x Part I Introduction 1 2 Chapter 1 Context The vision of the present dissertation is to further enable IT service hier- archies with dependability, in a scalable manner. As a quick introduction to this vision and the overall context, it is beneficial to provide the back- ground of the respective research. This initial chapter briefly discusses the scientific motivation for the dissertation. Then, it continues to pro- vide information about what is within the thesis’ scope, and what is not. 1.1 Motivation The research presented in this doctoral dissertation is primarily motivated by the wish to enable Service Oriented Infrastructures (SOIs) with depend- ability, for the purpose of automated service binding. More specifically, service computing envisages diverse service ecosystems, where software agents automatically discover and bind each other, to achieve reusability via composition. If this vision is to be achieved, some measure of cer- tainty needs to be integral to the discovery and binding process. The other option, best effort usage, is considered suboptimal and difficult to apply to real-world businesses, as it offers no guarantees and therefore could pose risks to the viability and sustainability of a business. Service Level Agreements (SLAs) are an instrument that can be used to approach such goals of increased certainty. They represent the elec- tronic equivalent of business contracts between service providers and service customers. As such, they can describe what exactly is the ser- vice, what the service provider guarantees, what are the obligations of the customer, what are the penalties if the guarantees are not honored, and so on. The complex relationships between different services as they are forming service chains and hierarchies can be reflected onto SLAs, forming equally complex SLA hierarchies. This way, it will become pos- 3 Chapter 1. Context sible to enhance complete service stacks, make them dependable, and advance service computing overall. SLAs are meant to be negotiated, just like normal contracts, and be used during the service’s lifetime as an official means to describe expectations and guide the service provisioning and monitoring processes. The information contained within SLAs includes details regarding the implementation of the service, for which the provider knows how to map these onto resource requirements. Thus, SLAs can be used as a tool to enable automated decision-making for purposes of online resource plan- ning. Adaptive resource capacity management is not a new area, yet it has had limited success so far in achieving integration with business objectives and the way businesses operate. SLAs provide necessary for- malisms to enable this connection to business objectives, as they typi- cally include provisions for pricing, penalty definitions, etc. When the service refers to infrastructure resources that are offered over the network (“Infrastructure as a Service” - IaaS), the link between SLAs and resource management becomes further prominent. Resources in this case could be many different things, such as computing nodes, storage, network circuits, sensors, scientific instruments, and so on. IaaS is a trend that has seen new heights recently with advances in resource virtualization and the prevalence of Cloud Computing. This area is one of the targets for the present thesis. 1.1.1 The SLA@SOI EU Project Most of the research presented within this thesis took place as part of the SLA@SOI EU-funded Integrated Project (grant agreement: 216556). SLA@SOI is a three-year project running from 2008 to 2011, aiming to enable service hierarchies with dependability, transparent SLA manage- ment, and automation. The project takes a holistic view on service man- agement, using a three-layer approach: An infrastructure layer, consist- ing of finite, countable physical and virtual infrastructure resources; the software layer, consisting of software executing on this infrastructure, possibly also constrained by limitations such as licensing options; and a business layer, that augments software and infrastructure services with business concepts and exposes them as complete products ready to be sold to customers. Each layer represents distinct providers, or distinct de- partments within the same provider. Eventually, for a product (business- enabled service) sold under guarantees foreseen by some SLA, all in- volved services from the two lower layers should be provisioned under appropriate SLAs, so that the top-level one is fulfilled (Figure 1.1). Each arrow in this figure represents the establishment and runtime manage- 4 Chapter 1. Context ment of SLAs, including (re-)negotiation, monitoring and alerting, arbitra- tion, etc. Business Pricing, customer management, ... Software Performance engineering, service orchestration, ... Infrastructure Virtualization, carbon footprint management... Customer Figure 1.1: Layered SLA establishment of SLA@SOI The work presented herein is only a fraction of the work completed (and still ongoing as of September 2010) within the project. For instance, this dissertation does not touch upon business concepts (with the excep- tion of cost and penalties), software service orchestration, performance engineering, or infrastructure monitoring. 1.2 Scope and Scientific Contribution The thesis has two different purposes. First, it tries to create some new formalisms and methods, that apply to SLAs independent of the domain and can be useful in combination with other ongoing research. Second, focusing on infrastructure services, these formalisms / methods are used and extended with algorithms for negotiation-time resource planning, and for resource-planning during exceptional situations when resources be- come scarce. Looking at the topic informally and from a very high level, Figure 1.2 illustrates the context of this work, and the questions it addresses. The scope and the respective scientific contributions of the thesis can be summarized in the following list: SLA Hierarchies: A concrete definition of the SLA hierarchies and the 5 Chapter 1. Context Provider Customer Provider Contract Sub-Contract Request Response Request Response Provisioning Provisioning Resources Resources Contracts bear responsibility; there are penalties if breached How to deduce a subcontract, starting from the contract? What do these boxes really look like? How do they function? How to assess if the contract can/should be established? (For infrastructure) How to make best use of existing resources? (For infrastructure) How are resources defined? How to signal and instruct them? Figure 1.2: High-level scenario process of “translating” from one SLA to others, is presented. This text is available in Chapter 5. Architecture: A complete, self-contained management architecture for SLAs is presented in Chapter 6. This architecture takes into account requirements for multi-level SLAs, and can be extended to fit any domain that requires SLA management. Penalty Model: A penalty model supporting SLA term interdependen- cies, fairness, and business value declarations, is presented in Chap- ter 7. In-memory SLA Model: In contrast to SLA models for on-the-wire rep- resentation, a generic and flexible model is necessary for in-memory management of highly-complex SLAs. Such a model should allow the efficient handling of their complexity. A proposal in this area is presented in Chapter 8. Resource Model and Reservation Primitives: For this work there was required a formalism to express resource groupings and time rele- vance, so that it can be used for co-allocation and advance reser- 6 Chapter 1. Context vation. Additionally, such a model can be directly used within SLAs to express guarantee terms for the provisioning of resources. Chap- ter 9 presents a standards-compatible model for these purposes. IaaS SLA Planning: The planning process differs significantly in differ- ent domains. Here, the focus is on methods that can be applied to the problem of SLA planning for infrastructure services. The topic is researched considering two different use cases: First, normal op- erations negotiation-time planning, where existing SLAs are never violated on purpose; and the focus is on energy consumption and planning performance. Second, planning under an extreme situa- tion where resources have become scarce (e.g. massive resource failure, or submission of non-negotiated SLAs to the system). In this latter case, some SLAs may be discarded partially or in full, so that eventual penalties are minimized. This is an offline process, and for its solution this work proposes a combinatorial (Integer Linear Pro- gramming - ILP) approach. Chapters 10 and 11 elaborate on these issues. Relevant, but out of scope for this dissertation, are: • Negotiation protocols; • Models for on-the-wire or on-the-screen SLA representation; • Management necessities for service discovery, composition, bind- ing, provisioning and monitoring. • Technology and approaches for implementing infrastructure adjust- ment capabilities such as scaling, virtual machine migration, etc. The next chapter includes a motivating example, that will be used in Part II for illustration purposes. In Part III a different example will be used, to fit the narrower scope of infrastructure services. What comes next In the upcoming Chapter 2, the essentials of service and cloud computing will be discussed. The remainder of Part I then continues to discuss SLAs in general in Chapter 3, and provide a more detailed description of the scientific problems that the thesis contributes to – also in the context of various example and more realistic use cases – in Chapter 4. 7 8 Chapter 2 Service and Cloud Computing This Chapter serves as a more elaborate introduction to service orienta- tion and service computing. Fundamental definitions are discussed and relevant concepts such as workflows and composition are provided. The Chapter then proceeds to explain the essentials of cloud computing and how it relates to service computing. Finally, outsourcing of infrastructure services is introduced. 2.1 Service-Oriented Computing Service-Oriented Computing (SOC) is the computing paradigm that uti- lizes services as fundamental elements for developing applications [2]. The question naturally arises, how is a service defined. In [3], Zhang et al. start their book by reviewing a number of definitions from prior art, and eventually define Information Technology (IT) services as follows: Services represent a type of relationships-based interactions (activities) between at least one service provider and one ser- vice consumer to achieve a certain business goal or solution objective. Service providers offer and implement the service, while service con- sumers utilize it (“consume” it). In general, one could informally say that SOC is the IT discipline that concerns with exposing functionality for others to utilize, possibly in the context of some larger application. As such, any activity offered by one entity to another, is a service. In recent years, SOC has changed to a large extent the way business is conducted. The main reason is the fact 9 Chapter 2. Service and Cloud Computing that it is possible to offer complete applications by combining services in a proper way, using the functionality offered by one to cover for function- ality missing from another. This makes it easier for companies to focus on their area of expertise and integrate necessary additional functionality by means of reusing existing tools offered by others. It also introduces the notion of a service dependency; that is, the fact that a service requires some other service in order to execute as designed. If this concept is recursively applied to the services within an application, the result is a service chain. Such formations are very typical nowadays. The term is an analogy to supply chains, but instead of referring to traded products, it refers to services. 2.2 An Example Scenario Before further discussing service dependencies, we introduce an example scenario to be used throughout the thesis for reasons of demonstration. This example consists of an infrastructure service provider, a software service provider, and a customer. The software service provider offers a geo-positioning facility, consist- ing of the following services: 1. Geo-coding (GC): Given an address it returns the respective longi- tude and latitude, and vice-versa. 2. Route planning (RP): Given two sets of coordinates representing two different geographical points, it returns the shortest route from the first point to the second. 3. Address-based route planning (AB): Performs the same functionality with coordinates-based route planning, but accepts full addresses instead of coordinates. Address-based route planning makes use of the other two services: First, it invokes the geo-coding to convert addresses to geographical co- ordinates; then, uses these coordinates to invoke the route planning ser- vice. The latter two rely on a database server, which they must invoke to retrieve necessary information. On the same time, the three geo-positioning software services exe- cute within a web application server, which they never “invoke”. The application server, in turn, executes within a physical or virtual server – and the same applies to the database server. We assume that they run on two distinct servers, and that these servers are both offered by the 10 Chapter 2. Service and Cloud Computing infrastructure provider. Figure 2.1 shows the different actors involved in this scenario. Central Topic Central Topic Central Topic Customer receives service from S/W service provider S/W service provider receives service from infrastructure provider Figure 2.1: Actors of the example scenario In the upcoming Section, this example will be used to illustrate the various types of dependencies between different services. 2.3 Service Dependencies Service dependencies can be explicit, or implicit. Explicit dependencies play a central role in service-oriented computing. The assumption here is that a service’s logic (the functionality it delivers) depends on some other service by means of composition. That is, there are service invo- cations involved. It is expected that each service is invokable over some interface expressed in well-defined languages (e.g. theWeb Services Def- inition Language - WSDL [4]). With a pipeline of input and output of such services, it becomes possible to create a software service composition, in the form of a workflow. Oftentimes such compositions are referred to as Business Processes, and are in general orchestrated by means of a work- flow engine. In recent years, the Web Services Business Process Execu- tion Language [5] (WSBPEL) has been the most widely used specification for this purpose. In the example from Section 2.2, service AB invokes service GC with 11 Chapter 2. Service and Cloud Computing the given addresses, and expects to receive two sets of coordinates (cor- responding to the two addresses). If one, or both the addresses cannot be found and mapped to geographical coordinates, an error is returned. Should both addresses be properly resolved, service RP is invoked and a route between the two is returned. Figure 2.2 illustrates this workflow of invocations and checkpoints, in the form of a flowchart. Process for Address-based Route Planning service (AB) Start Invoke GC Invoke RP Stop Addresses valid? Yes No Figure 2.2: Flowchart of address-based route planning service process Implicit dependencies are, on the other hand, typically related to mid- dleware and infrastructure services (without which a higher-layer service cannot operate at all), and also services used for reasons of redundancy. For instance, web site development agencies may do the site hosting themselves, or outsource it to specialized data centers. Any software relies on some hardware on which it can execute. When it comes to re- dundancy, the standby services set up for this reason are normally not used, but their existence affects the service that depends on them, in the case of a failure. In what follows, the services that depend on others will be referred to as dependents, and services on which others depend will be referred to as antecedents. Although it is possible to perform this rough classification of explicit and implicit dependencies, it is not possible to refine it further and try to identify in greater granularity the relations between dependent services. Rather, the notion of dependency itself is enough to construct a (Service) Dependency Graph [6, 7]. Figure 2.3 illustrates the dependency graph for the example scenario elaborated earlier. Relationships “invokes” and “queries” denote explicit dependencies, whereas “executes in” is an im- plicit relationship. Throughout this thesis SDGs will also be referred to as Service Hierar- chies, to denote the layered structure that one can also see in Figure 2.3. 12 Chapter 2. Service and Cloud Computing Server-1 Web server Service consumer Server-2 Data base Invokes AB GC RP Invokes Invokes Executes in Executes in Queries Queries Executes in Executes in Executes in Figure 2.3: Service dependency graph for example scenario 2.4 Cloud Computing Cloud Computing can be seen as an extension of the concepts of Service- Oriented Computing, enriching them with notions of virtualization and massive scalability. It is difficult to provide an exact definition of the “cloud computing” term, also due to the fact that at this time it is the sub- ject of extreme marketing, therefore the term is typically overloaded by many of its users [8]. Although it started referring only to computing in- frastructure, when AmazonTMreleased Elastic Compute CloudTM (EC2) [9] and Simple Storage ServiceTM (S3) [10], currently the term is also being used for highly scalable software services offered under similar condi- tions. Cloud computing is frequently also used as a term interchangeably with Utility Computing [11], which essentially refers to “pay as you go” usage models. With utility computing, the user initially leases a minimal 13 Chapter 2. Service and Cloud Computing amount of resources, and then she can dynamically add more to the uti- lized resource pool. In a similar manner, dynamically removing resources from the pool is also supported. As such, the user eventually pays only for the resources that she uses. Finally, cloud computing has been often related toGrid Computing [12], especially during its early stages. The latter refers to a resource sharing model, which is closer to Platform as a Service (Paas). Grid computing was very popular for the last decade, but eventually never managed to become a successful model in the business world. Conversely, cloud com- puting appears to be very suitable for business exploitation, and as such the respective uptake is very significant at the time. Grid computing re- mains, nevertheless, the preferred usage paradigm for large-scale scien- tific computing. A complete definition of cloud computing is not within the scope of the work presented in this thesis. The focus of Part III is on Infrastructure as a Service (IaaS), where the service always refers to infrastructure re- sources. These may consist of computing power in the form of countable CPU cores represented by Virtual Machines (VMs), storage measured in well-defined units such as MegaByte (MB), or even sensors, virtualized scientific instruments, etc. In general, the addressed field is that of fi- nite, countable, tangible resources which are required as an infrastruc- ture layer on which software executes. A key concept for IaaS is virtualization. The infrastructure that the consumer is using may be anywhere in the world (assuming no local- ity constraints have been agreed with the provider). Additionally, it may well be the case that the real hardware resource is shared between many consumers, although this is something that the users cannot see directly. This technology has been instrumental to the uptake of cloud computing and IaaS, as it offers important management capabilities. Providers can resize their infrastructure dynamically, with fine granularity. Enforcing us- age limits to users (quota) is not depending on the operating system that the user executes her applications on, but rather on the software that implements the virtualization. Thus, it is possible to do things such as changing the amount of volatile memory of a virtual machine, to scale up and down when needed without restarting the system or performing any manual (hardware-related) operations. Additionally, it is possible to sus- pend VMs for implementing check-pointing mechanisms, clone and mi- grate VMs to other physical infrastructure for load-balancing, etc. In gen- eral, such virtual environments offer significant flexibility and manage- ment convenience, hence also allow new techniques to be implemented for advancing the ways people use IT today. 14 Chapter 2. Service and Cloud Computing A common use case of IaaS is that of a company which needs com- puting resources, but either the usage patterns present sudden scaling requirements (e.g. simulations for a design that was just concluded), or the requirements in the near future are simply unknown. Startup compa- nies are a good example; when operations commence, typically it is not known how prospective users will react, and if the offered product will be successful. Thus, it is economical and makes good business sense to start with small infrastructure, and then scale it if needed. Nevertheless, such infrastructure scaling normally takes a lot of time to design, and the pro- curement itself is also time-consuming. On the other hand, using IaaS, it is possible for a company to size up its infrastructure in a very short time, by leasing resources from an IaaS provider. For instance, it can start load-balancing among 10 VMs, and then initiate another 10 following an agreement with the provider. It is merely a configuration change in the load-balancer, after which it is possible to distribute requests to double the machines, therefore decreasing by 50% on average the workload on each of the servers previously responding to incoming requests. Outsourcing infrastructure requirements is often a good business idea, also due to the avoidance of administration costs (and the costs associ- ated with initial equipment purchase). Economies of scale ensure that a provider of infrastructure services can achieve much lower management costs per resource, than what a company with a small in-house infras- tructure would have to pay for the same purpose. Provided that these savings are also affecting the pricing of IaaS, the knock-on effect has pos- itive results to the pricing of the services of the company who rents the infrastructure, while it may also lead to increased profits. Finally, infrastructure outsourcing is an attractive option for compa- nies who provide complex software systems as a service. Typically, per- formance or other requirements differ from one customer to another. This translates to different requirements from the underlying infrastruc- ture, with a typically very large number of different implementation op- tions [13]. Outsourcing infrastructure requirements enables companies to have access to possibly exotic equipment, that otherwise they might purchase at a high cost, to use only rarely. 2.5 Multi-Domain Infrastructure-as-a-Service The main concept underlying cloud computing (and therefore, IaaS) is one of abstraction. The user does not know (and is not concerned) exactly where her data resides, where it is processed, etc. As long as it is safe from unauthorized access, and ubiquitously present with reasonable per- 15 Chapter 2. Service and Cloud Computing formance, the exact location and domain is not relevant. This property allows that different IaaS domains collaborate in groups and exchange workload. It is therefore assumed that an IaaS domain which is in need of ad- ditional resources can become itself a customer to other IaaS domains, and request to outsource to them (a “variable-role” assumption). It also applies that an entity can perform as a broker of resources from other do- mains, without owning any itself. Domains can be administrative (i.e. dif- ferent providers), or management (i.e. different operational units within the same provider). The two are not differentiated in this work. The term domain is used here to imply resource management independence. Within a single domain, the provider (or unit) can allocate, utilize and re- lease resources basedmerely on “local” knowledge. In addition, providers (or units) from other domains do not have rights over these resources, therefore using them means that they have to request permission to do so. Customer Request I need 500 cores. Provider A Provider B Provider C R e q u e s t R e q u e s t I only have 300. I will ask B and C for 200. Figure 2.4: Multi-domain resource provisioning As evident, there is a straightforward analogy with the examples from Section 2.1, where a service was calling out to other services in order to accomplish its tasks. Figure 2.4 illustrates the concept. In this thesis, the term Multi-Domain IaaS is used to describe this kind of setup. Nevertheless, for the variable-role assumption to hold, there are cer- tain guarantees that should apply. We already mentioned some about security. There can also be others about performance, legal concerns, etc. Hiding the information about who actually delivers the service, as is the case here, means that the primary provider (“A” in the trivial exam- ple here) is legally bound to any promises made to the user, even if it is provider “B” or “C” who fails to deliver. This is a more general concept in service-orientation. Such concepts, related to guarantees, obligation, default and penalty, are covered by Service Level Agreements as will be 16 Chapter 2. Service and Cloud Computing discussed in the next chapter. Summary and Conclusions Service computing is the discipline that studies how different services can be exposed, discovered and bound (possibly into larger service pro- cesses), so that reusability and expertise are taken full advantage of. Cloud computing is the service computing area where virtualization is uti- lized to hide from the user the less relevant information, such as equip- ment and data location. Given sufficiently good reasons (e.g. capacity limitations), a cloud provider or operational unit may choose to outsource part of the service that a customer requires. This concept is essential to the present work, specifically in the context of infrastructure services. The upcoming chapter discusses Service Level Agreements more for- mally, and elaborates over relevant concepts of the SLA life cycle. 17 18 Chapter 3 Service Level Agreements The present Chapter explains important concepts relevant to the rest of the document. More specifically, it discusses in detail what Service Level Agreements are, how they are related to service computing and what SLA management includes from a life cycle point of view. In addition, there is provided some discussion regarding a number of problems pertinent to automated negotiation of SLAs. 3.1 Definitions and General Considerations A Service Level Agreement (SLA) is a representation of all those features that a user can expect to receive by a service, plus related information to provide the context unambiguously (such as user’s obligations). “Fea- tures” here refer not only to the functionality delivered by the service, but also to the quality that the user experiences. As a matter of fact, SLAs are typically associated with Quality of Service (QoS), but in a formal repre- sentation it is reasonably expected to find the service description as well. In general, Service Level Agreements define context, obligations and rights for the parties involved, as regards the service consumed. They consist of a set of facts, and a set of rules. Facts are globally (with re- spect to the contract) applicable truths, such as parties involved, mone- tary unit, etc. Rules include: 1. the conditions that must hold for a certain clause to be in effect; 2. the clause itself, typically describing the expected result that the customer wishes to receive – and which is usually referred to as Service Level Objective (SLO); and 3. a fall-back clause in the case that the aforementioned clause is not honored. 19 Chapter 3. Service Level Agreements As an example, for the condition “time of day is after 08:00”, the clause could be “response time is less than 5 seconds”, and the fall-back clause could be an applicable penalty. This kind of format actually reflects real- life contracts and their if-then-else structure, which might apply either as the default or as the exception to such default respectively. SLAs are typically said to govern service consumption, and guaran- tee its properties. It was already mentioned that SLAs try to augment services with acceptable levels of determinism. What this means is that, purely on a document level, SLAs also include measures of business value and penalty constructs for those cases that the agreed QoS level cannot be maintained by the service provider. Given the definition above, SLAs may be represented in various ways; in this work, only their electronic representation is considered, for pur- poses of automated management by means of machine-readable doc- uments. In a direct analogy with contracts as known from traditional business, SLAs are commonly referred to as electronic contracts. As a consequence of this analogy, the following important questions come up: • What are the legal implications of electronic contracts, when it is software that decides whether to establish (sign) them, or not? • What is the language in which an electronic contract is written? • How is negotiation of electronic contracts taking place? • How can the terms of the contract be expressed in uniform ways? That is, how can the semantics of the language be commonly un- derstood? • How can the software actually make the decisions? With regard to legal implications, it is clear that machines and soft- ware cannot be held accountable in the case of penalties, or more severe breach of contract. Therefore, it is commonly accepted that a formal (non- electronic) contract negotiated between humans needs to be in place, governing all automated interactions between software entities. This con- tract will have to define what are the purposes of the automated SLAs, the capabilities and administrative boundaries of the negotiating software en- tities, acceptable terms and penalties, and in general all those business and technical characteristics that apply to the automated interactions. Hence, a framework contract must be in place to reflect accountability and responsibilities of involved business parties. The next three questions are more technical and have to do with proper representation of messages exchanged between the two parties 20 Chapter 3. Service Level Agreements (the software entities), as part of the establishment process. In general, it is desirable to model this process after the establishment process of pa- per contracts, where a first draft is created and then the parties involved perform revisions iteratively. This iteration of drafts forms a multi-round negotiation, until both parties are satisfied and agree to sign. A simi- lar approach is commonly used to model negotiation of electronic con- tracts. However, the question of the contract’s content remains. In real life and paper contracts, someone expects that the contract is written in a physical language that both parties can read, e.g. German, English, French, etc. Achieving this universally in software is not trivial, and it is only recently that the first relevant standardisation effort became suc- cessful with WS-Agreement [14]. Even with such a common language, though, the problem of describing the obligations and expectations of the involved parties is still open. Let us consider an artificial example, where a contract is negotiated between an administrative manager and a systems administrator. The contract is about backoffice functions in- voked by the adminstrative person’s desktop spreadsheet. The manager would require that whenever she invokes a specific function from within the spreadsheet, there is always a result returned within a few seconds. The administrator may not be able to interpret this to technical terms, for which she would be willing to sign a contract. Such terms would be, for instance, the failover hardware of the backoffice server, load-balancing for performance optimisation, etc. These two people cannot sign a con- tract, although they may speak the same natural language, because they are mutually unaware of the meaning of the guarantee terms their co- signatory party is willing to accept. The exact same problem applies to electronic contracts, where software entities may support and under- stand different sets of SLA terms. The final question has to do with automated e-contracting, which is the latest trend in the area. Service Level Agreements are being used already extensively in various parts of the IT industry, but their first ap- plications were in computer networking by means of facilities such as Differentiated Services [15] (DiffServ). SLAs were initially contracts that defined things such as allocated bandwidth, quality of networking circuits, etc. SLA research in the networking area is still very active (e.g. [16, 17, 18]), nevertheless, with the advent of service computing and the “every- thing as a service” principles, the concept was applied to plenty of dif- ferent areas and automated SLAs [19] are considered in more recent re- search. Automation here is related to the absence of human intervention to decide what is acceptable and what not, and perform the negotiation and runtime management. Conversely, there are assumed autonomous 21 Chapter 3. Service Level Agreements agents that can perform all decision-making during negotiation and run- time, given predefined policies and business logic to drive decisions. 3.2 SLA Management SLA Management entails the actions that must be taken throughout the life cycle of a Service Level Agreement. Figure 3.1 [20] illustrates the definition of the TeleManagement Forum (TMF) for this life cycle [21]. ™⌢␥☧∨? ⨢⬥⤬⴩Ⱕ? ⸧☤∧∨⤭⤬┨ ⼰∱㈩Ⱕ? ㌴㐢㐴✢⠩ ?∣␥☧∨⤢⨢┫␣∬? ☬⤨∬✭✥∣∬? ⸢⼫✭☧∨☬? ∰∱㈧∨ㄫⰧ㌦ㄧ? 㔢Ⱒ㌦✢⠦Ⱙ ␳⬪ⴴ⴫Ⱘ㐢㌪ⴱ? ⬳⤢㌴⠦Ⱙ⠶㜸 ⌫Ⱝ✫㌭Ⱟ 㤤∳☧∨☬? ⌦⴬✦⴬⠺⠣⬬ⴧ⬳ 㘷?⠤∳㬫㌣☬ㄢ 㠴㐢㐴⠤∳㬫㌣☬ㄢ ☬⤨㌢☴㐢㐴 ✢⌤┦✢? ?∶✬⠭⤬┨ ?∳⌭Ⱖ✢⠦Ⱙ ⤢ㄫ⌭㐴⴫? 㐢㌪ⴱ? Figure 3.1: The SLA life cycle, according to the TMF Initially, a Service Level Agreement Template must be created. This template abides to some description model, such as WS-Agreement or WSLA [22]. Figure 3.2 illustrates the structure of an agreement, and that of an agreement template, according to WS-Agreement.?? ? ?? GFD-R-P.107 March 14, 2007 Grid Resource Allocation Agreement Protocol (GRAAP) WG graap-wg@ogf.org 14 Figure 2: Structure of an agreement. The section after the (optional) name is the context, which contains the meta-data for the entire agreement. It names the participants in the agreement, and the agreement’s lifetime. The next section contains the terms that describe the agreement itself. The XML representation of an agreement or an agreement creation offer has the following structure: xs:string ? wsag:AgreementContextType wsag:TermCompositorType The following describes the attributes and tags listed in the schema outlined above: /wsag:Agreement This is the outermost document tag which encapsulates the entire agreement. An agreement contains an agreement context and a collection of agreement terms. /wsag:Agreement/@AgreementId Agreement Terms Compositor Service Terms Guarantee Terms Context Name GFD-R-P.107 March 14, 2007 Grid Resource Allocation Agreement Protocol (GRAAP) WG graap-wg@ogf.org 29 5 Agreement Template and Creation Constraints To create an agreement, a client makes an offer to an agreement factory. An agre ment creation offer has the same structure as an agr e ent. The agreement factory advertises the types of offers it is willing to accept by means of agreement templat s. An agreement template is composed of three distinct parts. We summarize the structure in Figure 3: Figure 3: Structure of an agreement template. The structure of an agreement template is the same as that of an agreement, but an Agreement template MAY also contain a creation constraint section, i.e. a section with constraints on possible values of terms for creating an agreement. The constraints make it possible to specify the valid ranges or distinct values that the terms may take. The constraints refer back to individual terms they apply to using XPATH. The contents of an agreement template are of the form: xs:string ? Agreement Template Terms Compositor Agreement Creation Constraints Service Description Terms Guarantee Terms Context Name Figure 3.2: An SLA nd an SLA template, according to WS-Agreement A template’s purpose is to provide guidelines for the upcoming SLA negotation, by providing a description of the ervice, a list of negotiable 22 Chapter 3. Service Level Agreements properties and the applicable constraints, placeholders for stating busi- ness values and penalties, and possibly information such as trusted 3rd parties etc. The SLA templates need to be used by the initiator of a ne- gotiation, to form an initial Request for Quote (RfQ), an agreement offer, or other possible messages. The SLA negotiation follows, where the involved parties are expected to hold a series of message exchanges until they either converge to a final agreement, or fail and quit. For the negotiation to take place, a protocol is needed (e.g. Contract Net [23]) so that involved agents can interact in a structured and meaningful way. Additionally to that, the negotiation parties typically have a strategy based on which they decide what is a good and what is a bad SLA, from a utility point of view. Different agents with different strategies may have difficulty to converge to a commonly acceptable SLA; In fact, it has been proven that this negotiation may take prohibitively long to converge [24]. This is why, in most cases, it is consid- ered useful to provide hints to each other about preferences with regard to the candidate achieved objectives. For instance, if the negotiation in- volves an agreement on the cost of the service, and its availability, the customer may inform the provider that lower cost is more important than high availability. Thus, the provider would be in a position to provide counter-offers that have more chances to be acceptable by the user. A negotiation may trigger another negotiation; for instance, the nego- tiation of a customer with the geocoding service, may trigger a negotia- tion between the latter and an infrastructure service. At this point in time, an SLA Translation process takes place: it converts the terms of the first negotiation into terms of the second one. This way, it is possible for the geocoding service to have reasonable certainty that it will be able to of- fer itself under the negotiated terms and conditions, and therefore it can agree and establish the SLAs with its customers. SLA Translation, and the resulting SLA hierarchies, are discussed extensively in Chapter 5. The Implementation of the service, and therefore of the respective SLA, follows negotiation. In this phase the provider must configure the resources allocated during the negotiation phase, so that they are ready to be used during service execution. For example, this would involve installation of software on the allocated hardware, network configuration, etc. Monitoring setup is also included as part of this phase. The provider may prepare the monitoring infrastructure, so that it is readily available during service execution. For example, this would involve installation of software on the allocated hardware, network configuration, etc. The SLA Execution entails the execution of the respective service(s), and their monitoring to confirm compliance to the SLA. Following the col- 23 Chapter 3. Service Level Agreements lection of performance or other information, the parties involved will eval- uate the conformance of the service execution to the SLA. If some viola- tion occurs, or is about to occur, countermeasures may be taken to bring service performance back on track, or to reduce the violation probability. Hence, this adjustment process is a mostly dynamic part of the SLA life cycle. In the case of a possible failure, penalties should be assessed. A provider may choose to pay penalties than react to a violation, depend- ing on business policies. Another possibility is the re-negotiation of the SLA, perhaps in parallel with penalty procedures. In all possible tracks of action, SLA dependencies and hierarchies should be taken into account. This contributes to system sanity and efficiency, with the reasons for fail- ure being discovered faster and with greater certainty. The SLA Assessment phase is mostly related to the Execution phase, as it reuses the information collected during that one. More specifically, monitoring data (which shows how the service performed in relation to the guarantees provided as part of the SLA) is assessed in order to eval- uate the sensibility of the offered templates. The outcome of the process may be, for instance, that the current templates are not properly cali- brated to take into account the QoS level that a service can actually offer. As such, the templates must be adjusted to lower (or higher) limits and constraints. This feedback process is important for the longevity of SLA- based operations of a provider, otherwise there is a risk that penalties will be too high to sustain. The final phase is the SLA Termination, during which the allocated re- sources are decommissioned, and clean-up takes place. Accounting and billing may also happen at this stage, especially for short-lived services where there is no need to have periodical payments. In Chapter 6, a management architecture will be presented, which is largely affected by these definitions for the SLA life cycle. 3.3 Related Work In upcoming Chapters of this thesis, multiple citations and references to related work are provided to illustrate better the relationship between the presented work, and prior art. Nevertheless, there is important work that has been committed in the past in the area of automated SLA manage- ment, and is not directly within the scope of this thesis. In this Section, such work is discussed to provide a more complete picture of the area for the interested reader. Certainly the citation list herein cannot be exhaus- tive, rather there is an effort to provide some characteristic results and give a feeling of the developments timeline in this scientific area. 24 Chapter 3. Service Level Agreements In [25], Jin et al. make an analysis of SLAs for web services, around the time when such concepts also started being formalized with WSLA. They look at the topic in relation to then-developing technologies, such as WSDL and UDDI [26], and consider various aspects already such as service composition and how it affects SLAs. In 2002, Sahai et al. presented a paper on automated SLA monitor- ing for web services [27]. Without a formal representation for SLAs being available, they try to build on process description languages and intro- duce quality metrics to be monitored. Then, an architecture and neces- sary mechanisms for a distributed monitoring system are discussed. In 2003, one of the first articles on SLAs for Grid Computing appears, by Leff et al [28]. Within the context of commercial grids, the authors look at the issue of honoring SLAs and meeting demand via techniques mostly relevant to workload migration. Considerations for the definition, monitoring and enforcing of the SLAs are integral to the article. Shortly after WS-Agreement first drafts appeared, people already ea- gerly started using it to implement various services that could offer per- formance guarantees. In 2004, Zhang et al. used it also in the context of Grid Computing for a data transfer service [29]. Using the WS-Agreement model, the service would define agreements and offer guarantees for the time completion of a data transfer. Decision-making for the feasibility of the SLA was performed using prediction mechanisms; then rate-limiting on the network interface was used to achieve the enforcement of the SLA. In [30] (2005), Aiello et al. take a critical look to WS-Agreement, and reflect on the semantics of SLAs. Following, the authors propose exten- sions to the specification, mostly with regard to SLA life cycle and accept- able states, so that agreements become more long-lived and robust as re- gards forthcoming violations. In the same year, Sauve et al. connect SLA planning to business objectives such as infrastructure costs and financial loss [31]. The authors develop a complete impact model, to reflect SLA failures onto business objectives. Chawathe is also considering SLA planning in [32], from a theoretical perspective related to the strategy that can/should be followed by nego- tiating agents. Profit and risk distribution, as well as collaboration feasi- bility, are discussed. Also in 2006, Oldham et al. extend WS-Agreement with Semantic Web technologies, to enhance partner selection during au- tomated service compositions. In 2007, McKee et al. [33] discuss different strategies for decision- making while negotiating in the context of grid service marketplaces. The authors elaborate on the complexity of the problem, and look at possible consequences such as “self-denial-of-service” in the case of reserving re- 25 Chapter 3. Service Level Agreements sources as a means of promise during negotiation. Barbagallo and Comuzzi explored cross-disciplinary aspects of auto- mated SLA negotiation in 2008 [34]. Specifically, they focus on mecha- nisms to reduce adverse selection and moral hazard risks. They present a framework for the integration of microeconomic and engineering is- sues, in contract definition and management. Proposed techniques in- clude third-party verification, signaling, screening / monitoring, reputa- tion management, incentives, etc. Finally, recent and interesting research results include [35] by Wada et al. The authors both look into SLA-aware service composition, focusing on the multi-objective nature of the problem. A genetic algorithm is being used, alongside an optimization framework called E3. The application of multi-objective optimization on SLA planning is a promising approach for SLA planning, and indeed, something that this thesis also considers within the context of SLA Translation in Chapter 5. Summary and Conclusions In this chapter SLAs were discussed in general, initially providing some considerations which apply to them due to their (possibly) legal nature. Then, technical and other questions about SLAs were put forward, to in- dicate some of the existing, real-world problems that apply to their auto- mated management. Apropos, Section 3.2 elaborated over all the things that their management entails, according to the SLA life cycle definition from the TeleManagement Forum. The final Section offered a timeline of research in the area of automated SLA management, albeit with a re- duced but characteristic set of scholarly publications. The following Chapter is the last one of Part I. Having provided a con- crete background for the thesis’ work, Chapter 4 defines the respective scope in detail further than within Chapter 1, by means of realistic use cases. The gaps are identified, and scientific contributions are under- lined. 26 Chapter 4 Use Cases and Problem Description In this chapter two relevant use case scenarios are introduced, which il- lustrate typical relations between a service stack and its corresponding SLA hierarchy1. These use cases are found today in real-life situations, and are inspired to a large degree by the SLA@SOI EU project. They involve software and infrastructure services, which are expected to be provisioned with performance or other guarantees. A service hierarchy is present, implying similar SLA hierarchies. Following the description of the use cases, the existing problems are identified in greater detail than the summary of Chapter 1. 4.1 Hosted Enterprise Resource Planning This use case concerns a Software-as-a-Service scenario, where an Enter- prise Resource Planning (ERP) system is made available to some external customer. That is, the customer does not need to install the ERP stack locally, rather she can use it via some thin client, or a web browser. With- out loss of generality, this setup is mostly targeting smaller companies, who usually prefer to avoid the hassle and administrative costs of a local installation. As defined in [36] by Bidgoli: Enterprise Resource Planning (ERP) is an integrated computer- based system used tomanage internal and external resources, including tangible assets, financial resources, materials, and 1The concept of SLA hierarchies is formally defined in Chapter 5; here we will follow an informal, via-example approach. 27 Chapter 4. Use Cases and Problem Description human resources. Its purpose is to facilitate the flow of infor- mation between all business functions inside the boundaries of the organization and manage the connections to outside stakeholders. Built on a centralized database and normally utilizing a common computing platform, ERP systems consol- idate all business operations into a uniform and enterprise- wide system environment. Although different ERP implementations will typically entail different subsystems depending on the customer’s requirements and the business offering, the following components are considered to be common across the various implementations [37]: • Transactional Backbone – Financials – Distribution – Human Resources – Product life cycle management • Advanced Applications – Customer Relationship Management (CRM) – Supply chain management software ∗ Purchasing ∗ Manufacturing ∗ Distribution – Warehouse Management System • Management Portal/Dashboard – Decision Support System In the present use case, such involved subsystems are exposed as ser- vices within the ERP provider; and one may invoke the other to complete its various tasks. Other, lower-level components that are never directly used by the customer, are also exposed as services and invoked by these higher-level subsystems. Considering that this is a business-critical system, and that complete companies usually base their operations on it, it needs to be always avail- able, and perform within acceptable limits. What constitutes “acceptable limits” depends on the customer and the domain, and is the subject of 28 Chapter 4. Use Cases and Problem Description SLAs between the customer and the provider. For instance, let assume a retail store that uses such a hosted ERP system to manage its inventory. When a customer wishes to buy an article, a store clerk needs to check the availability of this article. The inventory module will typically first look at the warehouse availability; and if the product is not immediately avail- able, it would consult with the purchasing and distribution components, to see if there is an ongoing order, and if so, what is its current state. In this scenario, the retail store (as a customer of the ERP service) will expect that the performance of these queries is sufficiently good, so that the customer can be informed promptly on the article’s availability. Such performance guarantees are the subject of the SLA among the customer and the ERP service provider. For this running example, one guarantee may be: For all inventory queries within 24 hours, 99% should complete within 5 seconds; and the remaining 1% should complete in no more than 10 seconds. Another one, may be: Under the precondition that any query which does not return a result within 10 seconds is considered as a failed request to the system; and with “availability” being defined as the ratio of completed requests to failed requests; then availability of the system over 24 hours of measurements should be no less than 99.9%. Such requirements are very common in real life, and SLAs need to re- flect them. In the case of automatically negotiated SLAs among software agents, the software needs to be able to deduce what any of such guar- antees between the customer and the provider means for corresponding SLAs within the provider, and among its services. Relevant questions may involve, for this example: 1. What are the acceptable time limits for queries to the warehouse availability component, the purchase component, and the distri- bution component, so that the total time for the three sequential queries does not exceed 5 seconds (or 10 seconds for up to 1% of all calls)? 2. What is the technical setup for redundancy and failover, so that availability remains higher than 99.9% within a span of 24 hours? In addition, it is assumed that the infrastructure may be available in- ternally, or rented from an external (infrastructure) provider. Either way, 29 Chapter 4. Use Cases and Problem Description the SLAs for the software services will have to be reflected on SLAs for the infrastructure service(s). 4.2 Enterprise IT This use case refers to a provider of infrastructure resources. It is a de- partment of a larger organization, offering its services within that orga- nization, and may use its own resources exclusively or may choose to outsource some of the workload in the case of utilization spikes. The cus- tomers are the departments executing management functions (e.g. the company’s Chief Technical Officer). ™⌤┦ ✨⤨⌥✥⤪ ⬬⴮⼪〧┦ⰱ IT Infrastructure Gateway S L A Data center Data center Data center S L A S L A S L A External IT Provider SLA Objectives: Relative job importance, Job priority, Time-to-complete, ... Objectives: Resource types, Resource quantities, ... Figure 4.1: Enterprise IT Figure 4.1 provides a high-level overview of the basic concept. Here, higher management dictates the objectives according to which resource allocation and job prioritization takes place. Such objectives may include things like the expected completion time of a job, the performance re- quirements of a service, the importance of a task in relation to other sim- ilar tasks, etc. As soon as a new SLA is requested to be established, the system must infer resource requirements from such higher-level objec- tives. Following that, SLAs with data centers (resource pools), and even with external providers if possible / necessary, are requested to be es- tablished. These second-level SLAs are produced based on the first-level SLA, as it has been requested by the management department. In order to stay up-to-date with business environments that may be changing rapidly, objectives and priorities for tasks served by the IT in- frastructure may also change often. Examples of such situations include 30 Chapter 4. Use Cases and Problem Description the sudden, unexpected customer request for a software service, the ad- justment of a company’s operations to some natural disaster, a rapidly changing financial environment that reflects on the importance of finan- cial models and simulations, etc. It is therefore desirable that the sys- tem can handle SLA adjustment requirements. Let assume, for example, the case of two large-scale simulations executed in parallel by two differ- ent research groups within the same company. They are both governed by SLAs, which map time-to-complete on resource quantities assigned to each workload. Should one simulation be suddenly considered more im- portant than the other as regards its completion time, it may apply that resources have to be re-assigned accordingly. In such cases, it is desirable that the infrastructure adjusts automatically and takes into account all re- lated consequences for the complete set of SLAs. One possible course of action would be that the IT Infrastructure Gateway from Figure 4.1 rene- gotiates its SLAs with the data centers: It could free up some resources from the less important job, and re-assign them to the more important one. Another option is to request that higher-capacity and performance resources are now used for the more important job, therefore triggering a migration process within the data centers. High-level business objectives Planning model Acceptable costs Assigned penalties Optimization model Resource requirements per data center Figure 4.2: Enterprise IT planning and optimization Independent of the selected course of action, there is apparently re- quired the intelligence to translate objectives from one SLA to the oth- ers, and choose one of the various possibilities in accordance to expected utility. Models (typically, domain-specific ones) are therefore required to perform this translation via planning and optimization processes, as illus- trated in Figure 4.2. In this broader SLA planning / optimization and adjustment context, 31 Chapter 4. Use Cases and Problem Description things like Total Cost of Ownership (TCO), data center efficiency, high availability, etc, become very important decision factors. Additionally, resource allocation algorithms need to take into account resources avail- able externally, and deduce whether it would be beneficial to outsource some of the workload. Pricing then comes into the picture and has to be weighed against higher management requirements. Based on such factors, the SLA management framework can deduce whether subcon- tracting is feasible, and what may be a good or optimal contract. 4.3 Problem Description The gaps and open issues related to these two use cases (but also to the example scenario from Section 2.2) are quite a few, spanning plenty ICT disciplines (e.g. combinatorial optimization, performance engineer- ing, forecasting, security, capacity engineering, virtualization, etc). In this Section, the ones most relevant to this thesis will be identified. First and foremost, there needs to be a formal definition of a hier- archy of SLAs (Gap 1). Although the concept can be intuitively under- stood within the Service Computing area, a formal way to describe what it means is still lacking. More specifically, it is necessary to understand how can SLAs be related, what is the nature of dependencies among different SLAs, and how those dependencies affect other SLAs established by the same entity. As an example from the Enterprise IT use case, when the priority of one SLA is increased, and no outsourcing is possible, the pri- ority of other SLAs must be decreased. Even if it is possible to outsource some of the workload, someone must decide which exact workload will be outsourced, and how does that affect costs and performance. Having de- fined the concept of SLA hierarchies, it is necessary to understand what it means to deduce one from another; that is, how and when is a hierarchy of SLAs built. Following this step, it is needed to have an SLA management archi- tecture that takes into account such requirements for the establishment of SLA hierarchies (Gap 2). This architecture should be extensible, to be applied to any possible use case, independent of languages, proto- cols, and legacy systems already in place. In addition, it would ideally be autonomous, so that it is easier to reach (at least) some local optimum when evaluating the feasibility of a contract, or trying to infer a subcon- tract; both tasks can be extremely expensive from a computational point of view. An architecture of this kind is a necessary milestone in order to achieve hierarchical SLA management, as discussed in the example scenario and the use cases. 32 Chapter 4. Use Cases and Problem Description The tasks of evaluating SLA feasibility and deducing subcontracts are tightly coupled with the concept of utility; that is, some measure of bene- fit that comes from the respective decisions. More often than not, utility is associated with financial results and some measure of net income. Pricing may take many different forms, and there is already significant prior art in this area. Nevertheless, a complete model for the definition of penalties depending on the agreed price, is still missing (Gap 3). Various efforts have taken place in recent years –these are discussed in the respective chapter–, but the author believes there is room for improvement. This is, therefore, an additional identified gap, and a proposed solution is part of this thesis. The solution seeks to be fair (thus associated to price, and some objective measure of failure) and unambiguous. Penalties aside, computing the feasibility of an SLA under its specific terms, or inferring good/optimal subcontracts starting from a top-level contract, may be extremely demanding from a computational point of view. In certain cases (depending on the algorithm at hand and the num- ber of candidates) it may even be impossible to deduce. To the combi- natorial nature of resource planning for multiple SLAs, adds the fact that SLAs themselves may be arbitrarily complex. Multiple logical operators such as AND, OR, XOR may be found – for instance, this type of constructs are integral to WS-Agreement. Even within the two simple terms from the ERP use case, multiple terms can be found that are combined with such logical operators. Further to that, “if” clauses are found in the form of pre-conditions for an SLA term to apply, introducing further complexity to the handling of SLAs. Putting everything together, the task of processing incoming (or producing outgoing) SLA offers becomes very hard. Generic recipes to handle this kind of complexity are much needed, and consti- tute an important gap in the area of SLA negotiation and management in general (Gap 4). This thesis introduces and proposes a methodology us- ing a graph-based structure, known as Binary Decision Diagrams (BDDs). Using this methodology, it is possible to simplify SLAs significantly under the assumption that they can be represented as boolean functions. This simplification leads to a canonical form, therefore their handling becomes much easier. The four gaps and respective innovations discussed so far, are generic and applicable to SLA management independent of the domain at hand. Due to the recent explosive growth of IaaS platforms and applications, as well as the recognition by many experts of the need to enhance IaaS with full SLA management, the author decided to focus further on this area, applying the know-how achieved from the generic work outlined within Part II. Within this context, a necessary milestone was to have 33 Chapter 4. Use Cases and Problem Description an abstract and flexible syntax to describe resources and operations for reserving them. Advance resource reservation is a very useful facility during SLA negotiations. Although there is significant amount of prior art in the area of resource description, a sufficient combination of the two features could not be found; thus, a model was developed to describe ar- bitrary infrastructure resources and primitives for their reservation. The result was loosely based on existing standards such as CIM and GLUE (fur- ther discussed in Chapter 9), to enable practical use while also applying the respective experience. This model is a tool to use for some practical problems. The interest of the author was to research the practical use of SLAs in two different scenarios: First, examine the scheduling of IaaS SLAs in the normal case where customers arrive randomly and request some resources for a spe- cific amount of time. Taking into account these incoming requests, the main question to answer was: How to model the scheduling problem, so that the energy footprint of a provider’s data centers is minimized? The online nature of the scenario (i.e. for each incoming resource/SLA request, the next ones are unknown) means that some simpler, greedy approach is both convenient and suitable. Indeed, the problem was mod- eled after the online bin-packing problem, where each bin is represented by one server, potentially shared by many customers. The second scenario is modeling an extreme situation, where the re- sources are not enough to satisfy a number of already-established SLAs. One such example is the case where a considerable percentage of avail- able resources becomes unavailable, e.g. due to power outage. The question that arises, is how to best manage this case and adjust the use of available resources, so that the least penalties are paid to customers. Various possibilities are examined as regards which parts of the SLAs are honored, and which ones are purposefully withdrawn. A combinatorial model is developed, as a knapsack problem variant. To the best of the author’s knowledge, this is the first time that this variant appears in lit- erature. The model is evaluated in three different variations, as well as against two greedy algorithms. Summary and Conclusions In this Section there were provided two typical, real-world use cases for SLA management based on industrial use cases of the SLA@SOI project. The relevance to SLA hierarchies was discussed and respective gaps were identified. These gaps were analyzed to result in a more complete prob- lem description, so as to provide the full scope of the thesis. 34 Chapter 4. Use Cases and Problem Description The upcoming Part II includes scientific contributions that apply to SLA management independent of the application domain. Initially, SLA hier- archies will be discussed; then a management architecture; a penalty model; and finally, a flexible, generic model for in-memory SLA represen- tation. 35 36 Part II Hierarchical SLA Management 37 38 Chapter 5 SLA Hierarchies Starting from service chains and business processes, one can deduce de- pendencies of one service on some other(s). These dependencies stand behind the concept of SLA hierarchies, which is central to this thesis. The present chapter discusses the respective topic, exploring it in de- tail. A generic definition of the “SLA translation” problem will be provided, alongside related formalisms that connect it to SLA negotiation. 5.1 SLA Dependencies Let assume a service hierarchy, where a service “A” relies on two other services “B” and “C”. This practically means that, if a customer requests to consume A under some specific guarantees, the service provider can only agree if it has reasonable certainty that services B and C will per- form accordingly. Otherwise, best-effort service delivery by the latter can easily affect the delivery of the former. That is, the guarantees offered (and agreed upon) between the provider and the consumer of service A, depend for their compliance or violation on the guarantees agreed upon between A and B, and those between A and C. Hence, it is necessary to have SLA Hierarchies that reflect the service hierarchies to which they apply. For instance, let assume that service A is the address-based route planning service from Section 2.2 (AB), and services B and C are the re- spective geo-coding (GC) and route-planning (RP) services. It has already been shown that the completion time of an invocation of the former de- pends on the completion times of invocations of the latter two. There- fore, these non-functional properties are related. Should an SLA be es- tablished to guarantee the properties of the dependent service, then an SLA should also be established to guarantee the related properties of the 39 Chapter 5. SLA Hierarchies antecedents. Figure 5.1 illustrates the concept; and Section 5.1.1 elabo- rates further on it. AB Customer GC RP SLA 1 SLA 2 SLA 3 SLA 1 SLA 2 SLA 3 Figure 5.1: SLA dependencies, reflecting service dependencies The existence of these hierarchies signifies the need to have some method to deduce one from the others, based on their inter-dependencies. In the text that follows, this process will be referred to as SLA Translation. One could say that SLA Translation is the process that designs a hierar- chy of SLAs, during negotiation between the various parties involved; and the establishment of the SLAs is what builds and finalizes the hierarchy. Just like with service hierarchies, full distribution of SLA hierarchies is as- sumed. Each party knows only about the parties (services) depending on it, and the parties (services) on which it depends. Similarly, each party only knows SLAs for which it acts as a provider, SLAs for which it acts as a customer, and the links/dependencies between the two categories. There is no assumption about a centralized data store that holds all SLAs and all dependencies, thereby leading to the possibility to deduce the complete hierarchy/-ies within a service ecosystem. 5.1.1 Service Properties Dependencies An SLA was described earlier as a set of guarantees over the consump- tion of a service. Dependency of a service upon others, naturally means that its properties are depending on properties of these other services. Assuming the availability example for a dependent service, it will be af- fected by the availability of its antecedents: Service AB cannot execute when either GC or RP cannot execute – hence, the availability of the two latter affect that of the former. The same applies for other service prop- 40 Chapter 5. SLA Hierarchies p1 p2 p3 q1 q2 q3 r1 r2q4 Sp Sq Sr f f g gh h Figure 5.2: Example service Properties Dependency Graph erties such as invocation completion time, maximum throughput, Mean- time-to-Failure (MTTF), Mean-time-to-Repair (MTTR), etc. There may also be cases where a dependent service’s property may be related to a completely different property of its antecedents. For in- stance, the cost of a service is typically affected by the quality of the services that it uses for completing its tasks. Additionally, it is certainly possible to have an SLA which does not refer to all properties of a service, and therefore only a limited number of dependencies must be taken into account: only those which affect the properties mentioned in the SLA. It is becoming clear that there cannot be a universal classification of service properties, and their dependencies. Certainly, facts that set the context of an SLA cannot apply to all SLAs either, but rather, distinct facts for distinct SLAs should be assumed. It is only possible to define the prob- lem generically based on the abstraction of conditions that need to hold true, when aggregated according to the SLA structure. Use-case-specific knowledge needs to be applied by domain experts to instantiate these conditions, and associate them in the context of different, dependent ser- vices. Starting from the service dependency graph (see Section 2.1), it be- comes possible to make a next step towards a service Properties Depen- dency Graph (PDG). If service Si has properties PSi = (pSi1 ,p Si 2 , ...,p Si m); service Sj has properties PSj = (p Sj 1 ,p Sj 2 , ...,p Sj n ); and Si depends on Sj; then a dependency of Si’s r-th property can be formulated as a function fr of properties of Sj: pSir = f Si r (p Sj 1 ,p Sj 2 , ...,p Sj q ) (5.1) where 1 ! r ! m, and 1 ! q ! n. This definition can be extended for a service S that depends on more than one services. Let assume that S has properties pS1 , pS2 , ..., pSN. Let us also call the antecedents S1, S2, ..., Sm, with property sets PSi = 41 Chapter 5. SLA Hierarchies (pSi1 ,p Si 2 , ...,p Si Ni ), 1 ! i ! m. Then, each property pSr of service S can be generically expressed as follows: pSr = fr(p S1 1 , ...,p S1 N1 ,pS21 , ...,p S2 N2 , ...,pSm1 , ...,p Sm Nm ) (5.2) Figure 5.2 illustrates such an example. Here, service Sp depends on services Sq and Sr. Domain experts can then create rules about the de- pendencies of Sp’s properties p1-p3 on those of Sq (q1-q4) and Sr (r1- r2), based on modeling techniques, simulation, or real-world observation. These rules are represented by functions f1, f2 and f3 (named f,g and h for convenience) that implement a mapping of each property on the properties on which it depends. 5.1.2 SLA Translation Operating under the abstractions from Section 5.1.1, the concept of SLA Translation can be discussed. In essence, SLA translation is the process of analyzing an SLA in relation to the PDG (i.e. finding the service properties common to the SLA and the PDG), applying heuristics and pre-existing knowledge, and coming up with one or more subsequent SLAs for an- tecedents. These subsequent SLAs provide reasonable (but not necessar- ily complete) certainty that the top-level SLA will not be violated, unless some of them are violated too. The aforementioned analysis starts with Equation 5.2. One can first define a complete set of such equations, one for each property that the customer of the dependent service requires in the negotiated SLA. Proper- ties that the customer does not require guarantees for, can be excluded. Then, this system must be solved, taking into account existing constraints as provided by the SLA templates. These limits as regards what can be negotiated help customers to reduce the search space for their offers. The search space may otherwise be so large that the problem becomes practically infeasible from a computational point of view. Optimization of subsequent SLAs is one more requirement for the translation process. Actually, optimization is part of translation, as it af- fects the final output. The system mentioned above will typically accept plenty of different solutions, and it is up to optimization to select the best of all those candidates. As many properties are simultaneously dealt with, this becomes a multi-criteria optimization problem [38]. Using the example from Figure 5.2, the following relations apply: p1 = f(q1,q3) p2 = g(q3, r1) p3 = h(q4, r2) 42 Chapter 5. SLA Hierarchies If an incoming SLA request/offer for service Sp refers to all three prop- erties, an SLA negotiation mechanism which includes translation func- tions should first find out the dependencies of these three properties based on the PDG. It is assumed, as also mentioned earlier, that these dependencies are known as domain-specific expertise encoded into the system. For instance, if the property is “availability”, it will typically de- pend on the availability of all antecedents; this is fairly straightforward to assume. However, if cost is examined, it is not necessary that the cost of invoking service Sp is related specifically to cost properties of services Sq and Sr. On the contrary, it may be the case that there are no cost prop- erties for these two services, but rather that their providers only apply flat-rate pricing. In this case, the cost of Sp may rely on properties such as Quality of Service characteristics of Sq and Sr. The next action is to find acceptable value spaces (“domains”) for all of (q1,q3,q4, r1, r2), leading to a feasible solution. “Acceptable” are val- ues which remain within constraints set inside the templates of the two lower-level services, and on the same time satisfy the requested values in the offer for the higher-level SLA. The last step during the translation process, would be to come up with a sufficiently good solution to the problem of selecting SLA parameters for antecedents, according to an optimization algorithm. The latter typ- ically takes into account all applicable business objectives. An obvious such objective is to maximize profit, nevertheless others are often taken into account as well; for instance, reputation. Associating reputation with (future) profit may be impossible without large historical data sets and very complete, complex models. The same may apply for other possible business objectives. Thus, there may be no one single objective for the optimization, and no single optimal solution. Rather, a multi-criteria ap- proach should be adopted. Without loss of generality, it is here assumed that there may be N applicable criteria, N " 1. In this case, one of the solutions on the Pareto front should be chosen. The Pareto front is a set of all those solutions that are considered to be optimal in multi-criteria op- timization. More formally, they are solutions where none of the included criteria can be improved (accept a better value), without some other cri- teria in the same solution receiving a worse value. Figure 5.3 shows how this translation process can be implemented ac- cording to the previous discussion. Counter-offers are assumed to be pos- sible, as part of the negotiation protocol being used. An entity negotiating over a set of variables may find that small modifications to the negotiat- ing party’s requirements may increase significantly the resulting utility. In this case, it may just as well modify the proposed term slightly, and return 43 Chapter 5. SLA Hierarchies SLA Translation Incoming SLA offer Extract properties and target values Search PDG for properties in SLA Find antecedent properties Limit search space according to antecedents' templates Search and select optimal value Negotiate with antecedents Success? Accept incoming SLA offer No Yes Timeout? No Yes Reject incoming SLA offer Counter-offer preferrable? Construct counter-offer Yes No Reply with counter-offer Figure 5.3: Generic flowchart for translation/negotiation of SLAs 44 Chapter 5. SLA Hierarchies a counter-offer which does not match the other entity’s requirements, but may provide much better results if accepted. Such negotiation-time risk-taking attitudes can be modeled with game theory methods, and in- deed there has been important research towards this direction in the past (e.g. [39, 40]) – this topic is, however, outside the scope of the work pre- sented here. The reason to interleave translation and negotiation in this flowchart, is exactly that they take place in parallel and depend on each other. Ne- gotiation feeds translation with confirmations whether the translation’s results are acceptable; and translation provides negotiation with input according to existing domain-specific knowledge, and any further intelli- gence implemented within the system. It should be clarified, at this point, that the negotiation process and all kinds of decision making remain sep- arate here; that is to say, negotiation concerns only the exchange of messages in such a way that implements correctly a specific protocol, while all planning functionality lives elsewhere and is, in this case, part of the translation process. This is a distinction applied also in the man- agement architecture, as described in Chapter 6. Regarding planning it- self, it is anticipated that coping with multiple service level attributes and n-dimensional Pareto fronts might become difficult. In order to be able to implement an efficient and automated SLA translation and negotia- tion framework, use-case dependent heuristic functions will be required to narrow the search space down to a feasible size. Scalarization of the multiple objectives into a single one, may also be needed to achieve au- tomation. One last thing to mention here, is that the prior discussion tackles the problem of producing lower-level SLAs starting from a higher-level one, in a top-down approach. A different scenario is that of a bottom-up approach, where the lower-level SLAs are already established and used to decide whether an incoming higher-level offer can be satisfied. The essentials of the previous discussion on producing and using PDGs for SLA translation can be applied just as well on such a bottom-up process, as they are modeling service relationships without being tied to a specific usage scenario. 5.2 Related Work Although not directly related to SLAs, there is prior art that concerns rela- tionships between different services and/or resources. In [41], Keller et al. describe a Functional and a Structural dependency model, on which they base their work. They provide a specific classification of dependencies, 45 Chapter 5. SLA Hierarchies into a number of dimensions such as locality, time, type, dependency strength and criticality, etc. Then, building on the two models, they dis- cuss a method for dependency analysis. In [42] the same authors append the two models with an Operational one, and further refine a proposed architecture for managing dependencies. The main differences with the present work is that they constrain theirs in the ICT domain where they also provide an architecture to tackle the dependency discovery problem. Additionally, they do not consider SLA management. In [43], Hasselmeyer discusses the problem in a generic manner, much like this Chapter. However, the work is tied to software services; more specifically, it proposes a scheme where services declare their dependen- cies on other services based on access patterns (how often are services accessed, in what sequence, etc) with dynamic service selection in mind. Additionally, it looks specifically at functional dependencies. Contrary to that, dependency management in the context of SLAs requires a generic view that handles non-functional dependencies as well. Bahl et al., in [44], are looking at the same problem as regards net- work services. The authors describe a model called Inference Graph that represents the dependencies. Based on that, they present an algorithm for inferring probabilistically the malfunctioning components of the net- work, given real-world observations. The main difference with this work is that the Inference Graph describes dependencies based on service states, and probabilities that the state of one depends on the state of an- other; additionally, the referenced work is explicitly addressing network services. In [45], Di Nitto et al. introduce an agent-based automated SLA nego- tiation architecture and discuss efficient search-based algorithms to de- termine an acceptable Service Level Agreement in a multi-agent environ- ment. The main difference to the present work is the complex and layered multi-tier services landscape that requires additional analysis steps and related data-structures for SLA negotiation. In addition, SLA translation is considered here as an essential part of the SLA negotiation process. Although some of the search-based algorithms analyzed in [45] would be able to cope with multiple objectives that negotiation agents may have, it is not the primary concern of the authors’ work. Summary and Conclusions This chapter’s main contribution is a formalization of SLA Hierarchies, and the SLA Translation problem. That is, how to deduce an SLA for an- tecedents when the SLA for dependents is known, and the way that the 46 Chapter 5. SLA Hierarchies latter depend on the former. This computationally difficult exercise can be carried out as an optimization problem, when the properties are differ- ent. When the properties are the same (for instance, the number of CPU cores in an IaaS outsourcing request), it is straightforward to go from the first SLA to the second. The upcoming chapter will present a novel architecture that covers SLA management during its complete life cycle. Alongside components for tasks such as negotiation, provisioning, monitoring, and others, it con- tains a central component for decision-making (termed “Planning / Opti- mization / Adjustment”) that is expected to perform the task of SLA trans- lation when required by the specific use case. 47 48 Chapter 6 Management Architecture Managing automated SLAs implies the need for one or more software artifacts that can handle the various stages in the life cycle of an SLA. This Chapter presents a novel architecture for the complete SLA life cy- cle management, catering for SLA hierarchies. It has been designed with extensibility, domain-independence, and technology-independence in mind. The respective negotiation and runtime scenarios are discussed, followed by technical details about a reference implementation. 6.1 Design In the past there has been significant effort committed towards imple- menting different architectures and methodologies for negotiating SLAs, and managing them during their life cycle. Prior art has been concerned with both automated SLA establishment (typically using agent technol- ogy), and manual establishment steered by humans. However, existing research results are tied to specific assumptions/use cases, and often specific technologies. In this chapter, a generic architecture is proposed, which can be used across different domains and use cases for manag- ing SLAs. This architecture covers all phases of the SLA management life cycle, from negotiation and establishment to termination, also with attention to the existence of SLA inter-dependencies and hierarchies. The major requirement from this design is that it can be adapted to different use cases, ideally any scenario whatsoever that demands the establishment and management of SLAs. “Management” refers to provi- sioning, monitoring, adjusting and terminating SLAs. Therefore, the de- sign has to be adaptable, bear no ties to specific technologies, neither any ties to specific application contexts. In parallel, there must be separation of concerns between service-plane and SLA-plane activities, so that it can 49 Chapter 6. Management Architecture be realistically applied onto existing service infrastructures. Additionally, the design needs to advance the state-of-the-art, by considering hierar- chies of inter-dependent SLAs as a result of similar (hierarchical) service compositions. The SLA Management Instance (SAMI) is designed to operate on a level parallel with that of the services, so that is not disruptive to ex- isting service infrastructures. Figure 6.1 shows in an abstract manner the relationship between SAMI and service instances, in the context of the example from Section 2.2. It is reminded that this example involves an end-customer, a geo-location service offered by a SaaS provider, and the infrastructure offered by an IaaS provider. Management of service in- stances by the respective SAMIs refers to provisioning and adjustment of those instances. The end-customer invokes the geo-location service ac- cording to s-SLA, which has been established earlier following negotiation between the end-customer and the geo-location SAMI. The (geo-location) software service executes on the infrastructure provisioned according to i-SLA, established earlier between the geo-location service provider and the infrastructure provider. It should be emphasized, at this point, that this is a simple scenario to be used for illustrative purposes. In a full- fledged use case there may be many more inter-dependent services and SAMIs, in multiple layers. End- Customer SAMI Geo- location SAMI IaaS SAMI s - S L A i - S L A Infrastructure Geo- location service Service consumer S e r v i c e i n v o c a t i o n s Provision/ Adapt Monitoring information (Re-)Negotiate (Re-)Negotiate Provision/ Adapt Monitoring information Figure 6.1: SAMI and service instances 50 Chapter 6. Management Architecture It is assumed that the SAMI is autonomous and does not interact with the services in any way other than a) provisioning/adjusting them, and b) receiving monitoring information. Cardinality of services and the SAMI is not predefined either. The architecture is flexible to accommodate dif- ferent numbers of services. As such, feasible scenarios include a single SAMI for the whole provider’s domain, or one SAMI for some services, or even one per service. Figure 6.2 illustrates the SAMI architecture. Planning / Optimisation / Adjustment Negotiation Negotiation Offers Offers Accept / Reject, Counter-offers Accept / Reject, Counter-offers, Templates Forecasting Recom- mendations Provisioning Monitoring Execution Plan Historical information, State information, Events Dependencies Management Service Instance Pro visi on / A dap t SAMI Advertisements Legacy Monitoring Infrastructure Monitoring Data Monitoring Data Template requests, Negotiation templates, Agreement offers Template requests, Negotiation templates, Agreement offers Figure 6.2: The SLA Management Instance (SAMI) 6.1.1 Template Advertisements Interactions between different SAMI instances may be either based on broadcasting, through the Advertisements interface, or point to point, through the Negotiation interface. This depends on the type of mes- sage exchanged, and the negotiation mechanism. It should be noted that agreement offers may come either by a service customer, or by a service provider. A Publish/Subscribe System (PSS) is being used for broadcasting ad- vertisements to groups of interested parties. The advertisements are SLA templates, published either from customers for purposes of procurement, or from providers informing prospective customers about their service of- ferings. In the case of procurements, a customer broadcasts a template describing the expected service: what it involves from a functional point 51 Chapter 6. Management Architecture of view, what non-functional properties are supported, what can be cus- tomized and within what limits, etc. Following, providers interested in signing a contract with such terms can customize the template and return it (either as a template again, or as a final offer) in a point-to-point fash- ion through the customer’s Negotiation interface. In the case of providers that advertise their services, templates are broadcasted as soon as they become available. Then, interested prospective customers can use them to create a modified template for further negotiation, or an agreement offer and submit it through the provider’s Negotiation interface. In both cases, the published template should include an endpoint to the publish- ing SAMI’s negotiation interface, so that interested parties can contact it to submit offers. 6.1.2 Negotiation Interface The Negotiation interface is the endpoint formally used for point-to-point negotiations. It provides a list of available templates for existing service offerings in the specific SAMI’s domain of responsibility. This template list may be complete, or customized based on a properly formatted query. Having received and customized a certain template, it is then possible to submit a modified template for further negotiation, or directly an agree- ment offer. It must be noted, that the second (lower) negotiation interface in Figure 6.2 is there only for purposes of intuitive illustration. In reality, it is sufficient to have a single negotiation interface and implementation. However, this second instance is included in the figure to explicitly in- dicate the chaining of SAMIs, in the exact same way that services are composed into other services. 6.1.3 Planning and Optimization It is assumed that the negotiation interface will be able to support dif- ferent protocols and languages (e.g., WS-Agreement, WSLA), however a single internal model will have to be used for the management of SLAs; a proposal for such a model is discussed later, in Chapter 8. When a template or an offer is received, it is interpreted into this internal model and forwarded to the entity that implements Planning, Optimization and Adjustment (POA). This entity represents the central decision system for each specific SAMI, which has the domain-specific knowledge on how to implement SLAs in an optimal or near-optimal manner, and how to en- force them. The decisions of the POA module are related to the existing available resources, and also to the external services that may be needed in order to implement an SLA. Hence, if subcontracts are needed to sat- 52 Chapter 6. Management Architecture isfy the requested functionality and quality objectives, respective offers are created on the fly and forwarded to subcontractors’ SAMI instances in consecutive negotiation cycles. A SAMI knows other existing SAMIs through the templates that the latter have published with their advertise- ments. These templates can therefore also be stored by receiving SAMIs, and used to be aware of existing service offerings. It is this point where the translation process discussed in Section 5.1.2 may be required, to convert from the incoming offer to the outgoing offers towards third par- ties (external to the specific SAMI). An example of this planning process based on resource availability, and without outsourcing, is the approach discussed in Chapter 10. Chap- ter 11 provides an example of a planning process that does indeed out- source to other SAMIs, as it has no resources of its own. 6.1.4 Monitoring Infrastructure and Forecasting Before combining different candidate values for service parameters in- cluded in subcontract offers, one has to calculate those candidates first. For this purpose, the POA module depends on two other modules: Moni- toring, and Forecasting. Monitoring will typically be available, as in most cases it is not sensible to establish an SLA without the capability to mon- itor it1. Thus, it is expected that historical monitoring information will, in general, be available. Using monitoring data sets, POA is able to decide which external services should be used based on their previous perfor- mance and therefore their reputation. Forecasting may be available or not, depending on the use case at hand. In this case, there may be an existing forecasting mechanism used for specific services, to estimate future resource availability, or future performance given predefined con- ditions. Such mechanisms the SAMI should be able to reuse if they exist, also to make the transition to this new management structure easier. 6.1.5 Provisioning Following the planning and optimization process, an SLA must be provi- sioned. This includes preparatory phases that may exist, plus the actual service instance deployment. In this case, the POA module will have to feed an execution plan towards the Provisioning component. The latter will also have to take into account information about dependencies of 1In a high-risk setup or if dictated by other business policies, it may be the case that a provider establishes an agreement without monitoring it – but rather, based on some rough statistics about the overall system and its performance. 53 Chapter 6. Management Architecture services, for purposes of configuration and resource allocation. The Pro- visioning subsystem will orchestrate all those actions that are not directly related to Planning and Optimization, but must take place in order to im- plement a service. That includes, for example, configuration of services according to specific languages and notations, or accounting actions that must come before and after service deployment. Eventually, the Pro- visioning component will take all necessary actions to enable a service according to the SLA and the consequent execution plan produced by the POA subsystem. 6.1.6 Monitoring and SLA Adjustment When the service is deployed and enabled, and after consumption starts, monitoring information will have to be collected. As said, there may be an existing (legacy) infrastructure for service monitoring. That will have to be reused, so that service operations are not disrupted and the SAMI platform can be easily applied over existing services. If such a monitor- ing infrastructure does not already exist, it will have to be created, and eventually monitoring events will need to be sent to the SAMI’s Monitor- ing component (or the SAMI should be configured to retrieve such data if needed). First-level processing of raw events takes place there, so as not to flood the POA module with useless information. When a violation of existing SLAs occurs, or is about to occur according to Monitoring’s pro- jections, the respective event is sent to POA alongside information that can explain in more detail what went wrong. The POA subsystem may also need to receive current state information about various aspects of the service; that information will also be received from the Monitoring component. Based on current state of affairs and Monitoring’s assump- tions, the Adjustment part of POA will make a decision on the course of action to take. That may include an adaptation of services through re- configuration, scaling, etc, or renegotiation. The latter may be directed either towards customers, or towards third parties, depending on whether the (near-)violation is also due to failure of subcontracts. Business concepts built into POA are used also for negotiation, and their effect is equally profound during renegotiation. Penalties will have to be taken into account during the latter, as well as accounting for the service execution time and consumption up to the moment that rene- gotiation starts. Signaling of renegotiation takes place by submitting a template, like in negotiation, however it must include a reference to the SLA that is already effective. Whether renegotiation itself is acceptable at all, is something that may be a term in the initial agreement. Even if it is acceptable, the affected party may wish to discard any renegotiation 54 Chapter 6. Management Architecture requests, in which case the violating party will have to pay penalties as defined in the initial agreement. 6.2 Operations It has already been mentioned that a major design goal for the SAMI was to ensure that operations of existing services would not be affected. This practically means that, ideally, the services will not know anything about the SAMI that manages them, and will not initiate any interaction with it. That said, it is assumed that the processes for interacting with customers, making business decisions and deploying services will have to change as a result of applying the SAMI architecture and its principles within existing service providers. Looking again at the example from Figure 6.1, in a typical scenario the end-customer would first discover templates describing the geo-location service, indicating which parameters can be negotiated, and providing default values for those. The customer can then form one or more agree- ment offers based on the templates, or modify some parameters to its likings, and send them back for negotiation. Price could be one such parameter, or left empty as a way to ask for a quote. Figure 6.3 illus- trates such an example with a price request, given a throughput of 100 hits/minute and availability higher than 98%. Similarly, the provider could then make small modifications to this document and return it, or could form an agreement offer based on it and send it to the customer. On each round of this negotiation, the SAMIs representing the customer and the SaaS provider would have to evaluate templates, modify them in an ef- fort to optimize the resulting utility, and make their final offer (or submit a new template for negotiation). As mentioned previously, it has been proven that this process may take prohibitively long to finish [24], but the practical implications are limited: any business entity can set a timeout, after which it will terminate any negotiation that does not bear results. From the SaaS provider’s side, before it submits an offer or agrees to an offer, it needs to know that it can successfully support it. Therefore, an offer for s-SLA will have to be translated to an offer for i-SLA, according to the domain-specific knowledge implemented within the SaaS SAMI’s POA subsystem. Similarly to the case between the end customer and the SaaS SAMIs, there would be a negotiation taking place between the SaaS and the IaaS providers’ SAMIs. As soon as i-SLA is established, the offer for s-SLA could be accepted (or submitted). Nevertheless, this de- sign does not exclude riskier strategies, without consecutive (lower-layer) SLAs. Figure 6.4 shows a high-level view of the negotiation process, in 55 Chapter 6. Management Architecture Template Context Terms Customer Provider Service Price Throughput Availability Offer Context Terms Company A Company B ERP > 100 h/m > 98% Figure 6.3: A simple Template and Offer example the form of a UML sequence diagram. A loop of negotiation messages is typically followed by a final, binding offer; and binding offers are ide- ally synchronized, so that cancellation fees are avoided if a party retracts from negotiations on some layer. This is evident in Figure 6.4, where a failure of step 2.1 leads to failure of step 2. The author believes that a modeling of the POA system internals that applies to all imaginable use cases is probably impossible. Given the fact that different domains require different vocabularies for the guarantees of the applicable SLAs, and the very distinct utility functions that may be used even by different providers in the same domain/use case, it is only possible to model the interface with generic constructs for assessing SLAs and intervening to the planning/optimization process. Furthermore, it has been proven that no algorithm has a universal advantage over other al- gorithms, for the set of all possible optimization problems [46, 47]. As such, under the assumption that the Negotiation module looks after cor- rect message exchange, the POA subsystem only needs a method to eval- uate incoming offers (either during negotiation or final, binding offers), and a method to interrupt ongoing evaluations. At the time of SLA enactment, provisioning of i-SLA (i.e. provisioning of the infrastructure resources) must take place before deployment of the software, and the latter should also start before the time that the customer expects to find the service available for invocations. Therefore, this time difference should be taken into account during negotiation and 56 Chapter 6. Management Architecture Negotiation 2010/07/22 JUDE(Free Version) Negotiationsd Customer SaaS IaaS 1: Negotiate() loop loop 2: createAgreement() 2.1: createAgreement() 1.1: Negotiate() Figure 6.4: High-level overview of the negotiation process construction of offers. An alternative option is that the customer herself explicitly requests the provisioning of the service when the time comes; and receives confirmation when this happens, after the provisioning of all antecedent services. Following the provisioning phase according to established SLAs, the IaaS provider’s SAMI receives events from the infrastructure monitors, and correlates them to the existing SLAs. If some part of the infrastruc- ture fails, it will try to provision new infrastructure resources as soon as possible. If that is not possible, it will have to start a renegotiation, tak- ing into account any penalties that may apply for doing so. Similarly, the SaaS provider SAMI needs to monitor software service activity. If s-SLA fails, the reason for that should be traced: It could be a customer invok- ing the service more often than agreed, or it might be software failure, or infrastructure failure (i.e. violation of i-SLA which was not mitigated by IaaS already). Again, the provider needs to solve the problem by either reprovisioning the software onto existing resources, or renegotiating with the IaaS provider, or eventually, renegotiating with the end-customer if all else fails. 57 Chapter 6. Management Architecture 6.3 Architecture Applicability As a proof of concept for the generic application of the SAMI architecture, an additional example from a very different area is used, which includes non-technical services as well. As shown in Figure 6.5, there is a meeting organizer who wishes to arrange hotels and rental cars for the attendees. In such a case, the SAMI would be essentially a system to help decision- making and indicate favorable options. Negotiation would be driven by humans, but the SAMI’s negotiation interface/mechanisms could be used. Also, the SAMI’s planning and optimization mechanism would monitor ex- isting resources, know their semantics and suggest how to proceed with a negotiation. As soon as the latter concluded, it would book an account manager’s agenda, reserve rooms, or reserve cars depending on the ser- vice at hand. Monitoring would be implemented through possible com- plaints by attendees; the account manager would receive them, and ei- ther forward them to the hotel/car rental, try to renegotiate, or try to find a different subcontractor. PS S T ravel agent S AMI Negotiation: Price, number of rooms, extras Meeting organiz er Negotiation: Price, number of guests Hotel S AMI Rent-a-car S AMI Negotiation: Price, number of cars Bundles templates Rooms/Cars templates Rooms templates Cars templates Bundles templates Provisions Account manager ProvisionsRooms P r o v i s i o n s Cars Figure 6.5: Meeting organization outsourcing This example illustrates the generic applicability of the SAMI architec- ture, even in domains completely unrelated to IT. Concerns remain sep- 58 Chapter 6. Management Architecture arated throughout the SLA life cycle, and the domain-agnostic approach followed facilitates using the same concepts and architecture in very di- verse scenarios. 6.4 Framework Implementation The nature of the SAMI and the requirement to make it applicable in a wide spectrum of very different use cases, enforce the need to implement it as a reusable framework. The SAMI has been implemented as part of the SLA@SOI project, based on technologies that allow domain-specific methodologies to be inserted in the form of plugins, interacting via the SAMI kernel. Thanks to this plugin-based pattern, new implementations of the various components can be easily added or even replaced at runtime. A plugin in SAMI is implementing one of the components of the archi- tecture, for example, the POA module. As such, a plugin provides a spe- cific set of functionalities, within the context of SLA management, based on pre-specified interfaces. The SAMI kernel provides the infrastructure required to support the activation, operation and interaction of each of its components. This infrastructure is based on the SpringDM [48] applica- tion framework for the OSGi [49] technology, in which a plugin is known as bundle. The key feature of this approach is the high degree of flexibility pro- vided for dynamic behavior, and customizable system deployment and reconfiguration. Each SAMI implementation can customize or reuse com- ponents, integrate new ones or replace others (even at runtime) with min- imal effort. Some of the main features of the plugin-based SAMI architec- ture are as follows: • Simplicity: The life cycle of each plugin or bundle is handled by the SAMI kernel and supported by an OSGi container; • Adaptive behavior: Given different mechanisms and implementa- tions for different system conditions, it is straightforward to replace one with another when the administrator sees that fit, or automati- cally according to predefined rules and policies; • Easy deployment: SAMI is able to run under any implementation of an OSGi container. To the best of the author’s knowledge, the most complete implementation of the latest OSGi specification (R4) is Equinox [50]; it is therefore the one on which SAMI is being cur- rently tested; 59 Chapter 6. Management Architecture • Versioning: Due to the full support for replacement of the various components, when a new version of a component is available it is straightforward to exchange versions and upgrade to the latest one. Nevertheless, rollback is similarly easy if problems appear. • Reusability: The plugin-based architecture also allows the integra- tion of third-party libraries inside each bundle. This encapsulation feature improves code reusability within SAMI implementations. For example, if applied to a High-Performance Computing (HPC) do- main, one can wrap an existing scheduler with the interfaces fore- seen by the POA module, and therefore reuse the scheduler in this new context of SLA planning. Currently, the plugins implemented or under implementation in the context of the SLA@SOI Project are the following: • Negotiation: The Negotiation module includes a Syntax Converter, that converts from on-the-wire representations (e.g. WS-Agreement and other XML renderings) to internal SLA models, and vice-versa; a Protocol Engine that takes care of protocol-related formalisms and enforces the respective consistency; a Publish/Subscribe client that advertises templates and receives advertisements; a Template Reg- istry that stores templates and can be queried using templates as input; and an SLA Registry that holds established SLAs. • POA: The POA module is custom per domain and/or use case. One implementation already completed is for infrastructure services, par- tially using the ideas from Chapters 10 and 11. An additional im- plementation for a demo application with software services is also underway. • Provisioning and Monitoring: The two components have been im- plemented as one in the project, and comprise two different in- stances; one for infrastructure services, and one for the aforemen- tioned demonstrator. At the moment of writing, integration tests are taking place, also con- verging with parallel ongoing work on the topic of software service pre- diction services (“Forecasting”) and Dependency Management input. The latter comes in the form of queries to a Service Manager implemented within the project. The SAMI implementation as described so far is a service itself, with its functionality concerning SLA management in complete. Nevertheless, a different implementation approach is to enable the components of the 60 Chapter 6. Management Architecture SAMI architecture as distinct services. It could, therefore, be possible to have services to negotiate, others to plan, to monitor, etc, allowing even further flexibility. 6.5 Related Work In [51], Ludwig et al. present an architecture for the management of SLAs during negotiation and runtime. Their work is specifically targeting the WS-Agreement standard, and does not discuss renegotiation of SLAs or SLA hierarchy management. The latter is neither discussed in [52] by Padgett et al., which is furthermore tightly coupled to Grid Computing. In [53], Koumoutsos et al. present an abstract architecture for managing SLAs through their complete life cycle, but no information is provided with regards to the interactions of the various components for achieving the purpose, while additionally they concentrate on information management and SLA modelling. In [54], Keller et al. present the Web Service Level Agreement (WSLA) language for defining SLAs, alongside an architecture for monitoring them. Focusing on the architecture part, it only concerns the runtime, without reference to negotiation-time mechanisms. The same limitation to run- time management appears in [55], where an architecture based on the Common Information Model (CIM) [56] is presented by Debusmann and Keller. Apart from being limited to monitoring of SLAs, this work is also tightly coupled to specific technologies (CIM and WSLA). In [57], Bartolini et al. elaborate on an SLA management framework, which is very com- plete from a runtime perspective, but does not touch on the issue of nego- tiating and re-negotiating SLAs. The case is similar for [58] from Paschke et al., which applies extensively knowledge management techniques into the domain of SLA management. In [59], Buco et al. are also discussing SLA management after establishment, assuming that to be a manual pro- cess. In [60], Chhetri et al. present an SLA negotiation architecture. Al- though they include a “SLA life cycle management” component, they do not discuss how it is used. In [61], Bartolini et al. present an abstract framework like the one presented in this thesis (including, additionally, negotiation rules), but it is also limited to negotiation time only. Similarly, in [45] Di Nitto et al. focus on negotiation, especially on protocol man- agement and assuming a marketplace approach. In [62] Kim and Segev discuss SLA negotiation only, as part of an abstract framework, which is similar to the present one, in the sense that it also separates concerns between protocol management and decision making. Finally, in [63], 61 Chapter 6. Management Architecture Ayatollahzadeh-Shirazi and Barfouroush present a conceptual framework for negotiating SLAs using generic negotiation principles, but do not dis- cuss runtime management either. Also close to the work discussed here, is [64] by Dan, Ludwig and Paci- fici. The authors present an architecture that includes all those constructs for interacting with customers, provisioning, monitoring and adjusting SLAs. Nevertheless, multi-level (hierarchical) SLAs are not addressed. In addition, their work is described around WSLA and web services in spe- cific, while the SAMI is technology-independent. Overall, the main novelty introduced by the SAMI architecture can be summarized as follows: • It is technology-independent: No reliance on web services or other specific technology. • It brings specific support for SLA hierarchies and chained negotia- tion, which is necessary for service chains / hierarchies. In addition, these features are unified with capabilities that were found in prior art, albeit fragmented: • It covers the complete life cycle of SLAs, including negotiation and execution (monitoring / enforcement). • It can be customized to different domains; specifically, it innovates by separating protocol-related negotiation functionality from utility and strategy concepts; and by foreseeing a pluggable mechanism for the support of different planning mechanisms. Summary and Conclusions SAMI, the SLA Management Instance, was presented in this chapter. SAMI is an architectural design for achieving SLA management, throughout the complete life cycle of SLAs – most notably including negotiation and ex- ecution. SAMI’s essential principles are extensibility, as well as indepen- dence from specific domains and technologies. SLA hierarchies are specif- ically taken into account, by means of SLA template advertisements, ser- vice discovery and composition, SLA Translation interlaced with SLA Ne- gotiation, and hierarchical SLA Provisioning. In the next chapter, a novel penalty model is formalized and pre- sented. Such a penalty model describes what are the costs that a provider must pay, if an SLA is not honored. It is a necessary consideration both during negotiation (for the Planning and Optimization), but also during 62 Chapter 6. Management Architecture runtime (for deciding the most appropriate course of action when an SLA is violated or about to be violated). 63 64 Chapter 7 Penalty Model One important differentiation of SLAs from best-effort service provision- ing requests is the annotation with penalties. That is, all those provisions that define what will happen in the case that a provider fails to deliver the service, as agreed. The consequences of such a failure may be some kind of refunding, additional (free) service points, etc. The present chapter explores this topic, and proposes a new formal- ization for penalty definition. This formal model is taking into account requirements for fairness and business value. Following the model’s defi- nition, an example is provided that links the model to SLA hierarchies. 7.1 Penalties and Service Level Agreements Penalties are as an essential part of SLAs as the description of guaran- tees. The reason for that is the very concept of using SLAs as an instru- ment to provide some level of determinism in business relations. As such, SLAs also describe what must happen when something goes wrong and the SLA cannot be honored (i.e. it is violated). This section of a Service Level Agreement, describing the penalties, is typically of concern to busi- ness and legal departments of companies. Penalties requested by customers during negotiation indicate the im- portance of the SLA’s compliance, for these customers. In a similar way, the penalties acceptable by the service provider indicate a risk strategy; namely, how far a provider is willing to go to make the customer feel safe, while in the same time reducing the risk for its business. In addi- tion, it may be the case that some penalty is defined for customers who do not respect certain obligations they may have according to the SLA; for instance, exceeding the agreed invocation rate may lead to additional costs. 65 Chapter 7. Penalty Model Penalty fairness is important to keep business relationships stable, and to preserve SLAs as a useful and meaningful instrument to define such business relationships. Fairness refers to reasonable and propor- tional royalties returned when the SLA is violated. This kind of propor- tionality concerns the interests of both the customer and the provider. As mentioned earlier, the business value of the SLA to the customer is re- flected, ideally, in the penalties; while the provider usually does not wish to risk its complete business over a single contract. At the same time, penalties should reflect how far from the agreed QoS level an SLA has drifted, to achieve proportionality. By way of an example, one may con- sider an SLA where 95% of invocations of some operation are guaranteed to complete within 5 seconds. If 94.9% did so during the accounting pe- riod, the SLA is violated – but the penalty will typically be much smaller than if only 80% of the invocations completed within 5 seconds. Currently, there are various ways to express penalties, some simpler (e.g. flat rate for the whole SLA; or linear proportionality per guarantee) and some more complex – such as the ones discussed in Section 7.4. The various approaches, however, do not satisfy all of the following require- ments for formulating complex penalty expressions, in a single unam- biguous model: • Capability to describe associative penalties, where the penalty for failing one guarantee of the SLA depends on the state (failed or satisfied) of some other guarantee(s); • Full flexibility as regards the QoS levels agreed and/or achieved, without constraints such as pre-specified classes of service; • Openness and applicability to different domains, without depen- dence of the model on specific languages, taxonomies or technolo- gies. In this chapter, a formal model of defining penalties within an SLA is provided, taking into account the aforementioned requirements as part of the contract negotiation. The model must be flexible and adaptable to different scenarios, and expressive enough so that complex expressions can be accommodated – e.g. penalties for combinations of different guar- antees being violated. In addition, an example of this penalty model is given, applied to the geocoding service scenario from Section 2.2. 66 Chapter 7. Penalty Model 7.2 Penalty Model Let assume service S, and a Service Level Agreement that governs the consumption of this service by a certain customer. Also, that the total cost for the consumption of the service under this SLA is C, and that the agreed Quality of Service is given as a set of guarantees (Q1,Q2, ...,Qn) for the various supported quality metrics and properties. Then, a set of penalty functions is defined as Pm(Q1m , ...,QNm) = C · PWm · ∑ k QWk · FRk (7.1) where m > 0 and 1m ! k ! Nm. Q1m , ...,QNm represents a combination of guarantees that are de- pending on each other, and the violation of one may affect, under specific circumstances, the others. It is a way to express correlations of fine gran- ularity. Thus, a customer can express statements such as that a violation of one guarantee is more relevant if another guarantee is also violated. Pm is a penalty function that corresponds to the above combination Q1m , ...,QNm of guarantees. The sum of all penalty functions during one reporting period represents the total penalties for this SLA, during that period. PWm is the weight of penalty function Pm. It indicates how important is this function to the total calculated penalty; and is possibly a contribut- ing factor for the service provider to make decisions with regards to the deployment and implementation of the SLA. The sum of all weights is equal to 1. QWk is the weight of one specific guarantee being violated, for this specific combination of guarantees. This value may be arbitrarily high. It offers the capability to the customer, when negotiating, to express the importance of honoring certain guarantees in this penalty function. As an example, one may consider the case where the guarantees concern the availabilities of two load-balancing servers. If the availability guarantee for one of the servers is violated, its weight (and hence, the penalty) is kept small. If, however, on the same time the availability guarantee of the other server is also violated – i.e. the second server becomes unavailable as well –, there may be a very high weight to suggest an equally high penalty for the system becoming unavailable as a whole. Finally, FRk is the failure ratio, i.e. the achieved quality vs the planned quality. It indicates how far the offered quality has drifted, when com- pared to the quality that was agreed to be offered for that specific ser- vice parameter. For instance, if availability of a service was agreed to be 100%, but only 90% is achieved, then the failure ratio is 0.1; and if a 67 Chapter 7. Penalty Model response time was agreed to be 5 seconds on average, but 6 seconds is the average response time achieved, then the failure ratio is 0.2. By def- inition, FRk may also model possible rewards for performing better than agreed. Chapter 5 discussed the construction of subcontracts starting from a higher-level contract, in a top-down fashion. In parallel with inferring proper values for the service properties of the SLAs, suitable penalties must be deduced. Therefore, penalty calculation for subcontracts is also part of the SLA Translation process. Due to the domain-specificity of the way that properties depend on each other, it is not possible to provide generic analytical expressions of penalties for antecedent services / SLAs. Rather, the next Section illustrates how this would work, by means of an example. 7.3 Example Application This section provides an example of using the penalty model, based on the scenario from Section 2.2. It illustrates how the model is applied for the geocoding service GC, and how it is translated to penalties for an- tecedent services in this context. Let assume that the negotiable proper- ties for the geocoding service are: • Availability, defined as the ratio of service uptime to total monitor- ing time, over a certain predefined time period (e.g. one month); and • Response Time, defined as the time duration from the moment a message is received, to the moment a response is generated and put on the wire, for a specific percentage of all invocations. The customer wants to express that availability must be over 99%, and the penalty increases up to the full cost for availability dropping down to 90%. For response time, it is desired that 98% of invocations return in less than 5 seconds. The respective response time penalty increases linearly towards 85% of the calls; at that point, full SLA cost is claimed back. Network delays are not considered in this example. The customer expresses these penalty terms as follows: P(A) = C · 1 · ( 99 99− 90 · FRA ) (7.2) P(R) = C · 1 · ( 98 98− 85 · FRR ) (7.3) 68 Chapter 7. Penalty Model P(A), P(R), are the penalties for availability and response time. C is the total SLA price. For all penalty functions, a maximum penalty equal to the SLA price is requested (PWA = PWR = 1). Obviously, this may add up to more than the SLA price, should more than one parameters be “sufficiently violated”. It may well be the case that the provider does not accept this, but rather sets a maximum aggregate penalty; this is something to be negotiated between the two parties, and its expression is a syntactical matter which is out of scope for this work. The weight for each guarantee being violated is set such that each unit of additional failure from agreed target contributes linearly to a total failure at the agreed threshold. Thus, 99/(99−90) is the necessary num- ber to reach a value of 1 when multiplied with the threshold failure ratio of (99 − 90)/99 for availability (99% being the planned one, 90% being the actual). Similarly, 98/(98 − 85) for the response time threshold of 85% (down from 98%). The Planning/Optimization component of the geocoding service SAMI knows that the availability of the geocoding service, Ag, depends on the availability of the infrastructure on which the service executes. Its re- sponse time, R, depends on utilization of the infrastructure being suffi- ciently low. Equations 7.4 and 7.5 describe the availability and response time of the geocoding service in relation to the availability and utilization of the supporting VMs. Coefficients k and l are specific to the software used, and are known by the Planning/Optimization component via soft- ware modeling, monitoring, or other means. A constant c denotes the response time for near-zero utilization of the VMs. Based on these for- mulae, the geocoding service provider would deduce the required guar- anteed properties and construct an SLA offer towards the infrastructure provider. Ag = min(AVM1,AVM2) (7.4) R = k ·UVM1 + l ·UVM2 + c (7.5) AVM1,AVM2,UVM1,UVM2 ∈ [0, 1] (7.6) The policy of the geocoding provider is to ensure that full infrastruc- ture SLA costs will be refunded, if the infrastructure provider fails up to an extent such that the profit from the geocoding SLA starts being affected. That is; although availability of the geocoding service can become as low as 90% before a full refund is issued; and it relates to VM availability as shown in Equation 7.4; the geocoding provider will ask that it converges sooner to values that constitute “complete failure” to deliver. As such, the failure multipliers must be larger, to reflect this financial difference. 69 Chapter 7. Penalty Model The same applies for response time, and the penalties for infrastructure utilization levels. For the purposes of the example, a naive assumption is made that the geocoding provider has no implementation costs, and that the difference between software SLA price C and infrastructure SLA price C∗ is all profit. Eventually, this penalty can be formulated as in Equations 7.7 and 7.8. P(AVM1,AVM2) = C∗ · 1 · ( C C∗ · α · FRAVM1 + C C∗ · α · FRAVM2 ) (7.7) P(UVM1,UVM2) = C∗ · 1 · ( C C∗ · β · FRUVM1 + C C∗ · β · FRUVM2 ) (7.8) where α = 99/(99− 90) and β = 98/(98− 85). One can easily notice that these equations can be simplified as fol- lows: P(AVM1,AVM2) = C · α · (FRAVM1 + FRAVM2) (7.9) P(UVM1,UVM2) = C · β · (FRUVM1 + FRUVM2) (7.10) Taking into account Equations 7.2 and 7.3, it holds that: P(A) P(AVM1,AVM2) = FRA FRAVM1 + FRAVM2 (7.11) P(R) P(UVM1,UVM2) = FRR FRUVM1 + FRUVM2 (7.12) However, it must be underlined that this applies only because of the assumption that the geocoding service provider has no implementation costs other than the infrastructure for service execution. 7.4 Related Work In [65], Becker et al. propose a price function over achieved QoS. Sub- tracting from the agreed price provides the penalty, so it can be con- sidered to be the penalty function as well. Rewards are also possible, using their approach. The main difference with the approach presented in this Chapter, is that in [65] it is not possible to express penalty with more than one QoS properties involved, and these properties being cor- related. That is, quality is seen as a unidirectional aggregated measure for a service, while in fact there may be quality characteristics that cannot be aggregated without preference information and also without property inter-dependence information. 70 Chapter 7. Penalty Model In [66], Jurca and Faltings suggest a method to calculate penalties (after using a service and following an SLA) based on a reputation mech- anism, where all customers evaluate the quality of the service they re- ceived. The authors take measures to avoid false voting for price reduc- tion or other fraud. The approach, however, can only be combined with classes of service. In [67], Rana et al. discuss monitoring and reputation mechanisms for SLAs. The authors look into EU contract law and take some relevant points into account, then define three broad penalty categories: All-or- nothing, Partial, Weighted Partial. However, a complete mathematical model is not present. The work presented in this paper is then related to WS-Agreement, so the relevant negotiation concepts are also discussed. Finally, Kosinski et al. present in [68] a mathematical formulation of penalty functions which is fairly similar to what is presented here (al- though applied specifically in the area of networking). The paper does capture the relationships between different properties, and policies are defined depending on such relationships and the respective combina- tions. These policies are captured within “subcontracts” (the term is used differently than in this thesis; it refers to sections of a single SLA). The main difference to the present work is that in [68] failures are calculated using a pre-specified taxonomy (number of violations, amount of viola- tions, etc), while here the way to calculate the failure ratio is considered domain-specific and left open. In addition, the model from Section 7.2 as- signs a weight not only to each violation, but also to each penalty function (“subcontract”, using the terminology of [68]). Summary and Conclusions In this chapter, a novel, complete mathematical penalty model was intro- duced and formulated. Expressions compliant to this model can be inte- grated within SLAs, to define unambiguously the penalties that accom- pany SLA failures. The model is taking into account concepts of fairness, business value, and quality parameter interdependencies alike. Via an example, the application of the model to SLA hierarchies was demonstrated. It was shown that it can be applied using domain-specific (and perhaps use-case specific) knowledge, in order to build analytical penalty expressions at negotiation time, dynamically and according to top-down or bottom-up hierarchy construction. The chapter that follows presents a model for an in-memory SLA rep- resentation that can simplify highly complex SLAs, and contribute to mak- ing decisions during negotiation and subcontracting. 71 72 Chapter 8 System-Internal SLA Representation Machine-readable SLAs, being a projection of real-world contracts in the IT space, can be arbitrarily complex. They may include conditionals, re- quirements that must all apply in parallel, optional items that affect final pricing, penalties, rewards, and so on. Processing such complex docu- ments without prior knowledge of their structure (even if the syntax is well-defined and understood) may be a task of significant difficulty. More specifically, this task is equivalent to the graph isomorphism problem, which remains open as regards its complexity; its generalization, the sub- graph isomorphism problem, is known to be NP-Complete [69]. The present chapter proposes a methodology to reduce this complex- ity, by accepting that SLAs have special properties and can be modeled as boolean functions. Based on this proposition, a structure known as Reduced Ordered Binary Decision Diagrams can be used, to result in a canonical representation of SLAs, and allow their processing for reasons of planning and outsourcing. A canonical representation like this can fa- cilitate fast processing of SLAs and decision-making about them during negotiation. 8.1 Basic Concepts In this chapter it is proposed that a graph-based data structure, well- known in the domain of Computer Aided Design (CAD) for Very Large Scale Integrated (VLSI) circuits, is suitable for modeling SLAs in a way which is both expressive enough, and very efficient. R. Bryant introduced Reduced Ordered Binary Decision Diagrams (ROBBDs) in 1986 [70] as an evolution of C.Y. Lee’s [71] and S. Akers’ [72] work on BDDs. They are 73 Chapter 8. System-Internal SLA Representation classified as a tool in the area of symbolic model checking, the scientific discipline looking into the problem of verifying that a given system satis- fies specific requirements, given any kind of input. The hardware industry race has further contributed to the optimization of the structure itself with a significant amount of relevant research, and a large number of methods already exist for taking advantage of ROBDDs’ inherent properties. It is here reminded, from Section 3.1, that on an abstract level SLAs consist of facts, conditions and clauses; the latter two being here col- lectively referred to as rules, and comprise if-then-else structures. The essential reason that ROBDDs are useful for modeling SLAs, is that they are canonical representations generated on the grounds of such if-then- else structures. Thus, they can express SLAs unambiguously: equivalent SLAs which are structurally different, are eventually represented by the same ROBDD. On the contrary, using formats developed for on-the-wire representation (e.g. the various XML-based models that were mentioned in the introductory chapters) does not guarantee this property. It is herein proposed that ROBDDs are used internally in systems which have to man- age SLAs, as a representation that facilitates their management. Suitable interpreters should then be developed to convert from standardized, in- terchangeable formats such as WS-Agreement and WSLA, to this more convenient data structure and vice-versa. 8.2 Binary Decision Diagrams This section serves as a general, high-level introduction to BDDs and their basic properties. Motivated readers are encouraged to consult with the bibliography for in-depth material. Most of the definitions provided in this section, are summaries of the definitions that can be found in [73] by Ebendt et al. A BDD is a graph-based representation of one or more boolean func- tions. This kind of diagram is based on Shannon’s decomposition theo- rem [74], which states that, assuming a boolean function f : Xn → Xm, where Xn = {x1, ..., xn} and Xm = {x1, ..., xm}, then for any boolean vari- able xi, i ! 1 ! n: f = xi · fxi=1 + xi · fxi=0 (8.1) What Equation 8.1 provides, is the if-then-else representation sought for: If xi is true, then fxi=1 must be evaluated, or else fxi=0 must be evaluated. A BDD then, is a directed acyclic graph G = (V,E), where V denotes the vertices (nodes) and E the edges. Vertices can be either terminal (i.e. their out-degree is equal to zero), or non-terminal. The former can carry a value of either 1 (true) or 0 (false). The latter are 74 Chapter 8. System-Internal SLA Representation labelled with a variable xi ∈ Xn; if u is the node, the variable xi is referred to as var(u). Of the two children nodes, the one followed if xi evaluates to true is referred to as then(u), and the other as else(u). An illustrative example can be found in [73]. This example is shown in Figure 8.1(a), with a BDD representation of the boolean function f = x1 · x2 + x1 · x3. Solid lines are typically used for the edge between u and then(u), and dashed lines for the edge between u and else(u). Addition- ally, non-terminal nodes are denoted as circles, while terminal nodes as squares. Let pi be an ordering of the boolean variables involved in the function to represent. Then, the pair (pi,G) is the Ordered BDD (OBDD) represen- tation of the function, as long as (additionally to simple BDD definitions) it is true that on each path from the root to a terminal node the vari- ables are encountered at most once and in the same order. Looking into the previous example, Figure 8.1(b) is illustrating exactly this ordering of variables, and how it affects the diagram. A diagram with more than one roots (i.e., representing more than one boolean functions which depend on the same boolean variables) is a Shared BDD (SBDD). It must be noted that a root node here does not necessarily imply that the in-degree of this node is equal to zero. For a specific function within a BDD or a SBDD, a path is a subset of G which connects the root with a terminal node, with- out any duplicate occurrences of a node or an edge. The set of all paths for function f is herein denoted as Γf. 1 f x2 x3 x3 x2 x2 x3 x1 0 (a) 1 f x2 x3 x3 x3 x3 x2 x1 0 (b) Figure 8.1: Simple/Ordered BDD representations of f = x1 · x2 + x1 · x3 An ordered BDD can be reduced (resulting in a Reduced Ordered BDD), using the following two operations: 1. Deletion: If for a non-terminal node u of G holds that then(u) = 75 Chapter 8. System-Internal SLA Representation else(u) = u ′, the node can be removed from the graph. All edges pointing to it, if any, must now point to u ′, and if u was a root node, then u ′ must be upgraded to a root node. 2. Merging: If for two non-terminal nodes u and u ′ holds that var(u) = var(u ′), then(u) = then(u ′) and else(u) = else(u ′), then it is possi- ble to remove u and have all edges pointing to it redirected to point to u ′. Additionally, if u is a root node, then u ′ must be made into a root node. Remark: In the text that follows, the term BDD will be used uni- versally, to refer to Reduced Ordered BDDs. Also, there will be no dis- tinction between single-rooted and shared diagrams. Whenever single- rooted BDDs are explicitly excluded, this will be denoted by pre-pending “shared” or just the letter “S”. 8.3 SLAs as BDDs 8.3.1 A Usage Example In order to illustrate more intuitively the usage of this structure for SLAs, the example from Section 2.2 will be used. As discussed, this kind of busi- ness scenario involves two SLAs. The first (s-SLA) is established between the end-customer and the SaaS provider, to govern their interactions and apply guarantees. The second (i-SLA) is established between the SaaS and the IaaS providers for the same purpose. The customer would try to engage in an SLA with the geo-location ser- vice provider, which would involve –for instance– metrics for service avail- ability, and service invocations completion time (CT). The SaaS provider would apply its understanding about the software (based on modeling principles or historical monitoring evidence) to derive expected resource requirements, possibly varying throughout a day’s, month’s or other pe- riod. The infrastructure resource requirements, on the other hand, would be the guarantees that the SaaS provider’s SLA with the IaaS provider would need to include. The example SLAs are described as follows: SLA-1: For service “Geo-location”, and given that business hours are be- tween 09:00 and 17:00: During business hours, operation “Plan- Route” must complete within 5 seconds, and the service’s availabil- ity must be more than 99%. Outside business hours, completion time for the same operation can be up to 10 seconds, and the ser- vice’s availability must be more than 95%. 76 Chapter 8. System-Internal SLA Representation SLA-2: For service “VMpool”, and given that business hours are between 09:00 and 17:00: During business hours, 4 virtual machines must be allocated to this contract. Outside business hours, 2 virtual ma- chines must be allocated. SLA Variable Proposition Proposition type SLA-1 x1 ServiceName = ’Geo-location’ Fact SLA-1 x2 BusinessHours = 09:00 - 17:00 Fact SLA-1 x3 TimeOfDay in BusinessHours Condition SLA-1 x4 ’PlanRoute’ CT < 5 sec Clause SLA-1 x5 Geo-location availability > 99% Clause SLA-1 x3 TimeOfDay not in BusinessHours Condition SLA-1 x6 ’PlanRoute’ CT < 10 sec Clause SLA-1 x7 Geo-location availability > 95% Clause SLA-2 y1 ServiceName = ’VMpool’ Fact SLA-2 y2 BusinessHours = 09:00 - 17:00 Fact SLA-2 y3 TimeOfDay in BusinessHours Condition SLA-2 y4 Number of VMs = 4 Clause SLA-2 y3 TimeOfDay not in BusinessHours Condition SLA-2 y5 Number of VMs = 2 Clause Table 8.1: Example clauses Table 8.1 illustrates the set of facts and clauses to be used for this example scenario. It is straightforward to see that, given these facts and clauses in the form of boolean variables which evaluate to true or false, the SLAs can also evaluate correctly if they are modeled according to Equations 8.2 and 8.3 respectively. In the upcoming Section 8.3.2, the problem of expressing SLAs as boolean functions will be formalized. Then in Section 8.3.3 it will be shown how these specific example SLAs map to BDDs. f = x1 · x2 · (x3 · x4 · x5 + x3 · x6 · x7) (8.2) g = y1 · y2 · (y3 · y4 + y3 · y5) (8.3) 8.3.2 SLAs and SLA Hierarchies Here follows a proposed SLA representation: Let Φn be the universe of facts applicable to contracts as indisputable truth, Φn = {φ1, ...,φn}. Also let Ym be the universe of clauses which can be evaluated to either true or false, Ym = {y1, ...,ym}. A Service Level Agreement is the boolean 77 Chapter 8. System-Internal SLA Representation function f: f : Fk ∪ Zl → {0, 1} (8.4) where Fk ⊆ Φn, Fk = {φ1, ...,φk} and Zl ⊆ Ym, Zl = {z1, ..., zl}. Hence, there exists a representation of an SLA as a boolean function, taking advantage of an SLA’s binary nature upon evaluation as possible / impossible to satisfy (at negotiation time) or honored / violated (at run- time, i.e. while the service is being consumed). The variable terms of an SLA are taking values from Zl, while pre-agreed understanding (and in general, facts about the world) is encoded in facts accepting values from Fk. This definition is broad enough to encompass various previous defini- tions, both conceptual (e.g. [75]) and syntactical (e.g. WS-Agreement). In Chapter 5, service hierarchies and the corresponding SLA hierar- chies (contracts and sub-contracts) were discussed. There was also a formal definition of the relationship between one property of a service, and all relevant properties of the services that it depends on (i.e. service dependencies). It is now possible to codify generically SLA dependencies, in a manner that allows enough flexibility to describe any such kind. Let: • f : Fk ∪ Zl → {0, 1}, the dependent SLA • fi : Fki ∪ Zli → {0, 1}, i ∈ N, the depending SLAs • Fki ⊆ Φn, Fki = {φ1i, ...,φki} • Zli ⊆ Ym, Zli = {z1i, ..., zli} There is defined the dependency of f upon {fi} (and therefore the resulting hierarchy) as a function g: g : Zl → ∪i(Zli)|Fk ∪i (Fki) (8.5) Simply put, a function of any number of variable terms from SLA f equals a function of any number of variable terms from one or more SLAs fi, under the circumstances defined by the relevant fact sets. Operating under this highly abstract definition allows us the required flexibility to describe contracts with dependencies of any kind, as long as each of them does eventually evaluate to either true, or false. 8.3.3 BDD Mapping There now exists a formal representation of SLAs (Equation 8.4) and SLA dependencies (Equation 8.5). The gain in using BDDs lies in reduction. Through this process, a BDD becomes a canonical representation of the 78 Chapter 8. System-Internal SLA Representation boolean function it describes, as proven in [70]. Therefore, an SLA de- scribed as a boolean function in the form of a BDD takes a unique, well- specified andminimal form, eliminating redundancy and allowing to make the mapping which describes SLA dependencies far more efficient than what it would be if complete graphs were processed. Additionally, the canonical form of the SLAs allows objective evaluation and comparison for maximizing utility. The exact method to construct a BDD from an SLA depends on the for- mat in which this SLA is originally expressed, and therefore it cannot be algorithmically defined in a universal way. In the case of WS-Agreement, the Context and Service Description Terms would be used as facts; Qualifying Conditions as conditions; Guarantee Terms as clauses; and Term Compositor Terms could be classified as either conditions or clauses. In fact, WS-Agreement’s Term Compositor Terms are essentially boolean operators: All (AND), OneOrMore (OR), ExactlyOne (XOR). Using this pre- defined knowledge for such a specific SLA language, it is straightforward to implement a parser that can read the documents and construct a (Re- duced Ordered) BDD on-the-fly as described in [76] by Bryant with the revised “APPLY” operation. Using the example also discussed earlier in Section 8.3.1, it is possible to illustrate the reduced form of BDDs representing SLAs. As mentioned, Equations 8.2 and 8.3 represent the two example SLAs as boolean func- tions of the variables from Table 8.1. Then, assuming an ordering corre- sponding to the numbering of the variables, the two resulting BDDs would be as in Figure 8.2. The main deficiency of BDDs is their reliance on the ordering of the variables. The size of a BDD for the same function may vary from linear to exponential, depending on how variables are ordered [70]. Generic al- gorithms for near-optimal orderings of variables during or after BDD con- struction have been researched extensively in the past (e.g. [77, 78]). The application to the domain of SLA management and the involvement of facts as variables, whose else edge always points directly to terminal node 0, provides already a possibility for optimizing the BDD by push- ing all facts to the top of the diagram. Although this kind of ordering does not reduce the total number of nodes, it allows to ensure that indisputable facts are honored by all parts of the SLA, otherwise it will evaluate to false at runtime (i.e. it is violated). Also, at negotiation time, this ordering may speed up the negotiation process significantly, since the first thing to be confirmed as acceptable (or not) is the agreement of the involved parties on the essentials of the contract (for instance, monetary unit). It should be underlined, at this point, that facts are propositions which apply to 79 Chapter 8. System-Internal SLA Representation x1 1 0 x2 x3 x4 x5 x6 x7 y1 1 0 y2 y3 y4 y5 (a) (b) f g Figure 8.2: The BDDs corresponding to functions from Equations 8.2 and 8.3 the complete contract, and govern all terms included. Therefore, in cer- tain cases, additional attention is required for choosing what is a fact and what is not. Considering, for instance, the case of a two-party contract with two sections describing the obligations of each party, starting each section with an indication as to which party it applies: The statement “section (a) describes the obligations of party (A)” is certainly true for the complete contract. Nevertheless, if reference to the section includes some contract-locality constraint, e.g. “this section describes the obliga- tions of party (A)”, then this causes ambiguity and cannot apply to the whole contract any more – therefore should be modeled as a condition. After ordering facts at the beginning of the diagram, one of the ex- isting BDD methods to optimize the ordering of conditions and clauses is assumed to be used. Additionally to generic methods described in rel- evant literature, a kind of structural optimization that takes advantage of the semantics of SLAs and may be applied here is one that consid- ers what is more crucial to the user. Certain SLA representations con- tain sections on Business Values, that may reference specific terms as regards their importance (in Chapter 7 they were used as coefficients to 80 Chapter 8. System-Internal SLA Representation failure ratios). Given proper formalization of such sections, a constructor of BDDs from SLAs can take them into account and order clause variables from maximum to minimum importance, thus allowing faster evaluation of business-critical terms. It is now possible to discuss additional principles for the SLA applica- tion domain, and for outsourcing parts or all of the contract. Starting from the very semantics of SLAs represented as BDDs, one must distinguish between the meaning of a boolean variable (and the whole diagram) dur- ing negotiation time, and during runtime. 8.3.4 Negotiation Time Operations During negotiation time, the evaluation of a fact variable to true or false shows whether the fact is recognized as such from the receiving party. For conditions and clauses, it indicates whether there is any reachable state based on assignments of respective variables, so that the condition / clause under examination can eventually evaluate to true. Extending this to the complete diagram, at negotiation time it is important to see if there exist, in general, truth assignments for the whole set Fk ∪ Zl which satisfy the diagram and lead to 1. At this point lies an implicit decision. The party that receives the offer needs to have some certainty that it can honor it after signing. It is a policy issue if this certainty needs to be 100%, or near that, or even much lower (perhaps indicating a high-risk strategy). Whatever the policy, the decision will have to be taken based on some objective criteria. A certainty of 100% would mean that paths of the BDD must be checked for tautology, that is, any truth assignment for a path will lead to terminal node 1. If tautology applies for a single path, that should be enough to accept the offer. Should more than one paths lead to 1, then the provider must apply additional criteria to make a choice as regards which path should be chosen. Such criteria would typically involve utility, that is, which path would provide the most returns. This is, for instance, directly applicable to the the IaaS use case of Part III. If all paths refer to infrastructure resources that can be provided, then the one that yields the highest profit would naturally be selected. If there is no path leading with certainty to 1, it is necessary to make an educated guess whether the offer is acceptable, and whether some part needs to be subcontracted. A simple calculation that can be per- formed, is the following: Let Γ1f be the set of all paths for f that connect the root to terminal node 1, and Γ0f the respective set of paths leading to terminal node 0. It is assumed that by means of historical monitoring in- formation, forecasting, or simply common sense (e.g. time of day) there is assigned to each node ui in h ∈ Γ1f a probability P ′(ui) to evaluate to a 81 Chapter 8. System-Internal SLA Representation result so that node ui+1 is (also) on the same path, and 1− p to evaluate otherwise. If the variables of the nodes in the path are dependent, then one needs to take this into account and calculate the conditional proba- bility of each variable, given the evaluation of all previous variables on this path: P(ui+1) = P ′(ui+1|u1 ∩ u2 ∩ ... ∩ ui) (8.6) In somewhat less formal notation, the variables (and the events of them taking a value of true or false) were represented by the names of their nodes. If the variables are independent, then P(ui) = P ′(ui), ∀i. The probability ph that the complete path evaluates to true, is ph = ∏ u P(u)|u ∈ h (8.7) Then, the total probability that the SLA can be honored if established, is C = ∑ h (ph)|h ∈ Γ1f (8.8) Assuming that the acquisition of this probability per node can be per- formed in constant time, then the complexity of estimating this probabil- ity per path is O(n). The consequent requirement to minimize the total number of paths at construction time or variable ordering time, should also be taken into account. A negotiating party will want C to exceed some threshold, in order to agree to an offer that was received. If this is not the case, then the party (typically, a service provider) will have to either reject the offer, or try to increase C by subcontracting one or more paths and thus increase their contribution to the total success probability. Representing SLAs as BDDs is most useful at this point: The canonical and reduced form of a BDD produces a tractable list of options with regard to what can be assigned to subcontractors. For items in such a list, due to the specific ordering of variables, it is possible to devise unique and unambiguous signatures. The latter may then be associated to different boolean functions, which represent candidate subcontracts. Domain-specific intelligence can be applied by area experts before operation starts, and define the depen- dencies of certain variables on others for subcontracts. Then, a system based on these principles can make use of this knowledge, and construct proper offers towards third parties. In order to do this, such a system would use the principles discussed in Section 5.1.2. As long as these of- fers are accepted, and the respective second-level SLAs are established, it should be the case that the corresponding path has increased certainty to complete successfully as regards honoring the first-level SLA. The ne- gotiating party has a choice, according to policies and strategies, to mod- ify the offer and return it with specific values for the variables of that path 82 Chapter 8. System-Internal SLA Representation (practically suggesting the SLA equivalent of the path), or to accept the complete SLA as long as the increase in C is sufficient. Coming back to the example scenario from Section 8.3.1, there are two possible ways where this kind of subcontracting is / may be needed. The first, is the explicitly mentioned subcontracting from the SaaS to the IaaS provider. Conceptually, since the SaaS provider has no infrastruc- ture, it cannot offer the service at all unless it subcontracts for infras- tructure. Terms x4, x5, x6 and x7 would always evaluate to false unless infrastructure resources are available for the software to execute on. As such, the SaaS provider has to go through this translation process in any case, to calculate infrastructure requirements and make a respective of- fer to the IaaS provider. If an agreement with the IaaS provider already exists, the contracting system in use should find this automatically af- ter the translation occurs, try to reuse it if possible, otherwise resolve to making a new offer. It must be noted here that, since the outsourcing concerns paths, the SaaS provider may just as well make two different offers to two different infrastructure providers (one for each of the two paths in Γ1f ), or can make a single offer to one infrastructure provider for both paths (this is the working assumption in the example scenario). The second case, is if it so happens that the IaaS provider cannot satisfy the incoming offer – for instance, does not have the resources to offer the requested performance during business hours. This means that, according to its estimation, y4 would evaluate to false most of the times, and therefore path y1−y2−y3−y4−1 would contribute minimally or not at all to the whole agreement’s C-value. In this case, the IaaS provider can reject the offer, or – depending on projected utility – try to outsource this path to another IaaS provider. It should be mentioned that an offer may be for a single SLA, or for multiple SLAs (typically for different services or groups of services) in the form of a Shared BDD. The presented working assumption of an offer for a single SLA does not affect generality. 8.3.5 Runtime Operations For the runtime scenario a monitoring subsystem is assumed, that can capture service execution-related events from various sources and de- tect if some SLA term is being violated. The process actually starts much earlier, during negotiation. At that time already, it is beneficial to verify that terms of an agreement can actually be monitored (this has already been further discussed in Section 3.2 and [79]). Following this verification step, as part of the negotiation process, an SLA may be formally estab- lished, perhaps relying on other SLAs for its existence. 83 Chapter 8. System-Internal SLA Representation While the service is being consumed, incoming events are processed and terms (in the form of boolean variables of the BDD) are examined to see if a violation has occurred. The ordering of the variables allows the linear-time confirmation, starting from the root and traversing the diagram towards terminal nodes. As each variable evaluates to true or false, the respective child (then/else) is followed until a terminal node is reached. If that node is 0, then there exists a violation, and the reason of failing at that specific part of the SLA should be assessed. Depending on whether this failure happened on a path which was outsourced, or not, there may be a re-negotiation initiated, penalties claimed, or simply adjust the method to estimate success probabilities for different paths. Additionally to corrective actions, such an event must be logged to be reused in next negotiation cycles. The exact methodology to use in order to avoid unnecessary evalu- ations of the complete diagram, depends on the monitoring system, the way to evaluate each variable, and the acceptable time thresholds for re- action to violations. A complete definition of such methodologies is out of scope for this work. 8.4 Proof of Concept As a proof of concept, a simple prototype was built to assess the ap- proach’ efficiency in calculating the feasibility of [complying with] an in- coming SLA offer, if it is established. The prototype accepts an SLA al- ready expressed as a boolean function in Reverse Polish Notation (RPN) form, produces a BDD from it and assigns probabilities to the nodes in a semi-random way. Then, it calculates the paths leading to 1, their prob- abilities to be followed and the total probability that the SLA can be hon- ored without any subcontracting. Experimentation was with a single SLA offer, which was crafted not to contain dependent variables, according to the following description: The SLA concerns service “Geo-location” (fact). Business hours are set to 09:00-17:00 (fact). The whole system must run in isolation from other customers of the service provider (fact). If the operation invoked is “PlanRoute” (condition), and time is within business hours (condition), then: completion time should be less than 5 seconds (clause); availability should be more than 99% (clause); and throughput should be more than 100 operations per minute (clause). For times outside business hours: completion time should be less than 10 sec- onds, availability should be more than 95%, and throughput 84 Chapter 8. System-Internal SLA Representation should be more than 50 operations per second. For operations other than “PlanRoute”, invocations should be authenticated (clause) and availability should be more than 98%. With regard to the assignment of probabilities to the nodes and the paths to follow, there was assigned a probability equal to 1.0 to facts and to the proposition of authenticated invocations (this being a functional requirement that the provider should be aware of). It was then assumed that invocations of “PlanRoute” are one out of three, i.e. a probability of roughly 0.33, and the same for the time of day being within business hours – so it is implied that invocations of the service are equally dis- tributed throughout the day. Finally, for the propositions of completion time, availability and throughput, there was randomly assigned on each node a probability between 0.8 and 1.0 that the provider can satisfy it or will fail (a second random number indicates which of the two applies). In a real scenario, the provider would calculate these probabilities based on monitoring, forecasting or other information. Eventually, this simple sce- nario was run 10000 times, to see under these semi-random conditions how the SLA success estimations behave. Constructing the BDD for this specific SLA took place in a mere 2.2 seconds. Running the 10000 prob- ability tests took approximately 4 seconds on a 2.4 GHz processor. The diagram contained 16 levels, excluding terminal nodes. Of the 21 paths leading to terminal node 1, the shortest was 6-nodes long (excluding 1), and the longest was 13-nodes long. 0 50 100 150 200 250 300 0 0.2 0.4 0.6 0.8 1Number of hits Overall success probability Figure 8.3: Experimental result Figure 8.3 illustrates the overall calculated probability that the SLA will be successful if established, in the form of a histogram for these 10000 85 Chapter 8. System-Internal SLA Representation runs. For each of the runs, a total success probability for the whole SLA was produced, then it was added to a counter of runs with the same result in order to produce the histogram. Allowing some certainty for individual terms (80%-100% probability of success or failure) results in a clear gap between SLAs projected to fail, and those projected to succeed. This is an indication that, using BDDs in this context and under such circumstances, it is possible to calculate in only a few milliseconds and with a reasonable amount of certainty, whether the complete SLA can be satisfied or not. From this experimental evaluation, the feasibility and validity of the approach is exhibited for all SLAs that consist of propositions evaluating to true or false. As long as all invariable statements of an SLA (e.g. ref- erences to other SLAs) can be expressed as facts, and all variable state- ments can be expressed as conditions and clauses, this assumption is valid for any SLA. 8.5 Related Work To the best of the author’s knowledge, this is the first work that uses BDDs to model and verify SLAs and SLA dependencies. That said, BDDs have been used in service computing before, albeit in very few occa- sions. In [80], Binder et al. are using a special form of BDDs, called Zero- Suppressed BDDs, to create compact digests of service advertisements. Then, the digests are distributed to interested parties which use them for their service composition needs. In [81], Campailla et al. are using BDDs for matching service advertisements in publish-subscribe systems (making use of equivalence checking). With regard to SLA modeling in general, the most well-known efforts are WS-Agreement and WSLA. As also mentioned in the previous sec- tion, the focus of these specifications is on-the-wire representations for enabling interoperability between independent agents. This is an area that is not targeted with the work presented in this Chapter; rather, the focus here is a system-internal representation, that will enable efficient mechanisms for decision making. As regards applying logic-based approaches to the topic of SLA man- agement, the work which comes closest to the present one, is the one de- scribed in [58] by Paschke et al. There, the authors look at the problem in more detail, defining constructs also for things such service description, pricing, QoS, etc. Conversely, in this chapter everything is addressed in an abstract way, and external syntactical definitions / appropriate archi- tectural patterns are assumed for applying these definitions. Additional differences include the explicit focus on managing hierarchies of SLAs 86 Chapter 8. System-Internal SLA Representation and associations between them as such. The necessary constructs for this kind of functionality also exist in [58], however there is no mention of essential facilities such as equivalence checks and translation between different vocabularies for different layers of a complete IT stack. One should also not disregard the breadth of tools that exist for build- ing, optimizing and processing BDDs. As part of the SLA@SOI project, a software artifact was prepared to parse SLAs into BDDs, using a readily available BDD construction Java library [82] in only a few days. Libraries like that, and the algorithmic work that has taken place in this area over the last decades, provides significant tooling that can be readily used for fast and efficient SLA processing. This is a very important aspect of the contribution outlined in this Chapter. Summary and Conclusions In this Chapter there was presented a novel application of (Shared) Re- duced Ordered Binary Decision Diagrams, for representing and managing SLAs, as well as facilitating the construction of SLA hierarchies. BDDs are graph-based structures which have been used for decades in the field of VLSI design and verification, with particular success. They are one of the main tools of the VLSI industry for testing prototypes, and therefore BDDs are a topic under heavy research for decades. The depth and breadth of existing ideas and research can be applied to SLA management for further advancement of this complex service management area. In this particular Chapter there was discussion on the representation through a formal definition of SLAs as boolean functions and from there as BDDs; explained the advantages of this approach; and showed how such kind of use is possible for negotiating SLAs, subcontracting (leading to implicit SLA hierarchies) and detecting SLA violations. Finally, experimental re- sults of applying BDDs to SLA representation were briefly discussed. In Part III that follows, an application of the principles from Part II will be illustrated in the context of Cloud Computing and, more specifically, Infrastructure as a Service. In addition, there will be an evaluation of using SLAs as a means for infrastructure capacity planning and resource scheduling. 87 88 Part III Infrastructure SLA Planning 89 90 Chapter 9 Resource Model Although SLA models define structurally a general context for describing electronic contracts (an envelope as referred to in many cases), they usu- ally do not provide languages and models for the SLA content. This is on purpose, in order to achieve generic use of the various envelope models. In this Chapter there is provided a resource description model that can be used within infrastructure SLAs. This model, called Resource Ele- ment (RE), is an abstract and standards-compatible construct to describe dynamic groups of resources in a generic manner. Given that resource pools can be described by the RE, it is also used as an abstraction for a Resource Manager that is invoked by the respective SAMI’s “Provision- ing” module from Chapter 6. The relevant invocations are for reasons of Advance and Immediate Reservation, as a means of resource leasing and provisioning. 9.1 Rationale and Requirements In certain usage scenarios, infrastructure resources need to be accessed by multiple users concurrently; while in others, exclusive operation is re- quested. In both cases, the maximum number of users accessing the resources must be controlled – either to enable an acceptable level of service quality, or to comply with exclusive operation requirements. This functionality implies the requirement for advance or immediate reserva- tion of resources, for which an infrastructure service architecture and a set of operations and reservation attributes need to be defined. Not only is reservation required for the management of access concurrency, but also to integrate control tasks such as job submission and data staging into complex workflows. This is further exaggerated when there are spe- cific QoS requirements from such workflows, and therefore some level of 91 Chapter 9. Resource Model predictable behavior is needed [83, 84]. Therefore, reservation of a re- source when shared use is possible, provides a means to support strict guarantees and thus contributes to predictable QoS [85]. The reservation of resources, especially when multiple of them are in- volved and must be reserved together, implies the need for a model that can describe resources and groups of them. In this chapter a generic model is provided, alongside primitives for declaring resource groups, and for reserving them. Using these, it is possible to formulate relation- ships between resources, and thus express requirements for resource co- allocation. The resulting data types can be used inside the envelope of a generic negotiation protocol, but can also be directly supported by re- source pools attached to a SAMI. In addition to the model and data types, there is provided a set of generic operations that may be supported by these resource pools, and therefore used by the SAMI’s provisioning mod- ule during the negotiation and provisioning phases. It is important to note that, although when speaking of IaaS one typ- ically focuses on computing, storage and network resources, there is no particular reason why other types of resources would not be considered. Such resources may include sensors, scientific instruments, robotic and industrial equipment, etc. The presented model takes that into account, and uses this diversity as a requirement. On a more specific technical level, the following requirements are adopted. Firstly, it is necessary to be able to define groups of similar or identical resources, to which the reservations will correspond. This is for usability reasons: Reserving large amounts of resources by identify- ing them, one by one, would be impractical. Via the definition of groups which correspond to the inherent details of resources, it is possible to use more complex expressions for resource selection. The reservation of a single group allows the simultaneous booking of all resources in the group. There are assumed “get”, “set” (which also includes “update” semantics) and “cancel” operations for the reservations. Additionally, functional and non-functional group properties –which are subject to in- frequent changes and only meant for resource selection purposes– are expected to be available. Based on the rationale elaborated above, the aforementioned require- ments are interpreted into the following categories of functionality: Execution of reservations: Reservation of groups by identification, by resource type and properties. Provision of reservation metadata: Provision of metadata for a spe- cific reservation and for all reservations. 92 Chapter 9. Resource Model Handling of existing reservations: Update and cancellation of a reser- vation. 9.2 Resource Element Design 9.2.1 The Resource Element Model The Resource Element (RE) is defined to be the service representing and virtualizing resources within the IaaS provider’s domain, or one of its sub- domains (for instance, different data centers of the same provider). It al- lows access, manipulation and management of a set of resources via the exposure of resource-specific, as well as general-purpose primitives. As a single RE may represent multiple resources of different kinds, use cases can require individual operations to be performed atomically on groups of the real resources represented by this RE. These groups are required to include resources that are supporting semantically and syntactically identical operations, such that the invocation of a given group primitive ensures the operation to be atomically executed on all the elements that are part of the group itself. This model will now be expressed in terms of CIM, and its compliance will be shown. 9.2.2 Relevance to CIM Reference Model The Common Information Model (CIM) [56] provides a common definition of management information for systems, networks, applications and ser- vices, although currently it does not include reservation concepts. Never- theless, the RE definition provided above is compatible with CIM as herein described (in what follows, the CIM_ prefix of CIM classes is truncated). The RE implements System, as it defines a functional whole. Resource groups implement the LogicalDevice class, which is meant to provide an abstract model of hardware entities. In the case of physical resources that are sensors, the Sensor subclass of LogicalDevice can be used. Other re- sources implement PhysicalElement, defined in CIM as “any component of a System that has a distinct physical identity”. Through the SystemDevice relationship, LogicalDevice instances can belong to a System. The many-to-many relationship defined between PhysicalElement and LogicalDevice models the case where one phys- ical resource belongs to multiple groups at a time. The LogicalDevice class exposes properties that can be used for dis- covery and matchmaking. This allows the construction of complex ex- pressions for resource selection and reservation based on group type, or on group properties (for instance “location”, or “CPU clock speed”). 93 Chapter 9. Resource Model Resource Element (CIM_System) Group (CIM_LogicalDevice) Group (CIM_LogicalDevice) Resource (CIM_PhysicalElement) Resource (CIM_PhysicalElement) ... CIM_SystemDevice ID (DeviceID) QualityCharacteristics (IdentifyingDescriptions, OtherIdentifyingInfo) Type (CreationClassName) Resource (Sensor) (CIM_Sensor) Resource (Sensor) (CIM_Sensor) realizes realizes ... Figure 9.1: Mapping of RE concepts to CIM The relevant attributes of LogicalDevice are DeviceID, which identifies groups uniquely, and IdentifyingDescriptions/OtherIdentifyingInfo, which are free-form and may be used for any kind of such character- istics. The extension of the LogicalDevice class and the usage of the CreationClassName property offer an additional easy way to implement group types. 9.2.3 Relevance to GLUE Schema As mentioned earlier, the less frequently changing attributes of resource representations are expected to be available for resource selection and matchmaking. To achieve this, an appropriate information model is nec- essary for representation purposes. The Grid Laboratory Uniform En- vironment (GLUE) Schema by the respective Open Grid Forum working group [86] addresses this requirement for computing and storage re- sources, as it already provides representations for them. The abstract resource virtualization design presented here is compatible with the es- sential GLUE concepts, therefore the two can be combined and a nor- mative description in the GLUE framework can be produced straightfor- wardly. According to the the GLUE 2.0 specification, five core concepts 94 Chapter 9. Resource Model lie in the heart of the model and can be directly mapped onto the de- sign presented here. The Resource Element itself is the Service. Its in- terface to the SAMI, offering implementation information and normative access methods, is the Endpoint. Single resources themselves are the Resources; and resource groups are the Shares. Finally, GLUE Activities refer to control operations taking place on physical resources. This design does not pose any restrictions to these operations, therefore they can be freely described in GLUE terms. 9.3 Resource Element Reservations Establishment of infrastructure SLAs involves a number of players; us- ing WS-Agreement terminology, they are the Agreement Initiator (AI), the Agreement Responder (AR) and the RE. Practically, a request to establish an SLA is – in the context of IaaS – primarily a request to provision re- sources. As such, the reservation of resources will be required at some point, be it during negotiation (tentatively) or after the establishment (permanently for the contract duration). The main scenario addressed in this Part of the thesis can be as illustrated in Figure 9.2. As evident there, a customer SAMI (AI) negotiates with the IaaS provider SAMI (AR) for resources provided by the resource pools (RE). Thus, reservation oper- ations of the RE must be invoked by the AR, via the provisioning module. The data structures necessary to support RE reservation functionality are presented below. Time Lease: this type is used to define the temporal creation constraints of reservations. As such, it is required for requesting a reservation, for verifying the availability of a resource group at a certain point in time, or for retrieving reservation metadata. The type consists of a start time expressed as a timestamp, and the guaranteed dura- tion of the reservation. The start time may be zero, indicating an immediate reservation. Group Types and Characteristics: these are simple structures consist- ing of an identification number and a textual description. They pro- vide group-specific information that can allow for logical grouping in the process of matchmaking for a reservation request. Such types and characteristics may enable the construction of complex con- straint expressions, such as “reserve 10 CPUs where clock speed is 2GHz and location is Germany”. Resource Group: the group data type includes a numerical identifica- tion, a textual description, the group’s type and a list of characteris- 95 Chapter 9. Resource Model tics associated with it. As any domain-specific group-related data is decoupled from the group identifier itself, the design can be flexibly used in very diverse application environments. Reservation: this type represents the reservation of one or more groups and includes all related metadata. It contains an identification num- ber, a timestamp indicating the date and time for reservation set- up, its duration, the list of the relevant groups reserved, a boolean attribute to flag the state of the booking operation for all groups, and, finally, the reservation state itself. The state may indicate that the agreement is established but not activated yet; alternatively, that it is active and ready to be used by the authorized consumers; or that is has expired. Group Group D Agreement Initiator Agreement Responder Resource Element Negotiation protocol Reservation protocol Agreement and Resource Consumer Group B Group A Group C Legend: A single resource Figure 9.2: Resource Reservation Scenario It is very important to note that the model allows both static (prede- fined) and dynamic groups. In the latter case, it is possible that a group 96 Chapter 9. Resource Model contains other groups recursively, although this is purely a conceptual artifact. What is imposed by the data type is that one single specific re- source may belong to more than one groups. Thus, if resources R1, ..., Rn all belong to a group S1, and there exists a group S2 which contains all of R1, ..., Rn but also Rn+1, ..., Rm, then one can conceptually represent S1 to be contained within S2. The following operations are defined for RE reservation management, according to requirements from Section 9.1: Reserve groups By ID: Assuming that the IDs of the resource groups to reserve are known in advance, the SAMI may use this operation to invoke a reservation request on the receiving RE. Reserve Groups By Expression: This operation enables the use of ex- pressions including resource quantity, group type attributes and characteristics of a group as reservation constraints. The expres- sion predicates may be combined using unary or binary boolean operators to specify dynamically the groups to be reserved. Retrieve Reservation By ID: This operation returns metadata about a specific reservation, using its identification number. Retrieve All Reservations: This operation is meant to return all the reservations for a Resource Element, or, with appropriate factoring of the input argument, all reservations in a specific state (e.g. ex- pired). Update Reservation: Reservation properties are modified by this oper- ation using information such as time lease, the list of groups to be added or subtracted, new types and qualitative characteristics. Cancel Reservation: It removes an existing reservation, as long as it has not expired. The policy to adopt in case of active reservations is a service-provider specific issue. 9.4 Application to Real Use Cases The model and reservation language presented can have practical ap- plications when combined with ontologies which describe the systems (equipment) that will be modeled and reserved. Such ontologies can be built using various sources, depending on the specific resources. Here, for illustration purposes, it will be shown how the model applies to a resource pool including storage, CPUs (processing) and network traffic sensors, as 97 Chapter 9. Resource Model well as to a real-world scientific instrument. The former could be, for in- stance, part of the infrastructure supporting the geo-location service dis- cussed so far in the thesis. Additionally, the structure of a reservation re- quest and the corresponding operation in a simple pseudo-language will be illustrated. Storage supports the types found in the Storage Resource Manager (SRM) specification [87] and a location characteristic. The in- strument is a weather station coming from a SensorML [88] example, and including a thermometer, a barometer, an anemometer and a rain gauge. The thermometer has a resolution of 0.1 degrees Celsius, a range of -45 to +60 degrees, and its accuracy is ±0.5 degrees. In the simplest scenario, the computing resources would be as illustrated in Figure 9.3, while the instrument would be represented as shown in Figure 9.4. Resource Element Group A Type: Stor age Characteristics V olatile: No Location: German y Backup: Daily Group B Type: Processing Group C Type: Network T raffic Sensors Figure 9.3: Representation of the computing resources For the computing resources, a sample reservation request could look in pseudocode as follows: Reserve Groups by Expression: Time Lease: Start: 2010-06-01 13:00:00 GMT Guaranteed Duration: 3600 seconds Desired Duration: 7200 seconds Type: Storage Characteristics: Storage: Volatile Location: Germany Quantity: 1TB 98 Chapter 9. Resource Model Resource Element Group A Type: Thermometer Quality Characteristics Resolution: 0.1 Range: -45 -- +60 Accur acy: 0.5 Group B Type: Barometer Group C Type: Anemometer Group D Type: Rain gauge Figure 9.4: Representation of weather station Making the weather station example slightly more complex, it is as- sumed that it has two thermometers; one with the characteristics men- tioned earlier, and another one with more accurate characteristics, e.g. resolution equal to 0.01 degrees, range -65 to +70, and accuracy 0.05 degrees. A request to reserve this latter sensor for measurements would look in pseudocode as follows: Reserve Groups by Expression: Time Lease: Start: 2010-06-01 13:00:00 GMT Guaranteed Duration: 3600 seconds Desired Duration: 7200 seconds Type: Thermometer Characteristics: Accuracy: 0.05 In the case where dynamic groups would be used, a group containing the more accurate thermometer would be automatically created and re- served. The initial resource representation would only include the groups originally defined by the resource owner (IaaS provider). Then, as reser- vation requests would come in and be accepted, additionally created groups could enrich the group ecosystem of the RE for all users, or re- main private for the authenticated user who created them, depending on 99 Chapter 9. Resource Model implementation choices. Similarly, for more traditional resources such as CPUs or storage, the use of dynamic group creation and deletion is necessary as a means to group resources on-the-fly according to incoming requests on one hand, and existing availability on the other. By way of an example, one can consider a case of a RE with a single group of 100 CPUs, all of the same characteristics. A reservation requests for a group that includes 50 CPUs of those same characteristics, would allow the RE to randomly select 50 out of 100, group them under a single reference endpoint, and return that to be used in all further communication with the requesting SAMI. The remaining 50 CPUs would form a new group to denote available (free) re- sources, but on the same time the original group would remain to denote the complete resource set. 9.5 Related Work As discussed in Paragraph 9.2.3, GLUE is a model to describe computa- tional resources – especially in the context of Grid computing. The model presented here is compatible with GLUE, but more generic, so as to allow the modeling of infrastructure resources other than compute units and storage. This abstract approach is necessary to describe, for instance, sensors, or scientific instruments. In this latter case, the types of the re- source could be anything: “cameras” for telescopes, “pressure sensors” or “temperature sensors” in meteorological stations, “network interfaces” in active networking equipment, “motors” in robotic equipment, etc. The diversity of the different instrument types that comprise certain appa- ratus can be too large to express with a reduced dictionary. Similarly, the quality and suitability of a specific instrument can be expressed in many different ways. Examples include error rate, accuracy, response time, availability. However, it must be pointed out that different devices may have completely different metrics to indicate their appropriateness for a purpose, therefore also their suitability to a specific use case: Field of view (lens width), for cameras; Resolution, for visualization devices; Throughput, for a hardware encryption device; and so on. The same differentiation applies for storage: Advance reservation of storage resources has been specified by the Storage Resource Manage- ment Working Group (SRM-WG) of the Open Grid Forum. SRM reservation operations are storage-specific as they include a number of functional attributes which are of particular interest to this application area. In [89], Czajkowski et al. define a protocol for SLA negotiation in the context of Grid computing. Within this definition there exists a resource 100 Chapter 9. Resource Model description language, which is largely equivalent to the one presented in this Chapter. One difference between the two, is the additional semantics in the present work for resource clustering in virtual, dynamic groups. In addition, here there were presented advance reservation primitives, which were not within the scope of [89]. In [90], Kee et al. describe vgDL, the “Virtual Grid Description Lan- guage” that serves similar purposes. Nevertheless, the language is specif- ically targeting the Grid computing area, and as such, it is not clear how easily the language can be applied to different environments. Conversely, the language presented in this Chapter remains abstract on purpose, and therefore application in diverse domains is relatively straightforward. Some very recent work that took place in parallel with the one de- scribed herein, is [91] by Koslovski et al. The authors present VXDL, a language to describe virtual resources and interconnection networks. Targeting a similar area, but not providing similar advance reservation primitives as in this Chapter, the aforementioned paper provides the def- inition of a more well-specified language (i.e. with specific syntax) that covers both traditional infrastructure resources (compute units, storage, network elements/circuits) but also acquisition and visualization devices, software, etc. In addition it makes possible to define resource groups, as is also the case with the work described here. Compatibility with other efforts and standards is not discussed. Summary and Conclusions In this chapter, a resource description model was defined, alongside prim- itives for advance reservation of such resources. The goal of this work was to complement the SAMI architecture and allow access to resources in parallel to SLA negotiations. The contribution by this work can be sum- marized as follows: • The model is extensible and can be used in very diverse use cases, to describe arbitrary kinds of infrastructure; • It is compatible at large with standards such as CIM and GLUE; • Given a machine-readable rendering, the model can be used to- gether with envelope negotiation protocols and models (such as, for example, WS-Agreement); • It is coupled with generic constructs for reserving resources in ad- vance; 101 Chapter 9. Resource Model • It supports clustering of resources in virtual groups; and • It is not tied to specific technologies (such as XML), rather it can be adapted through respective renderings. In the upcoming Chapter, two greedy resource planning algorithms will be examined in the context of a hierarchy of SAMIs. The concepts from Part II, together with the concepts from this Chapter, will be applied to the use case of Infrastructure as a Service as an example of applica- tion. 102 Chapter 10 IaaS SLA Planning As mentioned earlier in the thesis, this part comprises an evaluation of the concepts mentioned thus far to show how hierarchical SLA manage- ment can be achieved. In this Chapter and the next one, SLAs and their hierarchies will be used as a means for infrastructure resource capacity management in the context of IaaS. More specifically, in this chapter a scenario is sketched and the whole setup for IaaS SLA management is illustrated. A simulation is designed as proof of concept based on this scenario, putting the focus on normal, runtime operations. There, SLA requests arrive from customers, they are being negotiated, served, and expire. To this extent, the Chapter serves as proof that the proposed design and models for hierarchical SLA man- agement are feasible. As discussed earlier, the SLA planning methodologies used during negotiation are considered to be use-case specific. Here, two different greedy scheduling algorithms are used for this purpose. The algorithms are modeled after the online bin-packing problem, and they are evalu- ated (via the aforementioned simulation) with regard to their energy effi- ciency, performance, and resource utilization efficiency. 10.1 Scenario Description The scenario assumes customers (which may be represented by a SAMI themselves) who contact the provider to ask for the establishment of one or more SLAs, where the service involves infrastructure resources. The provider has a top-level negotiating SAMI, without any attached re- sources. The latter are distributed among a number of data centers, and resource pools contained in those data centers. For each data center, there is a SAMI that negotiates for the respective resources, and a Re- 103 Chapter 10. IaaS SLA Planning source Manager (abstracted by the Resource Element model) that man- ages them. An SLA of the customer with the provider’s SAMI implies there is at least one, or perhaps more, SLAs between the provider’s SAMI and the data center SAMIs. For reasons of convenience, but without loss of generality, the assumption here is that each SLA consists of one co- location unit – that is, all resources for that SLA must reside in the same data center. Nevertheless, it is straightforward to have more than one co-location units inside an agreement, and assign each one of them to a single data center. This applies as long as it is possible to associate each co-location unit with a specific price and penalty. An incoming SLA request, such as the ones used for the simulation discussed later in this Chapter, looks like the example from Figure 10.1. There it is assumed that the constructs offered by the negotiation proto- col’s envelope are used to assign temporal, price, and penalty parame- ters. That said, there is nothing that prohibits to use the resource request (RE) structure for the same purpose; the SAMI would then have to remove those group characteristics that cannot be understood by the resource manager. This would then allow to have a single resource request – with multiple resource groups – inside the SLA request. SLA Request ... Resource Request 1 Group 1.1 Type : Virtual Machine Characteristics Quantity: n1 Resource Quality: (QoS, Size) Reservation: [Long running | Strict | Batch] Start Time: DateTime End Time: DateTime Duration: Integer Price: Float Penalty: [Penalty Function] Resource Request N Group N.1 Type : Virtual Machine Characteristics Quantity: nN Resource Quality: (QoS, Size) Reservation: [Long running | Strict | Batch] Start Time: DateTime End Time: DateTime Duration: Integer Price: Float Penalty: [Penalty Function] Figure 10.1: An SLA request example Incoming SLA requests were produced at random for the simulation. It should be noted that these requests contained no boolean operators, and therefore no BDD construction or path selection was necessary. This ap- proach was chosen to reduce less-relevant complexity, but also because the performance of constructing BDDs for random boolean functions is out of scope for the present work (readers interested in the topic can 104 Chapter 10. IaaS SLA Planning study [92] by Wegener and [93] by Gröpl). The specific choice does not affect generality; should there be more than one alternatives (paths end- ing to node 1), the one with the highest returns could be chosen – and if it could not be satisfied due to lack of resources, the next one(s) would be tested. The resources involved in each SLA are Virtual Machines (VMs), all characterized as “gold”, “silver” and “bronze” to illustrate different qual- ity of service (CPU clock speed, amount of memory, amount of storage). Each QoS level also has two categories (“small” and “large”), resulting in the possible requested configurations from Table 10.1. They are as- signed in servers of the same QoS class, to reflect the correct clock speed, but also the correspondingly larger total amounts of memory and storage (Table 10.2). All memory and storage amounts are counted in GB. Each resource manager (data center) has three resource pools, one for each server type, each pool featuring 5000 servers. There are three REs in to- tal, each represented by one SAMI. Figure 10.2 illustrates the described setup. QoS class Category Clock speed Memory Storage Bronze Small 1.6 1 80 Bronze Large 1.6 2 140 Silver Small 2.0 2 160 Silver Large 2.0 4 280 Gold Small 2.4 4 320 Gold Large 2.4 8 560 Table 10.1: Virtual machine types QoS class Clock speed Num. Cores Memory Storage Bronze 1.6 4 8 640 Silver 2.0 8 32 2560 Gold 2.4 16 128 10240 Table 10.2: Server types A customer must provide, as part of the request, the respective details of each resource group that must be allocated. These details include, for each resource group, the following information: 1. Reservation type: May be one of long-running, strict, and batch. Long-running indicates that the point in time when the allocation should be de-activated (i.e. when the lease will expire) is not known; 105 Chapter 10. IaaS SLA Planning Pro vider's S AMI (A) Data Center S AMI (B) Data Center S AMI (C) Data Center S AMI (D) Resource Manager Gold serv ers Bronz e serv ers Silv er serv ers Negotiation Pro visioning Resource Manager Resource Manager Customer Negotiation First-fit First/Best Fit Figure 10.2: The IaaS scenario of Part III the resources would be constantly needed for the foreseeable fu- ture. Strict means that the given time limits (as discussed below) are strict and should be honored, otherwise the SLA will be consid- ered to be violated. Finally, batch means that the resources must be allocated to the user at some point between the declared time limits, for a duration also declared, with the duration being shorter than the given time frame. This basically means that the provider is free to schedule those resources for some point in time when it will be most convenient; the user does not care exactly when that will be, as long as it is for the given duration. Should the total duration be less than the requested one, penalties will apply. 106 Chapter 10. IaaS SLA Planning 2. Resource quantity: The amount of requested resources, in some standard unit. 3. Resource quality: Combination of size category (large, small) and quality (gold, silver, bronze). 4. Start time: When the resources should become available (or earliest time limit for batch resource groups). 5. End time: When the resource can be de-commissioned (may be a value implying “never” for long-running groups, or a value implying the latest acceptable point in time for batch groups). 6. Duration: For how long the resources must be available. May be irrelevant for long-running groups, equal to the given time frame for strict groups, and shorter than the given time frame for batch groups. 7. Penalty: A percentage of the group’s cost, depending on failure ra- tio, as discussed in Chapter 7. The failure ratio is proportional to the amount of time that the group’s resources are unavailable. A price per resource group is also assigned to each of them. It is the product of a random value, the group’s resource quantity, and its duration. 10.2 Algorithms As also illustrated in Figure 10.2, the assignment to a data center is al- ways made (in this scenario) on a First-Fit basis. In the next Chapter, different strategies for the negotiation among the provider SAMI and the data center SAMIs will be explored. The provisioning strategy, however, may take different forms. The scenario is modeled as a dynamic, multi-dimensional, online bin-packing problem. A bin-packing problem is the (combinatorial) problem that tries to minimize the number of bins into which are placed N given items of different sizes, and all items are known in advance. Its online counterpart is the same problem, but the items are not known in advance; rather, every time a new item m is given, m < N, the remaining items (m + 1..N) are unknown. The multiple dimensions are related to the different types of resources required per item (CPUs, memory, storage). Finally, the problem is characterized as dynamic due to the introduction of a time dimension, and the fact that items may also leave a bin – not only join it. 107 Chapter 10. IaaS SLA Planning Figure 10.3 illustrates the problem graphically. Each box without a label represents a request for a VM, within the incoming SLA offer. Its horizontal side represents the amount of memory that the VM should fea- ture, and the vertical side represents the respective amount of storage. The color shows what is the required clock speed (and, therefore, in which server pool it should be assigned). The problem faced is how to assign all VM requests and minimize the number of (active) servers that have one or more VMs assigned to them. That means, unused memory and storage should be minimized for the active servers, while in parallel, the number of inactive servers (dark-colored ones) should be maximized. Figure 10.4 illustrates the time dimension of the problem (excluding the “storage” dimension and keeping only the one about memory, to simplify represen- tation). Requests A – E are being agreed to as they come. The height of each box represents the requested memory, while the width represents the duration of the advance lease. Location on the time axis represents the exact point in time when the lease starts. When SLA request F ar- rives, it cannot be placed into Server 1, as there is not enough memory left available from point in time T1 to point in time T2. Server 2 must then be started, and the request should be assigned on it. Bronze Bronze Bronze Bronze Bronze Silver Silver Silver Gold Gold Figure 10.3: Two-dimensional bin (server) packing The reason that the formalization of bin-packing is chosen, is the wish to minimize the number of active servers, and thus explore the efficiency of different approaches with regard to energy savings. Although bin- packing is one of the oldest combinatorial problems, and there exist var- ious different formulations (online/offline, dynamic/static, single-/multi- dimensional), the combination of all three properties has not been stud- 108 Chapter 10. IaaS SLA Planning M e m o r y u t i l i z a t i o n Time A Total available memory C E D B Server 1 M e m o r y u t i l i z a t i o n Time Total available memory Server 2 F T1 T2 Figure 10.4: Dynamic bin (server) packing ied until very recently. To the best of the author’s knowledge, the only other work that addresses this specific problem is [94] by Epstein and Levy, under publication at the time of writing (September 2010). This paper studies the problem in a more complete manner, and therefore readers interested in its properties are recommended to advise with it. In the present Chapter the requirement is to apply the problem, as an ap- propriate formulation, and explore the efficiency of two relatively simple solving algorithms. It should also be noted that in the traditional dynamic variant of the online bin-packing problem, the events of an item leaving the bins is not known in advance. In the case studied here, it may be known in advance when an item will leave the bins, as it reflects the du- 109 Chapter 10. IaaS SLA Planning ration of the reservation (declared by the customer at negotiation time, in the case of strict and batch reservations). The strategies explored are a First-Fit (FF) and a Best-Fit (BF) greedy algorithm. The FF algorithm is going through the list of open bins (can- didate servers) for each VM group requested in an SLA, until the first server with sufficient resources is found for the time frame of the group and some part of its total quantity; when this happens, the specific VM group’s quantity is reduced by the respective amount that was assigned to the server, and the search continues in the same manner until the full group quantity has been assigned. Figure 10.5 illustrates the algorithm executed for every resource group in the request, in the form of a flow chart. Start Select fi rst server in pool Select next server in pool No No YesYes A vailable resources? Assign as many as possible Group fully assigned? Stop Last server? No Yes Figure 10.5: First-Fit assignment algorithm for each resource group The Best-Fit algorithm is similar, but instead of selecting the first server with available resources, it tries to find the one that will minimize resource 110 Chapter 10. IaaS SLA Planning waste (in the form of spare cores, or memory/storage residue). Naturally, this means that BF has to run through the server list multiple times – up to s2, s being the total number of servers in the pool. Therefore, its performance is expected to be much worse than FF; this is confirmed in Section 10.3 via simulation. The Best-Fit algorithm used is illustrated in Figure 10.6. 10.3 Experimental Evaluation The simulation was executed for three different arrival rates and resource demand scenarios: One where resources were always enough for both algorithms (low consumption), a “borderline” one where some SLA re- quests would start being dropped, and one where resource demand was very high and the pools could not keep up with it. Results were averaged across the three data centers, for each case. In all cases, execution speed of the Best-Fit algorithm was predictably much worse than First-Fit. This is illustrated in Figure 10.7. It is imme- diately evident that BF is significantly slower than FF. Nevertheless, this delay should be assessed in relation to potential advantages in SLA ac- ceptance rate and energy savings. In the first case, as designed, all SLAs were accepted by both strate- gies. Looking at energy savings, measured by means of total servers switched on, Figure 10.8 shows that the Best-Fit algorithm performs bet- ter to some small extent. Zooming in to the graph, Figure 10.9 shows that the difference is in the scale of 2%. In general, the better performance of the BF strategy is expected: It it tuned to avoid resource waste, and therefore, it always first tries to maximize utilization for the servers that have some resources available. Therefore, it fills utilization gaps, and is slower to try switch on the next available server. Switching to a larger arrival rate, Figure 10.10 shows that BF accepts all SLA requests, while FF already starts dropping some. In this case, however, energy savings are much more using a BF strat- egy, and can be seen to approach 5% - 6%. This is illustrated in Fig- ures 10.11 and 10.12. When moving to the very large arrival rate, where in both cases not all incoming SLA requests can be served, these energy saving difference is evened out as one would expect. All servers must be utilized, and the advantage of the BF strategy now is that it can serve more SLA requests, as it has been making better use of the available resources. Figure 10.13 illustrates this better rate of accepted SLAs, which at times reaches a gap of approximately 10%. Finally, Figure 10.14 shows that all servers 111 Chapter 10. IaaS SLA Planning Start Select first server in pool Available resources? Calculate residue Store as candidate Select next server in pool Last server? Yes No No Sort candidates by residue Assign as many resources as possible Select first candidate Last candidate? Select next candidate No Yes List empty? No Stop Yes Group fully assigned? No Yes Yes Figure 10.6: Best-Fit assignment algorithm for each resource group 112 Chapter 10. IaaS SLA Planning 0 5 10 15 20 25 0 10 20 30 40 50 60 70 80 90 100Planning durat ion (sec) Time Duration of planning for incoming requests First-FitBest-Fit Figure 10.7: Performance comparison of FF and BF 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100Percentage of a ll servers Time Active (switched-on) servers First-FitBest-Fit Figure 10.8: Energy savings comparison (low arrival rate) 113 Chapter 10. IaaS SLA Planning 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0 10 20 30 40 50 60 70 80 90 100Percentage of a ll servers Time Active (switched-on) servers First-FitBest-Fit Figure 10.9: Energy savings comparison, zoom in (low arrival rate) are fully utilized, and the advantage of the BF strategy has vanished as expected. Via all the experiments run, it is evident that the use of a Best-Fit strat- egy that tries to minimize resource waste within a single data center, has significant advantages when aggregated over the whole infrastructure of a provider. Energy savings varied and reached a difference of approxi- mately 6%; and accepted SLA requests (which can be directly reflected on additional customers and therefore additional revenue) was also signif- icantly higher in the case of saturated infrastructure. These advantages should be balanced against the significantly higher planning time. The gap among the two varied, but was consistently high, in some cases more than an order of magnitude. That said, the total time wasn’t prohibitively large, and in all cases remained under 20 seconds for our setup of three data centers, nine resource pools, and 5000 servers in each pool. Should this planning duration be acceptable, then the gains in using a Best-Fit strategy should not be overlooked. 10.4 Related Work Variants of bin-packing have been studied extensively in the past. As mentioned in Section 10.2, Epstein and Levy very recently studied the complete variant we use here (albeit with unknown departure times for the items involved) in [94]. An overview of related approximation algo- 114 Chapter 10. IaaS SLA Planning 0 0.2 0.4 0.6 0.8 1 1.2 0 10 20 30 40 50 60 70 80 90 100Percentage of total sub mitted requests Time Accepted SLA requests First-FitBest-Fit Figure 10.10: Accepted SLAs comparison (medium arrival rate) 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100Percentage of a ll servers Time Active (switched-on) servers First-FitBest-Fit Figure 10.11: Energy savings comparison (medium arrival rate) 115 Chapter 10. IaaS SLA Planning 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0 10 20 30 40 50 60 70 80 90 100Percentage of a ll servers Time Active (switched-on) servers First-FitBest-Fit Figure 10.12: Energy savings comparison, zoom in (medium arrival rate) 0 0.2 0.4 0.6 0.8 1 1.2 0 10 20 30 40 50 60 70 80 90 100Percentage of total sub mitted requests Time Accepted SLA requests First-FitBest-Fit Figure 10.13: Accepted SLAs comparison (large arrival rate) 116 Chapter 10. IaaS SLA Planning 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100Percentage of a ll servers Time Active (switched-on) servers First-FitBest-Fit Figure 10.14: Energy savings comparison (large arrival rate) rithms for other online variants can be found in [95] by T. Gonzalez (ed.) and [96] by Coffman et al. In [97], Van et al. also address a VM packing problem aiming at min- imizing active servers, but they approach it as an offline, combinatorial problem that they solve periodically. Also, VM migration is possible in their model, something not allowed in the scenario presented here. A similar approach is presented by Hermenier et al. in [98]. The authors discuss task placement in cluster systems, but assume a VM per task, making the problem largely equivalent. In [99], Nakada et al. apply a genetic algorithm to plan efficient VM packing vectors, as an alternative to best-effort and VM migrations. Their work, however, is currently at the first stages and therefore it is difficult to extract conclusions with regard to its sufficiency or purpose fitness. Ajiro and Tanaka present some similar work for server consolidation in [100]. They perform measurements using two different algorithms, a First-Fit Decreasing (FFD) one (which is equivalent to BF from this Chap- ter), and a Least-Loaded (LL) one where the load was assigned to the server with the most available resources. By definition for this Chapter’s energy-saving objective, LL was inappropriate. Also, the authors do not address the time perspective. As regards SLA-based resource scheduling, there has been quite some work in the area during the last few years. Most existing work, however, looks into application-specific SLAs (which may, for instance, address ap- 117 Chapter 10. IaaS SLA Planning plication performance guarantees), while the present work considers only SLAs for infrastructure services that are application-agnostic. In [101], Netto et al. concern mostly with advance reservation of re- sources. Similarly to the work presented in this chapter, SLAs may have some flexibility with regard to their start and end times. Although Netto et al. do not include long-running SLA requests, they have a type that has flexible start time and strict end time. The authors also assume that the user provides (as part of the SLA) a function that correlates execution time with number of resources; and that there is, as such, flexibility with regard to the amount of resources assigned to the user. Conversely, here it is assumed that the user requests a specific amount of resources for a specific duration, and the main concern is which resources to use so that the energy footprint is reduced. In [102], Goiri et al. look at SLA-driven resource management in the context of Grid computing. As such, SLAs are related to specific tasks also in this case. The paper is related mostly to SLA adjustment issues: It presents a method to reschedule SLAs that are failing, by scaling either vertically or horizontally. These concepts and the architecture presented may be useful for the purpose of renegotiation and dynamic workloads. A very similar system is presented in [103] by Roy and Mukherjee. Using management agents in a Grid environment, the authors monitor tasks in relation to the respective SLAs, and perform adjustments when neces- sary. Finally, in [104] Ardagna et al. present a solution to the problem of as- signing multi-tier applications to servers within a data center. The multi- tier nature of those applications makes the problem quite similar to the topic of hierarchical SLAs, nevertheless the authors address it exclusively in the context of a single data center. Ardagna et al. construct a Mixed- Integer Non-linear Program, based on which they try to maximize revenue and minimize costs; the latter, in the form of reduced energy footprint. The problem is solved using a custom heuristic. Summary and Conclusions In this Chapter, there was presented a scenario that applies the main SLA management concepts from Part II on the domain of Infrastructure as a Service. Multiple SAMIs are assumed, in different management domains, serving incoming SLA requests. Based on those requests, which were dis- patched from a central broker to the data centers, two different VM place- ment algorithms were evaluated as regards their performance and energy efficiency, in a simulated environment. The algorithms, First-Fit (FF) and 118 Chapter 10. IaaS SLA Planning Best-Fit (BF), could be used as the POC implementations of the data cen- ter (2nd level) SAMIs. They exhibit a large difference in performance, with FF being faster. However, BF is significantly more efficient with regard to resource utilization, and energy savings by means of switched-off servers (no VMs assigned to them). In the Chapter that follows, a different planning problem (yet, within the same setup) will be tackled. More specifically, the problem is to de- cide on the best course of action when suddenly resources become less than those needed to serve the already established SLAs. This may hap- pen, for instance, in the case of massive resource failure. The objective of upcoming work is to minimize resulting penalties, and the approach used is to solve a knapsack-based combinatorial problem. 119 120 Chapter 11 Offline Planning with Limited Resources After an SLA has been established, it must be executed and monitored. Failure to provide the necessary resources and implement the service will result in a penalty, according to the SLA’s provisions. In this Chapter, it is assumed that at some point in time the resources become drastically less than needed to serve the SLAs already estab- lished. This may be, for example, due to some massive resource failure. The provider at that point must make a choice, which SLAs to violate pur- posefully, and to what extent. This is different from the previous Chapter not only because of the extraordinary circumstances involved, but also because the planning involved is now offline: the provider knows all SLAs in advance, and can use the complete SLA set during this planning pro- cess. 11.1 Scenario Description As an example of the situation to be examined in this Chapter, it will be assumed that at some point in time the normal operations (as described in Chapter 10) are disrupted because of one data center failing completely, e.g. due to power outage, network disruption, etc (Figure 11.1). It must be stressed that this is just an example; the approach is valid in any case where the existing available resources are less than those required by established SLAs. Another example of such an occasion could be the case where SLAs are not automatically negotiated among SAMIs, but rather inserted into the system by an administrator without prior confirmation of the existing resource capacity. Should resources become less than those necessary for the complete provisioning of all SLAs, some of them 121 Chapter 11. Offline Planning with Limited Resources will have to be withdrawn partially or completely. Pro vider's S AMI (A) Data Center S AMI (B) Data Center S AMI (C) Data Center S AMI (D) Negotiation Resource Manager Resource Manager Resource Manager 3rd P arty Pro vider S AMI Negotiation Figure 11.1: Resource failure scenario This time, the problem is not to fill up as few servers as possible, like in Chapter 10. Rather, it is a question how to make the best possible use of all available resources – where “best” is associated to the least penalty. Figure 11.2 illustrates the concept in a simple manner. If the resource pool is as shown in this Figure, where the horizontal dimension represents time (lease start times and durations), the vertical axis represents the SLA(s) requested resource quantity; and the number in each SLA representation is the associated penalty if it is discarded; then the problem is which SLAs to choose and assign resources to, so that the penalties of the remaining SLAs (the ones “outside” the pool) are minimized. SLAs Resource pool(s) 3 4 1 8 2 1 6 9 Figure 11.2: Problem illustration In this case, we are evaluating five different planning strategies, for the interaction among the main provider’s SAMI (A) and the data center SAMIs (B, C) or possibly a 3rd-party provider’s one. It is important to note 122 Chapter 11. Offline Planning with Limited Resources that 3rd-party SAMIs and data center SAMIs are essentially treated in the same way. For the provisioning phase (i.e. the communication between data center SAMIs and REs), a simpler model has been chosen here to improve simulation speed: It is assumed that there are large resource pools with CPUs, memory and storage, instead of separate servers of- fering them as it was examined in Chapter 10. This does not affect the results, due to the fact that within a data center there is simply a check whether an SLA fits or not; and the result is the same, irrelevant of the strategy used each time. Thus, this simplification causes no loss of gen- erality when it comes to the comparison of the five examined strategies. The structure of an SLA is as discussed in the previous Chapter. It includes a single co-location unit, that is, a group of resources that must end up in the same data center. Each group contains subgroups (resource requests) of the same resource types, for instance “CPU cores of clock speed 1.6 GHz”, and a requested quantity. For each resource request, there is also mentioned the type of the lease (long-running, strict, batch) and the respective time limits. The planning strategies that will be evaluated are the following: Non-flexible: The SLA is included/ scheduled in full, or not at all. That is, there is either no penalty, or a penalty equal to the full price of the SLA. Apparently this strategy is the least flexible, being almost “bi- nary” in nature: All resources have to be provided, otherwise the full penalty applies. It represents resource requests where the various groups inside them are tightly coupled, and it is not useful to have some resources available, while others are not. The only element of flexibility is with regard to start / end time, for “batch” reserva- tion types as discussed in Chapter 10. Even in that case, however, the total duration of the lease is strictly defined; if the resources are provided for less time than agreed, the full penalty applies. One can intuitively see that this strategy will be the fastest, as there are far less options to evaluate during the optimization phase. Flexible-items: Some of the resource requests in the SLA may be ac- cepted, while others not. The total penalty is the sum of penalties for each resource request that was not accepted. This strategy is similar to the previous one, with regard to the lack of flexibility re- garding time. It is not possible to modify the start or end time of the associated resource leases, except for batch reservations where the duration must still remain the agreed one, or the full penalty will oc- cur. Looking at Figure 11.3, it may be the case that resource request 1 is accepted and assigned, while resource request 2 is not. This could happen if, for instance, after calculating penalties for the two 123 Chapter 11. Offline Planning with Limited Resources requests according to the respective penalty functions, it turned out that the penalty for request 1 is relatively smaller than the one for 2, and therefore can be tolerated. SLA Request Resource Request 1 Group 1.1 Type : Virtual Machine Characteristics Quantit y: 5 Resource Qualit y: ... Reservation: Long running Start Time: Date Time End Time: Date Time Duration: Integer Price: Float Penalty: [P enalty Function] Resource Request 2 Group 2.1 Type : Virtual Machine Characteristics Quantit y: 10 Resource Qualit y: ... Reservation: Strict Start Time: Date Time End Time: Date Time Duration: Integer Price: Float Penalty: [P enalty Function] Figure 11.3: Example for “flexible items” strategy Flexible-time: In an effort to maximize the number of accepted SLAs (i.e. minimize the number of customers that receive no service whatsoever), in this strategy there is given complete flexibility: Not only some resource requests of an SLA may be served while oth- ers not, but those that were chosen may be scheduled to start later than planned, end earlier (for “strict” leases) or have a duration shorter than requested (for “batch” leases). In the example from Figure 11.3, it may be the case that request 1 (featuring a long- running reservation) may start later than agreed; and 2 may start later, end earlier, or be rejected altogether. First-Fit: This is a greedy algorithm, that sorts SLAs according to their penalties in decreasing order and tries to assign (serve) as many of them as possible. It follows an all-or-nothing approach, like the “Non-flexible” strategy. Each SLA is committed to the first data cen- ter in a row that will have enough resources (i.e., the first SAMI to accept it). Best-Fit: Like First-Fit, but the SLA is committed to the SAMI that will report back the least available (free) resources after acceptance. As it cannot be realistically expected that 3rd-party SAMIs will report resource availability, this algorithm can be applied only within the same administrative domain – that is, within the same IaaS provider. 124 Chapter 11. Offline Planning with Limited Resources In all strategies, the objective is to pay the least penalties possible. The evaluation in Section 11.3 will assess them with regard to execution time, best achieved objective, how far customers are affected, etc. 11.2 Problem Models The three combinatorial strategies (Non-flexible, Flexible-items, Flexible- time) will be formalized as binary (0-1) Integer Linear Programming (ILP) models. It is also possible, with some modifications, to formalize them as Mixed-Integer Programs (MIP) of quadratic form. Such models are simpler to express and more flexible with time. However, they are also known to be extremely difficult even for state-of-the art solvers and relatively small instances, therefore an ILP approach was preferred. As it will be shown later, even like this, the problems are fairly hard to solve for larger instances. In order to achieve the ILP formulation, time was quantized. The exact analogy of each simulated time slot to a real time duration is left open; it may be one minute, one hour, or anything else. The problem is formulated as a knapsack problem [105] variant. More specifically, this is a multi-period, multi-dimensional, multiple-knapsack problem. In the traditional knapsack problem, the question is the follow- ing: If there is one knapsack of known (weight) capacity; a number of items with their total weight to be larger than the knapsack’s capacity; and each item has a value associated with it: What is the combination of items that can be put in the knapsack, so that its capacity is respected, and the total value of items is maximized. Here, the “multi-period” part relates to the fact that items may come and go depending on time. “Multi-dimensional” is related to the fact that each SLA includes resource requests of different types (reflecting the dif- ferent resource types), and therefore there are more than one knapsack capacity dimensions to be considered. Finally, there are “multiple knap- sacks” representing the multiple data centers that may be used. Variants of the knapsack problem have been extensively studied in the past, and there have been various combinations of such variants ex- amined. To the best of the author’s knowledge, this is the first time that the specific combination of the three variants is formalized and studied. 11.2.1 Non-flexible The integer variable xij indicates whether SLA i (out of I SLAs in total) is assigned to data center j, or not. Its values are 1 and 0, respectively: xij ∈ {0, 1} (11.1) 125 Chapter 11. Offline Planning with Limited Resources Each SLA may be assigned only to one out of the existing J data cen- ters, at most: J∑ j=1 xij ! 1 (11.2) The integer variable yijklm indicates whether resource request m of SLA i is assigned to data center j, from time slot k to time slot l (inclusive). It also takes values 1 or 0, to indicate an assignment or not: yijklm ∈ {0, 1} (11.3) k ! l (11.4) Each of these resource requests may be assigned to one data center at most. In addition, it can only be assigned once – i.e., only one value for k and one for l are acceptable: J∑ j=1 H∑ k=1 H∑ l=1 yijklm ! 1 (11.5) As mentioned earlier, in this strategy the target is to assign the com- plete SLA, or no part of it at all. This means that, if the SLA is accepted (i.e. assigned to a data center), all of its resource requests should be accepted as well; and if the SLA is not accepted, no resource requests should be accepted either. This constraint is formulated as follows: H∑ k=1 H∑ l=1 Mi∑ m=1 yijklm = Mi · xij (11.6) Mi is the number of resource requests in SLA i. Each resource request m of SLA i has a lease type LTm, which takes values from the set (LR,ST ,BA) – for Long Running, STrict and BAtch. It also has a declared duration, DURim, a start time Tsm and an end time Tem. If a start time is in the past (i.e. the SLA is already executing), it is substituted by the current time: Tsim = max((NOW), T s im) (11.7) It applies that: (LTm = LR)∧ (k *= Tsim ∨ l *= H) (LTm = ST)∧ (k *= Tsim ∨ l *= T e im) (LTm = BA)∧ (k < Tsim ∨ l > T e im ∨ l− k + 1 *= DURim)    ⇒ yijklm = 0 (11.8) 126 Chapter 11. Offline Planning with Limited Resources That is, for long-running leases, candidate start time k must be equal to the declared start time Tsm and candidate end time l must be equal to the end of the planning horizon. For strict leases, k and l must be equal to declared start and end times; and for batch leases, k and l must be within the declared limits Tsm and Tem, while duration should be equal to the duration declared (requested) by the user. In all other cases, the assignment variable yijklm must be 0. For any given time slot t within the planning horizon, the 0-1 variable zimt indicates whether resource request m of SLA i may be active at that moment: zimt = { 0, t < Tsim ∨ t > T e im yijklm, Tsim ! t ! Teim (11.9) The weight of a resource request m of SLA i in a specific dimension d is wimd. As also mentioned earlier, dimensions are different types of resources, such as CPU cores, memory, storage. The sum of all weights of active resource requests at any given point in time and for any given resource type (dimension), should not exceed the amount of available resources of that type: I∑ i=1 Mi∑ m=1 wimd · zimt ! cjd (11.10) Finally, PENiklm is the penalty of an resource request m of SLA i, if it is active during time slots k to l. Penalty definition is external to the model. Here, for reasons of convenience it is given as a percentage of the resource request’s price Pim, proportionate to its time-related violation: PENiklm = Pim · DURim − (l− k + 1) DURim (11.11) Given Equation 11.8, the penalty in the Non-flexible strategy can be ei- ther equal to the full price Pim, or 0. Nevertheless, it is straightforward to apply a more complete penalty scheme, like the one elaborated in Chap- ter 7. The objective is to maximize the total penalties for resource requests assigned to data centers; that is, to minimize penalties for resource re- quests that will be left unassigned, and therefore, will have to be re- funded. max I∑ i=1 J∑ j=1 H∑ k=1 H∑ l=1 Mi∑ m=1 PENiklm · yijklm (11.12) 127 Chapter 11. Offline Planning with Limited Resources 11.2.2 Flexible-items The only difference between strategies Non-flexible and Flexible-items, is that in the latter it is possible to select only some of the resource requests to assign in one of the data centers. Therefore, the only part of the model to change is Equation 11.6, which becomes: H∑ k=1 H∑ l=1 Mi∑ m=1 yijklm !Mi · xij (11.13) 11.2.3 Flexible-time Apart from the modification of Equation 11.6 into 11.13, in the Flexible- time strategy it is also possible to start serving some SLA resource re- quests later than agreed, or end them earlier than agreed. This is re- flected on the relaxation of constraints from Equation 11.8. More specifi- cally, this Equation now becomes as follows: k < Tsim ∨ l > T e im ⇒ yijklm = 0 (11.14) It is assumed that starting to serve a resource request earlier than agreed, or continuing to serve it after its agreed expiration time, offers no value either to the customer, or to the provider. 11.3 Experimental Evaluation 11.3.1 Setup The models described in Section 11.2, plus the two greedy algorithms (Fist-Fit and Best-Fit) were evaluated and compared with random data for incoming SLA requests. Randomness was related to the reservation type, the duration of the reservations (i.e. the durations of the SLAs), their start and end time, the amount of requested resources, and the costs (with penalties being calculated as a percentage of those costs). Nor- mal distributions were used across the board. ILP models were built with the PuLP modeler [106] and executed using the GUROBITM [107] state of the art commercial solver. In various benchmarks against CPLEXTM and other solvers, GUROBITMproves to be as fast or even faster in most cases (e.g. [108, 109]. All simulations were run on a 4-way server running the Linux SMP kernel v2.6.32. Each of the four processors was a 1 GHz AMD Opteron with 1 MB of attached cache memory. The total RAM of the server was 16 GB. Except for standard operating system processes, and the Se- cure Shell (SSH) daemon, no other processes were executing at the times of the experiments. 128 Chapter 11. Offline Planning with Limited Resources The experiments executed included the following setups: • Standard configuration: 100 SLAs, 4 different resource types, 3 data centers, a planning horizon of 10 time units, and 3 resource re- quests per SLA; • Increasing number of SLAs, from 100 to 1000 in steps of 100; • Increasing number of resource types, from 1 to 5; • Increasing number of data centers, from 1 to 5; • Increasing planning horizon, from 5 to 30 units in steps of 5; • Increasing number of resource requests per SLA, from 1 to 5; and • All parameters increasing in parallel, in four steps: SLAs from 200 to 800 in steps of 200, resource types/data centers/number of re- source requests per SLA from 1 to 4, and planning horizon from 5 to 20 in steps of 5. Each setup was tested with all five strategies, and for the combinato- rial ones there was a timeout of 15 minutes. It should be noted that GUROBITMis highly optimized for parallelism, and indeed it was possible to confirm that all CPUs were at top load during problem solving. There- fore, the 15 minutes of real time correspond to 1 CPU-hour due to the four processors of the server used. For the last type of experiments (variables increasing in parallel), there was initially tested a 5th configuration with 1000 SLAs, 5 resource types/data centers/SLA resource requests and 25 time units in the planning horizon, but even the pre-solving phase (lead- ing to a bound value) was not finishing in the 15 minutes allocated; there- fore, this setup was dropped from the experiments. The complete set of experiments was run five times, and results were averaged when suitable (e.g. for normalized values). In all experiments, the amount of available resources was crafted dur- ing modeling time to be significantly less than the amount required. With- out loss of generality, it was assumed that all scheduled SLAs start in the future; a pseudo-random number generator following a gaussian distri- bution was used to return start/end times, required resources per SLA resource request, resource type, etc. Figure 11.4 illustrates the required resources over time, as a percentage of the available resources. This graph concerns only experiments with a planning time horizon of 30 units, but for other experiments the resource requirements are similar. 129 Chapter 11. Offline Planning with Limited Resources 0 0.5 1 1.5 2 2.5 3 0 5 10 15 20 25 30Ratio Time Required / available resources over time Figure 11.4: Required / available resources ratio 11.3.2 Results Resource Utilization The first goal was to see how good use of the available resources, do the various strategies offer. Figure 11.5 illustrates exactly that. It is straight- forward to see that Flexible-times provides the best utilization, converg- ing towards a ratio of 100% much faster than the rest. Flexible-items is second best, then follows Non-flexible. The two greedy algorithms achieve the worst resource utilization, but they should be compared only to the Non-flexible strategy, as they are also applying an all-or-nothing approach for the SLAs under consideration. Complexity and Runtime Behavior of Optimization It was also important to achieve an understanding of the complexity of the different ILP models, as all parameters were increasing in parallel. The solver’s output was intercepted to see how fast it is approaching the best-bound values, as well as how long the pre-solving phase takes. Fig- ures 11.6 - 11.9 illustrate these results. The numbers in parentheses (in the captions) refer to the number of SLAs, resource types, data centers, planning horizon time units, and resource requests in each SLA – in this particular order. In Figure 11.6, the simplest of the four instances is solved instanta- 130 Chapter 11. Offline Planning with Limited Resources 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 5 10 15 20 25 30Resource utilizati on percent Time Average resource utilization Flexible-timeFlexible-itemsNon-flexibleGreedyBestFitGreedyFirstFit Figure 11.5: Average resource utilization (percentage) neously in all three models. In Figure 11.7, Non-flexible and Flexible-items strategies start solving immediately with a distance from best-bound ob- jective less than 1% on average, and quickly converge even further. The Flexible-time strategy goes through a very short pre-solving phase, then starts solving from a distance of approximately 10% on average, but con- verges in less than 10 seconds to almost the optimal value. In the third problem instance, the difference of complexity starts to become more apparent. The Non-flexible strategy starts solving almost immediately, with a large distance from best-bound objective (more than 15% on average), but converges in less than 30 seconds to less than 2%. Flexible-items behaves even better, starting from a distance of ap- proximately 2.5% and almost reaching best-bound a few seconds later. Flexible-time is pre-solving for approximately one minute on average, then remains quite far from best bound (more than 20%) for some time, and then quickly approaches best-bound – a total of almost 3.5 minutes for that). The results from the last instance of this problem were further reveal- ing about the exponential complexity of at least two out of three mod- els. While the Flexible-items strategy starts far from best-bound, it ap- proaches it to a distance less than 2% in approximately 30 seconds on average. Strategy Non-flexible starts at a distance of almost 30%, then slowly approaches best-bound taking almost 4 minutes to reach a dis- tance of 5%, and achieving a distance very close to optimality just before 131 Chapter 11. Offline Planning with Limited Resources -1-0.5 0 0.5 1 0 100 200 300 400 500 600 700 800 900Distance from b est bound Time Average approximation speed - 1 Non-flexibleFlexible-itemsFlexible-time Figure 11.6: Solving/Approximation speed for (200, 1, 1, 5, 1) 0 1 2 3 4 5 6 7 8 9 10 0 100 200 300 400 500 600 700 800 900Distance from b est bound Time Average approximation speed - 2 Non-flexibleFlexible-itemsFlexible-time Figure 11.7: Solving/Approximation speed for (400, 2, 2, 10, 2) 132 Chapter 11. Offline Planning with Limited Resources 0 5 10 15 20 25 0 100 200 300 400 500 600 700 800 900Distance from b est bound Time Average approximation speed - 3 Non-flexibleFlexible-itemsFlexible-time Figure 11.8: Solving/Approximation speed for (600, 3, 3, 15, 3) 0 5 10 15 20 25 30 0 100 200 300 400 500 600 700 800 900Distance from b est bound Time Average approximation speed - 4 Non-flexibleFlexible-itemsFlexible-time Figure 11.9: Solving/Approximation speed for (800, 4, 4, 20, 4) 133 Chapter 11. Offline Planning with Limited Resources the timeout. The last strategy, Flexible-time, is going through pre-solving phase to find a reasonably good first solution for an average of 7 min- utes. After that, it starts solving at a distance of more than 25% from best-bound objective, but never comes any closer before the 15-minute timeout. Clearly, with regard to complexity, Flexible-time is the most difficult problem and Flexible-items is the easiest. It is interesting to explore to some more detail, which of those five parameters causes the largest toll to performance degradation. Figures 11.10–11.14 illustrate the distance from best-bound for all combinatorial strategies. One can see that: • Increasing SLAs affect all three ILPs positively. This is reasonable, as it provides more options for a good initial choice when the integrity constraint is relaxed during pre-solving. • Increasing number of resource types affects negatively all strate- gies. This was expected; it is always underlined across literature, that increasing dimensions in multi-dimensional knapsack variants causes significant increase in the problem’s complexity (e.g. [105] by Kellerer et al.) • Like with resource types, the increase in the number of data centers also affects negatively the solving of the problem. Again, this is to be expected: In the multiple-knapsack variant, increasing number of knapsacks causes increased complexity. • Changing the planning horizon has, initially, a small (and inconsis- tent) effect for all three strategies. In order to get a better picture, more extensive experiments were conducted, increasing the time horizon up to 75 time units. It was found that, starting at 70 time units, the pre-solving phase of the Flexible-time strategy did not fin- ish within the 15-minute timeout. Therefore, the respective graph was produced with a planning horizon of 65 units at most. The result is that the respective increase affects only the Flexible-time strat- egy – as was expected, due to the increase in decision variables that it is causing. After a threshold (55 time units, in this case) the dis- tance from best-bound increases from less than 5%, to more than 20%. Flexible-items and Non-flexible are not affected in some con- sistent way. • Similarly with the planning horizon, changes in the number of re- source requests per SLA had small and inconsistent effects on prob- lem complexity. Thus, extended experiments were run with up to 15 resource requests per SLA. There is a trend that reduces distance 134 Chapter 11. Offline Planning with Limited Resources from best-bound for all three strategies, as the number of resource requests increases. This is an effect similar to the increase in the number of SLAs. 0 1 2 3 4 5 100 200 300 400 500 600 700 800 900 1000Distance from be st-bound Number of SLAs Effect of increasing SLAs Non-flexibleFlexible-itemsFlexible-time Figure 11.10: Effect of increasing number of SLAs Both greedy strategies were executing in less than 1 second for all problem problem instances, so from a performance point of view they are preferable. However, one must assess them also taking into account their fitness for the purpose and how well they behave with regard to approximation of an objective close to optimal. Affected Users Figures 11.15 and 11.16 provide an account of user satisfaction, based on numbers of accepted SLAs, and accepted SLA resource requests. Due to the allowed flexibility, as expected, Flexible-items and Flexible-time can accept many more SLAs, even if partially. Nevertheless, when looking at accepted SLA resource requests, their number is smaller – especially for Flexible-items, which accepts a resource request in full, or not at all. Although the model’s objective is to minimize penalties in all cases, it is important to pay attention to how many users are affected, and to what extent. The number of users receiving no service whatsoever is the largest with the two greedy algorithms, and the smallest with Flexible- time. 135 Chapter 11. Offline Planning with Limited Resources 0 1 2 3 4 5 6 1 2 3 4 5Distance from be st-bound Number of resource types Effect of increasing resource types Non-flexibleFlexible-itemsFlexible-time Figure 11.11: Effect of increasing number of resource types 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5Distance from be st-bound Number of data centers Effect of increasing data centers Non-flexibleFlexible-itemsFlexible-time Figure 11.12: Effect of increasing number of data centers 136 Chapter 11. Offline Planning with Limited Resources 0 5 10 15 20 25 0 10 20 30 40 50 60 70Distance from be st-bound Amount of time units in planning horizon Effect of increasing planning horizon Non-flexibleFlexible-itemsFlexible-time Figure 11.13: Effect of increasing planning horizon 0 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Distance from be st-bound Number of items in each SLA Effect of increasing items in each SLA Non-flexibleFlexible-itemsFlexible-time Figure 11.14: Effect of increasing number of resource requests in each SLA 137 Chapter 11. Offline Planning with Limited Resources ™ ℣␢ ℣┢ ℣☢ ℣✢ ⠢ ⠣␢ ⤪⬬⴮⼰ㄲ⸢ ㌲?⼰ㄲ⸬〴⸵㘢 ㌲?⼰ㄲ⸬㜵⸢ ㌰????㤴? 㨮???㤴? Figure 11.15: Placed SLAs per strategy (percent of total SLAs) ™ ℣␢ ℣┢ ℣☢ ℣✢ ℣⠢ ℣⤢ ℣⨢ ℣⬢ Ⱝⸯ〱㈳㐵ㄢ 㘵?㈳㐵ㄯ㌷ㄸ㤢 㘵?㈳㐵ㄯ㨸ㄢ 㘳????㰷? 㴱???㰷? Figure 11.16: Placed resource requests per strategy (percent of total re- source requests) 138 Chapter 11. Offline Planning with Limited Resources Distance from Optimality The last chart, Figure 11.17, provides an overview of avoided penalties, as a percentage to total penalties. It is reminded that the objective was to maximize avoided penalties. In order to be able to compare with the greedy algorithms, for which there is no notion of “best-bound objective”, the charts do not use that value as it was the case with Figures 11.6–11.9. Rather, the full penalties of all SLAs/resource requests under considera- tion are used. ™ ℣␢ ℣┢ ℣☢ ℣✢ ℣⠢ ℣⤢ ℣⨢ ℣⬢ Ⱝⸯ〱㈳㐵ㄢ 㘵?㈳㐵ㄯ㌷ㄸ㤢 㘵?㈳㐵ㄯ㨸ㄢ 㘳????㰷? 㴱???㰷? Figure 11.17: Avoided penalties per strategy (percent of total penalties) One can see that Flexible-time provides the best result, with Flexible- items a close second. Non-flexible is third, with quite some difference from the second, and finally the two greedy algorithms fall behind signifi- cantly. Experimental Conclusions Assessing all information presented, it appears that the best strategy to follow is Flexible-items. It performs very well in comparison to the other two ILPs with regard to the speed of approaching a reasonably good solution; it offers a very small difference from the best objective found by Flexible-time, less than 5% on average; and affected users, although more than Flexible-time, they are acceptable as a number under the ex- treme circumstances simulated. 139 Chapter 11. Offline Planning with Limited Resources 11.4 Related Work Some prior art related to SLA adjustment in case of resource failures, was already presented in Chapter 10. Goiri et al. [102] and Roy / Mukher- jee [103] monitor SLAs during runtime, and for each one that fails, the proposed systems try to find alternative execution environments. Con- versely, the work proposed here looks into the case of massive failures, in which case it is not clear whether the former approaches would suc- ceed due to the conflicts between the requesting agents. Penalty con- sequences are also unclear. In addition, the present chapter proposes a comprehensive model to treat such problems of extreme failures in their entirety. Outside the context of SLAs, nevertheless trying to optimize service performance, Song et al. present in [110] an architecture and amethodol- ogy for controlled, adaptive resource scheduling in the data center. They produce a linear program called the “resource flowing” problem, to model reservations and releases of resources by VMs; then solve it with the Sim- plex method. In the context of internal IT infrastructure at Intel Corporation used for chip simulations, Phan et al. present in [111] an approach for cost-based dynamic rescheduling of low-priority jobs. The problem they try to tackle is that low-priority jobs are often pre-empted by high-priority ones, al- though there are significant amounts of available resources in the infras- tructure as a whole. Process migration is not possible due to problems with multi-threaded processes, and VMs are not used due to the heavy performance toll (therefore, VM migration is not possible either). Hence, the report explores the possibility of restarting the processes. Whether a job will be kept suspended or be rescheduled is a cost-based decision, and has to do with “wasted run time” (the process execution time leading to pre-emption from a high-priority job), expected remaining completion time, and other similar measures. With regard to the knapsack problem and its applications in the area, the work closest to the one presented here is [112] by Lau and Lim. They use the multi-period, multi-dimensional knapsack problem to model what is known as Available-to-Promise in logistics; that is, the capability of a supplier to commit to meeting the requests of customers, during a pe- riod of time (the planning horizon). Clearly this bears a lot of similarity to IaaS SLAs, where the provider promises to make available the resources during some time period. It is particularly applicable to the case of batch SLAs, where the provider has some freedom as regards when to offer the resources. The authors develop a two-phase heuristic, comprising a greedy algorithm followed by a Tabu-search improvement. Additionally, 140 Chapter 11. Offline Planning with Limited Resources an ant-colony based approach is evaluated. Their work is trying to opti- mize profit; and an additional difference to the work presented here is the single vs multiple knapsacks. Finally, in [113], Kelly makes explicit the relationship between the knapsack problem, and the Winner Determination Problem in combina- torial auctions, which offer a suitable model for multiple parallel SLA ne- gotiations among autonomous agents. The actual motivation comes from resource allocation requirements in data centers, and the formulation is a multi-dimensional, multiple-choice knapsack problem. Multiple-choice is related to the selection of only one, out of many options in a list of requested resource bundles (which has been solved here at an earlier stage, using BDDs). The multiple dimensions represent resource types, as is also the case in the present work. The author presents a dynamic programming solver, and recognizes the significant complexity increase as the number of resource types increases. The author’s model concerns a single data center, so no discussion exists on multiple knapsacks. Summary and Conclusions In this Chapter, an Integer Linear Programming model was developed, as a means of SLA rescheduling and penalty minimization in the case that resources become scarce. The working example here was the malfunc- tion of a complete data center. Through extensive experimentation, it was possible to see that the constructed model behaves favorably to greedy algorithms, although foreseeably slower to approach some optimal value. Three combinatorial rescheduling strategies were explored: One with no flexibility whatsoever (complete acceptance of an SLA, or complete re- jection); one where it was possible to accept and reschedule only some of the requested resource bundles within an SLA; and one where it was additionally possible to offer some bundles later, or release them earlier than agreed. This third approach offered the best final objective on aver- age, normalized across all experiments. Yet, for large instances the com- plexity grows so much that even after a lot of time it remained very far from optimal values. The non-flexible strategy was faster, but its results with regard to average achieved objective were quite worse. The fastest method was the one to accept some of the resource bundles, but for the time and duration agreed in the SLA. Its results, on average, were only slightly worse than the Flexible-time strategy, therefore its performance is enough reason to prefer it in realistic situations. 141 142 Part IV Conclusions 143 144 Chapter 12 Conclusions Throughout this dissertation the question asked was, what are those nec- essary building blocks to enable service hierarchies with dependability, using SLAs as an instrument for this purpose. It was the question that the thesis tried to answer to, by providing new approaches and solutions towards this goal. As mentioned in Section 1.2, when looking at the topic from a very high level, automated SLA negotiation and provisioning can be reduced to the schematic of Figure 1.2 (repeated here for convenience, as Fig- ure 12.1). The dashed boxes illustrate the gaps to which this thesis tried to provide solutions and answers, in addition to a formal problem defini- tion. 12.1 Summary of Contributions Defining the problem generically, it was possible to express the relation- ships among services and their negotiable properties independent of the application domain. This definition cannot be directly applied, but rather needs significant refinement before using it in any specific use case; nev- ertheless, it provides the necessary grounds for getting an understanding of the problem, and coming closer to domain-independent recipes that can be combined with other tools and achieve the set goal of automated, hierarchical SLA management. Prior art was discussing service depen- dencies, or looking at QoS-related dependencies in specific domains. This is, to the best of the author’s knowledge, the first effort to define SLA de- pendencies and hierarchies formally. An additional very important result of this specific work, is the conclusion that converting an SLA to many others cannot be disentangled from SLA negotiation – rather, the two pro- cesses are closely related and, as amatter of fact, one could say that “SLA 145 Chapter 12. Conclusions Provider Customer Provider Contract Sub-Contract Request Response Request Response Provisioning Provisioning Resources Resources Contracts bear responsibility; there are penalties if breached How to deduce a subcontract, starting from the contract? What do these boxes really look like? How do they function? How to assess if the contract can/should be established? (For infrastructure) How to make best use of existing resources? (For infrastructure) How are resources defined? How to signal and instruct them? Figure 12.1: High-level scenario translation” is a part of negotiation itself. The next step was to define amanagement architecture. The resulting SLA Management Instance (SAMI) is a complete architectural pattern that has no specific ties to any application domain; rather, it foresees that it must be extended before applied. That said, a generic infrastructure for a SAMI was created as part of the project that supported this work [1], using state-of-the-art tooling and resulting in a framework already being customized and applied by four industrial use cases. The presented ar- chitecture is extending State-of-the-Art in a number of ways, the most important of which is the special consideration for SLA hierarchies. Fur- ther to that, the domain-agnostic and technology-independent nature of the architecture makes it applicable to very diverse use cases. Before discussing SLA assessment, a link to business objectives was necessary. A Service Level Agreement is a contract, and as such, it in- cludes not only rewards (service price), but also penalties if the described service is not delivered. One could argue that without the penalties, an SLA is nothing more than a smart provisioning request. There has been some prior art in SLA penalty models, but the author felt that it was not sufficient and was lacking necessary expressive power. Thus, 146 Chapter 12. Conclusions a new model was proposed in this work. Using this model it is possi- ble to form very complex penalty expressions, also associated with con- tract price and customer’s business values; therefore, it includes notions of fairness. In addition, the application of this generic penalty model to an example use case illustrated that it is possible to deduce subcontract penalties based on the penalties within the original contracts. The last part of generic, domain-independent contributions was re- lated to the decision-making part of negotiation. The dissertation’s con- tribution is the application of a structure well-known in VLSI design, the Bi- nary Decision Diagrams as a model for SLA representation. This is based on the capability to reduce SLA elements in boolean constants or vari- ables. The author deems this part of the thesis to be an important contri- bution, believing it can pave the way for efficient automated SLAmanage- ment. In the SLA@SOI project, it became obvious that SLAs, if pragmatic, they can be extremely complex. Indeed, so complex that their sheer pars- ing, and of course their processing for reasons of decision-making, is a task of enormous difficulty. The Reduced Ordered BDD structure has the very interesting property of being canonical; that is, it provides a unique and unambiguous representation of the boolean function from which it is constructed. Eventually, given domain-specific term dictionaries and the knowledge how to use them for creating SLAs, combined with BDDs, can result in structures that are immensely more manageable than their orig- inal SLA counterparts. As a proof of concept, for a specific SLA offer the approach showed how it is possible to use a BDD and decide if the offer should be accepted, or not. Following, Part III of the thesis applied these principles, albeit in sim- ple form, in the domain of Cloud Computing and more specifically Infras- tructure as a Service. One necessary detail for the specific domain, that had to prepend the discussion on IaaS SLA planning, was how to actually manage the resources themselves. This concerns a proper representa- tion, as well as a set of functions for reserving them – immediately or in advance. The supporting models provided are standards-compliant and designed to be agnostic of specific resource types: They may refer to computing servers, network storage, active network equipment and cir- cuits, scientific instruments, sensors, etc. Already and by itself, the sup- ported resource diversity is an interesting property of the model, and a contribution to State-of-the-Art. Having the necessary constructs available, an infrastructure SLA con- tent was described, and based on that two different greedy algorithms were tested as a means for online SLA planning within a data center. This corresponds to the lower of the two provider entities in Figure 12.1. The 147 Chapter 12. Conclusions evaluation took place in the context of the Planning, Optimization and Adjustment module of the SAMI architecture, as a possible implementa- tion for this specific IaaS setup. As there has been significant amount of work in this area during the last few years, the main interest of the work here was to see what is the energy footprint of these two fast algorithms, and how well they were performing with regard to resource utilization. The conclusion after experimentation was that a best-fit approach, that assigns SLAs to those servers that end up with the least available re- sources after the assignment, results in significant gains with regard to the amount of servers that are switched on. Nevertheless, the comple- tion time of the algorithm was an order of magnitude worse than its first- fit counterpart, for three pools of 5000 servers each. Even as such, the delay of up to 20 seconds for our setup can probably be considered as acceptable. Finally, a penalty-driven planning and optimization methodology was considered in the last Chapter. The scenario in mind was that of a resource- scarce environment, where resource requirements (far) exceed resource availability. Three ILP models were developed and solved, to represent three different strategies of varying flexibility with regard to time, or grouping constraints. These strategies express different possible ways that a coordinating SAMI instance (the higher one in Figure 1.2) would react to this scenario, while negotiating with data-center SAMIs. Via ex- tensive experimentation, it became clear that the best strategy to apply – at least in small to average problem instances – is to consider the existing SLAs with the possibility of discarding some of their parts (i.e. some of the requested resources). Experimenting with the time dimension resulted in very poor performance, while selecting to accept SLAs in full or not at all, resulted in significantly more penalties. With regard to the latter, two tested greedy algorithms (first-fit and best-fit) performed even worse. 12.2 Critical View of Presented Work The area tackled by this doctoral dissertation is a very large one, far too large to be addressed here in full. Various limitations of time and re- sources in general affected the results to some extent, although hopefully not enough to have a negative effect on the thesis’ contributions. The most important problem that the author identifies, is the complex- ity of some of the produced formalizations. The penalty model from Chap- ter 7 is extremely powerful, nevertheless this also means that a large number of parameters must be taken into account. It assumes that re- lationships among different SLA guarantees and service parameters are 148 Chapter 12. Conclusions known to the customer, and therefore assumes a very detailed model of interactions within an application and the supporting SOI. In addition, it is assumed that the customer is fully aware of the business value that certain objectives and guarantees have, which may be a wrong assump- tion in plenty realistic scenarios. The choice to favor expressiveness over simplicity was a conscious decision, in hope that future developments will allow such fine-grained understanding of SOAs, SOIs and the respective applications. Similarly, the application of BDDs to SLA modeling presumes the ca- pability to identify terms generically, within a domain and application. It requires detailed knowledge of the semantics of an application, which could have important repercussions on the wide applicability of the pro- posedmethodology. Additionally to that, there is a gap in using the canon- ical form of the structure for outsourcing and decision-making, related to matching paths from different BDDs and finding out whether they are equivalent. The author believes that the area to look into is that of Term Rewriting Systems [114]. This could, nevertheless, be a different doc- toral thesis on its own right, and therefore no further work towards that direction took place. Finally, the author would be very pleased if the proposed architecture was further applied to real-world use cases in the context of the project that funded the present work. Due to time restrictions and the fact that the project is ongoing, this was not possible. 12.3 Concluding Statements and Future Direc- tions If one final statement must be extracted from the present thesis, is that hierarchical SLA management is possible. There are multiple challenging issues about it, and certainly this dissertation does not reply to every- thing, yet it hopefully contributes to the more apparent areas. Furher- more, the extension to IaaS planning illustrates the applicability of these concepts onto a domain which is currently one of the most active areas of service computing. One problem that has been identified in multiple occasions and var- ious fora, is the lack of a common language for SLA negotiation. This includes both a model and a protocol. WS-Agreement is an important step in that direction, but at the moment has two significant problems: 1. It is considered to be very complex; specifically, its reliance on the Web Services Resource Framework (WSRF) [115] has proved to be 149 Chapter 12. Conclusions problematic for the community; 2. It supports only single-shot negotiations, without multiple rounds which are necessary for meaningful business interactions. Service registries for automated discovery and binding are not suffi- ciently polished yet, either. Both compatibility issues, but also the com- plexity of the topic (security, matching and semantics, policy / administra- tive domains) hinder the adoption of global service registries and large, dynamic service ecosystems. Monitoring SLAs is also a topic that is widely researched at the mo- ment, especially due to the commercial interest in the area. Here, there are two important considerations; one is of technical nature, related to what must be monitored and what are the associations among different signals received as part of a single SLA being monitored. The other prob- lem is one of policy; that is, providers do now allow transparent access to service execution parameters, making it very difficult for customers to have an educated understanding of the SLA state. The gap seems to be closing, with recent trends of trusted third parties: companies that act as mediators, are trusted by both the providers and the customers, and monitor the SLAs on behalf of both. Negotiation and planning strategies (including subcontracting deci- sions) remains, without doubt, the most demanding field of automated SLA management. The author strongly believes that the way to tackle this problem is to produce a sufficient amount of generic recipes (such as the BDD representation), that can be combined and applied on different domains. The complexity of SLA planning can be notoriously large, as also illustrated in Chapter 11. Areas of mathematics such as combinatorial optimization, operational research, evolutionary computing, and others, provide a solid foundation to build upon towards this direction; “stand- ing on the shoulders of giants”, combined with original cross-disciplinary ideas are sure to equip us better for the purpose. 150 Bibliography [1] “The SLA@SOI Project.” http://www.sla-at-soi.eu/ (Last retrieved: 09/2010). [2] M. Papazoglou and D. Georgakopoulos, “Service-Oriented Comput- ing,” Communications of the ACM, vol. 46, no. 10, pp. 25–28, 2003. [3] L. Zhang, J. Zhang, and H. Cai, Services Computing. Springer-Verlag New York Inc, 2007. [4] E. Christensen, F. Curbera, G. Meredith, and S. Weerawarana, “Web Services Description Language (WSDL) 1.1.” World Wide Web Con- sortium - W3C, 2001. [5] Organization for the Advancement of Structured Information Stan- dards - OASIS, “Web Services Business Process Execution Lan- guage (WSBPEL).” http://www.oasis-open.org/committees/wsbpel/ (Last retrieved: 09/2010). [6] I. Katzela and M. Schwartz, “Schemes for fault identification in communication networks,” IEEE/ACM Transactions on Networking, vol. 3, pp. 753–764, Dec 1995. [7] B. Gruschke, “Integrated event management: Event correlation us- ing dependency graphs,” in Proceedings of the 9th IFIP/IEEE Inter- national Workshop on Distributed Systems: Operations & Manage- ment (DSOM 98), pp. 130–141, 1998. [8] E. Knorr and G. Gruman, “What cloud computing really means.” http://www.infoworld.com/d/cloud-computing/what-cloud- computing-really-means-031 (Last retrieved: 09/2010). [9] “Amazon Elastic Compute Cloud - EC2.” http://aws.amazon.com/ec2/ (Last retrieved: 09/2010). [10] “Amazon Simple Storage Service - S3.” http://aws.amazon.com/s3/ (Last retrieved: 09/2010). 151 BIBLIOGRAPHY [11] M. Rappa, “The utility business model and the future of computing services,” IBM Systems Journal, vol. 43, no. 1, pp. 32–42, 2004. [12] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: Enabling scalable virtual organizations,” International Journal of High Performance Computing Applications, vol. 15, no. 3, p. 200, 2001. [13] Y. Chen, S. Iyer, X. Liu, D. Milojicic, and A. Sahai, “SLA Decomposi- tion: Translating Service Level Objectives to System Level Thresh- olds,” International Conference on Autonomic Computing, p. 3, 2007. [14] A. Andrieux, K. Czajkowski, A. Dan, K. Keahey, H. Lud- wig, J. Pruyne, J. Rofrano, S. Tuecke, and M. Xu, “Web Services Agreement specification (WS-Agreement).” http://www.ogf.org/documents/GFD.107.pdf, 2007 (Last retrieved: 09/2010). [15] K. Nichols, S. Blake, F. Baker, and D. Black, “Definition of the Dif- ferentiated Services Field (DS Field) in the IPv4 and IPv6 Headers,” tech. rep., RFC 2474, 12 1998. [16] C.-H. Chang, P. Kourtessis, and J. Senior, “GPON service level agree- ment based dynamic bandwidth assignment protocol,” Electronics Letters, vol. 42, pp. 1173–1174, September 2006. [17] W. Fawaz, B. Daheb, O. Audouin, M. Du-Pond, and G. Pujolle, “Ser- vice level agreement and provisioning in optical networks,” IEEE Communications Magazine, vol. 42, pp. 36–43, Jan 2004. [18] D. Nowak, P. Perry, and J. Murphy, “Bandwidth allocation for ser- vice level agreement aware Ethernet passive optical networks,” in Global Telecommunications Conference, 2004. GLOBECOM ’04. IEEE, vol. 3, pp. 1953–1957, 2004. [19] A. Sahai, A. Durante, and V. Machiraju, “Towards automated SLA management for Web services,” Hewlett-Packard Research Report HPL-2001-310 (R. 1), 2001. [20] R. Yahyapour and P. Wieder, Encyclopedia of Software Engineer- ing, ch. Service Level Agreements in Grids. Taylor & Francis Group, 2009. To appear. 152 BIBLIOGRAPHY [21] The TeleManagement Forum, “SLA Management Handbook, Vol- ume 2, Concepts and Principles, Release 2.5,” tech. rep., The Tele- Management Forum, Morristown, New Jersey, United States, 2005. [22] A. Keller and H. Ludwig, “The WSLA framework: Specifying and monitoring service level agreements for web services,” Journal of Network and Systems Management, vol. 11, no. 1, pp. 57–81, 2003. [23] R. Smith, “The contract net protocol,” IEEE Transactions on Com- puters, vol. 29, no. 12, pp. 1104–1113, 1980. [24] C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou, “The com- plexity of computing a Nash equilibrium,” Commun. ACM, vol. 52, no. 2, pp. 89–97, 2009. [25] L. Jin, V. Machiraju, and A. Sahai, “Analysis on Service Level Agree- ment of Web Services,” tech. rep., Software Technology Labora- tory, HP Laboratories Palo Alto, 2002. [26] OASIS, “Universal Description, Discovery and Integration (UDDI).” http://www.oasis-open.org/committees/uddi-spec, 02 2005 (Last retrieved: 09/2010). [27] A. Sahai, V. Machiraju, M. Sayal, A. van Moorsel, and F. Casati, “Au- tomated sla monitoring for web services,” Management Technolo- gies for E-Commerce and E-Business Applications, pp. 28–41, 2002. [28] A. Leff, J. T. Rayfield, and D. M. Dias, “Service-level agreements and commercial grids,” IEEE Internet Computing, vol. 7, no. 4, pp. 44– 50, 2003. [29] H. Zhang, K. Keahey, and W. Allcock, “Providing data transfer with qos as agreement-based service,” Services Computing, 2004. (SCC 2004). Proceedings. 2004 IEEE International Conference on, pp. 344–353, Sept. 2004. [30] M. Aiello, G. Frankova, and D. Malfatti, “What’s in an agreement? an analysis and an extension of ws-agreement,” Service-Oriented Computing - ICSOC 2005, pp. 424–436, 2005. [31] J. Sauvé, F. Marques, A. Moura, M. Sampaio, J. Jornada, and E. Radz- iuk, “Sla design from a business perspective,” Ambient Networks, pp. 72–83, 2005. [32] S. S. Chawathe, “Strategic web-service agreements,” Web Ser- vices, IEEE International Conference on, vol. 0, pp. 119–126, 2006. 153 BIBLIOGRAPHY [33] P. McKee, S. Taylor, M. Surridge, R. Lowe, and C. Ragusa, “Strate- gies for the service market place,” Grid Economics and Business Models, pp. 58–70, 2007. [34] D. Barbagallo and M. Comuzzi, “Towards understanding the role of adverse selection and moral hazard in automated negotiation of service level agreements,” in SIPE ’08: Proceedings of the 3rd international workshop on Services integration in pervasive envi- ronments, (New York, NY, USA), pp. 7–12, ACM, 2008. [35] H. Wada, P. Champrasert, J. Suzuki, and K. Oba, “Multiobjective optimization of sla-aware service composition,” Services, IEEE Congress on, vol. 0, pp. 368–375, 2008. [36] H. Bidgoli, The Internet Encyclopedia. John Wiley & Sons Inc, 2004. [37] “Enterprise resource planning.” http://en.wikipedia.org/wiki/Enter- prise_resource_planning (Last retrieved: 09/2010). [38] M. Ehrgott, Multicriteria Optimization. Springer-Verlag New York, Inc., 2005. [39] S. Fatima, M. Wooldridge, and N. Jennings, “A Comparative Study of Game Theoretic and Evolutionary Models of Bargaining for Soft- ware Agents,” Artificial Intelligence Review, vol. 23, pp. 187–205, 04 2005. [40] C. Figueroa, N. Figueroa, A. Jofre, A. Sahai, Y. Chen, and S. Iyer, “A Game Theoretic Framework for SLA Negotiation,” tech. rep., HP Laboratories, 2008. [41] A. Keller, U. Blumenthal, and G. Kar, “Classification and Computa- tion of Dependencies for Distributed Management,” IEEE Sympo- sium on Computers and Communications, p. 78, 2000. [42] A. Keller and G. Kar, “Determining service dependencies in dis- tributed systems,” in IEEE International Conference on Communi- cations (ICC 2001), vol. 7, pp. 2084–2088, 2001. [43] P. Hasselmeyer, “Managing dynamic service dependencies,” in 12th International Workshop on Distributed Systems: Operations & Management (DSOM 2001), pp. 141–150, 2001. [44] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang, “Towards highly reliable enterprise network services via 154 BIBLIOGRAPHY inference of multi-level dependencies,” in SIGCOMM ’07: Proceed- ings of the 2007 conference on Applications, technologies, archi- tectures, and protocols for computer communications, pp. 13–24, 2007. [45] E. Di Nitto, M. Di Penta, A. Gambi, G. Ripa, and M. Villani, “Nego- tiation of service level agreements: An architecture and a search- based approach,” LECTURE NOTES IN COMPUTER SCIENCE: Proc. ICSOC’07, vol. 4749, p. 295, 2007. [46] D. Wolpert and W. Macready, “No free lunch theorems for optimiza- tion,” IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 1997. [47] D. Wolpert and W. Macready, “Coevolutionary free lunches,” IEEE Transactions on Evolutionary Computation, vol. 9, no. 6, pp. 721– 735, 2005. [48] “Spring.” http://www.springsource.org/osgi (Last retrieved: 09/2010). [49] “OSGi.” http://www.osgi.org/ (Last retrieved: 09/2010). [50] “Equinox.” http://www.eclipse.org/equinox/ (Last retrieved: 09/2010). [51] H. Ludwig, A. Dan, and R. Kearney, “Cremona: an architecture and library for creation andmonitoring of WS-agreements,” in Proc. 2nd international Conference on Service Oriented Computing (ICSOC ’04), pp. 65–74, 2004. [52] J. Padgett, K. Djemame, and P. Dew, “Grid-Based SLA Manage- ment,” Advances in Grid Computing - EGC 2005, pp. 1076–1085, 2005. [53] G. Koumoutsos, S. Denazis, and K. Thramboulidis, “SLA e- Negotiations, Enforcement and Management in an Autonomic En- vironment,” Modelling Autonomic Communications Environments, pp. 120–125, 2008. [54] A. Keller and H. Ludwig, “Defining and Monitoring Service Level Agreements for dynamic e-Business,” in Proceedings of the 16th System Administration Conference (LISA 2002), 2002. [55] M. Debusmann and A. Keller, “SLA-driven management of dis- tributed systems using the common information model,” IFIP/IEEE 155 BIBLIOGRAPHY 8th Int. Symp. on Integrated Network Management, pp. 563–576, March 2003. [56] Distributed Management Task Force, “Common Information Model,” 01 2008. [57] C. Bartolini, A. Boulmakou, A. Christodoulou, A. Farrell, M. Salle, and D. Trastour, “Management by Contract: IT Management driven by Business Objectives,” Proc. 11th HP Openview University Associa- tion (HP-OVUA) Workshop, 2004. [58] A. Paschke and M. Bichler, “Knowledge representation concepts for automated SLA management,” Decision Support Systems, vol. 46, no. 1, pp. 187 – 205, 2008. [59] M. J. Buco, R. N. Chang, L. Z. Luan, C. Ward, J. L. Wolf, and P. S. Yu, “Utility computing SLA management based upon business objec- tives,” IBM Syst. J., vol. 43, no. 1, pp. 159–178, 2004. [60] M. B. Chhetri, J. Lin, S. Goh, J. Y. Zhang, R. Kowalczyk, and J. Yan, “A Coordinated Architecture for the Agent-based Service Level Agree- ment Negotiation of Web Service Composition,” Australian Soft- ware Engineering Conference, pp. 90–99, 2006. [61] C. Bartolini, C. Preist, and N. Jennings, “A generic software frame- work for automated negotiation,” in First International Conference on Autonomous Agent and Multi-Agent Systems, 2002. [62] J. B. Kim and A. Segev, “A framework for dynamic eBusiness negoti- ation processes,” in IEEE International Conference on E-Commerce (CEC 2003), pp. 84–91, 06 2003. [63] M. Ayatollahzadeh Shirazi and A. Barfouroush, “A Conceptual Framework for Modeling Automated Negotiations in Multiagent Systems,” Negotiation Journal, vol. 24, no. 1, pp. 45–70, 2008. [64] A. Dan, H. Ludwig, and G. Pacifici, “Web service differentiation with service level agreements,” White Paper, IBM Corporation, 2003. [65] M. Becker, N. Borrisov, V. Deora, O. Rana, and D. Neumann, “Using k-Pricing for Penalty Calculation in Grid Market,” in Hawaii Inter- national Conference on System Sciences, Proceedings of the 41st Annual, pp. 97–97, Jan. 2008. [66] R. Jurca and B. Faltings, “Reputation-Based Service Level Agree- ments for Web Services,” Service-Oriented Computing - ICSOC 2005, pp. 396–409, 2005. 156 BIBLIOGRAPHY [67] O. Rana, M. Warnier, T. Quillinan, and F. Brazier, “Monitoring and Reputation Mechanisms for Service Level Agreements,” Grid Eco- nomics and Business Models, pp. 125–139, 2008. [68] J. Kosinski, D. Radziszowski, K. Zielinski, S. Zielinski, G. Przybylski, and P. Niedziela, “Definition and Evaluation of Penalty Functions in SLA Management Framework,” Networking and Services, Interna- tional conference on, pp. 176–181, 2008. [69] S. A. Cook, “The complexity of theorem-proving procedures,” in STOC ’71: Proceedings of the third annual ACM symposium on The- ory of computing, (New York, NY, USA), pp. 151–158, ACM, 1971. [70] R. Bryant, “Graph-Based Algorithms for Boolean Function Manipu- lation,” IEEE Transactions on Computers, vol. C-35, pp. 677–691, Aug. 1986. [71] C. Lee, “Representation of switching circuits by binary decision di- agrams,” Bell System Technical Journal, no. 38, pp. 985–999, 1959. [72] S. Akers, “Binary Decision Diagrams,” IEEE Transactions on Com- puters, vol. C-27, pp. 509–516, June 1978. [73] R. Ebendt, R. Drechsler, and G. Fey, Advanced BDD optimization. Springer, 2005. [74] C. E. Shannon, “A symbolic analysis of relay and switching circuits,” Trans. AIEE, no. 57, pp. 713–723, 1938. [75] P. Bhoj, S. Singhal, and S. Chutani, “SLA management in feder- ated environments,” Computer Networks, vol. 35, no. 1, pp. 5 – 24, 2001. [76] R. E. Bryant, “Symbolic Boolean manipulation with ordered binary- decision diagrams,” ACM Comput. Surv., vol. 24, no. 3, pp. 293– 318, 1992. [77] S. Friedman and K. Supowit, “Finding the optimal variable order- ing for binary decision diagrams,” IEEE Transactions on Computers, vol. 39, pp. 710–713, May 1990. [78] R. Rudell, “Dynamic variable ordering for ordered binary decision diagrams,” in ICCAD ’93: Proc. 1993 IEEE/ACM international confer- ence on Computer-aided design, pp. 42–47, IEEE Computer Society Press, 1993. 157 BIBLIOGRAPHY [79] M. Comuzzi, C. Kotsokalis, G. Spanoudakis, and R. Yahyapour, “Es- tablishing and Monitoring SLAs in Complex Service Based Sys- tems,” IEEE International Conference on Web Services, pp. 783– 790, 2009. [80] W. Binder, I. Constantinescu, and B. Faltings, “Scalable Automated Service Composition Using a Compact Directory Digest,” Database and Expert Systems Applications, pp. 317–326, 2006. [81] A. Campailla, S. Chaki, E. Clarke, S. Jha, and H. Veith, “Efficient filtering in publish-subscribe systems using binary decision dia- grams,” in ICSE ’01: Proc. 23rd International Conference on Soft- ware Engineering, pp. 443–452, 2001. [82] “JavaBDD.” http://javabdd.sourceforge.net/ (Last retrieved: 09/2010). [83] A. S. McGough, A. Akram, L. Guo, M. Krznaric, L. Dickens, D. Colling, J. Martyniak, R. Powell, P. Kyberd, C. Huang, C. Kotsokalis, and P. Tsanakas, “GRIDCC: A Real-time Grid workflow system with QoS,” Scientific Programming, vol. 15, no. 4, pp. 213–234, 2007. [84] J. Cardoso, A. Sheth, J. Miller, J. Arnold, and K. Kochut, “Quality of service for workflows and web service processes,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 1, no. 3, pp. 281–308, 2004. [85] D. Colling, T. Ferrari, Y. Hassoun, C. Huang, C. Kotsokalis, A. S. Mc- Gough, E. Ronchieri, Y. Patel, and P. Tsanakas, “On Quality of Ser- vice Support for Grid Computing,” Grid Enabled Remote Instrumen- tation, pp. 313–327, 2009. [86] “OGF GLUE-WG.” http://forge.ogf.org/sf/projects/glue-wg (Last re- trieved: 09/2010). [87] A. Sim, A. Shoshani, et al., “The Storage Resource Man- ager Interface Specification v2.2.” http://sdm.lbl.gov/srm- wg/doc/SRM.v2.2.pdf (Last retrieved: 09/2010). [88] M. Botts, “Sensor Model Language (SensorML) for In-situ and Re- mote Sensors,” tech. rep., Open Geospatial Consortium, 2004. [89] K. Czajkowski, I. Foster, C. Kesselman, V. Sander, and S. Tuecke, “SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating Resource Management in Distributed Systems,” in Job 158 BIBLIOGRAPHY Scheduling Strategies for Parallel Processing, vol. 2537 of Lecture Notes in Computer Science, pp. 153–183, Springer Berlin / Heidel- berg, 2002. [90] Y.-S. Kee, D. Logothetis, R. Huang, H. Casanova, and A. Chien, “Ef- ficient resource description and high quality selection for virtual grids,” Cluster Computing and the Grid, IEEE International Sympo- sium on, vol. 1, pp. 598–606, 2005. [91] G. P. Koslovski, P. V.-B. Primet, and A. S. Charão, “VXDL: Virtual Resources and Interconnection Networks Description Language,” in Networks for Grid Applications, vol. 2 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecom- munications Engineering, pp. 138–154, Springer Berlin Heidelberg, 2009. [92] I. Wegener, The complexity of Boolean functions. Teubner, 1987. [93] C. Gröpl, Binary decision diagrams for random boolean functions. PhD thesis, Mathematisch-Naturwissenschaftlichen Fakultät II der Humboldt-Universität zu Berlin, 1999. [94] L. Epstein and M. Levy, “Dynamic multi-dimensional bin packing,” Journal of Discrete Algorithms, 2010 (Currently under publication). [95] T. Gonzalez, ed., Handbook of approximation algorithms and meta- heurististics. Chapman & Hall, 2007. [96] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson, Approximation algorithms for bin packing: a survey. Boston, MA, USA: PWS Pub- lishing Co., 1997. [97] H. N. Van, F. D. Tran, and J.-M. Menaud, “SLA-Aware Virtual Resource Management for Cloud Infrastructures,” Computer and Informa- tion Technology, International Conference on, vol. 1, pp. 357–362, 2009. [98] F. Hermenier, X. Lorca, J.-M. Menaud, G. Muller, and J. Lawall, “En- tropy: a consolidation manager for clusters,” in VEE ’09: Proceed- ings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, (New York, NY, USA), pp. 41–50, ACM, 2009. [99] H. Nakada, T. Hirofuchi, H. Ogawa, and S. Itoh, “Toward Virtual Machine Packing Optimization Based on Genetic Algorithm,” in 159 BIBLIOGRAPHY Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living, vol. 5518 of Lecture Notes in Computer Science, pp. 651–654, Springer Berlin / Heidel- berg, 2009. [100] Y. Ajiro and A. Tanaka, “Improving packing algorithms for server consolidation,” in Proc. Computer Measurements Group 2007 In- ternational Conference, 2007. [101] M. A. S. Netto, K. Bubendorfer, and R. Buyya, “SLA-Based advance reservations with flexible and adaptive time QoS parameters,” in Service-Oriented Computing – ICSOC 2007, vol. 4749 of Lecture Notes in Computer Science, pp. 119–131, Springer Berlin / Heidel- berg, 2010. [102] I. Goiri, F. Julià, J. Ejarque, M. De Palol, R. Badia, J. Guitart, and J. Torres, “Introducing Virtual Execution Environments for Applica- tion Lifecycle Management and SLA-Driven Resource Distribution within Service Providers,” in 8th IEEE International Symposium on Network Computing and Applications, pp. 211–218, IEEE, 2009. [103] S. Roy and N. Mukherjee, “Adaptive Execution of Jobs in Computa- tional Grid Environment,” Journal of Computer Science and Tech- nology, vol. 24, pp. 925–938, September 2009. [104] D. Ardagna, M. Trubian, and L. Zhang, “SLA based resource allo- cation policies in autonomic environments,” Journal of Parallel and Distributed Computing, vol. 67, no. 3, pp. 259 – 270, 2007. [105] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack problems. Springer Verlag, 2004. [106] “PuLP-OR Modeler.” http://code.google.com/p/pulp-or/ (Last re- trieved: 09/2010). [107] “GUROBI Solver.” http://gurobi.com/ (Last retrieved: 09/2010). [108] H. Mittelmann, “Mixed Integer Linear Programming Bench- mark (parallel codes).” http://plato.asu.edu/ftp/milpc.html (Last re- trieved: 09/2010). [109] “Performance Profile Summary.” http://www.coin- or.org/GAMSlinks/benchmarks/MIP/bestSolver_100209/profile_sum- mary.htm (Last retrieved: 09/2010). 160 BIBLIOGRAPHY [110] Y. Song, H. Wang, Y. Li, B. Feng, and Y. Sun, “Multi-Tiered On- Demand Resource Scheduling for VM-Based Data Center,” in CC- GRID ’09: Proceedings of the 2009 9th IEEE/ACM International Sym- posium on Cluster Computing and the Grid, (Washington, DC, USA), pp. 148–155, IEEE Computer Society, 2009. [111] L. Phan, Z. Zhang, S. Jain, G. Tan, B. Loo, and I. Lee, “Cost-based Dynamic Job Rescheduling: A Case Study of the Intel Distributed Computing Platform,” tech. rep., University of Pennsylvania and In- tel Corporation, 2010. [112] H. Lau and M. Lim, “Multi-period multi-dimensional knapsack prob- lem and its applications to available-to-promise,” in Proceedings of the International Symposium on Scheduling (ISS), Hyogo, Japan, pp. 94–99, Citeseer, 2004. [113] T. Kelly, “Generalized Knapsack Solvers for Multi-unit Combinato- rial Auctions: Analysis and Application to Computational Resource Allocation,” in Agent-Mediated Electronic Commerce VI, vol. 3435 of Lecture Notes in Computer Science, pp. 73–86, Springer Berlin / Heidelberg, 2005. [114] M. Bezem, J. Klop, and R. de Vrijer, Term rewriting systems. Cam- bridge Univ Pr, 2003. [115] OASIS, “Web Services Resource Framework (WSRF) v1.2.” http://www.oasis-open.org/committees/wsrf/, 04 2006 (Last re- trieved: 09/2010). 161 162 Index Agreement Initiator, 95 Agreement Responder, 95 Antecedent Service, 12 Automated SLA, 21 BDD Non-Terminal Node, 74 BDD Terminal Node, 74 Best-Fit Greedy Algorithm, 110 Best-Fit Planning, 124 Bin-Packing Problem, 107 Binary Decision Diagrams, 74 Business Process, 11 Business Value, 80 Check-Pointing, 14 Cloud Computing, 13 Common Information Model, 93 Conditional Probability, 82 Contract Net, 23 CPLEX Solver, 128 Dependent Service, 12 Dependent Variables, 82 Differentiated Services, 21 Directed Acyclic Graph, 74 Elastic Compute Cloud, 13 Electronic Contract, 20 Energy Efficiency, 108 Explicit Dependency, 11 First-Fit Greedy Algorithm, 110 First-Fit Planning, 124 Flexible-Items Planning, 123 Flexible-Time Planning, 124 Forecasting, 53 Game Theory, 45 GLUE Schema, 94 Grid Computing, 14 GUROBI Solver, 128 High-Performance Computing, 60 IaaS, Multi-Domain, 16 Implicit Dependency, 12 Independent Variables, 82 Infrastructure as a Service, 14 Integer Linear Programming, 125 IT Service, 9 Knapsack Problem, 125 Mean-time-to-Failure, 41 Mean-time-to-Repair, 41 Mixed-Integer Programs, 125 Monitoring and SLA Adjustment, 54 Monitoring Infrastructure, 53 Multi-Criteria Optimization, 42 Multi-Round Negotiation, 21 Negotiation Interface, 52 Negotiation Process, 55 Non-Flexible Planning, 123 Ordered BDD, 75 Outsourcing, 15 Pareto Front, 43 Penalties, 65 Penalty Fairness, 66 Penalty Function, 67 Planning and Optimization, 52 163 INDEX Platform as a Service, 14 Procurement, 51 Properties Dependency Graph, 41 Provisioning, 53 Publish/Subscribe System, 51 PuLP Modeler, 128 Quality of Service, 19 Quota, 14 Reduced Ordered BDD, 75 Request for Quote, 23 Reservation Time Lease, 95 Reservation Type, 96 Resource Element, 93 Resource Group, 95 Resource Group Characteristic, 95 Resource Group Type, 95 Resource Reservation, 91 Reverse Polish Notation, 84 SAMI, 50 SAMI Framework, 59 Scalarization of Objectives, 45 Semantic Web, 25 Sensor Markup Language, 98 Service Chains, 10 Service Composition, 11 Service Dependencies Information, 54 Service Dependency, 11 Service Dependency Graph, 12 Service Level Agreement, 19 Service Manager, 60 Service-Oriented Computing, 9 Shannon’s Decomposition Theorem, 74 Shared BDD, 75 Simple Storage Service, 13 SLA Adjustment, 24 SLA Assessment, 24 SLA Clauses, 19 SLA Conditions, 19 SLA Dependency, 39 SLA Envelope, 91 SLA Execution, 23 SLA Facts, 19 SLA Hierarchy, 39 SLA Implementation, 23 SLA Management, 22 SLA Management Instance, 50 SLA Monitoring, 23 SLA Negotiation, 23 SLA Template, 22 SLA Termination, 24 SLA Translation, 40, 42 Storage Resource Manager, 98 Symbolic Model Checking, 74 TeleManagement Forum, 22 Template Advertisements, 51 Total Cost of Ownership, 32 Utility Computing, 13 Virtual Machine, 14, 105 Virtualization, 14 WS-Agreement, 21 WSBPEL, 11 WSDL, 11 WSLA, 22 164 Declaration I declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other in- stitute of tertiary education. Information derived from the published and unpublished work of others has been acknowledged in the text, and a list of references is given. Konstantinos (Costas) Kotsokalis