Internet-Draft | ROSA | July 2023 |
Contreras, et al. | Expires 10 January 2024 | [Page] |
The term 'service-based routing' (SBR) captures the set of mechanisms for the steering of traffic in an application-level service scenario. We position this steering as an anycast problem, requiring the selection of one of the possibly many choices for service execution at the very start of a service transaction.¶
This document builds on the issues and pain points identified across a range of use cases, reported in [I-D.mendes-rtgwg-rosa-use-cases]. We summarize the key insights and provide a gap analysis with key technologies related to the problem of SBR, developed by the IETF over many years. We further outline the requirements to a system that would adequately close those gaps and thus address the pain points of our use cases. Those requirements will be used for outlining a suitable architecture framework in a separate document.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 10 January 2024.¶
Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Virtualization and the proliferation of serverless service provisioning methods have driven the capability to dynamically deploy services in more than one network location, allowing for scaling both horizontally and vertically in a number of use cases, some of which can be found in [I-D.mendes-rtgwg-rosa-use-cases]. A key problem in such use cases is that of steering the service requests stemming from the applications, a mechanism we label as service-based routing (SBR). A key constraint in realizing solutions for such problem is the possible distribution of more than one service instance across several network locations, posing the SBR problem as an inherently anycast one.¶
Unlike existing methods for SBR, some of which we will survey in this document, we envision a system we call routing on service addresses (ROSA), that allows for suitable service-specific anycast decisions to be made under a possibly high frequency of change to the notion of the 'best' instance to be chosen with the expectation to yield in better performance, such as improved service completion latency, utilization, and others.¶
At the same time, it is important to recognize that we do not aim for replacing existing service routing capabilities, most notably the DNS as the main form of resolving a service name into routing locator; we see those capabilities working perfectly well for many Internet services. However, it is important to understand the gaps that those existing methods show in realizing the emerging use cases of high dynamicity in service relations. This document surveys key technologies, developed in the IETF over recent years, in order to identify the gaps of those technologies to deliver suitable solutions to the pain points identified in our use cases of [I-D.mendes-rtgwg-rosa-use-cases].¶
Complementing our gap analysis, we also formulate requirements for a solution to those pain points. We link the various requirements to observed issues in our use cases [I-D.mendes-rtgwg-rosa-use-cases] for better illustration and reasoning for their inclusion.¶
In the remainder of this document, we first introduce in Section 2 a terminology that provides the common language used throughout the remainder of the document; this terminology is kept in sync with the other ROSA draft. We then summarize the key observations from our use cases in [I-D.mendes-rtgwg-rosa-use-cases] as a recap for the following gap analysis in Section 4. The insights from our gap and use case analysis then leads us to the requirements in Section 5, before outlining in Section 6 the expected benefits from realizing those requirements in a suitable system.¶
The following terminology is used throughout the remainder of this document:¶
Several observations can be drawn from the use case examples in [I-D.mendes-rtgwg-rosa-use-cases] in what concerns their technical needs:¶
We can conclude from our observations above that (i) distribution (of service instances), (ii) dynamicity in the availability of and choosing the 'best' service instance, and (iii) efficiency in utilizing the best possible service instance are crucial for our use cases.¶
We now discuss observations and suitability of existing technologies for realizing the use cases in [I-D.mendes-rtgwg-rosa-use-cases]. We first survey technologies that possibly provide similar SBR functionality to our use cases. Here, we have currently identified the DNS (and solutions based on it), CATS, LISP, and ALTO as such technologies.¶
We then outline works that are related to certain aspects of SBR only for the purpose of explaining differences and relations for possible future integration or touching points in solutions to ROSA. Here, we currently include technologies such as SFC, SPRING, and TVR. Future discussions and work may extend on both of those areas for a more comprehensive analysis.¶
The Domain Name System (DNS) is the most prevalent method being used for service-based routing in that it supports the resolution of a domain name, such as foo.com, to an IP address, which is then used for subsequent message transfer between sender and receiver. We see, thus, the DNS and methods extending but basing themselves on the DNS, such as Global Server Load Balancing, as the baseline for SBR. In the following, we provide insights into the main technology and the gaps identified towards ROSA objectives.¶
The DNS [RFC1035] provides an explicit method for mapping domain names onto an IP locator, often referred to as 'early binding'. Those mappings are provided based on previous DNS registrations of IP locators to certain domain names.¶
There are many extensions to this basic lookup mechanism, some of which are relevant to our discussion. For instance, DNS extensions may be used to base the decision on which IP address of several to pick based on, e.g., geo-location or load information. For the latter, load balancing is provided alongside the DNS resolver, e.g., in the form of Global Server Load Balancing (GSLB) [GSLB] solutions in CDNs. Furthermore, a health check functionality may be provided to resolve IP address failures, providing alternatives to detected failures of reachability.¶
As mentioned upfront, the explicit resolution provided by the DNS is our baseline for comparison due to its widespread use in the Internet. Albeit its rather static nature of assigning IP addresses to domain names, it is sufficient for many of the use cases of the Internet, where the initial selection of a suitable server address suffices. We thus see the DNS to continue being a vital component of the Internet and thus only focus in our following gap analysis on those shortcomings in relatin to our identified use cases.¶
There are number of key differences and gaps to the desired properties of a ROSA system. Several of those gaps have already been identified in [I-D.yao-cats-gap-reqs] and also apply here:¶
The Compute-aware Traffic Steering (CATS) WG is a newly established working group in the IETF, which aims at supporting the selection of one of possibly many service instances for a particular service. This similarity in objectives makes us draw out the main concepts and gaps to the objectives for ROSA in the following.¶
Let us provide a brief overview of LISP and its main concepts - for more detail, we refer to, e.g., [I-D.ldbc-cats-framework].¶
CATS proposes compute-aware decisions in sending traffic between a client and a set of possible egress sites or directly Internet-connected service hosts. For this, CATS introduces the CS-ID as the CATS service identifier, which is mapped onto the CB-ID as the CATS binding identifier. The exact nature of those identifiers is still work-in-progress with proposals currently being presented to the CATS WG.¶
CATS proposes to use an ingress-egress tunneling approach, where ingress CATS routers use metrics to decide upon the CB-ID to be used for an incoming request to a CS-ID. The tunneling method is currently still under discussion with SRv6, MPLS and other technologies being considered.¶
As the name suggests, the basis for the aforementioned selection at the ingress CATS router are compute metrics that are being distributed to the ingress CATS routers through suitable methods, which are still under investigation together with the nature and extend of the metrics themselves.¶
To support the steering of longer service transactions, CATS proposes a CATS traffic classifier component, which associates several packets to such longer service transaction to ensure the steering of those packets to the same selection made for the initial packet.¶
CATS proposes a similar anycast type of addressing and as well as separation of service from routing identifier as done by ROSA. Furthermore, the ingress CATS router performs a traffic steering decision among the set of possible service instances albeit with a focus on such decisions to be compute-aware.¶
There are number of key differences and gaps to the desired properties of a ROSA system:¶
The Locator-ID Separation Protocol (LISP) WG has been in existence for many years, aiming at separation endpoint identifiers (called EIDs) and routing locators (called RLOCs) for better scalability of adjusting to changes in their relation. This similarity in focusing on in-band dynamic assignments of EIDs to RLOCs positions LISP as a possible technology to address the pain points identified in our use case draft. Let us draw out the LISP concepts and the gaps to ROSA objectives in the following.¶
Let us provide a brief overview of LISP and its main concepts - for more detail, we refer to, e.g., [RFC9299].¶
LISP introduces two namespaces, separating endpoint identifiers (EID) from routing locator (RLOC) for a device realizing the service or resource represented by the EID. The EID may be determined from mapping services such as the DNS, resolved from other application-specific identifiers (such as a URL).¶
Endpoints communicate through their EIDs, sent domain-locally through an intra-domain routing protocol either to a locally present EID or to the ingress tunnel router (ITR) of their local domain. The ITR in turn consults a mapping service [RFC9301] to resolve the EID to an RLOC of an egress tunnel router (ETR), to which the incoming request is then sent, while the ETR domain-locally forwards the packet to the destination EID. LISP uses UDP for ITR-ETR tunnelling as well as for access the mapping service.¶
Mapping service resolutions are usually cached at the ITR after initially being resolved due to an incoming packet request. In addition to this DNS-like pull operation, a pub/sub extension may proactively pull EID->RLOC mappings from the mapping service (e.g., for planned handovers) or update previously resolved mappings in the future.¶
One could position an EID as a service address in ROSA, where the mapping process in the ITR resembles the endpoint selection. The proactive pub/sub mapping resolution would allow for changing RLOC assignments and thus direct EID requests to other ETRs.¶
There are number of key differences and gaps to the desired properties of a ROSA system:¶
ALTO, as defined in [RFC7285], provides the ability to select suitable application-level servers for a client requesting it. It is thus seemingly aligned with the ROSA anycast problem but there are, however, very fundamental differences when looking closer:¶
ALTO follows other SBR methods in employing an explicit server discovery step, defined in [RFC7286], thus conceptually aligning with methods like DNS in that it employs an off-path method.¶
ALTO also follows more of a recommendation model, where the final decision is being made by the ALTO client, which of the possible choices to utilize in the data transfer, while ROSA advocates a ROSA overlay driven decision.¶
Moreover, ALTO operates at the application level, currently supporting HTTP/1, while ROSA advocates the use of any application (and transport) protocol similar to using the DNS for resolution.¶
ALTO provides insights into server selection criteria through metric work, as outlined in [RFC9274] [RFC9241][RFC8895]; work that is already considered as input to the CATS WG. This consideration equally applies to ROSA where metrics as well as metric distribution are not in scope.¶
Similar to the DNS, detailed in Section 4.1, ALTO provides an explicit resolution step for selecting HTTP/1-based service instances from a set of available servers. It thus provides a solution for an anycast selection albeit limited to HTTP/1-based services. It also allows for service-specific selection of the final server to be used through a recommendation model, i.e., providing choices of suitable servers to the client, which ultimately selects the server. With this, it differs from the DNS model, where the DNS resolver makes the ultimate selection.¶
There are number of key differences and gaps to the desired properties of a ROSA system. Several of those gaps are similar to those that have already been identified in Section 4.1.3 and also thus presented only briefly again here:¶
The following requirements for a routing on service addresses (ROSA) solution (referred to as 'solution' for short) have been identified from the analysis in the previous section of the use cases provided in [I-D.mendes-rtgwg-rosa-use-cases].¶
One commonality of all use cases is the communication with a 'service', realized at one or more network locations as equivalent 'service instances'. Associating the service to an 'owner' is key to avoid services being announced by fake entities, thus misdirecting the client's traffic, while obfuscating the purpose of communication (e.g., leaked through the specific name of a service) but also any possible policy to select one over another service instance may want to be kept private; this is likely the case across all of our use cases. Hence, any solution¶
MUST provide means to associate service instances with a single service address.¶
Across all our use cases, the knowledge of where service instances (realizing specific services) reside within the network, i.e., possibly at different network locations, is crucial for the communication to happen, at least for the ROSA domain with which the service has an association with. Such knowledge may be created by a service management platform, e.g., as part of the overall service deployment, and thus may not be initiated by the deployed service instance itself, such as in the example of mobile distributed applications of Section 3.4 in [I-D.mendes-rtgwg-rosa-use-cases]. Furthermore, service deployment may be delegated to service or CDN platforms, e.g., in the CDN, AR/VR and video distribution examples of [I-D.mendes-rtgwg-rosa-use-cases], albeit with linkages needed to the service routing capabilities of ROSA. Crucially, however, is that a solution ought to use proactive pushing of suitable reachability information to service instances into the ROSA system, i.e., pursuing a routing-based approach, allowing for faster availability of information to make suitable decisions on which service instance to choose among those available. Hence, any solution¶
MUST provide means to announce route(s) to specific instances realizing a specific service address, thus enabling service equivalence for this set of service instances.¶
A client application may not just invoke services within a single ROSA domain. While associating with different ROSA domain may be possible, clients may simply invoke services through their existing ROSA domain, e.g., for utilizing helper services in examples like distributed mobile applications (Section 3.4 in [I-D.mendes-rtgwg-rosa-use-cases]), expecting the service transaction to be realized regardless. The same goes for invoking services that may reside in the public Internet, without requiring an explicit awareness of the client to which ROSA domain (or the public Internet) to direct the invocation. Thus, any solution¶
MUST provide means to interconnect ROSA islands.¶
Use cases like distributed mobile applications (Section 3.4 in [I-D.mendes-rtgwg-rosa-use-cases]) but also video delivery ones such as for replicated chunk retrieval or AR/VR (Sections 3.5 and 3.6 in [I-D.mendes-rtgwg-rosa-use-cases], respectively) or the selection of an appropriate UPF (user plane functions) within a cellular sub-system (Section 3.2 in [I-D.mendes-rtgwg-rosa-use-cases]), may want to constrain the selection of 'suitable' service instances through service-specific constraints, such as the computing load (on the deployed service instances or their host platforms), service-level latency, but also, e.g., HW or SW, capabilities. This may also be the case for multi-homed deployments (see Section 3.3 in [I-D.mendes-rtgwg-rosa-use-cases]), where constraints on the multi-connectivity of the service instance may constrain the suitability for specific clients. Thus any solution¶
Solution MUST provide constraint-based routing capability.¶
The work in [OnOff2022] has shown the potential gains in making runtime decisions for every incoming service transaction, where transaction lengths may be as small as single (application-level) requests. For use cases such as for replicated chunk retrieval (Section 3.5 in [I-D.mendes-rtgwg-rosa-use-cases]) or AR/VR (Section 3.6 in [I-D.mendes-rtgwg-rosa-use-cases]), this may lead to significant smoothening of the request completion latency, i.e., reducing the latency variance, thus enabling a better, smoother experience at the client. However, the specific mechanism may vary and, more importantly, may be highly service-specific, with solutions such as [CArDS2022] providing a simple weighted round robin, while other methods may rely on regular (service) metric reporting. Thus any solution¶
MUST provide an instance selection at ROSA domain ingress nodes only.¶
Explicit resolution steps, such as those in DNS, GSLB, or Alto, suffer from the need for an explicit control plane exchange. This causes additional latency before the data transfer to the chosen service instance may start. In-band data, i.e., the inclusion of application-level data in the control messages, is not supported due to the layering of such solutions at the application level itself. It is desirable, however, to already allow for the exchange of application data, including that needed for establishing secure connections, in the process that determines the most suitable service instance to further reduce any latency for completing a given application-level service transaction. Thus any solution¶
While video delivery use cases like replicated chunk retrieval (Section 3.5 in [I-D.mendes-rtgwg-rosa-use-cases]) or AR/VR (Section 3.6 in [I-D.mendes-rtgwg-rosa-use-cases]) may exhibit short lived transactions of just one (service-level) request, due to the replicated nature of the video content in each service instance, service transactions may last many requests after the initial one has been sent. Ephemeral state may be created during this transaction, which would require that a change of the (initial) service instance during a transaction would share such ephemeral state with any new service instance being used. While service platforms, like K8S, provide such ability through 'shared data layer' capabilities, those are often limited to single site deployments. Any support across sites would incur additional costs or even possibly latencies for such state sharing, thus often leading to completing an ongoing service transaction with the service instance that has been originally been used (note that a service instance in ROSA may use internal methods for serving incoming requests across which state sharing would be applied - from a ROSA perspective, however, only one service instance is being used). We call the capability to retain an initial selection of a service instance for the length of a service transaction 'affinity'. Thus, any solution¶
All of our use cases are likely being deployed over existing network infrastructure, which makes a consideration to use its existing solutions in any realization of ROSA very important. Specifically, any solution¶
Solution SHOULD use IPv6 for the routing and forwarding of service and affinity requests.¶
Most of our use cases, specifically on distributed mobile applications (Section 3.4 in [I-D.mendes-rtgwg-rosa-use-cases]) but also our video delivery examples, may be realized in inherently mobile settings with clients moving about for their experience. While mobile IP solutions exist, the service initialization in ROSA needs to be equally supported in order to allow for invoking ROSA services on the move. Thus, any solution¶
Mobility of clients, but also varying loads in scenarios of no client mobility, may also lead to situations where moving on ongoing service transaction to another service instance may be beneficial, termed 'transaction mobility'. In other words, service instances may be replaced mid-transaction, in order to ensure the service level agreement. This may happen if, for instance, the local node where the service instance was initially installed is running out of resources, or its accessibility is reduced (which be periodically). Thus, any solution¶
With most service transactions likely being encrypted for privacy and security reasons, supporting the appropriate transport layer methods is crucial in all our scenarios in [I-D.mendes-rtgwg-rosa-use-cases]. While work in [OnOff2022] has shown that small service transactions in scenarios like replicated chunk retrieval (Section 3.5 in [I-D.mendes-rtgwg-rosa-use-cases]) or AR/VR (Section 3.6 in [I-D.mendes-rtgwg-rosa-use-cases]) may be beneficial for significantly reducing the service-level latency, the challenge lies in initiating suitable transport layer security associations with frequently changing service instances. Pre-shared certificates may address this to allow for 0-RTT handshakes being realized but come with well-known forward secrecy problems. Thus, any solution¶
We envision the ROSA layer in ROSA endpoints to be transparently integrated in the operation of transport protocols, and thus applications, by provuding suitable interfaces to accessing the ROSA services of a specific ROSA domain. Thus, any solution¶
We expect the following benefits to be realized through providing a solution to the problem statement presented in [I-D.mendes-rtgwg-rosa-use-cases]:¶
This draft provided a gap analysis of existing methods for service-based routing in relation to the issues and pain points identified in [I-D.mendes-rtgwg-rosa-use-cases].¶
Furthermore, we outlined requirements to fill those gaps in possible realizations, a first of which is being described in a companion document as the ROSA architecture.¶
To facilitate the decision between service information (i.e., the service address) and the IP locator of the selected service instance, information needs to be provided to the ROSA service address routers. This is similar to the process of resolving domain names to IP locators in today's solutions, such as the DNS. Similar to the latter techniques, the preservation of privacy in terms of which services the initiating client is communicating with, needs to be preserved against the traversing underlay networks. For this, suitable encryption of sensitive information needs to be provided as an option. Furthermore, we assume that the choice of ROSA overlay to use for the service to locator mapping is similar to that of choosing the client-facing DNS server, thus we assume it being configurable by the client, including to fall back using the DNS for those cases where services may be announced to ROSA methods and DNS-like solutions alike.¶
This draft does not request any IANA action.¶
Many thanks go to Ben Schwartz, Luigi Iannone, Mohamed Boucadair, Tommy Pauly, Joel Halpern, Daniel Huang, and Russ White for their comments to the text to clarify several aspects of the motiviation for and technical details of ROSA.¶