Title : An Overview of Internet Routing
Author : krnl
---[ Phrack Magazine Volume 8, Issue 53 July 8, 1998, article 05 of 15
-------------------------[ Introduction and Overview of Internet Routing
--------[ krnl <krnl@heuristic.org>
----[ Routing Overview:
The process of routing can be quickly summarized as a node finding the path to
every possible destination. Routing is present in everything from layer 1
(the physical layer) on up. The routing that most people are familiar with,
however, occurs at layer 3 (the network layer) and as such, we will only
reference layer 3 (and more specifically) Internet Protocol (IP) routing in
this document.
Protocols for exchange of routing information connect multiple routers around
the world to provide them with a common view of the network through their
heterogeneous, though generally consistent routing tables. Routing tables
store all information necessary for the router to reach every destination on
the network irrespective of size (i.e. the network could be j.random LAN with
one ip router and two hosts off of an ethernet port or it could be the
Internet proper).
----[ Routing Protocols:
There are a wide variety of routing protocols used to contribute to the
routing tables across a network. Protocols such as BGP, OSPF, RIP and ISIS
help to convey a correct and coherent picture of the network to all network
switches (routers).
----[ Routing Goals:
You can imagine that if each router has to store information that would allow
it to reach every destination on the network, there is the possibility for it
to amass a large routing table. Large routing tables are difficult (and
sometimes impossible) for routers to process because of physical constraints
(cpu, memory or a combination). Therefore, we would like to minimize the
routing table space without sacrificing the ability to reach every destination
on the network. For example, if the router is connected to the Internet via
one DS1 link to another router, the router could store routing table
information for each destination on the Internet or it could just default
non-local information out that serial link. What defaulting means is that if
the router does not have a specific entry in its table for the destination
that the packet is trying to find, it sends it out the default link. The
router towards which a router sends defaulted packets is sometimes called the
'gateway of last resort'. This simple trick allows many routing tables to
save a number of entries on the 30th order of magnitude. Routing information
should not be exchanged between routers in a spurious fashion. Frequent churn
in the routing table puts unnecessary stresses on the scare memory and cpu of
any given router. Information propagation should not interfere with the
forwarding operations of the router. Though this means that you should not
send routing updates every nanosecond, it does not mean that routing
information should only be exchanged and updated weekly. One of the important
goals of routing is that it provide the host with a table which accurately
reflects the current status of the network.
The most important aspect of a router's operation is sending packets from
input to correct output. Misrouting packets could cause a loss of data.
Routing table inconsistencies could also cause routing loops whereby a packet
is passed between two adjacent interfaces ad infinitum.
It is desirous for routers to have quick convergence. Convergence can be
informally defined as a metric which gauges the speed at which routers arrive
at a consistent view of the network. It would be ideal to have infinitesimal
convergence times because that would ensure that each router on the network
can accurately reflect the current topology even after a drastic change (link
failure). When the network is changing, each router must propagate data which
will aid other routers in converging to the correct picture of the network
status. Problems with quick convergence are found in the routing updates. If
a link is flapping (changing line status from up to down) rapidly, it can
generate numerous installation and withdrawal requests. Therefore, that one
link can end up consuming the resources of every router on the network because
the other routers are forced to install and withdraw the route in rapid
succession. While convergence is an important goal of routing protocols, it
is not a panacea to network woes.
----[ Distance Vector Routing
Distance vector routing protocols distribute a list of <destination, cost>
tuples to all of the router's neighbors. These tuples assign a cost to reach
every other node of the network. It is important to note that this routing
information is only distributed to routers which are assigned as neighbors to
the originating router. These neighbors are often physical, but can be
logical in the case of eBGP multihop. That cost is the sum of the link costs
for the router to reach a destination. Routers periodically send their
neighbors distance vector updates; the neighbor then compares the received
distance vector to its current distance vector. If the received values are
lower, the router sends output to the destination in the distance vector over
the link that it received the vector over.
The count to infinity problem is a problem with many distance vector
implementations. We will assume that all links have a unit cost and that each
hop corresponds to a unit. For example, if router X is connected to router Y
and router Y is connected to router Z, we can demonstrate this problem (see fig
1). Y knows a 1 hop path to Z and X knows a 2 hop path to Z. Assume that
link YZ goes down and the cost of that route is increased to infinity (fig 2).
Now, Y knows an infinite cost route to Z because it knows the link is down so
it propagates this distance vector to X. Suppose X has sent an update to Y
which advertises a 2 hop distance vector. Now, Y will think that it can get
to Z through X, so it sends X an update that says it can get to Z in three
hops (fig 3). Note that X has no idea that the distance vector being
advertised to it was originated from X. This is a serious flaw in distance
vectors. In their unmodified form, they do not contain the full path
information that the route has traversed. As illustrated above, the router
alternates states of advertising a path to Z and advertising infinity to Z.
They keep this exchange up forever or until they have reached some internally
defined infinity count (say 15 as in the case of RIP).
Count to Infinity problem:
X--------------------Y--------------------Z
Y:1 X:1 X:2
Z:2 Z:1 Y:1
[ fig 1 ]
All links are up, below each node we note the destination and hopcount
from each respective node.
X--------------------Y--------* *---------Z
Y:1 <------------- Z:infinity
Z:2 -------------> X:1
[ fig 2 ]
The link Y - Z breaks. Node X advertises Z:2 to node Y.
X--------------------Y--------* *---------Z
Z:infinity(frm Y) -> X:1
Y:1 <------------- Z:3
[ fig 3 ]
Node Y sends its Z distance vector to X _before_ it recieves node X's
infinity. Once node Y receives node X's infinity, it sets its distance to
infinity.
A path vector is an easy way to defeat the count-to-infinity problem.
Basically, each distance vector also includes the router path that it
traversed (fig 4). The router rejects an update from its neighbor if the path
included in the update includes the router receiving the update (fig 5). The
Border Gateway Protocol (which is used to exchange routing information between
Autonomous Systems on the Internet) incorporates the path vector to stop the
count-to-infinity problem. Obviously, you have to incorporate more
information in the routing table if you want to include the AS path
information that the route has traversed. The designers of BGP decided that it
was optimal to sacrifice storage space and processing power for the robustness
that the path vector affords the routing protocol.
Path Vector Solution:
X--------------------Y--------------------Z
Y:1 (Y) X:1 (X) X:2 (YX)
Z:2 (YZ) Z:1 (Z) Y:1 (Y)
[ fig 4 ]
All links are up, below each node we note the destination, hopcount and
path vector from each respective node.
X--------------------Y--------* *---------Z
Y:1 (Y) X:1 (X)
Z:2 (Y Z) Z:infinity
[ fig 5 ]
The link Y - Z breaks. Node Y knows to ignore Xs advertisement of Z
because Y is the path vector. The avoids the count-to-infinity problem.
Another way to counter this problem is the split horizon. Basically, this
means that a router shouldn't advertise a path to a neighbor if that neighbor
is the next hop to the destination. This solves the problem presented in the
example above because the path to Z from X through Y would not have been
advertised to Y because Y is the neighbor _and_ the next hop to the
destination (Z). A variation called split horizon with poisonous reverse has
router X advertise an infinite cost to get to destination Z. Under a split
horizon, router X would not advertise that it could get to router Z.
----[ Link State Routing
A router using a link state routing protocol distributes the distance to its
neighbors to every other router on the network. This allows each router on
the network to make a routing table without knowing the full cost to the
destination from any one source. The problems of loops are avoided because
each router contains the full topology of the network. Basically, the router
makes a 3 tuple containing the source router (itself) the neighbor and the
cost to its neighbor. Therefore, if router A is connected to Router B over a
link of cost 3 and router A is connected to router C over link cost 5, then
router A would advertise the Link State Packets (LSPs) <A,B,3> and <A,C,5> to
all routers on this network. Each router on the network would evaluate all of
the LSPs that it receives and calculate a shortest path to every destination
on the network.
Obviously, the LSP is an integral part of the convergence process. If someone
could inject false LSPs into the network, it could result in misrouted
information (a packet taking a longer path than it should) or even in the
blackholing of a router on the network. This is not necessary a malicious
attack of a network, however. Router C could advertise a link to its neighbor
D with the 3 tuple <C,D,6> and then withdraw the announcement when the link
goes down. Unfortunately, if the LSP advertising the link having an infinite
cost arrives before the LSP advertising the cost of that link being 6, the
routing table will not reflect the topology of the network and will be in that
state until another LSP comes to correct the problem.
To combat this, a sequence number is introduced into the LSP. Therefore, all
of the routers on the network would initialize their sequence number to some
starting value and then start advertising their LSPs. This solves the above
problem in that the LSP advertising the link of infinite cost would have a
higher sequence number than the LSP advertising the link as having cost 6.
Some problems encountered when using sequences numbers are finite sequence
space, sequence initialization, and aging. It is in the best interest of a
robust link state protocol needs to protect its LSPs as well as choose a
sequence space which is sufficiently large to accommodate updates. The
sequence space that the LSPs can use is set to some finite value. Therefore,
when the sequence numbers reach the top of the space, they must wrap around
towards the smallest sequence number. This presents a problem because when a
router compares link state updates, the greater sequence number takes
preference. To combat this problem, you can define a maximum age of the LSP.
Therefore, if you have not received an update in X ticks, you discard the
current LSP information and wait for a further update. It must be noted that
this invalidates the path information to a destination. For example, if
router Y advertises a cost to its neighbor router Z where router Y is
connected by one link to a meshed network, when the link between the mesh and
router Y breaks, the other routers in the mesh have preserved link state
information that will allow them to find a path towards Z. If they receive no
updates in MAX_AGE, then they will assume that the link to Y is unreachable.
This will allow each router to converge its table and allow it to advertise an
infinite LSP for Y and Z.
Sequence initialization is also an important facet of this problem. Say
router Y crashed and is rebooting while the network is recalculating paths to
it. When it starts its link state protocol back up, it must somehow indicate
that it needs to reinitialize its sequence number to the last number it gave
all of the other routers to allow for coherence. Therefore, it can announce
paths with a sequence number in a special "initialization set". This
initialization set will tell the other routers that this router needs the
sequence where it left off. This is the "lollipop sequence" idiom. The
sequence space really resembles a lollipop in that the normal sequence number
keep churning around the finite sequence space while reinitialization takes
place in a short linear sequence space (comparable to the stick :).
Great pains are taken to ensure the integrity of LSPs. In fact, this entire
routing algorithm depends on the LSP being digested in a coherent method to
guarantee that each router has the correct view of the network topology. The
question still remains how the root node router computes the distance to each
destination.
Because of the general nature of a link state protocol, you have various nodes
of the network advertising the distance to get to their neighbors to every
other node on the network. Thus each node has a collection of neighbor
distances from various routers on the network. The routing table is basically
'grown' outward from the root node to all of the network extremities. This
will be explained in a slightly rigorous fashion in the next section.
----[ Dijkstra's Algorithm
This algorithm is a simple and elegant way to determine network topology.
Basically, there are two distinct sets of destinations on the network.
Destinations in set K are known routes for which a shortest path has been
computed. Destinations in set U are routers for which the best path to that
router is not currently known. In this set, paths are being considered as
candidates to be the best path to that destination.
To start off, add the current node p into the set K. Then add all of its
neighbors into the set U with path/cost associations. If there is another path
to one of the neighbors in the U set, then choose the path which costs the
least. When the neighbors N* are added to U make sure that they indicate the
cost through p as well as p's ID .
Once this has been done for the set U, then pick the neighbor n to p which has
the smallest cost to reach p. This is assuming that the neighbor has not
already been installed in K. This algorithm stops when set U is equivalent to
the empty set. When set U is null, it is implied that all destinations are in
set K and have the shortest cost from the root node P on which this algorithm
is running. Note, that each step evaluates adds ONE neighbor into K. That
neighbor is the router with the smallest cost to reach p.
----[ Distance Vector vs. Link State
We are left with these protocols like BGP which uses path vector and OSPF
which uses link state. Why do they occupy such orthogonal space? When a link
state protocol is working correctly, it guarantees that there will be no
routing loops in the network. The link state protocol also guarantees fast
convergence when there is a change in the topology of the network because the
link state is distributed on a flat routing space. Since link state protocols
contain these inherent advantages, why do protocols like BGP chose to employ
the path vector approach?
Taking a cross-section of routing protocols that are employed on the internet,
one finds that the majority of large providers use OSPF to resolve routing
information on their internal network and BGP to talk to other distinct
networks (or autonomous systems) at their borders of administration. What
suits BGP as an external protocol and OSPF for an internal routing protocol?
One issue, which will be discussed in the next section, is hierarchy. BGP
provides a mechanism for a routing hierarchy which enables it to greatly
reduce the space of its table. OSPF, which is a link state protocol,
provides a flat routing table whereby any internal router knows the full
hop by hop path to any destination within the autonomous system. Furthermore,
distance vector protocols understand that different areas can have different
views of the network where link state protocols require that each node
independently compute a consistent view of the network. This saves the DV
protocol the overhead of maintaining a correct LSP database. BGP also has
another 'advantage' in that it is layered on top of the Transmission Control
Protocol (TCP). Therefore, in the 'best-effort' service of IP networks, BGP
has assurance (to the level that TCP can guarantee) that routing information
will be propagated. Whereas, you can (or should) be able to govern the status
of your internal network, the nebulous exterior past your border routers
confers no delivery guarantee on your routing information.
Each type of routing algorithm is suited for its function. Link State
protocols provide the quick convergence that is essential to an internal
network while distance vector protocols provide external reachability.
Hierarchy is not something that is inherent in distance vector protocols,
but the implementation of a hierarchy has made BGP a widely used exterior
gateway protocol.
----[ Routing Hierarchy
Routing hierarchy is an oft fought debate that borders on religion. There
are constantly questions about how hierarchy should be implemented (if at
all) in the free form state of the current global network. Hierarchy imposes
a tree of authority with the overall authority at the top of the tree and
branching down to regional authorities, local authorities ad infinitum.
Hierarchy simplifies routing because if a destination is not locally routable
(or under your section of the tree). You can iterate up towards the top tree
to try and deliver that information. As you move towards the top, the routing
information contained in the routers becomes less and less specific until you
reach the root node which is the least specific. It does, however, know how
to route information to every possible destination on the network. It may help
you to envision the hierarchy of the telephone network (built under one
collective). If a call cannot be placed within a central office, it is handed
to either another central office in the area code or a wide area link. The
wide area link understands how to route to each area code in a full national
mesh whilst the local 5ess switch only knows routing information for more
specific prefixes. As the phone number becomes less specific (from right
to left), the routing decision moves further up the strict hierarchy.
This similar to how the domain name system (DNS) works on the internet (fig 6).
You provide local records for domains that you host. When your nameserver
receives a query for a record, it either returns the fact that it has
authority for that record or points toward the root nameserver. The root
nameserver knows the delegations of .com, .net, .org et al. and then points
towards the site responsible for that top level domain. That site then points
towards the site that has authority for the specific second level domain.
Domain names take the form of most specific -> least specific; i.e.
microsoft.com is more specific than just .com. Likewise
gates.house.microsoft.com is more specific than microsoft.com.
DNS Hierarchy:
___ . ___
/ | \
.com. .org. .edu.
/ | \
microsoft.com. eff.org. isi.edu.
/ | \
billy.microsoft.com. x0r.eff.org. rs.isi.edu.
[ fig 6 ]
Each level in the hierarchy is responsible for levels of greater
specificity.
Root authority is controlled by the Internet Assigned Numbers Authority
(IANA). It provides the top of the hierarchy in a "centrally" managed
database (in fact, there are multiple root servers distributed across the
county which maintain a consistent database). This is the closest example of
strict hierarchy that can be found on the internet.
With IP addresses, specificity increases in the opposite direction. IP
addresses (version 4) are 32-bits. The rightmost bit signifies the greatest
amount of specificity and the leftmost, the least. IP routing authority
information is not maintained in a centralized database. Routing information
is exchanged between autonomous systems via the BGP protocol. Routes take
preference in order of most specific -> least specific. In this way, there is
some type of hierarchy in the system (even though it is more loose than the
DNS example). Generally, larger providers control larger parts of the total
IPv4 space ((2^32) - 3 addresses). The converse is also true.
Classless Inter-Domain Routing (CIDR) also helped to decrease the size of
routing tables and increase the appearance of hierarchy. Now, instead of
Sprint announcing routes to 130.4.0.0 through 130.20.0.0 (on the classical B
space boundary) it could announce 130.4.0.0/12 which encompasses that entire
16 class B range. The classful ranges, subnetworking and the like are
discussed in my "introduction to IP" page and are therefore not included in
this document.
----[ Routing Hierarchy and Aggregation
BBN divides their 8/8 network into two subnetworks and advertises reachability
to the aggregate to save table space. Once inside an AS, routing obeys a fairly
strict hierarchy. Router A is responsible for the entire 131.103/16. It
divides it into two /17. Likewise, Router D in AS1 is responsible for 8/8 and
chooses to divide it into 8.0/9 and 8.128/9 and divides responsibility for
these networks to Routers E and F respectively (fig 7). Routers B, C, E, and F
can further choose to subdivide their networks in a hierarchical fashion.
Because of the binary nature of subnetting, networks can only be divided in
half.
Routing Hierarchy and Aggregation:
BGP
131.169.0.0/16 <--------------------> 8.0.0.0/8
A (AS1239) D (AS1)
/ \ / \
B / \ C E / \ F
/ \ / \
131.169.0.0/17 131.169.128.0/17 8.0/9 8.128/9
[ fig 7 ]
In the internet, there is no strict routing hierarchy. There are simply
large networks which peer via BGP to distribute aggregated routing
information.
The national backbone is populated by few nodes (when compared to the end
nodes). Most national providers are one or two router hops away from every
large network. Through aggregation in networks below, national providers
provide fully (or at least we hope) aggregated routing information. In a
strict hierarchy, only one router on any given hierarchy level can advertise
reachability to a specific portion of the network. In the current state of
the Internet, multiple routers can advertise reachability information. For
example, Sprint announces 131.169.0.0/16 out to Digex, MCI, and BBN. Though
this breaks some of the benefits of a strict hierarchy, it confers other
benefits. This scheme allows for distributed control of routing information
instead of depending on the node above. Also, nodes on the same level are
often interconnected to aid in the dissemination of routing information.
----[ Aggregation
As discussed slightly before, aggregation allowed the internet to reduce the
size of its external reachability tables. Before, the granularity of route
announcements allowed for only /8, /16, and /24 (octet boundaries). Now, with
CIDR you could use variable length subnet masks. The only requirement was
that they fall on one of the 32-bit boundaries of the IP address.
Classless routing not only allows us to minimize routing table space, it also
allows us to divide up large chunks of unused space into manageable pieces.
Much of the Class A space is terribly under-utilized. With this scheme one
can more efficiently allocate IP addresses to providers/netizens. The American
Registry of Internet Numbers (ARIN) controls the allocation of IP addresses
within the United States.
Aggregation helps alleviate the problems of not being in a strict hierarchical
structure. It allows the least amount of route table specificity at each
level (assuming the routers on that level choose to fully aggregate their
announcements.) The less specific aggregates do not necessarily indicate the
position of a router in the hierarchy. For example, a university may announce
a /8 and be 3 hops away from the national backbone.
A problem with aggregates can be found when we examine candidate route
specificity. If ISP A only has address space from within the allocated block
to their parent P, then aggregation could cause problems if ISP A wanted to
multihome to parent Q. The problem comes in that ISP A is obligated to
advertise reachability only to their space. This would constitute them
announcing their address space to Parent Q. Assume that parent P aggregates
ISP A's space into a /16 for the sake of saving route announcements. Now, ISP
A would seem to have better reachability only through parent Q because of the
specificity of the route announcement (remember that more specific routes take
precedence over less specific routes). This would nullify the benefits of
multihoming in an attempt to distribute load over the two lines. In this case,
ISP A would ask parent P to announce a more specific destination which has a
length matching the length of the aggregate assigned to ISP A. Therefore, to
the world above parent P and parent Q, the path to ISP A looks equally
appealing.
----[ Exterior/Interior
It is important to look at how routing information is disseminated throughout
the network. It has already been discussed that we use separate routing
protocols (with their respective benefits/costs) to talk to the internal and
external world. However, these protocols cannot take orthogonal views on
routing information. In fact, the interplay between interior and exterior
routing protocols is what allows data to be effectively relayed to a
destination external to the network as well as to distribute external routing
information to all nodes on the internal network.
There are a few ways to ensure that each router has a consistent view of the
network. One is to distribute the external protocol into the internal
protocol. Thereby, the internal protocol instantly has routing information
injected in it for the best path to every external destination. Note that the
router which talks eBGP (or comparable protocol) only redistributes the route
that it installs in its routing table and not the other candidate routes which
may have been advertised to it.
Another approach is to inject the interior protocol into the exterior protocol.
Of course, this necessitates filtering at the entrance point to the exterior
protocol to prevent the announcement of all internal specifics. You can
accomplish internal routing dissemination inside an Interior Gateway Protocol
mesh. Because of the specifics of protocols like BGP, externally learned
routing information will only be propagated one logical hop within the network.
Therefore, every router that must know this external reachability information,
must be fully meshed with the eBGP speaking routers. Also, if other routers
are injecting information into the Exterior Gateway Protocol, the router
should be logically fully meshed with them.
----[ Multicast Routing Overview
What we have been talking about above is unicast routing. In unicast routing,
you assume that each packet has a single destination address. Assuming
infinite resources being available, unicast is a feasible solution for every
situation. However, there are situations when it would be advantageous to send
a packet to multiple destinations from a single source (point to multipoint) or
from multiple sources to multiple recipients (multipoint to multipoint).
The point of multicast is to provide a multicast group to which hosts can
subscribe and get the specific multicast feed. The multicast group is a single
IP address in class D space. Therefore, the senders and receivers are only
associated by a given multicast group address. Thus, the senders move their
data towards the multicast group address and the receivers specify that they
want to receive information from a given group address. In fact, the sender
is not required to know any information about the hosts that are receiving the
feed.
----[ Multicast vs. Unicast
If one was sending packets from a single source to a set of destinations, it
is important to investigate how multicast and unicast handle the distribution.
Assume that router A is sending data to routers B, D and E. A is at the top of
the hierarchy, B and C are at the second level of the hierarchy, and D and E
are below router B. With multiple unicast (fig 8), router A makes 3 copies of
the packet and sends them down link AB. Router B then sends one packet to a
host off of its ethernet and forwards the last two packets to routers D and E
whereupon those routers send the packets to the their respective hosts in the
multicast group.
Therefore, this transmission takes up 3 packets per second on link AB and 1
pps on links B->Host(b), router D and router E. In a multicast routing
implementation, assuming the same topology, we will have less packets. The
difference is that router A sends _one_ packet over link AB. Router B then
triplicates the packet and sends it to Host(b), router D and router E (fig 9).
One way for triplicating the packet is to simultaneously close crossbars on a
fully crossed switch fabric, thus sending data from one input to three outputs
simultaneously. As long as there is route redundancy, multicast is very
efficient because it minimizes redundant packets traveling to the same
next-hop. Simply, as long as there is route redundancy for the distributed
session (for example, an audio feed) you will see an advantage with multicast
over unicast.
Multicast Overview Example:
Multiple Unicast:
A A sends 3 packets to B.
/ \
/ \ 3
/ \
C B B sends 1 packet to each to D and E.
/ \
1 / \ 1
/ \
D E D and E send 1 packet to their respective
hosts.
[ fig 8 ]
Multicast:
A A sends 1 packet to B
/ \
/ \ 1
/ \
C B B duplicates the packet for its host;
/ \ therefore, there is 1 packet (at most) on
1 / \ 1 each link.
/ \
D E
[ fig 9 ]
This is a multicast topology rooted at node A. There is also a shortest path
from A to every destination in the multicast group. This is called the
shortest path multicast tree rooted in A. Data would like to shortest path on
the network layer. One problem with multicast sessions is that recipients
join and leave during a multicast session. This requires pruning of the
multicast "tree" so that packets do not clutter a link on which there is no
one requesting data from a given multicast group.
To detect if there are hosts on a particular broadcast LAN that are interested
in a multicast group, the router sends out Internet Group Management Protocol
(IGMP) messages. Each packet has a random reply time from which the host will
express interest. This is to prevent all hosts on a broadcast LAN from
responding at the same time to an IGMP query. Once one host desires to
receive data destined for a particular multicast groups, all other hosts which
desire to be in the multicast group can discard their replies because the
router knows to multicast all incoming packets destined for that group. The
host then configures its adapter to answer for the MAC address corresponding
to that group.
Multicast must also be functional outside of the broadcast LAN. A simple
solution to the problem is to give each router for which multicast is enabled
the multicast packet. This is called flooding. Basically, it functions by
forwarding the packet to every interface other than the one that the packet
arrived on. The inherent flaws in this approach is that there is packet
duplication as well as packets being sent to routers which have no hosts
subscribed to the multicast group. To clarify the duplication statement, if
Router A is physically meshed with routers B, C, and D and linked to its
upstream via serial, when router A receives the multicast packet, it floods it
out the interfaces to routers B, C, and D. These routers then flood the packet
out the interface other than the one they received the packet on (namely the
interface that connects them to A). This results in each of these routers
receiving two copies of the packet (other than the one they received from A)
in this exchange.
A solution to this problem can be found in a technique called Reverse Path
Forwarding (RPF). RPF specifies that the router forwards a packet with a
source address of X only if the interface which the router receives the
packet on corresponds to the shortest path that router has to source
X (fig 10). Therefore, in the above example, each of the meshed routers
_still_ receives 2 duplicate packets in the second step, but they refuse to
forward them because only the packet received from the interface directly
connected to A will be forwarded. As noted, RPF does not completely solve
the problem of packet duplication. To solve this, we must introduce
pruning. The idea is simplistic: inform neighbors that you do not wish to
receive multicast packets from source X to multicast group Y. You can also
specify prunes to a particular group. If a router tells its neighbors that it
did not desire to receive packets for group Y and then has a host which
desires to receive group Y, it sends a graft message to its neighbors
specifying that it be added into the multicast tree.
As a unicast aside, RPF can also be used to eliminate source address spoofing
in that the router will only forward packets from source Y if it is receiving
it on the interface which is the shortest path to source Y. Thus, if the
router receives packets from an external link which say their saddr ==
saddr(y), the router will not forward them because its shortest path to Y is
not from the external link.
RPF Example:
| <-- Point of ingress.
|
A-----------C
|\ /|
| \_______/ |
| / \ |
|/ \|
B-----------D
[ fig 10 ]
ABCD are physically meshed. When A distributes a packet to BCD, there is
no problem. Now, in the next step, B, C,and D forward the packet to each
of their respective neighbors (for B it would be C and D and ! A because
it received the packet from A). This results in C and D receiving 2
packets in this entire exchange. Note that C and D now do _not_ forward
the packet they have received from A through B because that is not their
shortest path to A. Their shortest path is their direct link.
----[ The Multicast Backbone (MBONE)
It would be myopic to assume that every router on the internet supports
multicast. Thus, when a router needed to find the shortest path to a
destination (for forwarding of a multicast packet) it could look in the
unicast routing table. Unfortunately (or fortunately depending on your
perspective) most routers on the Internet do not support multicast or do
not have it enabled by default. Therefore, until most routers support
multicast, it has been "layered" over IP and tunneled from multicast router to
multicast router (more specifically, the multicast header and data is
encapsulated in a unicast IP header). The tunnel (which bridges the gap of
unicast only routers between multicast routers) informs each end that some
packets will contain a multicast group in their payload. This allows data to
be routed by using unicast forwarding tables while at the same time preserving
the integrity of the multicast group id.
Because these multicast routers are not necessarily connected physically
(though they are tunneled logically), they must be connected by a multicast
routing protocol. This is necessary because the closest path via multicast
may not correspond to the shortest path over unicast only routers. Distance
Vector Multicast Routing Protocol (DVMRP) is an initial foray into this realm.
DVMRP distributes "unicast" routes to facilitate the construction of shortest
path trees. DVMRP uses the flood and prune method discussed above to discover
/maintain multicast trees. There is also a link state foray into this arena.
Multicast Open Shortest Path First (MOSPF) takes the LSP approach and
calculates shortest absolute path. One host off of a multicast router can
request to be in a multicast group. That router then distributes an LSP over
the network. Of course, MOSPF (being a link state protocol) runs into
salability problems. It is computationally expensive for a router to compute
reachability information for each end node router.
Core based trees (CBT) are an attempt to alleviate the problems that DVMRP and
MOSPF experience. The concept is that multicast routers send join requests to
core routers of arbitrary designation. When a router learns of a host which
wishes to join a specific multicast group, it forwards a packet to the core
multicast router. Every router along the way forwards the packet towards the
core router and marks the interface on which the packet arrives so that it
knows where to forward the multicast packets from the core. This solves the
problem of having to communicate topology among all of the endpoints. The
choice of a core multicast router is a non-trivial because all multicast
traffic for multicast routers branching off of it _must_ pass through the core
router.
----[ Protocol Independent Multicast
Protocol independent multicast (PIM). Pim is a balance between flood and
prune and CBT. When there are many multicast routers on a given network, it
is more efficient to use the flood-and-prune method. This network environment
is called "dense". On the contrary, sparse mode defines networks where there
are few multicast routers. In sparse mode, it is more efficient to use CBT
because the core router is not weighted in an environment when it 'polices'
few multicast routers. When most of network is comprised of multicast routers,
it is not prudent to require all sessions to be coordinated (and routed
through) a core. Sparse mode PIM has been adapted from CBT to allow data to
reach its destination via the core or through the shortest path tree.
Currently, the operator must define whether groups are sparse or dense instead
of leaving it up to the protocol. cisco systems' implementation of pim also
supports a middle ground called 'sparse-dense' mode.
----[ Border Gateway Protocol
There has been some mention of the Border Gateway Protocol (BGP) in this
document. BGP was groomed as the successor to the Exterior Gateway Protocol
(EGP). BGP is mainly an exterior routing protocol. It is used to communicate
with systems outside of the operator's control and both distribute and receive
network layer reachability information (NRLI) from the neighboring routers.
BGP must be a robust protocol which has the capability of quick convergence
while at the same time, not being influenced by frequent shifts in topology.
When you use BGP to receive routing information, you are depending on the
other networks to distribute correct information to your network.
A BGP speaking router communicates with its peers via TCP. TCP over IP is a
mechanism for guaranteeing the transmission of data over a best effort service
at the IP layer. The choice of TCP as the distribution mechanism for BGP
information is a point of contention. While TCP provides inherent checksums,
acknowledgments, retransmissions and duplicate suppression mechanisms for
received packets, it does not guarantee packet order or packet path. This can
lead to headaches for the router receiving this information.
BGP peers communicate with a variety of message formats. BGP speakers use the
OPEN message to establish a peering relationship with other speakers. BGP
speakers use the UPDATE message to transfer routing information between peers.
Update information includes all routes and their associated attributes.
KEEPALIVE messages assure that BGP peers are active. NOTIFICATION messages
inform peers of error conditions.
----[ BGP path selection
It is important that we understand the messages that constitute the Border
Gateway Protocol, but we are still left with the question of how BGP works on
the internet. One important area of clarification is in the BGP path selection
algorithm. This algorithm is how BGP decides which route to prefer and
attempt to install in the routing table.
This algorithm is employed when there are multiple paths to a destination. As
a failsafe, the first selection looks at the next hop and determines if it is
accessible. If the next hop is not accessible, it is important not to
consider that route as a candidate path to a destination because all data sent
to its next_hop will be blackholed. The second selection mechanism is the
weight of the route. Weight is a proprietary implementation to cisco Systems
routers and is analogous to local preference. If two routes have different
weights, the route with the largest weight is selected. Notice that the
selection mechanism is basically a logical if->then chain. If candidate paths
differ at a level of the selection algorithm, then the favorable path is
selected and the algorithm ceases trying to decide which path to prefer. The
next level is the local_pref attribute. This is a well known mandatory BGP
attribute. Much like weight, the path with the highest local_pref is
preferred. After local preference, the router selects the path that it
originated. If the router didn't originate the paths, then the path with the
shortest AS_PATH length should be selected. AS path length gauges the number
of autonomous systems that this routing information has traveled through.
The purpose of this selection relies in the assumption that the less ASNs the
route has passed through, the closer the destination. If all of the above
attributes are identical, then prefer path origin in this fashion IGP > EGP >
Incomplete. If the path origins are the same, prefer the path with the lowest
value MULTI_EXIT_DESCRIMINATOR (MED). MEDs are commonly used to distinguish
between multiple exit points to the same peer AS. If none of these attributes
are dissimilar, then prefer the path through the closest IGP neighbor. If
that fails, the tiebreaker is preferring the path with the lowest IP address
specified in the BGP router-id section discussed above.
This selection algorithm allows effective establishment of routing policy. If
I wanted to prefer routes from a certain AS over routes to another AS, I could
establish a route-map at both entry points of external routing information and
assign a higher LOCAL_PREF to the routes from the AS I want to favor.
Unfortunately, this does not provide much granularity. This means that all
traffic will be routed to the favorable AS and does not allow us to balance
the load over our multiple connections. If you allow path selection to
progress to the AS path length decision level, then you will get decent
(though not 50-50) load balancing to destinations. This of course is assuming
that you have providers with comparable customer routes and connectivity. If
you have a DS3 to MCI and a DS3 to the local BFE provider, nearly all traffic
will move out the MCI pipe if the BGP decision is allowed to progress down to
the AS path length category. At earlier selections, you can change the
preference of routes by using AS path access lists to select routes based on
as path regular expressions. For example, if you wanted to select all routes
that traversed UUnet and send them out your BFE provider, you could use a route
map to match an AS path access list which contained _701_ and set the
local_pref to 100 (or some value higher than the UUwho traversed paths from
MCI). This will force all traffic destined for UUwho to exit your AS over
your BFE DS3. While this affords you some granularity in load balancing, it
is often not optimal. Basically, you are forcing traffic out a path that it
would not normally select. If that BFE provider has many hops before it can
reach UUnet, you are forcing the traffic you send out that link to traverse
all of those hops and be subject to (possibly) more link congestion, latency,
etc.
Routing policy is something that requires the tweaking of many knobs. Much of
the tweaking I have described above pertains to cisco Systems routers. It is
important to understand that you must think through routing policy before you
implement it. You must evaluate what load balancing you want, what traffic
symmetry you want, and what general quality of service your traffic will
receive because of your decisions.
For information more specific than this, read the BGP RFC or current BGPv4
internet draft [1].
----[ Open Shortest Path First v2 (OSPFv2)
We are not going into great detail about OSPF. It is a link state routing
algorithm. As noted above, link state algorithms route on flat space (no
hierarchy). OSPF is an improvement over the vanilla LS protocol because it
provides areas of maintenance for hierarchy purposes. Areas distribute full
information internally by running a separate OSPF process with its area ID.
Each router has an identical link state database with other routers within its
area, but not with external routers. Each area operates in an autonomous
state and transfers inter-area information at junction routers called area
border routers. These routers are in two or more areas and help distribute
information between these areas. The router has separate link state databases
for each area to which it is connected.
OSPFv2's main advantage is that it supports Variable Length Subnet Masks
(VLSM). This means that a router can advertise reachability with more
granularity than a scheme which advertised host reachability. Therefore, if
the router can distribute packets to all hosts from 206.4.4.1 -> 206.4.5.254
it advertises reachability to 206.4.4.0/23 instead of each classful network
separately or each host separately. This obviously saves immensely on link
state database size and processing power required.
For information more specific than this, read the current OSPFv2 RFC or
internet draft [2].
----[ References
[1] Rehkter, Y., Li, T., " A Border Gateway Protocol 4 (BGP-4)",
draft-ietf-idr-bgp4-07.txt, February 1998.
[2] Moy, J., "OSPF Version 2", draft-ietf-ospf-vers2-02.txt,
January 1998.
----[ EOF