|
---[ Phrack Magazine
Volume 8, Issue 53 July 8, 1998, article 05 of 15
-------------------------[
--------[ krnl <krnl@heuristic.org>
----[ Routing Overview:
The process of routing
can be quickly summarized as a node finding the path to
every possible destination. Routing is present in everything from layer
1
(the physical layer) on up. The routing that most people are familiar
with,
however, occurs at layer 3 (the network layer) and as such, we will only
reference layer 3 (and more specifically) Internet Protocol (IP) routing
in
this document.
Protocols for exchange
of routing information connect multiple routers around
the world to provide them with a common view of the network through their
heterogeneous, though generally consistent routing tables. Routing tables
store all information necessary for the router to reach every destination
on
the network irrespective of size (i.e. the network could be j.random LAN
with
one ip router and two hosts off of an ethernet port or it could be the
Internet proper).
----[ Routing Protocols:
There are a wide variety
of routing protocols used to contribute to the
routing tables across a network. Protocols such as BGP, OSPF, RIP and
ISIS
help to convey a correct and coherent picture of the network to all network
switches (routers).
----[ Routing Goals:
You can imagine that
if each router has to store information that would allow
it to reach every destination on the network, there is the possibility
for it
to amass a large routing table. Large routing tables are difficult (and
sometimes impossible) for routers to process because of physical constraints
(cpu, memory or a combination). Therefore, we would like to minimize the
routing table space without sacrificing the ability to reach every destination
on the network. For example, if the router is connected to the Internet
via
one DS1 link to another router, the router could store routing table
information for each destination on the Internet or it could just default
non-local information out that serial link. What defaulting means is that
if
the router does not have a specific entry in its table for the destination
that the packet is trying to find, it sends it out the default link. The
router towards which a router sends defaulted packets is sometimes called
the
'gateway of last resort'. This simple trick allows many routing tables
to
save a number of entries on the 30th order of magnitude. Routing information
should not be exchanged between routers in a spurious fashion. Frequent
churn
in the routing table puts unnecessary stresses on the scare memory and
cpu of
any given router. Information propagation should not interfere with the
forwarding operations of the router. Though this means that you should
not
send routing updates every nanosecond, it does not mean that routing
information should only be exchanged and updated weekly. One of the important
goals of routing is that it provide the host with a table which accurately
reflects the current status of the network.
The most important
aspect of a router's operation is sending packets from
input to correct output. Misrouting packets could cause a loss of data.
Routing table inconsistencies could also cause routing loops whereby a
packet
is passed between two adjacent interfaces ad infinitum.
It is desirous for
routers to have quick convergence. Convergence can be
informally defined as a metric which gauges the speed at which routers
arrive
at a consistent view of the network. It would be ideal to have infinitesimal
convergence times because that would ensure that each router on the network
can accurately reflect the current topology even after a drastic change
(link
failure). When the network is changing, each router must propagate data
which
will aid other routers in converging to the correct picture of the network
status. Problems with quick convergence are found in the routing updates.
If
a link is flapping (changing line status from up to down) rapidly, it
can
generate numerous installation and withdrawal requests. Therefore, that
one
link can end up consuming the resources of every router on the network
because
the other routers are forced to install and withdraw the route in rapid
succession. While convergence is an important goal of routing protocols,
it
is not a panacea to network woes.
----[ Distance Vector Routing
Distance vector routing
protocols distribute a list of <destination, cost>
tuples to all of the router's neighbors. These tuples assign a cost to
reach
every other node of the network. It is important to note that this routing
information is only distributed to routers which are assigned as neighbors
to
the originating router. These neighbors are often physical, but can be
logical in the case of eBGP multihop. That cost is the sum of the link
costs
for the router to reach a destination. Routers periodically send their
neighbors distance vector updates; the neighbor then compares the received
distance vector to its current distance vector. If the received values
are
lower, the router sends output to the destination in the distance vector
over
the link that it received the vector over.
The count to infinity
problem is a problem with many distance vector
implementations. We will assume that all links have a unit cost and that
each
hop corresponds to a unit. For example, if router X is connected to router
Y
and router Y is connected to router Z, we can demonstrate this problem
(see fig
1). Y knows a 1 hop path to Z and X knows a 2 hop path to Z. Assume that
link YZ goes down and the cost of that route is increased to infinity
(fig 2).
Now, Y knows an infinite cost route to Z because it knows the link is
down so
it propagates this distance vector to X. Suppose X has sent an update
to Y
which advertises a 2 hop distance vector. Now, Y will think that it can
get
to Z through X, so it sends X an update that says it can get to Z in three
hops (fig 3). Note that X has no idea that the distance vector being
advertised to it was originated from X. This is a serious flaw in distance
vectors. In their unmodified form, they do not contain the full path
information that the route has traversed. As illustrated above, the router
alternates states of advertising a path to Z and advertising infinity
to Z.
They keep this exchange up forever or until they have reached some internally
defined infinity count (say 15 as in the case of RIP).
Count to Infinity
problem:
X--------------------Y--------------------Z
Y:1 X:1 X:2
Z:2 Z:1 Y:1
[ fig 1 ]
All links are up, below each node we note the destination and hopcount
from each respective node.
X--------------------Y--------* *---------Z
Y:1 <-------------
Z:infinity
Z:2 -------------> X:1
[ fig 2 ]
The link Y - Z breaks. Node X advertises Z:2 to node Y.
X--------------------Y--------*
*---------Z
Z:infinity(frm Y)
-> X:1
Y:1 <------------- Z:3
[ fig 3 ]
Node Y sends its Z distance vector to X _before_ it recieves node X's
infinity. Once node Y receives node X's infinity, it sets its distance
to
infinity.
A path vector is an
easy way to defeat the count-to-infinity problem.
Basically, each distance vector also includes the router path that it
traversed (fig 4). The router rejects an update from its neighbor if the
path
included in the update includes the router receiving the update (fig 5).
The
Border Gateway Protocol (which is used to exchange routing information
between
Autonomous Systems on the Internet) incorporates the path vector to stop
the
count-to-infinity problem. Obviously, you have to incorporate more
information in the routing table if you want to include the AS path
information that the route has traversed. The designers of BGP decided
that it
was optimal to sacrifice storage space and processing power for the robustness
that the path vector affords the routing protocol.
Path Vector Solution:
X--------------------Y--------------------Z
Y:1 (Y) X:1 (X) X:2
(YX)
Z:2 (YZ) Z:1 (Z) Y:1 (Y)
[ fig 4 ]
All links are up, below each node we note the destination, hopcount and
path vector from each respective node.
X--------------------Y--------* *---------Z
Y:1 (Y) X:1 (X)
Z:2 (Y Z) Z:infinity
[ fig 5 ]
The link Y - Z breaks. Node Y knows to ignore Xs advertisement of Z
because Y is the path vector. The avoids the count-to-infinity problem.
Another way to counter
this problem is the split horizon. Basically, this
means that a router shouldn't advertise a path to a neighbor if that neighbor
is the next hop to the destination. This solves the problem presented
in the
example above because the path to Z from X through Y would not have been
advertised to Y because Y is the neighbor _and_ the next hop to the
destination (Z). A variation called split horizon with poisonous reverse
has
router X advertise an infinite cost to get to destination Z. Under a split
horizon, router X would not advertise that it could get to router Z.
----[ Link State Routing
A router using a link
state routing protocol distributes the distance to its
neighbors to every other router on the network. This allows each router
on
the network to make a routing table without knowing the full cost to the
destination from any one source. The problems of loops are avoided because
each router contains the full topology of the network. Basically, the
router
makes a 3 tuple containing the source router (itself) the neighbor and
the
cost to its neighbor. Therefore, if router A is connected to Router B
over a
link of cost 3 and router A is connected to router C over link cost 5,
then
router A would advertise the Link State Packets (LSPs) <A,B,3> and
<A,C,5> to
all routers on this network. Each router on the network would evaluate
all of
the LSPs that it receives and calculate a shortest path to every destination
on the network.
Obviously, the LSP
is an integral part of the convergence process. If someone
could inject false LSPs into the network, it could result in misrouted
information (a packet taking a longer path than it should) or even in
the
blackholing of a router on the network. This is not necessary a malicious
attack of a network, however. Router C could advertise a link to its neighbor
D with the 3 tuple <C,D,6> and then withdraw the announcement when
the link
goes down. Unfortunately, if the LSP advertising the link having an infinite
cost arrives before the LSP advertising the cost of that link being 6,
the
routing table will not reflect the topology of the network and will be
in that
state until another LSP comes to correct the problem.
To combat this, a
sequence number is introduced into the LSP. Therefore, all
of the routers on the network would initialize their sequence number to
some
starting value and then start advertising their LSPs. This solves the
above
problem in that the LSP advertising the link of infinite cost would have
a
higher sequence number than the LSP advertising the link as having cost
6.
Some problems encountered
when using sequences numbers are finite sequence
space, sequence initialization, and aging. It is in the best interest
of a
robust link state protocol needs to protect its LSPs as well as choose
a
sequence space which is sufficiently large to accommodate updates. The
sequence space that the LSPs can use is set to some finite value. Therefore,
when the sequence numbers reach the top of the space, they must wrap around
towards the smallest sequence number. This presents a problem because
when a
router compares link state updates, the greater sequence number takes
preference. To combat this problem, you can define a maximum age of the
LSP.
Therefore, if you have not received an update in X ticks, you discard
the
current LSP information and wait for a further update. It must be noted
that
this invalidates the path information to a destination. For example, if
router Y advertises a cost to its neighbor router Z where router Y is
connected by one link to a meshed network, when the link between the mesh
and
router Y breaks, the other routers in the mesh have preserved link state
information that will allow them to find a path towards Z. If they receive
no
updates in MAX_AGE, then they will assume that the link to Y is unreachable.
This will allow each router to converge its table and allow it to advertise
an
infinite LSP for Y and Z.
Sequence initialization
is also an important facet of this problem. Say
router Y crashed and is rebooting while the network is recalculating paths
to
it. When it starts its link state protocol back up, it must somehow indicate
that it needs to reinitialize its sequence number to the last number it
gave
all of the other routers to allow for coherence. Therefore, it can announce
paths with a sequence number in a special "initialization set".
This
initialization set will tell the other routers that this router needs
the
sequence where it left off. This is the "lollipop sequence"
idiom. The
sequence space really resembles a lollipop in that the normal sequence
number
keep churning around the finite sequence space while reinitialization
takes
place in a short linear sequence space (comparable to the stick :).
Great pains are taken
to ensure the integrity of LSPs. In fact, this entire
routing algorithm depends on the LSP being digested in a coherent method
to
guarantee that each router has the correct view of the network topology.
The
question still remains how the root node router computes the distance
to each
destination.
Because of the general
nature of a link state protocol, you have various nodes
of the network advertising the distance to get to their neighbors to every
other node on the network. Thus each node has a collection of neighbor
distances from various routers on the network. The routing table is basically
'grown' outward from the root node to all of the network extremities.
This
will be explained in a slightly rigorous fashion in the next section.
----[ Dijkstra's Algorithm
This algorithm is
a simple and elegant way to determine network topology.
Basically, there are two distinct sets of destinations on the network.
Destinations in set K are known routes for which a shortest path has been
computed. Destinations in set U are routers for which the best path to
that
router is not currently known. In this set, paths are being considered
as
candidates to be the best path to that destination.
To start off, add
the current node p into the set K. Then add all of its
neighbors into the set U with path/cost associations. If there is another
path
to one of the neighbors in the U set, then choose the path which costs
the
least. When the neighbors N* are added to U make sure that they indicate
the
cost through p as well as p's ID .
Once this has been
done for the set U, then pick the neighbor n to p which has
the smallest cost to reach p. This is assuming that the neighbor has not
already been installed in K. This algorithm stops when set U is equivalent
to
the empty set. When set U is null, it is implied that all destinations
are in
set K and have the shortest cost from the root node P on which this algorithm
is running. Note, that each step evaluates adds ONE neighbor into K. That
neighbor is the router with the smallest cost to reach p.
----[ Distance Vector vs. Link State
We are left with these
protocols like BGP which uses path vector and OSPF
which uses link state. Why do they occupy such orthogonal space? When
a link
state protocol is working correctly, it guarantees that there will be
no
routing loops in the network. The link state protocol also guarantees
fast
convergence when there is a change in the topology of the network because
the
link state is distributed on a flat routing space. Since link state protocols
contain these inherent advantages, why do protocols like BGP chose to
employ
the path vector approach?
Taking a cross-section
of routing protocols that are employed on the internet,
one finds that the majority of large providers use OSPF to resolve routing
information on their internal network and BGP to talk to other distinct
networks (or autonomous systems) at their borders of administration. What
suits BGP as an external protocol and OSPF for an internal routing protocol?
One issue, which will
be discussed in the next section, is hierarchy. BGP
provides a mechanism for a routing hierarchy which enables it to greatly
reduce the space of its table. OSPF, which is a link state protocol,
provides a flat routing table whereby any internal router knows the full
hop by hop path to any destination within the autonomous system. Furthermore,
distance vector protocols understand that different areas can have different
views of the network where link state protocols require that each node
independently compute a consistent view of the network. This saves the
DV
protocol the overhead of maintaining a correct LSP database. BGP also
has
another 'advantage' in that it is layered on top of the Transmission Control
Protocol (TCP). Therefore, in the 'best-effort' service of IP networks,
BGP
has assurance (to the level that TCP can guarantee) that routing information
will be propagated. Whereas, you can (or should) be able to govern the
status
of your internal network, the nebulous exterior past your border routers
confers no delivery guarantee on your routing information.
Each type of routing
algorithm is suited for its function. Link State
protocols provide the quick convergence that is essential to an internal
network while distance vector protocols provide external reachability.
Hierarchy is not something that is inherent in distance vector protocols,
but the implementation of a hierarchy has made BGP a widely used exterior
gateway protocol.
----[ Routing Hierarchy
Routing hierarchy
is an oft fought debate that borders on religion. There
are constantly questions about how hierarchy should be implemented (if
at
all) in the free form state of the current global network. Hierarchy imposes
a tree of authority with the overall authority at the top of the tree
and
branching down to regional authorities, local authorities ad infinitum.
Hierarchy simplifies routing because if a destination is not locally routable
(or under your section of the tree). You can iterate up towards the top
tree
to try and deliver that information. As you move towards the top, the
routing
information contained in the routers becomes less and less specific until
you
reach the root node which is the least specific. It does, however, know
how
to route information to every possible destination on the network. It
may help
you to envision the hierarchy of the telephone network (built under one
collective). If a call cannot be placed within a central office, it is
handed
to either another central office in the area code or a wide area link.
The
wide area link understands how to route to each area code in a full national
mesh whilst the local 5ess switch only knows routing information for more
specific prefixes. As the phone number becomes less specific (from right
to left), the routing decision moves further up the strict hierarchy.
This similar to how
the domain name system (DNS) works on the internet (fig 6).
You provide local records for domains that you host. When your nameserver
receives a query for a record, it either returns the fact that it has
authority for that record or points toward the root nameserver. The root
nameserver knows the delegations of .com, .net, .org et al. and then points
towards the site responsible for that top level domain. That site then
points
towards the site that has authority for the specific second level domain.
Domain names take the form of most specific -> least specific; i.e.
microsoft.com is more specific than just .com. Likewise
gates.house.microsoft.com is more specific than microsoft.com.
DNS Hierarchy:
___ . ___
/ | \
.com. .org. .edu.
/ | \
microsoft.com. eff.org. isi.edu.
/ | \
billy.microsoft.com. x0r.eff.org. rs.isi.edu.
[ fig 6 ]
Each level in the hierarchy is responsible for levels of greater
specificity.
Root authority is controlled by the Internet Assigned Numbers Authority
(IANA). It provides the top of the hierarchy in a "centrally"
managed
database (in fact, there are multiple root servers distributed across
the
county which maintain a consistent database). This is the closest example
of
strict hierarchy that can be found on the internet.
With IP addresses,
specificity increases in the opposite direction. IP
addresses (version 4) are 32-bits. The rightmost bit signifies the greatest
amount of specificity and the leftmost, the least. IP routing authority
information is not maintained in a centralized database. Routing information
is exchanged between autonomous systems via the BGP protocol. Routes take
preference in order of most specific -> least specific. In this way,
there is
some type of hierarchy in the system (even though it is more loose than
the
DNS example). Generally, larger providers control larger parts of the
total
IPv4 space ((2^32) - 3 addresses). The converse is also true.
Classless Inter-Domain
Routing (CIDR) also helped to decrease the size of
routing tables and increase the appearance of hierarchy. Now, instead
of
Sprint announcing routes to 130.4.0.0 through 130.20.0.0 (on the classical
B
space boundary) it could announce 130.4.0.0/12 which encompasses that
entire
16 class B range. The classful ranges, subnetworking and the like are
discussed in my "introduction to IP" page and are therefore
not included in
this document.
----[ Routing Hierarchy and Aggregation
BBN divides their
8/8 network into two subnetworks and advertises reachability
to the aggregate to save table space. Once inside an AS, routing obeys
a fairly
strict hierarchy. Router A is responsible for the entire 131.103/16. It
divides it into two /17. Likewise, Router D in AS1 is responsible for
8/8 and
chooses to divide it into 8.0/9 and 8.128/9 and divides responsibility
for
these networks to Routers E and F respectively (fig 7). Routers B, C,
E, and F
can further choose to subdivide their networks in a hierarchical fashion.
Because of the binary nature of subnetting, networks can only be divided
in
half.
Routing Hierarchy
and Aggregation:
BGP
131.169.0.0/16 <-------------------->
8.0.0.0/8
A (AS1239) D (AS1)
/ \ / \
B / \ C E / \ F
/ \ / \
131.169.0.0/17 131.169.128.0/17 8.0/9 8.128/9
[ fig 7 ]
In the internet, there is no strict routing hierarchy. There are simply
large networks which peer via BGP to distribute aggregated routing
information.
The national backbone is populated by few nodes (when compared to the
end
nodes). Most national providers are one or two router hops away from every
large network. Through aggregation in networks below, national providers
provide fully (or at least we hope) aggregated routing information. In
a
strict hierarchy, only one router on any given hierarchy level can advertise
reachability to a specific portion of the network. In the current state
of
the Internet, multiple routers can advertise reachability information.
For
example, Sprint announces 131.169.0.0/16 out to Digex, MCI, and BBN. Though
this breaks some of the benefits of a strict hierarchy, it confers other
benefits. This scheme allows for distributed control of routing information
instead of depending on the node above. Also, nodes on the same level
are
often interconnected to aid in the dissemination of routing information.
----[ Aggregation
As discussed slightly
before, aggregation allowed the internet to reduce the
size of its external reachability tables. Before, the granularity of route
announcements allowed for only /8, /16, and /24 (octet boundaries). Now,
with
CIDR you could use variable length subnet masks. The only requirement
was
that they fall on one of the 32-bit boundaries of the IP address.
Classless routing
not only allows us to minimize routing table space, it also
allows us to divide up large chunks of unused space into manageable pieces.
Much of the Class A space is terribly under-utilized. With this scheme
one
can more efficiently allocate IP addresses to providers/netizens. The
American
Registry of Internet Numbers (ARIN) controls the allocation of IP addresses
within the United States.
Aggregation helps
alleviate the problems of not being in a strict hierarchical
structure. It allows the least amount of route table specificity at each
level (assuming the routers on that level choose to fully aggregate their
announcements.) The less specific aggregates do not necessarily indicate
the
position of a router in the hierarchy. For example, a university may announce
a /8 and be 3 hops away from the national backbone.
A problem with aggregates
can be found when we examine candidate route
specificity. If ISP A only has address space from within the allocated
block
to their parent P, then aggregation could cause problems if ISP A wanted
to
multihome to parent Q. The problem comes in that ISP A is obligated to
advertise reachability only to their space. This would constitute them
announcing their address space to Parent Q. Assume that parent P aggregates
ISP A's space into a /16 for the sake of saving route announcements. Now,
ISP
A would seem to have better reachability only through parent Q because
of the
specificity of the route announcement (remember that more specific routes
take
precedence over less specific routes). This would nullify the benefits
of
multihoming in an attempt to distribute load over the two lines. In this
case,
ISP A would ask parent P to announce a more specific destination which
has a
length matching the length of the aggregate assigned to ISP A. Therefore,
to
the world above parent P and parent Q, the path to ISP A looks equally
appealing.
----[ Exterior/Interior
It is important to
look at how routing information is disseminated throughout
the network. It has already been discussed that we use separate routing
protocols (with their respective benefits/costs) to talk to the internal
and
external world. However, these protocols cannot take orthogonal views
on
routing information. In fact, the interplay between interior and exterior
routing protocols is what allows data to be effectively relayed to a
destination external to the network as well as to distribute external
routing
information to all nodes on the internal network.
There are a few ways to ensure that each router has a consistent view
of the
network. One is to distribute the external protocol into the internal
protocol. Thereby, the internal protocol instantly has routing information
injected in it for the best path to every external destination. Note that
the
router which talks eBGP (or comparable protocol) only redistributes the
route
that it installs in its routing table and not the other candidate routes
which
may have been advertised to it.
Another approach is to inject the interior protocol into the exterior
protocol.
Of course, this necessitates filtering at the entrance point to the exterior
protocol to prevent the announcement of all internal specifics. You can
accomplish internal routing dissemination inside an Interior Gateway Protocol
mesh. Because of the specifics of protocols like BGP, externally learned
routing information will only be propagated one logical hop within the
network.
Therefore, every router that must know this external reachability information,
must be fully meshed with the eBGP speaking routers. Also, if other routers
are injecting information into the Exterior Gateway Protocol, the router
should be logically fully meshed with them.
----[ Multicast Routing
Overview
What we have been
talking about above is unicast routing. In unicast routing,
you assume that each packet has a single destination address. Assuming
infinite resources being available, unicast is a feasible solution for
every
situation. However, there are situations when it would be advantageous
to send
a packet to multiple destinations from a single source (point to multipoint)
or
from multiple sources to multiple recipients (multipoint to multipoint).
The point of multicast
is to provide a multicast group to which hosts can
subscribe and get the specific multicast feed. The multicast group is
a single
IP address in class D space. Therefore, the senders and receivers are
only
associated by a given multicast group address. Thus, the senders move
their
data towards the multicast group address and the receivers specify that
they
want to receive information from a given group address. In fact, the sender
is not required to know any information about the hosts that are receiving
the
feed.
----[ Multicast vs. Unicast
If one was sending
packets from a single source to a set of destinations, it
is important to investigate how multicast and unicast handle the distribution.
Assume that router
A is sending data to routers B, D and E. A is at the top of
the hierarchy, B and C are at the second level of the hierarchy, and D
and E
are below router B. With multiple unicast (fig 8), router A makes 3 copies
of
the packet and sends them down link AB. Router B then sends one packet
to a
host off of its ethernet and forwards the last two packets to routers
D and E
whereupon those routers send the packets to the their respective hosts
in the
multicast group.
Therefore, this transmission
takes up 3 packets per second on link AB and 1
pps on links B->Host(b), router D and router E. In a multicast routing
implementation, assuming the same topology, we will have less packets.
The
difference is that router A sends _one_ packet over link AB. Router B
then
triplicates the packet and sends it to Host(b), router D and router E
(fig 9).
One way for triplicating the packet is to simultaneously close crossbars
on a
fully crossed switch fabric, thus sending data from one input to three
outputs
simultaneously. As long as there is route redundancy, multicast is very
efficient because it minimizes redundant packets traveling to the same
next-hop. Simply, as long as there is route redundancy for the distributed
session (for example, an audio feed) you will see an advantage with multicast
over unicast.
Multicast Overview
Example:
Multiple Unicast:
A A sends 3 packets to B.
/ \
/ \ 3
/ \
C B B sends 1 packet to each to D and E.
/ \
1 / \ 1
/ \
D E D and E send 1 packet to their respective
hosts.
[ fig 8 ]
Multicast:
A A sends 1 packet to B
/ \
/ \ 1
/ \
C B B duplicates the packet for its host;
/ \ therefore, there is 1 packet (at most) on
1 / \ 1 each link.
/ \
D E
[ fig 9 ]
This is a multicast topology rooted at node A. There is also a shortest
path
from A to every destination in the multicast group. This is called the
shortest path multicast tree rooted in A. Data would like to shortest
path on
the network layer. One problem with multicast sessions is that recipients
join and leave during a multicast session. This requires pruning of the
multicast "tree" so that packets do not clutter a link on which
there is no
one requesting data from a given multicast group.
To detect if there
are hosts on a particular broadcast LAN that are interested
in a multicast group, the router sends out Internet Group Management Protocol
(IGMP) messages. Each packet has a random reply time from which the host
will
express interest. This is to prevent all hosts on a broadcast LAN from
responding at the same time to an IGMP query. Once one host desires to
receive data destined for a particular multicast groups, all other hosts
which
desire to be in the multicast group can discard their replies because
the
router knows to multicast all incoming packets destined for that group.
The
host then configures its adapter to answer for the MAC address corresponding
to that group.
Multicast must also
be functional outside of the broadcast LAN. A simple
solution to the problem is to give each router for which multicast is
enabled
the multicast packet. This is called flooding. Basically, it functions
by
forwarding the packet to every interface other than the one that the packet
arrived on. The inherent flaws in this approach is that there is packet
duplication as well as packets being sent to routers which have no hosts
subscribed to the multicast group. To clarify the duplication statement,
if
Router A is physically meshed with routers B, C, and D and linked to its
upstream via serial, when router A receives the multicast packet, it floods
it
out the interfaces to routers B, C, and D. These routers then flood the
packet
out the interface other than the one they received the packet on (namely
the
interface that connects them to A). This results in each of these routers
receiving two copies of the packet (other than the one they received from
A)
in this exchange.
A solution to this
problem can be found in a technique called Reverse Path
Forwarding (RPF). RPF specifies that the router forwards a packet with
a
source address of X only if the interface which the router receives the
packet on corresponds to the shortest path that router has to source
X (fig 10). Therefore, in the above example, each of the meshed routers
_still_ receives 2 duplicate packets in the second step, but they refuse
to
forward them because only the packet received from the interface directly
connected to A will be forwarded. As noted, RPF does not completely solve
the problem of packet duplication. To solve this, we must introduce
pruning. The idea is simplistic: inform neighbors that you do not wish
to
receive multicast packets from source X to multicast group Y. You can
also
specify prunes to a particular group. If a router tells its neighbors
that it
did not desire to receive packets for group Y and then has a host which
desires to receive group Y, it sends a graft message to its neighbors
specifying that it be added into the multicast tree.
As a unicast aside,
RPF can also be used to eliminate source address spoofing
in that the router will only forward packets from source Y if it is receiving
it on the interface which is the shortest path to source Y. Thus, if the
router receives packets from an external link which say their saddr ==
saddr(y), the router will not forward them because its shortest path to
Y is
not from the external link.
RPF Example:
| <-- Point of
ingress.
|
A-----------C
|\ /|
| \_______/ |
| / \ |
|/ \|
B-----------D
[ fig 10 ]
ABCD are physically meshed. When A distributes a packet to BCD, there
is
no problem. Now, in the next step, B, C,and D forward the packet to each
of their respective neighbors (for B it would be C and D and ! A because
it received the packet from A). This results in C and D receiving 2
packets in this entire exchange. Note that C and D now do _not_ forward
the packet they have received from A through B because that is not their
shortest path to A. Their shortest path is their direct link.
----[ The Multicast Backbone (MBONE)
It would be myopic
to assume that every router on the internet supports
multicast. Thus, when a router needed to find the shortest path to a
destination (for forwarding of a multicast packet) it could look in the
unicast routing table. Unfortunately (or fortunately depending on your
perspective) most routers on the Internet do not support multicast or
do
not have it enabled by default. Therefore, until most routers support
multicast, it has been "layered" over IP and tunneled from multicast
router to
multicast router (more specifically, the multicast header and data is
encapsulated in a unicast IP header). The tunnel (which bridges the gap
of
unicast only routers between multicast routers) informs each end that
some
packets will contain a multicast group in their payload. This allows data
to
be routed by using unicast forwarding tables while at the same time preserving
the integrity of the multicast group id.
Because these multicast
routers are not necessarily connected physically
(though they are tunneled logically), they must be connected by a multicast
routing protocol. This is necessary because the closest path via multicast
may not correspond to the shortest path over unicast only routers. Distance
Vector Multicast Routing Protocol (DVMRP) is an initial foray into this
realm.
DVMRP distributes "unicast" routes to facilitate the construction
of shortest
path trees. DVMRP uses the flood and prune method discussed above to discover
/maintain multicast trees. There is also a link state foray into this
arena.
Multicast Open Shortest Path First (MOSPF) takes the LSP approach and
calculates shortest absolute path. One host off of a multicast router
can
request to be in a multicast group. That router then distributes an LSP
over
the network. Of course, MOSPF (being a link state protocol) runs into
salability problems. It is computationally expensive for a router to compute
reachability information for each end node router.
Core based trees (CBT)
are an attempt to alleviate the problems that DVMRP and
MOSPF experience. The concept is that multicast routers send join requests
to
core routers of arbitrary designation. When a router learns of a host
which
wishes to join a specific multicast group, it forwards a packet to the
core
multicast router. Every router along the way forwards the packet towards
the
core router and marks the interface on which the packet arrives so that
it
knows where to forward the multicast packets from the core. This solves
the
problem of having to communicate topology among all of the endpoints.
The
choice of a core multicast router is a non-trivial because all multicast
traffic for multicast routers branching off of it _must_ pass through
the core
router.
----[ Protocol Independent Multicast
Protocol independent
multicast (PIM). Pim is a balance between flood and
prune and CBT. When there are many multicast routers on a given network,
it
is more efficient to use the flood-and-prune method. This network environment
is called "dense". On the contrary, sparse mode defines networks
where there
are few multicast routers. In sparse mode, it is more efficient to use
CBT
because the core router is not weighted in an environment when it 'polices'
few multicast routers. When most of network is comprised of multicast
routers,
it is not prudent to require all sessions to be coordinated (and routed
through) a core. Sparse mode PIM has been adapted from CBT to allow data
to
reach its destination via the core or through the shortest path tree.
Currently, the operator must define whether groups are sparse or dense
instead
of leaving it up to the protocol. cisco systems' implementation of pim
also
supports a middle ground called 'sparse-dense' mode.
----[ Border Gateway Protocol
There has been some
mention of the Border Gateway Protocol (BGP) in this
document. BGP was groomed as the successor to the Exterior Gateway Protocol
(EGP). BGP is mainly an exterior routing protocol. It is used to communicate
with systems outside of the operator's control and both distribute and
receive
network layer reachability information (NRLI) from the neighboring routers.
BGP must be a robust protocol which has the capability of quick convergence
while at the same time, not being influenced by frequent shifts in topology.
When you use BGP to receive routing information, you are depending on
the
other networks to distribute correct information to your network.
A BGP speaking router
communicates with its peers via TCP. TCP over IP is a
mechanism for guaranteeing the transmission of data over a best effort
service
at the IP layer. The choice of TCP as the distribution mechanism for BGP
information is a point of contention. While TCP provides inherent checksums,
acknowledgments, retransmissions and duplicate suppression mechanisms
for
received packets, it does not guarantee packet order or packet path. This
can
lead to headaches for the router receiving this information.
BGP peers communicate
with a variety of message formats. BGP speakers use the
OPEN message to establish a peering relationship with other speakers.
BGP
speakers use the UPDATE message to transfer routing information between
peers.
Update information includes all routes and their associated attributes.
KEEPALIVE messages assure that BGP peers are active. NOTIFICATION messages
inform peers of error conditions.
----[ BGP path selection
It is important that
we understand the messages that constitute the Border
Gateway Protocol, but we are still left with the question of how BGP works
on
the internet. One important area of clarification is in the BGP path selection
algorithm. This algorithm is how BGP decides which route to prefer and
attempt to install in the routing table.
This algorithm is
employed when there are multiple paths to a destination. As
a failsafe, the first selection looks at the next hop and determines if
it is
accessible. If the next hop is not accessible, it is important not to
consider that route as a candidate path to a destination because all data
sent
to its next_hop will be blackholed. The second selection mechanism is
the
weight of the route. Weight is a proprietary implementation to cisco Systems
routers and is analogous to local preference. If two routes have different
weights, the route with the largest weight is selected. Notice that the
selection mechanism is basically a logical if->then chain. If candidate
paths
differ at a level of the selection algorithm, then the favorable path
is
selected and the algorithm ceases trying to decide which path to prefer.
The
next level is the local_pref attribute. This is a well known mandatory
BGP
attribute. Much like weight, the path with the highest local_pref is
preferred. After local preference, the router selects the path that it
originated. If the router didn't originate the paths, then the path with
the
shortest AS_PATH length should be selected. AS path length gauges the
number
of autonomous systems that this routing information has traveled through.
The purpose of this selection relies in the assumption that the less ASNs
the
route has passed through, the closer the destination. If all of the above
attributes are identical, then prefer path origin in this fashion IGP
> EGP >
Incomplete. If the path origins are the same, prefer the path with the
lowest
value MULTI_EXIT_DESCRIMINATOR (MED). MEDs are commonly used to distinguish
between multiple exit points to the same peer AS. If none of these attributes
are dissimilar, then prefer the path through the closest IGP neighbor.
If
that fails, the tiebreaker is preferring the path with the lowest IP address
specified in the BGP router-id section discussed above.
This selection algorithm
allows effective establishment of routing policy. If
I wanted to prefer routes from a certain AS over routes to another AS,
I could
establish a route-map at both entry points of external routing information
and
assign a higher LOCAL_PREF to the routes from the AS I want to favor.
Unfortunately, this does not provide much granularity. This means that
all
traffic will be routed to the favorable AS and does not allow us to balance
the load over our multiple connections. If you allow path selection to
progress to the AS path length decision level, then you will get decent
(though not 50-50) load balancing to destinations. This of course is assuming
that you have providers with comparable customer routes and connectivity.
If
you have a DS3 to MCI and a DS3 to the local BFE provider, nearly all
traffic
will move out the MCI pipe if the BGP decision is allowed to progress
down to
the AS path length category. At earlier selections, you can change the
preference of routes by using AS path access lists to select routes based
on
as path regular expressions. For example, if you wanted to select all
routes
that traversed UUnet and send them out your BFE provider, you could use
a route
map to match an AS path access list which contained _701_ and set the
local_pref to 100 (or some value higher than the UUwho traversed paths
from
MCI). This will force all traffic destined for UUwho to exit your AS over
your BFE DS3. While this affords you some granularity in load balancing,
it
is often not optimal. Basically, you are forcing traffic out a path that
it
would not normally select. If that BFE provider has many hops before it
can
reach UUnet, you are forcing the traffic you send out that link to traverse
all of those hops and be subject to (possibly) more link congestion, latency,
etc.
Routing policy is
something that requires the tweaking of many knobs. Much of
the tweaking I have described above pertains to cisco Systems routers.
It is
important to understand that you must think through routing policy before
you
implement it. You must evaluate what load balancing you want, what traffic
symmetry you want, and what general quality of service your traffic will
receive because of your decisions.
For information more specific than this, read the BGP RFC or current BGPv4
internet draft [1].
----[ Open Shortest
Path First v2 (OSPFv2)
We are not going into
great detail about OSPF. It is a link state routing
algorithm. As noted above, link state algorithms route on flat space (no
hierarchy). OSPF is an improvement over the vanilla LS protocol because
it
provides areas of maintenance for hierarchy purposes. Areas distribute
full
information internally by running a separate OSPF process with its area
ID.
Each router has an identical link state database with other routers within
its
area, but not with external routers. Each area operates in an autonomous
state and transfers inter-area information at junction routers called
area
border routers. These routers are in two or more areas and help distribute
information between these areas. The router has separate link state databases
for each area to which it is connected.
OSPFv2's main advantage
is that it supports Variable Length Subnet Masks
(VLSM). This means that a router can advertise reachability with more
granularity than a scheme which advertised host reachability. Therefore,
if
the router can distribute packets to all hosts from 206.4.4.1 -> 206.4.5.254
it advertises reachability to 206.4.4.0/23 instead of each classful network
separately or each host separately. This obviously saves immensely on
link
state database size and processing power required.
For information more
specific than this, read the current OSPFv2 RFC or
internet draft [2].
----[ References
[1] Rehkter, Y., Li,
T., " A Border Gateway Protocol 4 (BGP-4)",
draft-ietf-idr-bgp4-07.txt, February 1998.
[2] Moy, J., "OSPF
Version 2", draft-ietf-ospf-vers2-02.txt,
January 1998.
----[ EOF
|