Point-to-Point Routing in the Internet

4.4.1 The IP Protocol

The Internet network layer standard is the IP protocol, version 4, which is specified in [RFC 791].   It is important to note that while IP is the Internet's network layer standard it does not specify a routing algorithm.  Instead the IP standard specifies: The IPv4 datagram format is shown in Figure 4.4-1.
Figure 4.4-1: IPv4 datagram format
Some of the key fields in the IPv4 datagram are the following:

IP addressing

Every host, router, gateway or network device that is capable of sending or receiving an IP datagram has an IP address.  More precisely, an IP address is associated with each network interface and thus a host, router, gateway  or device will have one or more IP addresses. (Gateways, in particular, will typically have multiple network interfaces/addresses -- one per attached network). IP addresses are meant to be globally unique, that is, no two network entities in the world can have the same IP address.
Figure 4.4-2: IPv4 address structures
Figure 4.4-2 shows the four possible formats of an IP address. A fifth address, beginning with 11110 is "reserved for future use.  An IP address is 32 bits long and address is divided into a "network" part -- the bits that identify the autonomous system (AS)  that the host, router, gateway or network device belongs to, and a "host" part -- the bits the identify the host, router, gateway within the AS.  To a first approximation, it is the network part of the address that is used for inter-domain routing. (This is not precisely true, as a  technique known as subnetting allows the administrator who controls the "network" specified by the network part of the address to place additional structure -- to define subnetworks within its network -- using some of the  "host" bits.  A discussion of subnetting can be found in RFC 950)

For a class A address, the first 8 bits identify the network, i.e., the autonomous system that we discussed in section 4.3, and the last 24 bits identify the host, router, gateway, or network device within that AS.   (For exposition purposes, we will assume from here on that the network entity is a host).   The class B address space allows for 214 networks, with up to 216 hosts with each network.  Class C address use 21 bits to identify the network and leave only 8 bits for the host identifier.  Class D addresses are reserved for so-called multicast addresses.  As we will see in section 4.6, these addresses do not identify a specific host but rather provide a mechanism through which multiple hosts can receive a copy of  each single packet sent by a sender.

Before leaving our discussion of addressing, a few comments are in order.  First, it should be noted that IP addresses do not reflect any geographical relationship among entities.  For example, unlike the telephone network with its standard country code numbers, it is not possible to tell where, physically, a network is.  If there were so, it might have been possible to design routing algorithms that would look at the address and route a datagram according (e.g., from western Europe one would generally route trans-atlantic to the west to reach a US address).   Second, mobile hosts may change the network to which they are attached, either dynamically while in motion or on a longer time scale.  Because routing is to a network (AS) first, and then to a host within the network, this means that the mobile host's IP address must change when the host changes networks.  Techniques for handling such issues are now under development within the IETF and the research community [RFC2002 RFC2131].

Now that we've seen the structure of the IP address, you might wonder where would you actually go to get one, e.g., for your network?  The first contact might well be with your Internet service provider who would provide an address from a block of addressees that have already been allocated to that ISP.  But how does an ISP get a block of addresses?  IP addresses are managed under the authority of the Internet Assigned Numbers Authority (IANA), under the guidelines set forth in RFC 2050.  The actual assignment of addresses is now managed by regional Internet registries.  As of mid-1998, there are three such regional registries: the American Registry for Internet Number (ARIN, which handles registrations for North and South America, as well as parts of Africa. ARIN has recently taken over a number of the functions previously provided by Network Solutions), the  Reseaux IP Europeens (RIPE, which covers Europe and nearby countries), and the Asia Pacific Network Information Center (APNIC).

 IP fragmentation and reassembly

We will see in Chapter 5 that different data link layer protocols have different maximum packet sizes that they can carry.  The maximum size of the payload (data) part of a data-link packet is called the MTU (maximum transfer unit). For example the MTU for Ethernet is 1,500 bytes.

A hard limit on the size of a data link packet places a hard limit on the length of an IP datagram that is  carried across that link, since the IP datagram will be the payload of the data link packet.  Suppose then that  a router receives an IP datagram on an incoming link, looks up the destination in its router table to determine the appropriate outgoing link, and finds that the outgoing link has an MTU that is smaller than the lengh of the IP datagram. Panic sets in -- how is the router going to cram this oversized IP packet into the payload field of the data link packet.

The solution to this problem is to divide (or "fragment") the data in the IP datagram among two or more smaller IP datagrams, and then send these smaller datagrams over the outgoing link.  Once an IP datagram has been fragmented, it will not be reassembled until all fragments reach the destination node.

The identifier, flags,  and fragmentation offset fields of the IP packet are used for fragmentation.  Each fragment carries the same source, destination and identifier field values as the original datagram.  The first bit in the two-bit flag field is the DF (dont' fragment) bit.  If this bit is set, an IP datagram will not be fragmented.  The other flag bit is the MF bit.  This bit is set to 1 in all except the last fragment.  Finally, the offset field gives the starting position in of the data in the data field of the original original.

Figure 4.4-3 gives an example of how an 8K packet is fragmented into two 4K packets.

Figure 4.4-3: IP Fragmentation

When the destination receives the fragments it uses the identifier flags and the fragmentation offset to reconstruct the original datagram. The payload of the datagram is only passed to the transport layer once the IP layer has fully reconstructed the original IP datagram. If one or more of the fragments does not arrive to the destination, datagram is "lost" and not passed to the transport layer. (But, as we learned in the previous chapter, if TCP is being used at the transport layer, then TCP will reccover from this loss by having the source retransmit the data in the original datagram.)

Following this section we provide a Java applet that generates fragments. You provide the incoming datagram size, the MTU and the incomimg datagram identification, it automatically generates the fragments for you.

 

4.4.2 Getting Started: acquiring an address.

Before looking at the routing algorithms that operate among the routers, let's start at a host, where everything begins.  One question immediately comes to mind: how does a host get  its own IP address? There are at least three ways this can happen:

4.4.3 Intradomain Routing in the Internet: RIP, OSPF  and IGRP

Recall from our discussion in section 4.3 that intradomain routing protocols are used to route within an autonomous system (AS).  These intradomain routing protocols are also sometimes known as interior gateway protocols. Historically, three routing protocols have been used extensively for routing with an autonomous system in the Internet: RIP (the Routing Information Protocol), and OSPF (Open Shortest Path First), and IGRP (Cisco's propriety Interior Gateway Routing Protocol).

RIP: Routing Information Protocol

The Routing Information Protocol (RIP) was one of the earliest intradomain Internet routing protocols and is still in widespread use today.  It traces its origins and its name to the  Xerox Network Systems (XNS) architecture.  The widespread deployment of RIP was due in great part to its inclusion in 1982 of the Berkeley Software Distribution (BSD) version of UNIX supporting TCP/IP. RIP version 1 is defined in  RFC 1058, with a backwards compatible version 2 defined in  RFC 1723.

RIP is a distance vector protocol that operates in a manner very close  to the idealized protocol we examined in section 4.2.3.  The version of RIP specified in RFC 1058 uses hop count as a cost metric, i.e., each link has a cost of 1, and limits the maximum cost of a path to 15.  This limits the use of RIP to networks that are less than 15 hops in diameter.

Recall that in DV protocols, neighboring routers exchange routing information with each other. In RIP, routing tables are exchanged between neighbors every 30 seconds using RIP's so-called response message, with each response message containing that host's routing table entries for up to 25 destinations.  If a router does not hear from its neighbor at least once every 180 seconds, that neighbor is considered to be no longer reachable , i.e., either the neighbor has died or the connecting link has gone down.  A router can also request information about its neighbor's cost to a given destination using RIP's request message.  Two routers communicating RIP messages do so by sending RIP request or response messages within a UDP packet addressed to port 520. The UDP packet is carried between routers in a standard IP packet.  The fact that RIP uses a transport layer protocol (UDP)  on top of a network layer protocol (IP)  to implement network layer functionality (a routing algorithm) may seem rather convoluted (it is!).  Looking a little deeper at how RIP is implemented will clear this up.

Figure  4.4-4 sketches how RIP is typically implemented in a UNIX system, e.g., a host (end system) or a workstation or PC serving as a router.   A process called routed (pronounced "route dee") executes the RIP protocol, i.e., maintains the routing table and exchanges messages with  routed processes running in neighboring routers.   Because RIP is implemented as an application-level process (albeit a very special one that is able to manipulate the routing tables within the UNIX kernel), it can send and receive messages using the standard socket interface and the a standard transport protocol.   Thus, RIP is an application-layer protocol (see Chapter 2), running over UDP.

Figure 4.4-4: Implementation of RIP as the routed daemon

Finally, let us take a quick look at a RIP routing table.  The RIP routing table below in Table 4.4-1  is taken from a UNIX host giroflee.eurecom.fr. If you give a netstat -rn  command on a UNIX system, you can view the routing table for that host. Performing a netstat on host giroflee.eurecom.fr yields the following routing table:

The host giroflee is a so-called "multi-homed" host that is connected to three networks, each via a different interface (with a correspondingly different data link layer controller). The second, third and fourth entries in the table tell us that these three networks are attached to giroflee via giroflee's network  interfaces fa0, le0 and qaa0.  These giroflee interfaces have IP addresses 192.168.2.5, 193.55.114.6 and 192.168.3.5  respectively.  To transmit a packet to any host  belonging to one of these three networks, giroflee will simply send the outgoing IP datagram over the appropriate interface.  Of particular interest to us is the default route. Any IP datagram that is not destined for one of the networks explicitly listed in the routing table will be forwarded to the router with IP address  193.55.114.129; this router is reached by sending the datagram over the default network interface.  The first entry in the routing table is the so-called loopback interface. When IP sends a datagram to the loopback interface, the packet is simply returned back to IP; this is useful for debugging purposes. The address 224.0.0.0 is a special multicast (Class D) IP address.  We will examine IP multicast in section 4.5.
 
TO ADD: network diagram showing giroflee, as well as the nearby default router?

IGRP: Internal Gateway Routing Protocol

The Interior Gateway Routing Protocol (IGRP) [Cisco97b] is a proprietary routing algorithm developed by Cisco Systems, Inc. in the mid-1980's as a successor for RIP.  IGRP is a distance vector protocol.  Several cost metrics (including delay, bandwidth, reliability, and load) can be used in making routing decisions, with the weight given to each of the metrics being determined by the network administrator.  This ability to use administrator-defined costs in making route selections is an important difference from RIP; we will see shortly that so-called policy-based interdomain Internet routing protocols such as BGP also allow administratively defined routing decisions to be made.  Other important differences from RIP include the use of a reliable transport protocol to communicate routing information, the use of  update messages that are sent only when routing table costs change (rather than periodically) , and the use of a distributed  diffusing update routing algorithm [Garcia-Luna-Aceves 1991] to quickly compute loop free routing paths.
 

OSPF: open shortest path first

Like RIP, the Open Shortest Path First (OSPF) routing is used for intradomain routing. The "Open" in OSPF indicates that the routing protocol specification is publicly available (e.g., as opposed to Cisco's IGRP protocol).  The most recent version of OSPF, version 2, is defined in RFC 2178 - a public document.
 
 OSPF was conceived at the successor to RIP and as such has a number of advanced features.  At its heart, however, OSPF is a link-state protocol that uses flooding of link state information and a Dijkstra least cost path algorithm that allows each router to compute its least cost path. Individual link costs are configured by the network administrator.  Some of the advances embodied in OSPF include the following:

As OSPF routing domain (an "autonomous system," or AS, in OSPF parlance) can be configured into "areas."  Each area runs its own OSPF link state routing algorithm, with each router in an area broadcasting its link state to all other routers in that area.  The internal details of an area thus remain invisible to all routers outside the area.  Intra-area routing involves only those routers within the same area.

Within each area, one of more area border routers are responsible for routing packets outside the area.  Exactly one OSPF area in the AS is configured to be the backbone area.  The primary role of the backbone area is to route traffic between the other areas in the AS.  The backbone always contains all area border routers in the AS and may contain non border routers as well.  Inter-area routing within the AS requires that the packet be first routed to an area border router (ntradomain routing), then routed though the backbone to the area border router that is in the destination area, and then routed to the final destination.

Figure 4.4-5: hierarchically structured OSPF AS with four areas

A diagram of a hierarchically structured OSPF network is shown in Figure 4.4-5 . We can identify four types of OSPF routers in Figure 4.4-5:

4.4.4 Interdomain Routing in the Internet: BGP

The Border Gateway Protocol version 4, specified in RFC 1771 (see also RFC 1772, RFC 1773), is the de facto standard interdomain routing protocol in today's Internet.  As an interdomain routing protocol, it provides for routing between administrative domains (a.k.a. autonomous systems).

While BGP has the same general flavor as the distance vector protocol that we studied in section 4.2, it is more appropriately characterized as a path vector protocol.  This is because rather than propagating cost information to a destination, BGP propagates path attributes.  We will examine these path attributes  in detail shortly. We note though that while  these attributes include information such as the names of the administrative domains on a route to the destination, they do not contain cost information.   Nor does BGP specify how a route to a particular destination should be chosen given the attributes of the various possible paths.  That decision is a policy decision that is left up to the domain's administrator.  Each domain can thus choose it routes according to whatever criteria  it chooses (and need not even inform of neighbors of its  policy!) -- allowing a significant degree of autonomy in route selection.  In essence, BGP provides the mechanisms to distribute path information among the interconnected domains, but leaves the policy for making the actual route selections up to the network administrator.

As in the generic distance vector protocol, BGP gateways typically exchange routing information with their immediate neighbors, which are known as peers in BGP. There are only four message types in the BGP protocol: OPEN, UPDATE, NOTIFICATION and KEEPALIVE.

A BGP gateway uses the UPDATE message to advertise a route to a given destination to the BGP peer.  The UPDATE message can also be used to withdraw routes that had previously been advertised (that is, to tell a peer that a route that it had previously advertised is no longer a valid route).  An UPDATE message advertising a path contains the identity of  one or more destinations all of which share the specified attributes of this path to that destination.   Each destination is typically the prefix part of a full IP address, indicating, for example a network or set of networks that have this common prefix. There are seven attributes, each of which has one of  four possible types. Table 4.4-2 summarizes the well-known path attributes of BGP paths.  There is only one optional transitive attribute and one optional non-transitive attribute specified in BGP 4.  Since not even a BGP router need not understand these optional attributes, we won't cover them here and refer the interested reader to  RFC 1771  for details.
 
Attribute name Attribute type Function
ORIGIN well-known, mandatory origin of route
AS-PATH well-known, mandatory list of AS's traversed by lath
NEXT HOP well-known, mandatory next hop on path to destination(s)
LOCAL_PREF well-known, discretionary preference value for path
ATOMIC AGGREGATE well-known, discretionary route can not be de-aggregated
Table 4.4-2: Well-known BGP path attributes
 
Recall from our discussion above that BGP provides mechanisms for distributing path information but does not mandate policies for selecting a route from those available.  Within this framework, it is thus possible for an AS such as Hatfield.net  to implement a policy such as "traffic from my AS should not cross the AS McCoy.net,"  since it knows the identities of all AS's on the path. (The Hatfield and the McCoy's are two famous feuding families in the US).  But what about  a policy that would prevent the McCoy's from sending traffic through the Hatfield's network?  The only means for an AS to control the traffic passes though its network (known as "transit" traffic - traffic that neither originates in, nor is destined for, the network, but instead is "just passing through") is by controlling the paths that it advertises.  For example, if the McCoy's are immediate neighbors of the Hatfields, the Hatfields could simply not advertise any routes to the McCoy's that contain the Hatfield network. But restricting transit traffic by controlling an AS's route advertisement can only be partially effective.  For example, if the Jones are between the Hatfield's and the McCoy's, and the Hatfield's advertise routes to the Jones' that pass through the Hatfields, then the Hatfields can not prevent (using BGP mechanisms) the Jones from advertising these routes to the McCoys.

As noted above, BGP, which is the successor to EGP, is becoming the de facto standard for interdomain routing for the public Internet.  BGP is used for example at the major network access points (NAP's) where major Internet carries connect to each other and exchange traffic. To see the contents of today's (less than four hours out of date) BGP routing table (large!) at one of the major NAP's in the US (which include Chicago and San Francisco ),  click here.
 
TO ADD: network diagram showing csgateway.umass.edu (or BGP host table)?

4.4.5 Why are there different Interdomain and Intradomain routing protocols?

Having now studied the details of specific interdomain and intradomain routing protocols deployed in today's Internet, let us conclude by considering perhaps the most fundamental question we could ask about these protocols in the first place (hopefully, you have been wondering this all along, and have not lost the forest for the trees!): The answer to this question gets at the heart of the differences between the goals of routing within a domain and between domains:

4.4.6. ICMP: the Internet Message Control Protocol

The Internet Message Control Protocol, ICMP, is used by hosts, routers, and gateways to communicate network layer information  to each other. ICMP is specified in RFC 792. The most typical use of ICMP is for error reporting.  For example, when running a telnet, ftp, or http session, you may have encountered an error message such as "Destination network unreachable."  This message had its origins in ICMP.  At some point, an IP router was unable to find a path (even a default path) to the host specified in your telnet, ftp or http application.  That router created and sent a type-3  ICMP message to your host indicating the error.  Your host received the ICMP message and returned the error code to the TCP code that was attempting to connect to the remote host. TCP in turn returned the error code to your application.

ICMP is often considered part of IP, but architecturally lies just above IP, as ICMP messages are carried inside IP packets.  That is, ICMP messages are carried as IP payload, just as TCP or UDP packets are carried at IP payload.  Similarly, when an host receives an IP packet with ICMP specified as the upper layer protocol, it demultiplexes the packet to ICMP, just as it would demultiplex a packet to TCP or UDP.

ICMP messages have a type and a code field, and also contain the first 8 bytes if an IP packet that caused the IP message to be generated in the first place (so that the sender can determine which packet is sent that caused the error).  Selected ICMP messages are shown below in Table 4.4-3.  Note that ICMP messages are used not only for signaling error conditions.  The well-known ping [ping man page] program uses ICMP.  ping sends an ICMP type 8 code 0 message to the specified host.  The destination host, seeing the echo request sends back an type 0 code 0 ICMP echo reply.  Another interesting ICMP message is the source quench message.  This message is seldom used in practice.  Its original purpose was to perform congestion control -- to allow a congested router to send an ICMP source quench message to a host to force that host to reduce its transmission rate.  We have seen in Chapter 3 that TCP has its own congestion control mechanism that operates at the transport layer, without the use of network layer support such as the ICMP source quench message.
 
 

ICMP type code description 
0 0 echo reply (to ping) 
3 0 destination network unreachable 
3 1 destination host unreachable 
3 2 destination protocol unreachable 
3 3 destination port unreachable 
3 6 destination network unknown 
3 7 destination host unknown 
4 0 source quench (congestion control) 
8 0 echo request 
9 0 router advertisement 
10 0 router discovery 
11 0 TTL expired 
12 0 IP header bad 
 Table 4.4-3: Selected ICMP messages
 
TO ADD?: short discussion of how ICMP is used in traceroute.

References

[Arin 1996] ARIN, "IP allocation report"  ftp://rs.arin.net/netinfo/ip_network_allocations
[Braden 1989] R. Braden, D. Borman, and C. Partridge , "Computing the Internet Checksum," ACM Computer Communication Review, Volume 19, Number 2 (April 1989)
[Bradner 1996] S. Bradner, A. Mankin, IPng: Internet Protocol Next Generation, Adddison Wesley, 1996.
[Cisco 97] Cisco - Advanced QoS Services for the Intelligent Internet.
[Cisco97b] Interior Gateway Routing Protocol and Enhanced IGRP
[RFC 791] J. Postel, "Internet Protocol: DARPA Internet Program Protocol Specification," RFC 791, Sept 1981.
[RFC 792] J. Postel, "Internet Control Message Protocol, " RFC 792, Sep-01-1981.
[RFC 903] R. Finlayson, Mann, J. Mogul, Theimer,  "A Reverse Address Resolution Protocol," RFC 903, June 1884.
[RFC 904] D. Mills, "Exterior Gateway Protocol Formal Specification," RFC 791, April 1984.
[RFC 950]  J. Mogul,   J. Postel , "Internet Standard Subnetting Procedure," RFC 950, August 1985.
[RFC 1058] C.L. Hendrick, "Routing Information Protocol," RFC 1058, June 1988.
[RFC 1256] S. Deering, "ICMP Router Discovery Messages," RFC 1256, Sept. 1991.
[RFC 1542] W. Wimer, "Clarifications and Extensions for the Bootstrap Protocol," RFC 1532, October 1993.
[RFC 1584] J. Moy, "Multicast Extensions to OSPF," RFC 1584, March 1994.
[RFC 1700] J. Reynolds, J. Postel, "Assigned Numbers" RFC 1700, Oct. 1994.
[RFC 1723] G. Malkin, RIP Version 2 - Carrying Additional Information.  RFC 1723, November 1994.
[RFC 1771] Y. Rekhter and T. Li,  "A Border Gateway Protocol 4 (BGP-4),"  RFC 1771, March 1995.
[RFC 1772] Y. Rekhter and P. Gross, "Application of the Border Gateway Protocol in the Internet," RFC 1772, March 1995.
[RFC 1773] P. Traina, "Experience with the BGP-4 protocol," RFC 1773,   March 1995
[RFC 2002] C. Perkins, "IP Mobility Support," RFC 2002, 1996.
[RFC 2050] K. Hubbard, M. Kosters, D. Conrad,  D. Karrenberg, J. Postel, "Internet Registry IP Allocation Guidelines", RFC 2050, Nov. 1996.
[RFC 2131] R. Droms, "Dynamic Host Configuration Protocol," RFC 2131, March 1997.
[RFC 2178] J. Moy, "Open Shortest Path First Version 2", RFC 2178, July 1997.
 
 

Copyright Keith W. Ross and Jim Kurose, 1998.  All rights reserved.