K&K Software Lynxviewer

Ergebnis für URL: http://www.faqs.org/rfcs/rfc2396.html
   [1]faqs.org

RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax

   faqs.org
   faqs.org - Internet FAQ Archives

RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax

     * [2]Internet RFC Index
     * [3]Usenet FAQ Index
     * [4]Other FAQs
     * [5]Documents
     * [6]Tools
     * Search
     * [7]Search FAQs
     * [8]Search RFCs
     * IFC Home
     * [9]Cities
     * [10]Countries
     * [11]Hospitals
     * [12]Web Hosting Ratings
     ____________________________________________________________________________

Search the RFC Archives

   ____________________  Search

Or Display the document by number

   _________ Display RFC By Number
     ____________________________________________________________________________

    [ [13]RFC Index | [14]Usenet FAQs | [15]Web FAQs | [16]Documents | [17]Cities |
                [18]Copyrights | [19]SEC Filings | [20]Neighborhoods ]


Network Working Group                                     T. Berners-Lee
Request for Comments: 2396                                       MIT/LCS
Updates: 1808, 1738                                          R. Fielding
Category: Standards Track                                    U.C. Irvine
                                                             L. Masinter
                                                       Xerox Corporation
                                                             August 1998

           Uniform Resource Identifiers (URI): Generic Syntax

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (1998).  All Rights Reserved.

IESG Note

   This paper describes a "superset" of operations that can be applied
   to URI.  It consists of both a grammar and a description of basic
   functionality for URI.  To understand what is a valid URI, both the
   grammar and the associated description have to be studied.  Some of
   the functionality described is not applicable to all URI schemes, and
   some operations are only possible when certain media types are
   retrieved using the URI, regardless of the scheme used.

Abstract

   A Uniform Resource Identifier (URI) is a compact string of characters
   for identifying an abstract or physical resource.  This document
   defines the generic syntax of URI, including both absolute and
   relative forms, and guidelines for their use; it revises and replaces
   the generic definitions in [21]RFC 1738 and [22]RFC 1808.

   This document defines a grammar that is a superset of all valid URI,
   such that an implementation can parse the common components of a URI
   reference without knowing the scheme-specific requirements of every
   possible identifier type.  This document does not define a generative
   grammar for URI; that task will be performed by the individual
   specifications of each URI scheme.

1. Introduction

   Uniform Resource Identifiers (URI) provide a simple and extensible
   means for identifying a resource.  This specification of URI syntax
   and semantics is derived from concepts introduced by the World Wide
   Web global information initiative, whose use of such objects dates
   from 1990 and is described in "Universal Resource Identifiers in WWW"
   [[23]RFC1630].  The specification of URI is designed to meet the
   recommendations laid out in "Functional Recommendations for Internet
   Resource Locators" [[24]RFC1736] and "Functional Requirements for Uniform
   Resource Names" [[25]RFC1737].

   This document updates and merges "Uniform Resource Locators"
   [[26]RFC1738] and "Relative Uniform Resource Locators" [[27]RFC1808] in order
   to define a single, generic syntax for all URI.  It excludes those
   portions of [28]RFC 1738 that defined the specific syntax of individual
   URL schemes; those portions will be updated as separate documents, as
   will the process for registration of new URI schemes.  This document
   does not discuss the issues and recommendation for dealing with
   characters outside of the US-ASCII character set [ASCII]; those
   recommendations are discussed in a separate document.

   All significant changes from the prior RFCs are noted in Appendix G.

1.1 Overview of URI

   URI are characterized by the following definitions:

      Uniform
         Uniformity provides several benefits: it allows different types
         of resource identifiers to be used in the same context, even
         when the mechanisms used to access those resources may differ;
         it allows uniform semantic interpretation of common syntactic
         conventions across different types of resource identifiers; it
         allows introduction of new types of resource identifiers
         without interfering with the way that existing identifiers are
         used; and, it allows the identifiers to be reused in many
         different contexts, thus permitting new applications or
         protocols to leverage a pre-existing, large, and widely-used
         set of resource identifiers.

      Resource
         A resource can be anything that has identity.  Familiar
         examples include an electronic document, an image, a service
         (e.g., "today's weather report for Los Angeles"), and a
         collection of other resources.  Not all resources are network
         "retrievable"; e.g., human beings, corporations, and bound
         books in a library can also be considered resources.

         The resource is the conceptual mapping to an entity or set of
         entities, not necessarily the entity which corresponds to that
         mapping at any particular instance in time.  Thus, a resource
         can remain constant even when its content---the entities to
         which it currently corresponds---changes over time, provided
         that the conceptual mapping is not changed in the process.

      Identifier
         An identifier is an object that can act as a reference to
         something that has identity.  In the case of URI, the object is
         a sequence of characters with a restricted syntax.

   Having identified a resource, a system may perform a variety of
   operations on the resource, as might be characterized by such words
   as `access', `update', `replace', or `find attributes'.

1.2. URI, URL, and URN

   A URI can be further classified as a locator, a name, or both.  The
   term "Uniform Resource Locator" (URL) refers to the subset of URI
   that identify resources via a representation of their primary access
   mechanism (e.g., their network "location"), rather than identifying
   the resource by name or by some other attribute(s) of that resource.
   The term "Uniform Resource Name" (URN) refers to the subset of URI
   that are required to remain globally unique and persistent even when
   the resource ceases to exist or becomes unavailable.

   The URI scheme (Section 3.1) defines the namespace of the URI, and
   thus may further restrict the syntax and semantics of identifiers
   using that scheme.  This specification defines those elements of the
   URI syntax that are either required of all URI schemes or are common
   to many URI schemes.  It thus defines the syntax and semantics that
   are needed to implement a scheme-independent parsing mechanism for
   URI references, such that the scheme-dependent handling of a URI can
   be postponed until the scheme-dependent semantics are needed.  We use
   the term URL below when describing syntax or semantics that only
   apply to locators.

   Although many URL schemes are named after protocols, this does not
   imply that the only way to access the URL's resource is via the named
   protocol.  Gateways, proxies, caches, and name resolution services
   might be used to access some resources, independent of the protocol
   of their origin, and the resolution of some URL may require the use
   of more than one protocol (e.g., both DNS and HTTP are typically used
   to access an "http" URL's resource when it can't be found in a local
   cache).

   A URN differs from a URL in that it's primary purpose is persistent
   labeling of a resource with an identifier.  That identifier is drawn
   from one of a set of defined namespaces, each of which has its own
   set name structure and assignment procedures.  The "urn" scheme has
   been reserved to establish the requirements for a standardized URN
   namespace, as defined in "URN Syntax" [[29]RFC2141] and its related
   specifications.

   Most of the examples in this specification demonstrate URL, since
   they allow the most varied use of the syntax and often have a
   hierarchical namespace.  A parser of the URI syntax is capable of
   parsing both URL and URN references as a generic URI; once the scheme
   is determined, the scheme-specific parsing can be performed on the
   generic URI components.  In other words, the URI syntax is a superset
   of the syntax of all URI schemes.

1.3. Example URI

   The following examples illustrate URI that are in common use.

   [30]ftp://ftp.is.co.za/rfc/rfc1808.txt
      -- ftp scheme for File Transfer Protocol services

   [31]gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
      -- gopher scheme for Gopher and Gopher+ Protocol services

   [32]http://www.math.uio.no/faq/compression-faq/part1.html
      -- http scheme for Hypertext Transfer Protocol services

   [33]mailto:mduerst@ifi.unizh.ch
      -- mailto scheme for electronic mail addresses

   news:comp.infosystems.www.servers.unix
      -- news scheme for USENET news groups and articles

   [34]telnet://melvyl.ucop.edu/
      -- telnet scheme for interactive services via the TELNET Protocol

1.4. Hierarchical URI and Relative Forms

   An absolute identifier refers to a resource independent of the
   context in which the identifier is used.  In contrast, a relative
   identifier refers to a resource by describing the difference within a
   hierarchical namespace between the current context and an absolute
   identifier of the resource.

   Some URI schemes support a hierarchical naming system, where the
   hierarchy of the name is denoted by a "/" delimiter separating the
   components in the scheme. This document defines a scheme-independent
   `relative' form of URI reference that can be used in conjunction with
   a `base' URI (of a hierarchical scheme) to produce another URI. The
   syntax of hierarchical URI is described in Section 3; the relative
   URI calculation is described in Section 5.

1.5. URI Transcribability

   The URI syntax was designed with global transcribability as one of
   its main concerns. A URI is a sequence of characters from a very
   limited set, i.e. the letters of the basic Latin alphabet, digits,
   and a few special characters.  A URI may be represented in a variety
   of ways: e.g., ink on paper, pixels on a screen, or a sequence of
   octets in a coded character set.  The interpretation of a URI depends
   only on the characters used and not how those characters are
   represented in a network protocol.

   The goal of transcribability can be described by a simple scenario.
   Imagine two colleagues, Sam and Kim, sitting in a pub at an
   international conference and exchanging research ideas.  Sam asks Kim
   for a location to get more information, so Kim writes the URI for the
   research site on a napkin.  Upon returning home, Sam takes out the
   napkin and types the URI into a computer, which then retrieves the
   information to which Kim referred.

   There are several design concerns revealed by the scenario:

      o  A URI is a sequence of characters, which is not always
         represented as a sequence of octets.

      o  A URI may be transcribed from a non-network source, and thus
         should consist of characters that are most likely to be able to
         be typed into a computer, within the constraints imposed by
         keyboards (and related input devices) across languages and
         locales.

      o  A URI often needs to be remembered by people, and it is easier
         for people to remember a URI when it consists of meaningful
         components.

   These design concerns are not always in alignment.  For example, it
   is often the case that the most meaningful name for a URI component
   would require characters that cannot be typed into some systems.  The
   ability to transcribe the resource identifier from one medium to
   another was considered more important than having its URI consist of
   the most meaningful of components.  In local and regional contexts

   and with improving technology, users might benefit from being able to
   use a wider range of characters; such use is not defined in this
   document.

1.6. Syntax Notation and Common Elements

   This document uses two conventions to describe and define the syntax
   for URI.  The first, called the layout form, is a general description
   of the order of components and component separators, as in

      /;?

   The component names are enclosed in angle-brackets and any characters
   outside angle-brackets are literal separators.  Whitespace should be
   ignored.  These descriptions are used informally and do not define
   the syntax requirements.

   The second convention is a BNF-like grammar, used to define the
   formal URI syntax.  The grammar is that of [[35]RFC822], except that "|"
   is used to designate alternatives.  Briefly, rules are separated from
   definitions by an equal "=", indentation is used to continue a rule
   definition over more than one line, literals are quoted with "",
   parentheses "(" and ")" are used to group elements, optional elements
   are enclosed in "[" and "]" brackets, and elements may be preceded
   with * to designate n or more repetitions of the following
   element; n defaults to 0.

   Unlike many specifications that use a BNF-like grammar to define the
   bytes (octets) allowed by a protocol, the URI grammar is defined in
   terms of characters.  Each literal in the grammar corresponds to the
   character it represents, rather than to the octet encoding of that
   character in any particular coded character set.  How a URI is
   represented in terms of bits and bytes on the wire is dependent upon
   the character encoding of the protocol used to transport it, or the
   charset of the document which contains it.

   The following definitions are common to many elements:

      alpha    = lowalpha | upalpha

      lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
                 "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
                 "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

      upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

      digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"

      alphanum = alpha | digit

   The complete URI syntax is collected in Appendix A.

2. URI Characters and Escape Sequences

   URI consist of a restricted set of characters, primarily chosen to
   aid transcribability and usability both in computer systems and in
   non-computer communications. Characters used conventionally as
   delimiters around URI were excluded.  The restricted set of
   characters consists of digits, letters, and a few graphic symbols
   were chosen from those common to most of the character encodings and
   input facilities available to Internet users.

      uric          = reserved | unreserved | escaped

   Within a URI, characters are either used as delimiters, or to
   represent strings of data (octets) within the delimited portions.
   Octets are either represented directly by a character (using the US-
   ASCII character for that octet [ASCII]) or by an escape encoding.
   This representation is elaborated below.

2.1 URI and non-ASCII characters

   The relationship between URI and characters has been a source of
   confusion for characters that are not part of US-ASCII. To describe
   the relationship, it is useful to distinguish between a "character"
   (as a distinguishable semantic entity) and an "octet" (an 8-bit
   byte). There are two mappings, one from URI characters to octets, and
   a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over
   the radio, etc.

   A URI scheme may define a mapping from URI characters to octets;
   whether this is done depends on the scheme. Commonly, within a
   delimited component of a URI, a sequence of characters may be used to
   represent a sequence of octets. For example, the character "a"
   represents the octet 97 (decimal), while the character sequence "%",
   "0", "a" represents the octet 10 (decimal).

   There is a second translation for some resources: the sequence of
   octets defined by a component of the URI is subsequently used to
   represent a sequence of characters. A 'charset' defines this mapping.
   There are many charsets in use in Internet protocols. For example,
   UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
   of characters in the repertoire of ISO 10646.

   In the simplest case, the original character sequence contains only
   characters that are defined in US-ASCII, and the two levels of
   mapping are simple and easily invertible: each 'original character'
   is represented as the octet for the US-ASCII code for it, which is,
   in turn, represented as either the US-ASCII character, or else the
   "%" escape sequence for that octet.

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [[36]RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used.

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.

2.2. Reserved Characters

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

   The "reserved" syntax class above refers to those characters that are
   allowed within a URI, but which may not be allowed within a
   particular component of the generic URI syntax; they are used as
   delimiters of the components described in Section 3.

   Characters in the "reserved" set are not reserved in all contexts.
   The set of characters actually reserved within any given URI
   component is defined by that component. In general, a character is
   reserved if the semantics of the URI changes if the character is
   replaced with its escaped US-ASCII encoding.

2.3. Unreserved Characters

   Data characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include upper and lower case
   letters, decimal digits, and a limited set of punctuation marks and
   symbols.

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "". | "!" | "~" | "*" | "'" | "(" | ")"

   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

2.4. Escape Sequences

   Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed, as
   explained below.

2.4.1. Escaped Encoding

   An escaped octet is encoded as a character triplet, consisting of the
   percent character "%" followed by the two hexadecimal digits
   representing the octet code. For example, "%20" is the escaped
   encoding for the US-ASCII space character.

      escaped     = "%" hex hex
      hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                            "a" | "b" | "c" | "d" | "e" | "f"

2.4.2. When to Escape and Unescape

   A URI is always in an "escaped" form, since escaping or unescaping a
   completed URI might change its semantics.  Normally, the only time
   escape encodings can safely be made is when the URI is being created
   from its component parts; each component may have its own set of
   characters that are reserved, so only the mechanism responsible for
   generating or interpreting that component can determine whether or

   not escaping a character will change its semantics. Likewise, a URI
   must be separated into its components before the escaped characters
   within those components can be safely decoded.

   In some cases, data that could be represented by an unreserved
   character may appear escaped; for example, some of the unreserved
   "mark" characters are automatically escaped by some systems.  If the
   given URI scheme defines a canonicalization algorithm, then
   unreserved characters may be unescaped according to that algorithm.
   For example, "%7e" is sometimes used instead of "~" in an http URL
   path, but the two are equivalent for an http URL.

   Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.  Implementers should be careful not to
   escape or unescape the same string more than once, since unescaping
   an already unescaped string might lead to misinterpreting a percent
   data character as another escaped character, or vice versa in the
   case of escaping an already escaped string.

2.4.3. Excluded US-ASCII Characters

   Although they are disallowed within the URI syntax, we include here a
   description of those US-ASCII characters that have been excluded and
   the reasons for their exclusion.

   The control characters in the US-ASCII coded character set are not
   used within a URI, both because they are non-printable and because
   they are likely to be misinterpreted by some control mechanisms.

   control     = 

   The space character is excluded because significant spaces may
   disappear and insignificant spaces may be introduced when URI are
   transcribed or typeset or subjected to the treatment of word-
   processing programs.  Whitespace is also used to delimit URI in many
   contexts.

   space       = 

   The angle-bracket "" and double-quote (") characters are
   excluded because they are often used as the delimiters around URI in
   text documents and protocol fields.  The character "#" is excluded
   because it is used to delimit a URI from a fragment identifier in URI
   references (Section 4). The percent character "%" is excluded because
   it is used for the encoding of escaped characters.

   delims      = "" | "#" | "%" |
Usage: http://www.kk-software.de/kklynxview/get/URL
e.g. http://www.kk-software.de/kklynxview/get/http://www.kk-software.de
Errormessages are in German, sorry ;-)