Regular Expressions 3

これは第一巻および第二巻の続きである。すなわち、正規表現を使いこなせないという掲示板の声に応える形で書きはじめたものである。いまさら書くまでもないとは思うが、私はこれを実用に供するように書こうとは思ってはいない。正規表現に関した読みもの程度に考えてもらった方がよい。それが証拠に正規表現の解説などまったく無視して書いた部分もある。

例によって、このテキストを読まれる際には何等かの正規表現のリファレンスなどを手もとに置いて読み進まれることを希望する。繰り返すが、このテキストには正規表現の一覧表などはない。また、このテキストで扱う正規表現は Vine Linux 1.1 にインストールされている perl 5.00404 の正規表現である。スクリプトはすべてこの perl で動作を確認している。日本語化された jperl などを使用されている方は正規表現の解釈に於いて動作が違う場合もあるので注意されたい。

この巻は以前に掲示板で何か良い『正規表現のドリル』みたいなものはないかと、問われたことに端を発している。色々考えた挙げ句に、 BNF あるいは BNF もどきで書いてあるものを正規表現に翻訳して見ることを思い付いた。これなら、RFC などを漁ればいくらでも『正規表現の練習問題』を手にいれることが出来る。そういったわけで、この巻にはそういった練習をするのにはどうしたら良いかの案内の役割がある。この他にも、W3C にある DTD なんかが手頃かも知れない。ただし、正規表現はかっこの対の認識や入れ子構造までは捕捉できないので限界はある。これには注意されたい。

ところで、この巻を途中まで書いたところで Jeffrey Friedl の『正規表現』を買って読んでみた。書評などでこの本の内容については大雑把に知っていたのであるが、私が読んでみた限り書評からイメージされる内容とはかけ離れていた。難しい本だと聞いていたのだが、確かに正規表現のコンパイルの過程などを記した部分は難しいかも知れないが、他の部分はそんなに難しいと言うほどのものではない。特に perl を使っている人には参考になる部分が多いと思う。正規表現を離れて perl の動作について詳しく書いてあるからだ。当初このテキストはそんなに難しいものをターゲットにせずに、いわば、正規表現と perl の入門をかねたつもりで書いていた。しかし、Jeffrey Friedl の『正規表現』を読む限りでは、ある意味このテキストは無意味と言えば無意味といえる。逆に、彼の『正規表現』を読むための架け橋になるかもしれないと、いう意味ではずいぶんと意味のあるものになったかも知れない。いずれにしても、内容はかなり似通っている。第二巻の IP アドレスの正規表現もまったく同じことが書いてある。当然のことながら、 Friedl の正規表現の方がパフォーマンスは良い。

BNF というのはプログラミング言語の文法を規定するのに良く用いられる。例えば、perl4 のキャメルブックの最後に perl4 の文法を書いたものがある。この数式の羅列のようなものが BNF である。 perl4 のキャメルブックのは正式なものではなくてより人間が見やすいようなものになっている。( perl5 の場合はもはや BNF では文法を規定できない。) C 言語なども文法は BNF を使って規定されている。プログラミング言語の文法と言うとおそらく多くの人は尻込みをしてしまうかも知れないが、例えば、ちょっとした式を定義するのにも使える。実際に、RFC でこれが使われはじめたのは RFC733 や RFC822 あたりからである。電子メール関係の定義にこれを使うと記述が簡潔になるからである。以後、インターネットの様々な規格が BNF で定義されるようになった。ちょっと見るとすぐにわかるが、これは正規表現のルールを知っている人には簡単に理解できる。 ABNF の定義は RFC2234 に書いてある。

2.1  Rule Naming

   The name of a rule is simply the name itself; that is, a sequence of
   characters, beginning with  an alphabetic character, and followed by
   a combination of alphabetics, digits and hyphens (dashes).

        NOTE:     Rule names are case-insensitive

   The names <rulename>, <Rulename>, <RULENAME> and <rUlENamE> all refer
   to the same rule.

『ルールの名前と言うのは単純に名前それ自身である』とある。これだけだと何が何だがわからないが、通常 BNF ではルールの名前は <rulename> などと不等号で括ることになっていからこのようなことが書いてある。ただ、英文の文章中だと不等号で括らないとわけがわからない場合があるので、ルールの名前を呼ぶ際に <rulename> などと書いても良いことになっている。このように、説明する文章中などで使う場合などは不等号で括っても構わないが、文法的には ABNF の中ではそのまま rulename などとしてつかうということになっている。それ以外で文法的に認められているのは、繰り返しなんかを表現するために紛らわしい場合に限られている。

さらに、上のようにルールの名前に使える文字はきまっている。正規表現で表すと

となる。なお、名前は大文字小文字を区別しないので、例えば、 rulename も RuleName も RULENAME も同じルールを表す名前だと見なされる。

2.2  Rule Form

   A rule is defined by the following sequence:

        name =  elements crlf

   where <name> is the name of the rule, <elements> is one or more rule
   names or terminal specifications and <crlf> is the end-of- line
   indicator, carriage return followed by line feed.  The equal sign
   separates the name from the definition of the rule.  The elements
   form a sequence of one or more rule names and/or value definitions,
   combined according to the various operators, defined in this
   document, such as alternative and repetition.

上のように『何々とは何々である』といった式の羅列が ABNF の実体で個々の『何々とは何々である』という内容を表す式をルールと言う。これは第二巻で正規表現を定義するのに部分にわけて変数で定義したがそのようなものと同じ考えであると理解すればいい。まとめると

name =  elements crlf

というルールの名前は name であり、それは elements として定義される。ということになる。

これらのルールを書く際にはルールはイコールが揃うように、必ず左にそろえること、一つのルールが複数の行にまたがる場合には二行目以降をインデントすることなどがすすめられている。 (と、いうよりはこれはかかる手間を考えると強制に近い。綺麗なフォーマットで書かない正当な理由はちょっと考えにくい。)

2.3  Terminal Values

   Rules resolve into a string of terminal values, sometimes called
   characters.  In ABNF a character is merely a non-negative integer.
   In certain contexts a specific mapping (encoding) of values into a
   character set (such as ASCII) will be specified.

   Terminals are specified by one or more numeric characters with the
   base interpretation of those characters indicated explicitly.  The
   following bases are currently defined:

        b           =  binary

        d           =  decimal

        x           =  hexadecimal

2.2 の定義のように『何々とは何々である』を続けていくと最後にはこれ以上定義できないところまでたどり着く、そういった末端・終端の定義がここにある。なお、上の binary というのは 2 進数という意味である。

などとピリオドを使う。上の場合何進数であるかは既にわかっているので二番目の 10 は十進数の 10 である。

ABNF ではリテラルを直接使って良い。ただし、ダブルクォートで括る。

        command     =  "command string"

これは自動的に command が c と o と… g を連結させたものであることを意味する。なお、大文字小文字を区別しないので、

は "abc", "Abc", "aBc", "abC", "ABc", "aBC", "AbC", "ABC" などと同じである。大文字小文字の区別をはっきりさせたい場合には文字コードを使って

        rulename    =  %d97 %d98 %d99

        rulename    =  %d97.98.99

3.1  Concatenation                                  Rule1 Rule2

   A rule can define a simple, ordered string of values -- i.e., a
   concatenation of contiguous characters -- by listing a sequence of
   rule names.  For example:

        foo         =  %x61           ; a

        bar         =  %x62           ; b

        mumble      =  foo bar foo

        So that the rule <mumble> matches the lowercase string "aba".

3.2  Alternatives                               Rule1 / Rule2

   Elements separated by forward slash ("/") are alternatives.
   Therefore,

        foo / bar

   will accept <foo> or <bar>.

3.3  Incremental Alternatives                    Rule1 =/ Rule2

   It is sometimes convenient to specify a list of alternatives in
   fragments.  That is, an initial rule may match one or more
   alternatives, with later rule definitions adding to the set of
   alternatives.  This is particularly useful for otherwise- independent
   specifications which derive from the same parent rule set, such as
   often occurs with parameter lists.  ABNF permits this incremental
   definition through the construct:

        oldrule     =/ additional-alternatives

        ruleset     =  alt1 / alt2

        ruleset     =/ alt3

        ruleset     =/ alt4 / alt5

        ruleset     =  alt1 / alt2 / alt3 / alt4 / alt5

3.4  Value Range Alternatives                           %c##-##

   A range of alternative numeric values can be specified compactly,
   using dash ("-") to indicate the range of alternative values.  Hence:

        DIGIT       =  %x30-39

        DIGIT       =  "0" / "1" / "2" / "3" / "4" / "5" / "6" /

                           "7" / "8" / "9"

連結記号と範囲指定は同時には出来ない。例えば、印字可能な文字が一つだけ CRCL にはさまれているというのは

        char-line = %x0D.0A %x20-7E %x0D.0A

3.5  Sequence Group                             (Rule1 Rule2)

   Elements enclosed in parentheses are treated as a single element,
   whose contents are STRICTLY ORDERED.   Thus,

        elem (foo / bar) blat

   which matches (elem foo blat) or (elem bar blat).

        elem foo / bar blat

   matches (elem foo) or (bar blat).

いくつかのエレメントをグループ化するのには括弧で括る。また括弧を使って誤解なく定義するようにと勧告されている。例えば、

        (elem foo) / (bar blat)

3.6  Variable Repetition                                *Rule

   The operator "*" preceding an element indicates repetition. The full
   form is:

        <a>*<b>element

   where <a> and <b> are optional decimal values, indicating at least
   <a> and at most <b> occurrences of element.

は element の a 以上 b 以下の繰り返しに相当する。また、

は一回以上の繰り返しに相当する。これも正規表現風に書けば

3.7  Specific Repetition                                  nRule

   A rule of the form:

        <n>element

   is equivalent to

        <n>*<n>element

   That is, exactly  <N>  occurrences  of <element>. Thus 2DIGIT is a
   2-digit number, and 3ALPHA is a string of three alphabetic
   characters.

などといった定義もある。これは良く使われるからである。これは正規表現の場合

3.8  Optional Sequence                                   [RULE]

   Square brackets enclose an optional element sequence:

        [foo bar]

   is equivalent to

        *1(foo bar).

3.9  ; Comment

   A semi-colon starts a comment that continues to the end of line.
   This is a simple way of including useful notes in parallel with the
   specifications.

コメントである。セミコロンから行末まではコメントとして扱われる。

3.10 Operator Precedence

   The various mechanisms described above have the following precedence,
   from highest (binding tightest) at the top, to lowest and loosest at
   the bottom:

        Strings, Names formation
        Comment
        Value range
        Repetition
        Grouping, Optional
        Concatenation
        Alternative

   Use of the alternative operator, freely mixed with concatenations can
   be confusing.

        Again, it is recommended that the grouping operator be used to
        make explicit concatenation groups.

例によって演算子には優先度がある。上が一番優先度が高く下へいくほど優先度はさがる。ただし、基本的には括弧を使ってグルーピングした方が良い。

簡単な練習に良いので、RFC2234 の APPENDIX にある CORE とよばれるルールセットを見てみよう。

6.1  Core Rules

   Certain  basic  rules  are  in uppercase, such as SP, HTAB, CRLF,
   DIGIT, ALPHA, etc.

        ALPHA          =  %x41-5A / %x61-7A   ; A-Z / a-z

        BIT            =  "0" / "1"

        CHAR           =  %x01-7F
                               ; any 7-bit US-ASCII character,
                                  excluding NUL

        CR             =  %x0D
                               ; carriage return

        CRLF           =  CR LF
                               ; Internet standard newline

        CTL            =  %x00-1F / %x7F
                               ; controls

        DIGIT          =  %x30-39
                               ; 0-9

        DQUOTE         =  %x22
                               ; " (Double Quote)

        HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F"

        HTAB           =  %x09
                               ; horizontal tab

        LF             =  %x0A
                               ; linefeed

        LWSP           =  *(WSP / CRLF WSP)
                               ; linear white space (past newline)

        OCTET          =  %x00-FF
                               ; 8 bits of data

        SP             =  %x20
                               ; space

        VCHAR          =  %x21-7E
                               ; visible (printing) characters

        WSP            =  SP / HTAB
                               ; white space

大部分が直接に文字コードを指定しているような形なので理解に困難はないと思われる。これを正規表現に翻訳するとどうなるかを考えるとちょっとした正規表現の簡単な練習になる。例えば、ALPHA は

[\x41-\x5a]|[\x61-\x7a]

ですむ。 LWSP がちょっとこのなかでは複雑であるが、 WSP が

ABNF 自体を ABNF で書いてみよう。これには上のコアルールセットも使っている。

   This syntax uses the rules provided in Appendix A (Core).

        rulelist       =  1*( rule / (*c-wsp c-nl) )

        rule           =  rulename defined-as elements c-nl
                               ; continues if next line starts
                               ;  with white space

        rulename       =  ALPHA *(ALPHA / DIGIT / "-")

        defined-as     =  *c-wsp ("=" / "=/") *c-wsp
                               ; basic rules definition and
                               ;  incremental alternatives

        elements       =  alternation *c-wsp

        c-wsp          =  WSP / (c-nl WSP)

        c-nl           =  comment / CRLF
                               ; comment or newline

        comment        =  ";" *(WSP / VCHAR) CRLF

        alternation    =  concatenation
                          *(*c-wsp "/" *c-wsp concatenation)

        concatenation  =  repetition *(1*c-wsp repetition)

        repetition     =  [repeat] element

        repeat         =  1*DIGIT / (*DIGIT "*" *DIGIT)

        element        =  rulename / group / option /
                          char-val / num-val / prose-val

        group          =  "(" *c-wsp alternation *c-wsp ")"

        option         =  "[" *c-wsp alternation *c-wsp "]"

        char-val       =  DQUOTE *(%x20-21 / %x23-7E) DQUOTE
                               ; quoted string of SP and VCHAR
                                  without DQUOTE

        num-val        =  "%" (bin-val / dec-val / hex-val)

        bin-val        =  "b" 1*BIT
                          [ 1*("." 1*BIT) / ("-" 1*BIT) ]
                               ; series of concatenated bit values
                               ; or single ONEOF range

        dec-val        =  "d" 1*DIGIT
                          [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ]

        hex-val        =  "x" 1*HEXDIG
                          [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ]

        prose-val      =  "<" *(%x20-3D / %x3F-7E) ">"
                               ; bracketed string of SP and VCHAR
                                  without angles
                               ; prose description, to be used as
                                  last resort

少しこれらについて正規表現で表すことを考えてみる。ただし、正規表現は文法を規定するものではないので単純にマッチするかどうかしか判定できない。

<[\x20-\x3d\x3f-\x7e]*>

x[0-9A-F]+((\.([0-9A-F]+))|(-([0-9A-F]+)))?

d\d+((\.\d+)|(-\d+))?

b[01]+((\.[01]+)|(-[01]+))?

% (
    b[01]+((\.[01]+)|(-[01]+))? # bin-val
    |
    d\d+((\.\d+)|(-\d+))? # dec-val
    |
    x[0-9A-F]+((\.([0-9A-F]+))|(-([0-9A-F]+)))? # hex-val
  )

%((b[01]+((\.[01]+)|(-[01]+))?)|(d\d+((\.\d+)|(-\d+))?)|(x[0-9A-F]+((\.([0-9A-F]+))|(-([0-9A-F]+)))?))

"[\x20-\x21\x23-\x7e]*"

である。 comment は VCHAR が [\x21-\x7e] で CRLF が \x0d\x0a さらに WSP が [\x20\x09] なので

;[\x09\x20-\x7e]*\x0d\x0a

(;[\x09\x20-\x7e]*)?\x0d\x0a

((;[\x09\x20-\x7e]*)?\x0d\x0a)?[\x20\x09]

ただし、最初にも注意したように正規表現は文法を規定するものではないので限界がある。例えば、alternation をみると concatenation がその定義に入っている。さらに、concatenation には repetition が入っている。 repetition には element があり、element には option がある。で、 option を見ると alternation となる。これでも最後は終端値にたどり着くがこれを行うのは正規表現の仕事ではない。ただし、電子メールアドレスや URI の類は正規表現である程度まで捕まえることが出来る。正確に言うと特定の文字列が電子メールアドレスらしき姿をしているかどうかということの判定なら正規表現で出来るということである。

誤解しないで欲しいのは、正規表現の翻訳を試みたのはあくまでも正規表現の練習になるからである。いくら正規表現を使ってもパーサとは別のレキシカルアナライザを作るのにはある程度役にたつがパーサは作れない。

再び URL の話になる。ABNF をお目にかけたところで URL の ABNF を RFC1738 から抜きだして来た。上の説明と違うのは、『または』の / のかわりに | が用いられているところである。

; The generic form of a URL is:

genericurl     = scheme ":" schemepart

; Specific predefined schemes are defined here; new schemes
; may be registered with IANA

url            = httpurl | ftpurl | newsurl |
                 nntpurl | telneturl | gopherurl |
                 waisurl | mailtourl | fileurl |
                 prosperourl | otherurl

; new schemes follow the general syntax
otherurl       = genericurl

; the scheme is in lower case; interpreters should use case-ignore
scheme         = 1*[ lowalpha | digit | "+" | "-" | "." ]

schemepart     = *xchar | ip-schemepart


; URL schemeparts for ip based protocols:

ip-schemepart  = "//" login [ "/" urlpath ]

login          = [ user [ ":" password ] "@" ] hostport
hostport       = host [ ":" port ]
host           = hostname | hostnumber
hostname       = *[ domainlabel "." ] toplabel
domainlabel    = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit
toplabel       = alpha | alpha *[ alphadigit | "-" ] alphadigit
alphadigit     = alpha | digit
hostnumber     = digits "." digits "." digits "." digits
port           = digits
user           = *[ uchar | ";" | "?" | "&" | "=" ]
password       = *[ uchar | ";" | "?" | "&" | "=" ]
urlpath        = *xchar    ; depends on protocol see section 3.1

; The predefined schemes:

; FTP (see also RFC959)

ftpurl         = "ftp://" login [ "/" fpath [ ";type=" ftptype ]]
fpath          = fsegment *[ "/" fsegment ]
fsegment       = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
ftptype        = "A" | "I" | "D" | "a" | "i" | "d"

; FILE

fileurl        = "file://" [ host | "localhost" ] "/" fpath

; HTTP

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

; GOPHER (see also RFC1436)

gopherurl      = "gopher://" hostport [ / [ gtype [ selector
                 [ "%09" search [ "%09" gopher+_string ] ] ] ] ]
gtype          = xchar
selector       = *xchar
gopher+_string = *xchar

; MAILTO (see also RFC822)

mailtourl      = "mailto:" encoded822addr
encoded822addr = 1*xchar               ; further defined in RFC822

; NEWS (see also RFC1036)

newsurl        = "news:" grouppart
grouppart      = "*" | group | article
group          = alpha *[ alpha | digit | "-" | "." | "+" | "_" ]
article        = 1*[ uchar | ";" | "/" | "?" | ":" | "&" | "=" ] "@" host

; NNTP (see also RFC977)

nntpurl        = "nntp://" hostport "/" group [ "/" digits ]

; TELNET

telneturl      = "telnet://" login [ "/" ]

; WAIS (see also RFC1625)

waisurl        = waisdatabase | waisindex | waisdoc
waisdatabase   = "wais://" hostport "/" database
waisindex      = "wais://" hostport "/" database "?" search
waisdoc        = "wais://" hostport "/" database "/" wtype "/" wpath
database       = *uchar
wtype          = *uchar
wpath          = *uchar

; PROSPERO

prosperourl    = "prospero://" hostport "/" ppath *[ fieldspec ]
ppath          = psegment *[ "/" psegment ]
psegment       = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
fieldspec      = ";" fieldname "=" fieldvalue
fieldname      = *[ uchar | "?" | ":" | "@" | "&" ]
fieldvalue     = *[ uchar | "?" | ":" | "@" | "&" ]

; Miscellaneous definitions

lowalpha       = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" |
                 "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" |
                 "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
                 "y" | "z"
hialpha        = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"

alpha          = lowalpha | hialpha
digit          = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
national       = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]" | "`"
punctuation    = "<" | ">" | "#" | "%" | <">


reserved       = ";" | "/" | "?" | ":" | "@" | "&" | "="
hex            = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                 "a" | "b" | "c" | "d" | "e" | "f"
escape         = "%" hex hex

unreserved     = alpha | digit | safe | extra
uchar          = unreserved | escape
xchar          = unreserved | reserved | escape
digits         = 1*digit

lowalpha       = [a-z]
hialpha        = [A-Z]
alpha          = [a-zA-Z]
digit          = [0-9]
safe           = [\$_.+-]
extra          = [!*'(),]
national       = [][{}|\\^~`]
punctuation    = [<>#%"]
reserved       = [;/?:@&=]
hex            = [0-9a-fA-F]
escape         = %[0-9a-fA-F][0-9a-fA-F]

unreserved     = [a-zA-Z0-9\$_.+\-!*'(),]
uchar          = [a-zA-Z0-9\$_.+\-!*'(),]|%[0-9a-fA-F][0-9a-fA-F]
xchar          = [a-zA-Z0-9\$_.+\-!*'(),;/?:@&=]|%[0-9a-fA-F][0-9a-fA-F]
digits         = [0-9]+

unreserved     = [\w\$.+\-!*'(),]
uchar          = [\w\$.+\-!*'(),]|%[\da-fA-F][\da-fA-F]
xchar          = [\w\$.+\-!*'(),;/?:@&=]|%[\da-fA-F][\da-fA-F]
digits         = \d+

最初のホスト名などから表してみよう。まずは、部分だけを考える。

domainlabel    = [a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]
toplabel       = [a-zA-Z]|[a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9]
alphadigit     = [a-zA-Z0-9]
hostnumber     = \d+\.\d+\.\d+\.\d+
port           = \d+
user           = ([\w\$.+\-!*'(),;?&=]|%[\da-fA-F][\da-fA-F])*
password       = ([\w\$.+\-!*'(),;?&=]|%[\da-fA-F][\da-fA-F])*
urlpath        = ([\w\$.+\-!*'(),;/?:@&=]|%[\da-fA-F][\da-fA-F])*

次に hostname を表す。長く複雑なので拡張正規表現を使うことにする。

  # hostname       = *[ domainlabel "." ] toplabel

  # *[ domainlabel "." ]
  ([a-zA-Z0-9] | [a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9] \.)*
  # toplabel
  ([a-zA-Z] | [a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9])

  # host           = hostname | hostnumber

  ( # hostname
    # *[ domainlabel "." ]
    ([a-zA-Z0-9] | [a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9] \.)*
    # toplabel
    ([a-zA-Z] | [a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9])
  )
  |
  ( \d+\.\d+\.\d+\.\d+ ) # hostnumber

  (
    ( # hostname
      # *[ domainlabel "." ]
      ([a-zA-Z0-9] | [a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9] \.)*
      # toplabel
      ([a-zA-Z] | [a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9])
    )
    |
    ( \d+\.\d+\.\d+\.\d+ ) # hostnumber
  )
  ( : \d+ )?  # port

  # login          = [ user [ ":" password ] "@" ] hostport

  (
    ( [\w\$.+\-!*'(),;?&=] | %[\da-fA-F][\da-fA-F] )* # user
    ( : ( [\w\$.+\-!*'(),;?&=] | %[\da-fA-F][\da-fA-F])* )? @
  )?
  (
    ( # hostname
      # *[ domainlabel "." ]
      ([a-zA-Z0-9] | [a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9] \.)*
      # toplabel
      ([a-zA-Z] | [a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9])
    )
    |
    ( \d+\.\d+\.\d+\.\d+ ) # hostnumber
  )
  ( : \d+ )?  # port

  # telneturl      = "telnet://" login [ "/" ]

  telnet://
  (
    ( [\w\$.+\-!*'(),;?&=] | %[\da-fA-F][\da-fA-F] )* # user
    ( : ( [\w\$.+\-!*'(),;?&=] | %[\da-fA-F][\da-fA-F])* )? @
  )?
  (
    ( # hostname
      # *[ domainlabel "." ]
      ([a-zA-Z0-9] | [a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9] \.)*
      # toplabel
      ([a-zA-Z] | [a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9])
    )
    |
    ( \d+\.\d+\.\d+\.\d+ ) # hostnumber
  )
  ( : \d+ )?  # port 
  /?

fsegment       = ([\w\$.+\-!*'(),?:@&=]|%[\da-fA-F][\da-fA-F])*
ftptype        = [AIDaid]

  # fpath          = fsegment *[ "/" fsegment ]

  ( [\w\$.+\-!*'(),?:@&=] | %[\da-fA-F][\da-fA-F] )*
  (
    / ( [\w\$.+\-!*'(),?:@&=] | %[\da-fA-F][\da-fA-F] )*
  )*

  # ftpurl         = "ftp://" login [ "/" fpath [ ";type=" ftptype ]]

  ftp://
  (
    ( [\w\$.+\-!*'(),;?&=] | %[\da-fA-F][\da-fA-F] )* # user
    ( : ( [\w\$.+\-!*'(),;?&=] | %[\da-fA-F][\da-fA-F])* )? @
  )?
  (
    ( # hostname
      # *[ domainlabel "." ]
      ([a-zA-Z0-9] | [a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9] \.)*
      # toplabel
      ([a-zA-Z] | [a-zA-Z][a-zA-Z0-9-]*[a-zA-Z0-9])
    )
    |
    ( \d+\.\d+\.\d+\.\d+ ) # hostnumber
  )
  ( : \d+ )?  # port 
  ( /

    ( [\w\$.+\-!*'(),?:@&=] | %[\da-fA-F][\da-fA-F] )*
    (
      / ( [\w\$.+\-!*'(),?:@&=] | %[\da-fA-F][\da-fA-F] )*
    )*

    ( ;type = [AIDaid] )?
  )?

あと gopher や news, nntp なども上の ABNF からわかる範囲で書いてみれば良い練習になると思う。もちろん、第二巻で指摘したように、不要な括弧をはずしたりあるいは括弧を (?: ) に変えたりすることも試みると良いかも知れない。

  / ( [\w\$.+\-!*'(),?:@&=] | %[\da-fA-F][\da-fA-F] )*
    (
      / ( [\w\$.+\-!*'(),?:@&=] | %[\da-fA-F][\da-fA-F] )*
    )*

/usr///////local///bin

なんていうものにもマッチする。実は、これはこれで正しい。実際に、

cd /usr///////local///bin

なお、このテキストでは RFC1738 を参考にして話を進めたが、その後 RFC2396 に URI の定義がある。正規表現の練習ではなくて別の必要があって参照するのなら、 RFC2396 などを参考にした方が良いと思う。

つれづれの正規表現 (第三巻)

Augmented Bakus-Naur Form (ABNF)

ルール

演算子

コアルール

ABNF による ABNF

URL