Backus–Naur–Nicholi normal form

From XionKB
Jump to navigationJump to search

This page specifies a metasyntax notation for context-free grammars called Backus–Naur–Nicholi normal form, or simply BN3F. It is an "extended" Backus–Naur form authored by Alexander Nicholi to address several shortcomings of both the original BNF and other extended BNFs such as the version authored by the W3C for XML. The main concerns include, in no particular order:

  • Concretisation of the binary form underlying the grammar
  • Succinct and unopinionated delineation of the basic groups of syntaxes in the notation
  • A basic orientation around ASCII, with facilities for encoding-agnostic Unicode-conscious grammars

Encoding

BN3F is itself an ASCII only grammar. Additionally, characters in the range of 0 to 31 and 127 are also forbidden. Octets with the high order bit set (i.e., characters in the range 128 to255) are forbidden except within the body of comments.

Syntax

Any formal grammar in BN3F is given as a series of one or more rules in the following form:

symbol := expression;

In the following sections, the full extent of this grammar will be defined. Appendix A contains the Unicode facilities that extend it to be suitable for Unicode-conscious grammars. Appendix B contains a full grammar specifying BN3F, written in BN3F.

Whitespace

Whitespace are strings of characters comprised of only the following ASCII characters:

Dec Hex Description
9 0x9 HORIZONTAL TAB
10 0xA LINE FEED
11 0xB VERTICAL TAB
12 0xC FORM FEED
13 0xD CARRIAGE RETURN
32 0x20 SPACE

There is no semantic distinction between new lines and other forms of whitespace. This is done out of respect for the idea that style and form should be separate, and that style is the domain of editors, not languages.

Comments

BN3F uses the C-style /* ... */ notation to denote "comments", which are text meant to be lexically ignored. The whitespace which may conventionally immediately follow the opening symbol /* or the closing symbol */ is not semantically significant.

/*this is the same*/
/* as this, lexically */

Comment exception in encoding

As explained earlier, there exists a "comment exception" to the restrictions on encoding that normally apply to BN3F grammars. Specifically, octets with the high order bit set, corresponding to the numeric range of 128 to 255, are permitted within the body of the comment.

This has the practical effect of allowing Unicode encodings which superset ASCII, such as UTF-8, to be used in comments, so that all languages may be used by grammar authors to explain their code.

In the context of validating a grammar, lexers should not attempt to interpret the contents of comments with the intention of checking their validity except to ensure that these constraints are satisfied; ergo, whether a comment is "valid UTF-8" is outside the scope of BN3F lexers, while whether a comment contains invalid octets is in scope.

Operators

BN3F grammars use a set of operator symbols, including (in order of operator precedence):

Op Description
:= rule definition operator
; statement termination operator
( ) grouping operators
... terminal symbol range operator
! negation operator
+ solid repetition operator (1 or more)
* hollow repetition operator (0 or more)
? optioning operator
{ } finite repetition operator (n)
| alternation operator
Rule definition
BN3F uses the Pascal-style colon-hyphen := to separate the left and right hand sides of a rule definition. On the left hand side, an identifier is expected, while on the right hand side, an expression is expected that the identifier will lexically resolve to is expected.
Statement termination
Also understood as a rule definition termination, a simple semicolon ; is used to end a rule definition. It has no other use. This is necessary to keep the language agnostic to the difference of new lines and inline whitespace.
Grouping
Parentheses ( ) are used to group portions of an expression so that other operators may be applied to them as a single logical unit. The operators applicable to groups include negation, solid repetition, hollow repetition, finite repetition, optioning, and alternation. Groups may contain terminal symbols, non-terminal symbols, or a mix of both, and all of the operators which are valid on its constituent symbols are still valid within the grouping.
Negation may not be applied to groups which contain non-terminal symbols.
Ranges
The range operator ... is an abridged variant of the alternation operator. With two terminal symbols on each side, it constitutes a continuous stretch of alternation between the starting symbol and the ending symbol, inclusively. The stride of this range operation is always one (1). It may not be used with non-terminal symbols, nor may it be used with groups.
Negation
Negation is an operation that performs a logical inversion upon some terminal symbols using an exclamation point !. It may be applied to either a single terminal symbol or a group containing multiple terminal symbols.
A group negated may have the following operators applied to its constituents:
  • Additional groups (with full recursion)
  • Ranges
  • Alternation
A group negated may not have the following operators applied to its constituents:
  • Negations
  • Solid repetition
  • Hollow repetition
  • Optioning
  • Finite repetition
Additionally, a group negated may not contain any non-terminal symbols.
Solid repetition
Solid repetition (one or more, using the plus symbol +) is one of two varieties of infinite repetition of symbols in BN3F. It may be applied to terminal symbols, non-terminal symbols, or groups. It may not be coupled with other repetition operators, nor may it be coupled with the optioning operator (use the other infinite repetition operator for that).
Hollow repetition
Hollow repetition (zero or more, using the asterisk symbol *) is one of two varieties of infinite repetition of symbols in BN3F. It may be applied to terminal symbols, non-terminal symbols, or groups. It may not be coupled with other repetition operators, nor may it be coupled with the optioning operator (use the other infinite repetition operator for that).
Optioning
Optioning is the denotation that a symbol or group may or may not appear in the grammar, using a trailing question mark ?. It may not be combined with either of the infinite repetition operators, as they provide their own logical equivalence of this function anyway. It may follow, but not precede, a finite repetition operator. It may be combined with the negation operator, where it logically options it as in !symbol? is semantically equivalent to (!symbol)?. It may not be combined with the range operator (use grouping to achieve a valid logical effect).
Finite repetition
A symbol can be repeated for a finite number of times using curly braces { } with an integer inside them. This suffix may follow a terminal symbol, a non-terminal symbol, or a group. The figure within must be a literal natural number. It may not be coupled with either of the infinite repetition operators. It may not be coupled with the range operator. It may be coupled with the optioning operator, where it must precede it, following the symbol or group it repeats for.
Alternation
One symbol can be optionally substituted for another in satisfying the grammar. This is achieved with alternation, using a pipe symbol |. This operator has the lowest precedence of all operators.

Identifiers

Identifiers are strings of characters comprised of only alphanumerics and the underscore which do not begin with a digit. They are lexically separated by either whitespace, punctuation, or both. Unlike in other extended BNFs, there is no semantic significance to the capitalisation boilerplate of identifiers. In grammar rules, identifiers comprise the left-hand side of the rule definition, where they are called symbols.

Terminal symbols

BN3F uses a C-like string literal notation for communicating terminal symbols distinctly from identifiers and other characters used in the grammar of BN3F itself.

Singular symbols

Single quotation marks '' are used for conveying single characters. In addition to representing most of the ASCII range literally, the following backslash-delimited escape sequences are supported:

Seq Description ASCII
Dec Hex
\a BELL 7 0x7
\b BACKSPACE 8 0x8
\t HORIZONTAL TAB 9 0x9
\n LINE FEED 10 0xA
\v VERTICAL TAB 11 0xB
\f FORM FEED 12 0xC
\r CARRIAGE RETURN 13 0xD
\' SINGLE QUOTE 39 0x27
\\ BACKSLASH 92 0x5C
\ arbitrary octet†
where is a 3-digit octal number between 0o000 and 0o377 (decimal 0-255)

Unlike in other languages, BN3F requires octal escape sequences to always bear 3 digits. In other words, \0 is not a legal representation of ASCII NUL (as it is in ANSI C), but \000 is. This is done for the sake of parser simplicity and disambiguation when reading grammars.

Symbol ranges

BN3F additionally extends the usage of single quoted characters with ellipses ... in order to convey ranges. The ellipsis operator may not be used with non-terminal symbols (i.e., symbols denoted by identifiers instead of literals), even if the definition of such non-terminals trivially resolves to terminal symbols.

/* the following expression... */
'a' | 'b' | 'c' | 'd' | 'e'
/* ...can be reduced to this: */
'a' ... 'e'

Strings of symbols

Double quotation marks "" are used for conveying several characters in sequence:

/* without double quotes: */
'a' 'b' 'c' 'd' 'e'
/* with double quotes: */
"abcde"

In addition to representing most of the ASCII range literally, the following backslash-delimited escape sequences are supported:

Seq Description ASCII
Dec Hex
\a BELL 7 0x7
\b BACKSPACE 8 0x8
\t HORIZONTAL TAB 9 0x9
\n LINE FEED 10 0xA
\v VERTICAL TAB 11 0xB
\f FORM FEED 12 0xC
\r CARRIAGE RETURN 13 0xD
\" DOUBLE QUOTE 34 0x22
\\ BACKSLASH 92 0x5C
\ arbitrary octet†
where is a 3-digit octal number between 0o000 and 0o377 (decimal 0-255)

Unlike in other languages, BN3F requires octal escape sequences to always bear 3 digits. In other words, \0 is not a legal representation of ASCII NUL (as it is in ANSI C), but \000 is. This is done for the sake of parser simplicity and disambiguation when reading grammars.

Although double quotes are a convenient shorthand for longer strings of single characters, they are still lexically distinct from them in relation to other operators. A double quoted string can be thought of as a string of single quoted characters wrapped around with grouping operators, like so:

/* this expression... */
"abcde"
/* ...is more concisely understood as: */
('a' 'b' 'c' 'd' 'e')

In this sense, it should be more intuitive to understand why certain operators, like the range operator, are disallowed on double quoted strings.

Rule definitions

BN3F uses the Pascal-style definition-assignment operator := to separate the left-hand side and right-hand side of a rule definition. On the left is the identifier of a new non-terminal symbol to be defined, and on the right is its definition, which is an expression as specified below.

identifier := expression;

Expressions

Expressions are the most complex part of BN3F's grammar. They can be understood as a sequence of one or more terms. Terms can take several forms:

  • A sequence or range of characters
  • A logical negation (inversion) of a sequence, range or grouping of terminal symbols
  • A sequence of other terms
  • A grouping of terms
  • An alternation (or choice) between two terms
  • A repetition (either solid, hollow or finite) of a term
  • An optional denotation of a term

Appendix A

This appendix provides the lexical facilities to BN3F that are necessary for writing grammars that are conscious of Unicode. It is completely agnostic to the encodings used to represent Unicode code points.

The use of these Unicode facilities should not be misunderstood as a revocation of any requirement of encoding of the actual grammar being written. That is still expected to be ASCII conforming with the comment exception as described above.

Grammars that use the Unicode facilities at all are forbidden from using the octal-based octet escape sequence defined above, as it falsifies the model of encoding agnosticism in use with consistently representing Unicode-conscious grammars.

Terminal symbols

In following the C-like escape sequence notation, BN3F's Unicode facilities add two new escape sequences for expressing Unicode codepoints as literals:

/* the short form: */
'\uXXXX'
/* and the long form: */
'\UXXXXXXXX'

As with the octal notation in the main spec, all digits are significant and may not be omitted when zero. The short form is capable of representing all code points in Unicode's Basic Multilingual Plane, while the long form is capable of representing all Unicode code points.

These escape sequences are also valid in the same way inside double quotation literals:

/* this expression... */
'\u2018' 'a' 'b' 'c' '\u2019'
/* ...is equivalent to this: */
"\u2018abc\u2019"

Appendix B

First, an anecdote:

In 1973, Robin Milner created ML, a language based on the M&M type theory. ML begets SML which has a formally specified semantics. When asked for a formal semantics of the formal semantics, Milner's head explodes.

This is BN3F's formal semantics for formal semantics. Proceed with caution.


grammar := whitespace* rule+;

rule := identifier whitespace* ":=" whitespace? term+ ';' whitespace*;

identifier := ( 'A' ... 'Z' | 'a' ... 'z' | '_' ) ( 'A' ... 'Z' | 'a' ... 'z' |
'0' ... '9' | '_' )*;

term := ( ( single_expr | double_expr | group_expr | altern | identifier )
whitespace* )+;

single := '\'' ( character | '\\' '\'' ) '\'';

double := '"' ( character | '\\' '"' )+ '"';

group := '(' term ')';

altern := term '|' term;

single_expr := single ( '+' | '*' | '?' | '{' digit+ '}' )?;

double_expr := double ( '+' | '*' | '?' | '{' digit+ '}' )?;

group_expr := group ( '+' | '*' | '?' | '{' digit+ '}' )?;

/* recall intercompatibility notes about octal escapes and Unicode mixing */
character := char_ascii | char_unicode;

char_ascii := ' ' ... '~' | ( '\\' ( 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r' |
('\\' '0' ... '3' ( '0' ... '7' ){2} ) ) );

char_unicode := ' ' ... '~' | ( '\\' ( 'a' | 'b' | 't' | 'n' | 'v' | 'f' | 'r'
| ( 'u' hex_digit{4} ) | ( 'U' hex_digit{8} ) ) );

digit := '0' ... '9';

hex_digit := '0' ... '9' | 'A' ... 'F' | 'a' ... 'f';

whitespace := comment | '\a' | '\b' | '\t' | '\n' | '\v' | '\f' | '\r' | ' ';

comment := "/*" ( ' ' ... '~' | '\200' ... '\377' )+ "*/";