(*:maxLineLen=78:*) theory Inner_Syntax imports Main Base begin chapter ‹Inner syntax --- the term language \label{ch:inner-syntax}› text ‹ The inner syntax of Isabelle provides concrete notation for the main entities of the logical framework, notably ‹λ›-terms with types and type classes. Applications may either extend existing syntactic categories by additional notation, or define new sub-languages that are linked to the standard term language via some explicit markers. For example ▩‹FOO›~‹foo› could embed the syntax corresponding for some user-defined nonterminal ‹foo› --- within the bounds of the given lexical syntax of Isabelle/Pure. The most basic way to specify concrete syntax for logical entities works via mixfix annotations (\secref{sec:mixfix}), which may be usually given as part of the original declaration or via explicit notation commands later on (\secref{sec:notation}). This already covers many needs of concrete syntax without having to understand the full complexity of inner syntax layers. Further details of the syntax engine involves the classical distinction of lexical language versus context-free grammar (see \secref{sec:pure-syntax}), and various mechanisms for ∗‹syntax transformations› (see \secref{sec:syntax-transformations}). › section ‹Printing logical entities› subsection ‹Diagnostic commands \label{sec:print-diag}› text ‹ \begin{matharray}{rcl} @{command_def "typ"}‹⇧*› & : & ‹context →› \\ @{command_def "term"}‹⇧*› & : & ‹context →› \\ @{command_def "prop"}‹⇧*› & : & ‹context →› \\ @{command_def "thm"}‹⇧*› & : & ‹context →› \\ @{command_def "prf"}‹⇧*› & : & ‹context →› \\ @{command_def "full_prf"}‹⇧*› & : & ‹context →› \\ @{command_def "print_state"}‹⇧*› & : & ‹any →› \\ \end{matharray} These diagnostic commands assist interactive development by printing internal logical entities in a human-readable fashion. \<^rail>‹ @@{command typ} @{syntax modes}? @{syntax type} ('::' @{syntax sort})? ; @@{command term} @{syntax modes}? @{syntax term} ; @@{command prop} @{syntax modes}? @{syntax prop} ; @@{command thm} @{syntax modes}? @{syntax thms} ; ( @@{command prf} | @@{command full_prf} ) @{syntax modes}? @{syntax thms}? ; @@{command print_state} @{syntax modes}? ; @{syntax_def modes}: '(' (@{syntax name} + ) ')' › ➧ @{command "typ"}~‹τ› reads and prints a type expression according to the current context. ➧ @{command "typ"}~‹τ :: s› uses type-inference to determine the most general way to make ‹τ› conform to sort ‹s›. For concrete ‹τ› this checks if the type belongs to that sort. Dummy type parameters ``‹_›'' (underscore) are assigned to fresh type variables with most general sorts, according the the principles of type-inference. ➧ @{command "term"}~‹t› and @{command "prop"}~‹φ› read, type-check and print terms or propositions according to the current theory or proof context; the inferred type of ‹t› is output as well. Note that these commands are also useful in inspecting the current environment of term abbreviations. ➧ @{command "thm"}~‹a⇩1 … a⇩n› retrieves theorems from the current theory or proof context. Note that any attributes included in the theorem specifications are applied to a temporary context derived from the current theory or proof; the result is discarded, i.e.\ attributes involved in ‹a⇩1, …, a⇩n› do not have any permanent effect. ➧ @{command "prf"} displays the (compact) proof term of the current proof state (if present), or of the given theorems. Note that this requires an underlying logic image with proof terms enabled, e.g. ‹HOL-Proofs›. ➧ @{command "full_prf"} is like @{command "prf"}, but displays the full proof term, i.e.\ also displays information omitted in the compact proof term, which is denoted by ``‹_›'' placeholders there. ➧ @{command "print_state"} prints the current proof state (if present), including current facts and goals. All of the diagnostic commands above admit a list of ‹modes› to be specified, which is appended to the current print mode; see also \secref{sec:print-modes}. Thus the output behavior may be modified according particular print mode features. For example, @{command "print_state"}~‹(latex)› prints the current proof state with mathematical symbols and special characters represented in {\LaTeX} source, according to the Isabelle style \<^cite>‹"isabelle-system"›. Note that antiquotations (cf.\ \secref{sec:antiq}) provide a more systematic way to include formal items into the printed text document. › subsection ‹Details of printed content› text ‹ \begin{tabular}{rcll} @{attribute_def show_markup} & : & ‹attribute› \\ @{attribute_def show_types} & : & ‹attribute› & default ‹false› \\ @{attribute_def show_sorts} & : & ‹attribute› & default ‹false› \\ @{attribute_def show_consts} & : & ‹attribute› & default ‹false› \\ @{attribute_def show_abbrevs} & : & ‹attribute› & default ‹true› \\ @{attribute_def show_brackets} & : & ‹attribute› & default ‹false› \\ @{attribute_def names_long} & : & ‹attribute› & default ‹false› \\ @{attribute_def names_short} & : & ‹attribute› & default ‹false› \\ @{attribute_def names_unique} & : & ‹attribute› & default ‹true› \\ @{attribute_def eta_contract} & : & ‹attribute› & default ‹true› \\ @{attribute_def goals_limit} & : & ‹attribute› & default ‹10› \\ @{attribute_def show_main_goal} & : & ‹attribute› & default ‹false› \\ @{attribute_def show_hyps} & : & ‹attribute› & default ‹false› \\ @{attribute_def show_tags} & : & ‹attribute› & default ‹false› \\ @{attribute_def show_question_marks} & : & ‹attribute› & default ‹true› \\ \end{tabular} ┉ These configuration options control the detail of information that is displayed for types, terms, theorems, goals etc. See also \secref{sec:config}. ➧ @{attribute show_markup} controls direct inlining of markup into the printed representation of formal entities --- notably type and sort constraints. This enables Prover IDE users to retrieve that information via tooltips or popups while hovering with the mouse over the output window, for example. Consequently, this option is enabled by default for Isabelle/jEdit. ➧ @{attribute show_types} and @{attribute show_sorts} control printing of type constraints for term variables, and sort constraints for type variables. By default, neither of these are shown in output. If @{attribute show_sorts} is enabled, types are always shown as well. In Isabelle/jEdit, manual setting of these options is normally not required thanks to @{attribute show_markup} above. Note that displaying types and sorts may explain why a polymorphic inference rule fails to resolve with some goal, or why a rewrite rule does not apply as expected. ➧ @{attribute show_consts} controls printing of types of constants when displaying a goal state. Note that the output can be enormous, because polymorphic constants often occur at several different type instances. ➧ @{attribute show_abbrevs} controls folding of constant abbreviations. ➧ @{attribute show_brackets} controls bracketing in pretty printed output. If enabled, all sub-expressions of the pretty printing tree will be parenthesized, even if this produces malformed term syntax! This crude way of showing the internal structure of pretty printed entities may occasionally help to diagnose problems with operator priorities, for example. ➧ @{attribute names_long}, @{attribute names_short}, and @{attribute names_unique} control the way of printing fully qualified internal names in external form. See also \secref{sec:antiq} for the document antiquotation options of the same names. ➧ @{attribute eta_contract} controls ‹η›-contracted printing of terms. The ‹η›-contraction law asserts \<^prop>‹(λx. f x) ≡ f›, provided ‹x› is not free in ‹f›. It asserts ∗‹extensionality› of functions: \<^prop>‹f ≡ g› if \<^prop>‹f x ≡ g x› for all ‹x›. Higher-order unification frequently puts terms into a fully ‹η›-expanded form. For example, if ‹F› has type ‹(τ ⇒ τ) ⇒ τ› then its expanded form is \<^term>‹λh. F (λx. h x)›. Enabling @{attribute eta_contract} makes Isabelle perform ‹η›-contractions before printing, so that \<^term>‹λh. F (λx. h x)› appears simply as ‹F›. Note that the distinction between a term and its ‹η›-expanded form occasionally matters. While higher-order resolution and rewriting operate modulo ‹αβη›-conversion, some other tools might look at terms more discretely. ➧ @{attribute goals_limit} controls the maximum number of subgoals to be printed. ➧ @{attribute show_main_goal} controls whether the main result to be proven should be displayed. This information might be relevant for schematic goals, to inspect the current claim that has been synthesized so far. ➧ @{attribute show_hyps} controls printing of implicit hypotheses of local facts. Normally, only those hypotheses are displayed that are ∗‹not› covered by the assumptions of the current context: this situation indicates a fault in some tool being used. By enabling @{attribute show_hyps}, output of ∗‹all› hypotheses can be enforced, which is occasionally useful for diagnostic purposes. ➧ @{attribute show_tags} controls printing of extra annotations within theorems, such as internal position information, or the case names being attached by the attribute @{attribute case_names}. Note that the @{attribute tagged} and @{attribute untagged} attributes provide low-level access to the collection of tags associated with a theorem. ➧ @{attribute show_question_marks} controls printing of question marks for schematic variables, such as ‹?x›. Only the leading question mark is affected, the remaining text is unchanged (including proper markup for schematic variables that might be relevant for user interfaces). › subsection ‹Alternative print modes \label{sec:print-modes}› text ‹ \begin{mldecls} @{define_ML print_mode_value: "unit -> string list"} \\ @{define_ML Print_Mode.with_modes: "string list -> ('a -> 'b) -> 'a -> 'b"} \\ \end{mldecls} The ∗‹print mode› facility allows to modify various operations for printing. Commands like @{command typ}, @{command term}, @{command thm} (see \secref{sec:print-diag}) take additional print modes as optional argument. The underlying ML operations are as follows. ➧ \<^ML>‹print_mode_value ()› yields the list of currently active print mode names. This should be understood as symbolic representation of certain individual features for printing (with precedence from left to right). ➧ \<^ML>‹Print_Mode.with_modes›~‹modes f x› evaluates ‹f x› in an execution context where the print mode is prepended by the given ‹modes›. This provides a thread-safe way to augment print modes. It is also monotonic in the set of mode names: it retains the default print mode that certain user-interfaces might have installed for their proper functioning! ┉ The pretty printer for inner syntax maintains alternative mixfix productions for any print mode name invented by the user, say in commands like @{command notation} or @{command abbreviation}. Mode names can be arbitrary, but the following ones have a specific meaning by convention: ▪ ▩‹""› (the empty string): default mode; implicitly active as last element in the list of modes. ▪ ▩‹input›: dummy print mode that is never active; may be used to specify notation that is only available for input. ▪ ▩‹internal› dummy print mode that is never active; used internally in Isabelle/Pure. ▪ ▩‹ASCII›: prefer ASCII art over mathematical symbols. ▪ ▩‹latex›: additional mode that is active in {\LaTeX} document preparation of Isabelle theory sources; allows to provide alternative output notation. › section ‹Mixfix annotations \label{sec:mixfix}› text ‹ Mixfix annotations specify concrete ∗‹inner syntax› of Isabelle types and terms. Locally fixed parameters in toplevel theorem statements, locale and class specifications also admit mixfix annotations in a fairly uniform manner. A mixfix annotation describes the concrete syntax, the translation to abstract syntax, and the pretty printing. Special case annotations provide a simple means of specifying infix operators and binders. Isabelle mixfix syntax is inspired by {\OBJ} \<^cite>‹OBJ›. It allows to specify any context-free priority grammar, which is more general than the fixity declarations of ML and Prolog. \<^rail>‹ @{syntax_def mixfix}: '(' (@{syntax template} prios? @{syntax nat}? | (@'infix' | @'infixl' | @'infixr') @{syntax template} @{syntax nat} | @'binder' @{syntax template} prio? @{syntax nat} | @'structure') ')' ; @{syntax template}: (string | cartouche) ; prios: '[' (@{syntax nat} + ',') ']' ; prio: '[' @{syntax nat} ']' › The mixfix ‹template› may include literal text, spacing, blocks, and arguments (denoted by ``‹_›''); the special symbol ``▩‹ı›'' (printed as ``‹ı›'') represents an index argument that specifies an implicit @{keyword "structure"} reference (see also \secref{sec:locale}). Only locally fixed variables may be declared as @{keyword "structure"}. Infix and binder declarations provide common abbreviations for particular mixfix declarations. So in practice, mixfix templates mostly degenerate to literal text for concrete syntax, such as ``▩‹++›'' for an infix symbol. › subsection ‹The general mixfix form› text ‹ In full generality, mixfix declarations work as follows. Suppose a constant ‹c :: τ⇩1 ⇒ … τ⇩n ⇒ τ› is annotated by ‹(mixfix [p⇩1, …, p⇩n] p)›, where ‹mixfix› is a string ‹d⇩0 _ d⇩1 _ … _ d⇩n› consisting of delimiters that surround argument positions as indicated by underscores. Altogether this determines a production for a context-free priority grammar, where for each argument ‹i› the syntactic category is determined by ‹τ⇩i› (with priority ‹p⇩i›), and the result category is determined from ‹τ› (with priority ‹p›). Priority specifications are optional, with default 0 for arguments and 1000 for the result.⁋‹Omitting priorities is prone to syntactic ambiguities unless the delimiter tokens determine fully bracketed notation, as in ‹if _ then _ else _ fi›.› Since ‹τ› may be again a function type, the constant type scheme may have more argument positions than the mixfix pattern. Printing a nested application ‹c t⇩1 … t⇩m› for ‹m > n› works by attaching concrete notation only to the innermost part, essentially by printing ‹(c t⇩1 … t⇩n) … t⇩m› instead. If a term has fewer arguments than specified in the mixfix template, the concrete syntax is ignored. ┉ A mixfix template may also contain additional directives for pretty printing, notably spaces, blocks, and breaks. The general template format is a sequence over any of the following entities. ➧ ‹d› is a delimiter, namely a non-empty sequence delimiter items of the following form: ▸ a control symbol followed by a cartouche ▸ a single symbol, excluding the following special characters: ┉ \begin{tabular}{ll} ▩‹'› & single quote \\ ▩‹_› & underscore \\ ‹ı› & index symbol \\ ▩‹(› & open parenthesis \\ ▩‹)› & close parenthesis \\ ▩‹/› & slash \\ ‹‹ ›› & cartouche delimiters \\ \end{tabular} ┉ ➧ ▩‹'› escapes the special meaning of these meta-characters, producing a literal version of the following character, unless that is a blank. A single quote followed by a blank separates delimiters, without affecting printing, but input tokens may have additional white space here. ➧ ▩‹_› is an argument position, which stands for a certain syntactic category in the underlying grammar. ➧ ‹ı› is an indexed argument position; this is the place where implicit structure arguments can be attached. ➧ ‹s› is a non-empty sequence of spaces for printing. This and the following specifications do not affect parsing at all. ➧ ▩‹(›‹n› opens a pretty printing block. The optional natural number specifies the block indentation, i.e. how much spaces to add when a line break occurs within the block. The default indentation is 0. ➧ ▩‹(›‹‹properties›› opens a pretty printing block, with properties specified within the given text cartouche. The syntax and semantics of the category @{syntax_ref mixfix_properties} is described below. ➧ ▩‹)› closes a pretty printing block. ➧ ▩‹//› forces a line break. ➧ ▩‹/›‹s› allows a line break. Here ‹s› stands for the string of spaces (zero or more) right after the slash. These spaces are printed if the break is ∗‹not› taken. ┉ Block properties allow more control over the details of pretty-printed output. The concrete syntax is defined as follows. \<^rail>‹ @{syntax_def "mixfix_properties"}: (entry *) ; entry: atom ('=' atom)? ; atom: @{syntax short_ident} | @{syntax int} | @{syntax float} | @{syntax cartouche} › Each @{syntax entry} is a name-value pair: if the value is omitted, it defaults to ▩‹true› (intended for Boolean properties). The following standard block properties are supported: ▪ ‹indent› (natural number): the block indentation --- the same as for the simple syntax without block properties. ▪ ‹consistent› (Boolean): this block has consistent breaks (if one break is taken, all breaks are taken). ▪ ‹unbreakable› (Boolean): all possible breaks of the block are disabled (turned into spaces). ▪ ‹markup› (string): the optional name of the markup node. If this is provided, all remaining properties are turned into its XML attributes. This allows to specify free-form PIDE markup, e.g.\ for specialized output. ┉ Note that the general idea of pretty printing with blocks and breaks is described in \<^cite>‹"paulson-ml2"›; it goes back to \<^cite>‹"Oppen:1980"›. › subsection ‹Infixes› text ‹ Infix operators are specified by convenient short forms that abbreviate general mixfix annotations as follows: \begin{center} \begin{tabular}{lll} ▩‹(›@{keyword_def "infix"}~▩‹"›‹sy›▩‹"› ‹p›▩‹)› & ‹↦› & ▩‹("(_›~‹sy›▩‹/ _)" [›‹p + 1›▩‹,›~‹p + 1›▩‹]›~‹p›▩‹)› \\ ▩‹(›@{keyword_def "infixl"}~▩‹"›‹sy›▩‹"› ‹p›▩‹)› & ‹↦› & ▩‹("(_›~‹sy›▩‹/ _)" [›‹p›▩‹,›~‹p + 1›▩‹]›~‹p›▩‹)› \\ ▩‹(›@{keyword_def "infixr"}~▩‹"›‹sy›▩‹"›~‹p›▩‹)› & ‹↦› & ▩‹("(_›~‹sy›▩‹/ _)" [›‹p + 1›▩‹,›~‹p›▩‹]›~‹p›▩‹)› \\ \end{tabular} \end{center} The mixfix template ▩‹"(_›~‹sy›▩‹/ _)"› specifies two argument positions; the delimiter is preceded by a space and followed by a space or line break; the entire phrase is a pretty printing block. The alternative notation ▩‹(›‹sy›▩‹)› is introduced in addition. Thus any infix operator may be written in prefix form (as in Haskell), independently of the number of arguments. › subsection ‹Binders› text ‹ A ∗‹binder› is a variable-binding construct such as a quantifier. The idea to formalize ‹∀x. b› as ‹All (λx. b)› for ‹All :: ('a ⇒ bool) ⇒ bool› already goes back to \<^cite>‹church40›. Isabelle declarations of certain higher-order operators may be annotated with @{keyword_def "binder"} annotations as follows: \begin{center} ‹c ::›~▩‹"›‹(τ⇩1 ⇒ τ⇩2) ⇒ τ⇩3›▩‹" (›@{keyword "binder"}~▩‹"›‹sy›▩‹" [›‹p›▩‹]›~‹q›▩‹)› \end{center} This introduces concrete binder syntax ‹sy x. b›, where ‹x› is a bound variable of type ‹τ⇩1›, the body ‹b› has type ‹τ⇩2› and the whole term has type ‹τ⇩3›. The optional integer ‹p› specifies the syntactic priority of the body; the default is ‹q›, which is also the priority of the whole construct. Internally, the binder syntax is expanded to something like this: \begin{center} ‹c_binder ::›~▩‹"›‹idts ⇒ τ⇩2 ⇒ τ⇩3›▩‹" ("(3›‹sy›▩‹_./ _)" [0,›~‹p›▩‹]›~‹q›▩‹)› \end{center} Here @{syntax (inner) idts} is the nonterminal symbol for a list of identifiers with optional type constraints (see also \secref{sec:pure-grammar}). The mixfix template ▩‹"(3›‹sy›▩‹_./ _)"› defines argument positions for the bound identifiers and the body, separated by a dot with optional line break; the entire phrase is a pretty printing block of indentation level 3. Note that there is no extra space after ‹sy›, so it needs to be included user specification if the binder syntax ends with a token that may be continued by an identifier token at the start of @{syntax (inner) idts}. Furthermore, a syntax translation to transforms ‹c_binder x⇩1 … x⇩n b› into iterated application ‹c (λx⇩1. … c (λx⇩n. b)…)›. This works in both directions, for parsing and printing. › section ‹Explicit notation \label{sec:notation}› text ‹ \begin{matharray}{rcll} @{command_def "type_notation"} & : & ‹local_theory → local_theory› \\ @{command_def "no_type_notation"} & : & ‹local_theory → local_theory› \\ @{command_def "notation"} & : & ‹local_theory → local_theory› \\ @{command_def "no_notation"} & : & ‹local_theory → local_theory› \\ @{command_def "write"} & : & ‹proof(state) → proof(state)› \\ \end{matharray} Commands that introduce new logical entities (terms or types) usually allow to provide mixfix annotations on the spot, which is convenient for default notation. Nonetheless, the syntax may be modified later on by declarations for explicit notation. This allows to add or delete mixfix annotations for of existing logical entities within the current context. \<^rail>‹ (@@{command type_notation} | @@{command no_type_notation}) @{syntax mode}? ⏎ (@{syntax name} @{syntax mixfix} + @'and') ; (@@{command notation} | @@{command no_notation}) @{syntax mode}? ⏎ (@{syntax name} @{syntax mixfix} + @'and') ; @@{command write} @{syntax mode}? (@{syntax name} @{syntax mixfix} + @'and') › ➧ @{command "type_notation"}~‹c (mx)› associates mixfix syntax with an existing type constructor. The arity of the constructor is retrieved from the context. ➧ @{command "no_type_notation"} is similar to @{command "type_notation"}, but removes the specified syntax annotation from the present context. ➧ @{command "notation"}~‹c (mx)› associates mixfix syntax with an existing constant or fixed variable. The type declaration of the given entity is retrieved from the context. ➧ @{command "no_notation"} is similar to @{command "notation"}, but removes the specified syntax annotation from the present context. ➧ @{command "write"} is similar to @{command "notation"}, but works within an Isar proof body. › section ‹The Pure syntax \label{sec:pure-syntax}› subsection ‹Lexical matters \label{sec:inner-lex}› text ‹ The inner lexical syntax vaguely resembles the outer one (\secref{sec:outer-lex}), but some details are different. There are two main categories of inner syntax tokens: ▸ ∗‹delimiters› --- the literal tokens occurring in productions of the given priority grammar (cf.\ \secref{sec:priority-grammar}); ▸ ∗‹named tokens› --- various categories of identifiers etc. Delimiters override named tokens and may thus render certain identifiers inaccessible. Sometimes the logical context admits alternative ways to refer to the same entity, potentially via qualified names. ┉ The categories for named tokens are defined once and for all as follows, reusing some categories of the outer token syntax (\secref{sec:outer-lex}). \begin{center} \begin{supertabular}{rcl} @{syntax_def (inner) id} & = & @{syntax_ref short_ident} \\ @{syntax_def (inner) longid} & = & @{syntax_ref long_ident} \\ @{syntax_def (inner) var} & = & @{syntax_ref var} \\ @{syntax_def (inner) tid} & = & @{syntax_ref type_ident} \\ @{syntax_def (inner) tvar} & = & @{syntax_ref type_var} \\ @{syntax_def (inner) num_token} & = & @{syntax_ref nat} \\ @{syntax_def (inner) float_token} & = & @{syntax_ref nat}▩‹.›@{syntax_ref nat} \\ @{syntax_def (inner) str_token} & = & ▩‹''› ‹…› ▩‹''› \\ @{syntax_def (inner) string_token} & = & ▩‹"› ‹…› ▩‹"› \\ @{syntax_def (inner) cartouche} & = & @{verbatim "‹"} ‹…› @{verbatim "›"} \\ \end{supertabular} \end{center} The token categories @{syntax (inner) num_token}, @{syntax (inner) float_token}, @{syntax (inner) str_token}, @{syntax (inner) string_token}, and @{syntax (inner) cartouche} are not used in Pure. Object-logics may implement numerals and string literals by adding appropriate syntax declarations, together with some translation functions (e.g.\ see 🗏‹~~/src/HOL/Tools/string_syntax.ML›). The derived categories @{syntax_def (inner) num_const}, and @{syntax_def (inner) float_const}, provide robust access to the respective tokens: the syntax tree holds a syntactic constant instead of a free variable. Formal document comments (\secref{sec:comments}) may be also used within the inner syntax. › subsection ‹Priority grammars \label{sec:priority-grammar}› text ‹ A context-free grammar consists of a set of ∗‹terminal symbols›, a set of ∗‹nonterminal symbols› and a set of ∗‹productions›. Productions have the form ‹A = γ›, where ‹A› is a nonterminal and ‹γ› is a string of terminals and nonterminals. One designated nonterminal is called the ∗‹root symbol›. The language defined by the grammar consists of all strings of terminals that can be derived from the root symbol by applying productions as rewrite rules. The standard Isabelle parser for inner syntax uses a ∗‹priority grammar›. Each nonterminal is decorated by an integer priority: ‹A⇧(⇧p⇧)›. In a derivation, ‹A⇧(⇧p⇧)› may be rewritten using a production ‹A⇧(⇧q⇧) = γ› only if ‹p ≤ q›. Any priority grammar can be translated into a normal context-free grammar by introducing new nonterminals and productions. ┉ Formally, a set of context free productions ‹G› induces a derivation relation ‹⟶⇩G› as follows. Let ‹α› and ‹β› denote strings of terminal or nonterminal symbols. Then ‹α A⇧(⇧p⇧) β ⟶⇩G α γ β› holds if and only if ‹G› contains some production ‹A⇧(⇧q⇧) = γ› for ‹p ≤ q›. ┉ The following grammar for arithmetic expressions demonstrates how binding power and associativity of operators can be enforced by priorities. \begin{center} \begin{tabular}{rclr} ‹A⇧(⇧1⇧0⇧0⇧0⇧)› & ‹=› & ▩‹(› ‹A⇧(⇧0⇧)› ▩‹)› \\ ‹A⇧(⇧1⇧0⇧0⇧0⇧)› & ‹=› & ▩‹0› \\ ‹A⇧(⇧0⇧)› & ‹=› & ‹A⇧(⇧0⇧)› ▩‹+› ‹A⇧(⇧1⇧)› \\ ‹A⇧(⇧2⇧)› & ‹=› & ‹A⇧(⇧3⇧)› ▩‹*› ‹A⇧(⇧2⇧)› \\ ‹A⇧(⇧3⇧)› & ‹=› & ▩‹-› ‹A⇧(⇧3⇧)› \\ \end{tabular} \end{center} The choice of priorities determines that ▩‹-› binds tighter than ▩‹*›, which binds tighter than ▩‹+›. Furthermore ▩‹+› associates to the left and ▩‹*› to the right. ┉ For clarity, grammars obey these conventions: ▪ All priorities must lie between 0 and 1000. ▪ Priority 0 on the right-hand side and priority 1000 on the left-hand side may be omitted. ▪ The production ‹A⇧(⇧p⇧) = α› is written as ‹A = α (p)›, i.e.\ the priority of the left-hand side actually appears in a column on the far right. ▪ Alternatives are separated by ‹|›. ▪ Repetition is indicated by dots ‹(…)› in an informal but obvious way. Using these conventions, the example grammar specification above takes the form: \begin{center} \begin{tabular}{rclc} ‹A› & ‹=› & ▩‹(› ‹A› ▩‹)› \\ & ‹|› & ▩‹0› & \qquad\qquad \\ & ‹|› & ‹A› ▩‹+› ‹A⇧(⇧1⇧)› & ‹(0)› \\ & ‹|› & ‹A⇧(⇧3⇧)› ▩‹*› ‹A⇧(⇧2⇧)› & ‹(2)› \\ & ‹|› & ▩‹-› ‹A⇧(⇧3⇧)› & ‹(3)› \\ \end{tabular} \end{center} › subsection ‹The Pure grammar \label{sec:pure-grammar}› text ‹ The priority grammar of the ‹Pure› theory is defined approximately like this: \begin{center} \begin{supertabular}{rclr} @{syntax_def (inner) any} & = & ‹prop | logic› \\\\ @{syntax_def (inner) prop} & = & ▩‹(› ‹prop› ▩‹)› \\ & ‹|› & ‹prop⇧(⇧4⇧)› ▩‹::› ‹type› & ‹(3)› \\ & ‹|› & ‹any⇧(⇧3⇧)› ▩‹==› ‹any⇧(⇧3⇧)› & ‹(2)› \\ & ‹|› & ‹any⇧(⇧3⇧)› ‹≡› ‹any⇧(⇧3⇧)› & ‹(2)› \\ & ‹|› & ‹prop⇧(⇧3⇧)› ▩‹&&&› ‹prop⇧(⇧2⇧)› & ‹(2)› \\ & ‹|› & ‹prop⇧(⇧2⇧)› ▩‹==>› ‹prop⇧(⇧1⇧)› & ‹(1)› \\ & ‹|› & ‹prop⇧(⇧2⇧)› ‹⟹› ‹prop⇧(⇧1⇧)› & ‹(1)› \\ & ‹|› & ▩‹[|› ‹prop› ▩‹;› ‹…› ▩‹;› ‹prop› ▩‹|]› ▩‹==>› ‹prop⇧(⇧1⇧)› & ‹(1)› \\ & ‹|› & ‹⟦› ‹prop› ▩‹;› ‹…› ▩‹;› ‹prop› ‹⟧› ‹⟹› ‹prop⇧(⇧1⇧)› & ‹(1)› \\ & ‹|› & ▩‹!!› ‹idts› ▩‹.› ‹prop› & ‹(0)› \\ & ‹|› & ‹⋀› ‹idts› ▩‹.› ‹prop› & ‹(0)› \\ & ‹|› & ▩‹OFCLASS› ▩‹(› ‹type› ▩‹,› ‹logic› ▩‹)› \\ & ‹|› & ▩‹SORT_CONSTRAINT› ▩‹(› ‹type› ▩‹)› \\ & ‹|› & ▩‹TERM› ‹logic› \\ & ‹|› & ▩‹PROP› ‹aprop› \\\\ @{syntax_def (inner) aprop} & = & ▩‹(› ‹aprop› ▩‹)› \\ & ‹|› & ‹id | longid | var |›~~▩‹_›~~‹|›~~▩‹...› \\ & ‹|› & ▩‹CONST› ‹id |›~~▩‹CONST› ‹longid› \\ & ‹|› & ▩‹XCONST› ‹id |›~~▩‹XCONST› ‹longid› \\ & ‹|› & ‹logic⇧(⇧1⇧0⇧0⇧0⇧) any⇧(⇧1⇧0⇧0⇧0⇧) … any⇧(⇧1⇧0⇧0⇧0⇧)› & ‹(999)› \\\\ @{syntax_def (inner) logic} & = & ▩‹(› ‹logic› ▩‹)› \\ & ‹|› & ‹logic⇧(⇧4⇧)› ▩‹::› ‹type› & ‹(3)› \\ & ‹|› & ‹id | longid | var |›~~▩‹_›~~‹|›~~▩‹...› \\ & ‹|› & ▩‹CONST› ‹id |›~~▩‹CONST› ‹longid› \\ & ‹|› & ▩‹XCONST› ‹id |›~~▩‹XCONST› ‹longid› \\ & ‹|› & ‹logic⇧(⇧1⇧0⇧0⇧0⇧) any⇧(⇧1⇧0⇧0⇧0⇧) … any⇧(⇧1⇧0⇧0⇧0⇧)› & ‹(999)› \\ & ‹|› & ▩‹%› ‹pttrns› ▩‹.› ‹any⇧(⇧3⇧)› & ‹(3)› \\ & ‹|› & ‹λ› ‹pttrns› ▩‹.› ‹any⇧(⇧3⇧)› & ‹(3)› \\ & ‹|› & ▩‹(==)›~~‹|›~~▩‹(›‹≡›▩‹)›~~‹|›~~▩‹(&&&)› \\ & ‹|› & ▩‹(==>)›~~‹|›~~▩‹(›‹⟹›▩‹)› \\ & ‹|› & ▩‹TYPE› ▩‹(› ‹type› ▩‹)› \\\\ @{syntax_def (inner) idt} & = & ▩‹(› ‹idt› ▩‹)›~~‹| id |›~~▩‹_› \\ & ‹|› & ‹id› ▩‹::› ‹type› & ‹(0)› \\ & ‹|› & ▩‹_› ▩‹::› ‹type› & ‹(0)› \\\\ @{syntax_def (inner) index} & = & ▩‹⇘› ‹logic⇧(⇧0⇧)› ▩‹⇙›~~‹| | ı› \\\\ @{syntax_def (inner) idts} & = & ‹idt | idt⇧(⇧1⇧) idts› & ‹(0)› \\\\ @{syntax_def (inner) pttrn} & = & ‹idt› \\\\ @{syntax_def (inner) pttrns} & = & ‹pttrn | pttrn⇧(⇧1⇧) pttrns› & ‹(0)› \\\\ @{syntax_def (inner) type} & = & ▩‹(› ‹type› ▩‹)› \\ & ‹|› & ‹tid | tvar |›~~▩‹_› \\ & ‹|› & ‹tid› ▩‹::› ‹sort | tvar›~~▩‹::› ‹sort |›~~▩‹_› ▩‹::› ‹sort› \\ & ‹|› & ‹type_name | type⇧(⇧1⇧0⇧0⇧0⇧) type_name› \\ & ‹|› & ▩‹(› ‹type› ▩‹,› ‹…› ▩‹,› ‹type› ▩‹)› ‹type_name› \\ & ‹|› & ‹type⇧(⇧1⇧)› ▩‹=>› ‹type› & ‹(0)› \\ & ‹|› & ‹type⇧(⇧1⇧)› ‹⇒› ‹type› & ‹(0)› \\ & ‹|› & ▩‹[› ‹type› ▩‹,› ‹…› ▩‹,› ‹type› ▩‹]› ▩‹=>› ‹type› & ‹(0)› \\ & ‹|› & ▩‹[› ‹type› ▩‹,› ‹…› ▩‹,› ‹type› ▩‹]› ‹⇒› ‹type› & ‹(0)› \\ @{syntax_def (inner) type_name} & = & ‹id | longid› \\\\ @{syntax_def (inner) sort} & = & @{syntax class_name}~~‹|›~~▩‹_›~~‹|›~~▩‹{}› \\ & ‹|› & ▩‹{› @{syntax class_name} ▩‹,› ‹…› ▩‹,› @{syntax class_name} ▩‹}› \\ @{syntax_def (inner) class_name} & = & ‹id | longid› \\ \end{supertabular} \end{center} ┉ Here literal terminals are printed ▩‹verbatim›; see also \secref{sec:inner-lex} for further token categories of the inner syntax. The meaning of the nonterminals defined by the above grammar is as follows: ➧ @{syntax_ref (inner) any} denotes any term. ➧ @{syntax_ref (inner) prop} denotes meta-level propositions, which are terms of type \<^typ>‹prop›. The syntax of such formulae of the meta-logic is carefully distinguished from usual conventions for object-logics. In particular, plain ‹λ›-term notation is ∗‹not› recognized as @{syntax (inner) prop}. ➧ @{syntax_ref (inner) aprop} denotes atomic propositions, which are embedded into regular @{syntax (inner) prop} by means of an explicit ▩‹PROP› token. Terms of type \<^typ>‹prop› with non-constant head, e.g.\ a plain variable, are printed in this form. Constants that yield type \<^typ>‹prop› are expected to provide their own concrete syntax; otherwise the printed version will appear like @{syntax (inner) logic} and cannot be parsed again as @{syntax (inner) prop}. ➧ @{syntax_ref (inner) logic} denotes arbitrary terms of a logical type, excluding type \<^typ>‹prop›. This is the main syntactic category of object-logic entities, covering plain ‹λ›-term notation (variables, abstraction, application), plus anything defined by the user. When specifying notation for logical entities, all logical types (excluding \<^typ>‹prop›) are ∗‹collapsed› to this single category of @{syntax (inner) logic}. ➧ @{syntax_ref (inner) index} denotes an optional index term for indexed syntax. If omitted, it refers to the first @{keyword_ref "structure"} variable in the context. The special dummy ``‹ı›'' serves as pattern variable in mixfix annotations that introduce indexed notation. ➧ @{syntax_ref (inner) idt} denotes identifiers, possibly constrained by types. ➧ @{syntax_ref (inner) idts} denotes a sequence of @{syntax_ref (inner) idt}. This is the most basic category for variables in iterated binders, such as ‹λ› or ‹⋀›. ➧ @{syntax_ref (inner) pttrn} and @{syntax_ref (inner) pttrns} denote patterns for abstraction, cases bindings etc. In Pure, these categories start as a merely copy of @{syntax (inner) idt} and @{syntax (inner) idts}, respectively. Object-logics may add additional productions for binding forms. ➧ @{syntax_ref (inner) type} denotes types of the meta-logic. ➧ @{syntax_ref (inner) sort} denotes meta-level sorts. Here are some further explanations of certain syntax features. ▪ In @{syntax (inner) idts}, note that ‹x :: nat y› is parsed as ‹x :: (nat y)›, treating ‹y› like a type constructor applied to ‹nat›. To avoid this interpretation, write ‹(x :: nat) y› with explicit parentheses. ▪ Similarly, ‹x :: nat y :: nat› is parsed as ‹x :: (nat y :: nat)›. The correct form is ‹(x :: nat) (y :: nat)›, or ‹(x :: nat) y :: nat› if ‹y› is last in the sequence of identifiers. ▪ Type constraints for terms bind very weakly. For example, ‹x < y :: nat› is normally parsed as ‹(x < y) :: nat›, unless ‹<› has a very low priority, in which case the input is likely to be ambiguous. The correct form is ‹x < (y :: nat)›. ▪ Dummy variables (written as underscore) may occur in different roles. ➧ A sort ``‹_›'' refers to a vacuous constraint for type variables, which is effectively ignored in type-inference. ➧ A type ``‹_›'' or ``‹_ :: sort›'' acts like an anonymous inference parameter, which is filled-in according to the most general type produced by the type-checking phase. ➧ A bound ``‹_›'' refers to a vacuous abstraction, where the body does not refer to the binding introduced here. As in the term \<^term>‹λx _. x›, which is ‹α›-equivalent to ‹λx y. x›. ➧ A free ``‹_›'' refers to an implicit outer binding. Higher definitional packages usually allow forms like ‹f x _ = x›. ➧ A schematic ``‹_›'' (within a term pattern, see \secref{sec:term-decls}) refers to an anonymous variable that is implicitly abstracted over its context of locally bound variables. For example, this allows pattern matching of ‹{x. f x = g x}› against ‹{x. _ = _}›, or even ‹{_. _ = _}› by using both bound and schematic dummies. ➧ The three literal dots ``▩‹...›'' may be also written as ellipsis symbol ▩‹…›. In both cases this refers to a special schematic variable, which is bound in the context. This special term abbreviation works nicely with calculational reasoning (\secref{sec:calculation}). ➧ ▩‹CONST› ensures that the given identifier is treated as constant term, and passed through the parse tree in fully internalized form. This is particularly relevant for translation rules (\secref{sec:syn-trans}), notably on the RHS. ➧ ▩‹XCONST› is similar to ▩‹CONST›, but retains the constant name as given. This is only relevant to translation rules (\secref{sec:syn-trans}), notably on the LHS. › subsection ‹Inspecting the syntax› text ‹ \begin{matharray}{rcl} @{command_def "print_syntax"}‹⇧*› & : & ‹context →› \\ \end{matharray} ➧ @{command "print_syntax"} prints the inner syntax of the current context. The output can be quite large; the most important sections are explained below. ➧ ‹lexicon› lists the delimiters of the inner token language; see \secref{sec:inner-lex}. ➧ ‹productions› lists the productions of the underlying priority grammar; see \secref{sec:priority-grammar}. Many productions have an extra ‹… ❙⇒ name›. These names later become the heads of parse trees; they also guide the pretty printer. Productions without such parse tree names are called ∗‹copy productions›. Their right-hand side must have exactly one nonterminal symbol (or named token). The parser does not create a new parse tree node for copy productions, but simply returns the parse tree of the right-hand symbol. If the right-hand side of a copy production consists of a single nonterminal without any delimiters, then it is called a ∗‹chain production›. Chain productions act as abbreviations: conceptually, they are removed from the grammar by adding new productions. Priority information attached to chain productions is ignored. ➧ ‹print modes› lists the alternative print modes provided by this grammar; see \secref{sec:print-modes}. ➧ ‹parse_rules› and ‹print_rules› relate to syntax translations (macros); see \secref{sec:syn-trans}. ➧ ‹parse_ast_translation› and ‹print_ast_translation› list sets of constants that invoke translation functions for abstract syntax trees, which are only required in very special situations; see \secref{sec:tr-funs}. ➧ ‹parse_translation› and ‹print_translation› list the sets of constants that invoke regular translation functions; see \secref{sec:tr-funs}. › subsection ‹Ambiguity of parsed expressions› text ‹ \begin{tabular}{rcll} @{attribute_def syntax_ambiguity_warning} & : & ‹attribute› & default ‹true› \\ @{attribute_def syntax_ambiguity_limit} & : & ‹attribute› & default ‹10› \\ \end{tabular} Depending on the grammar and the given input, parsing may be ambiguous. Isabelle lets the Earley parser enumerate all possible parse trees, and then tries to make the best out of the situation. Terms that cannot be type-checked are filtered out, which often leads to a unique result in the end. Unlike regular type reconstruction, which is applied to the whole collection of input terms simultaneously, the filtering stage only treats each given term in isolation. Filtering is also not attempted for individual types or raw ASTs (as required for @{command translations}). Certain warning or error messages are printed, depending on the situation and the given configuration options. Parsing ultimately fails, if multiple results remain after the filtering phase. ➧ @{attribute syntax_ambiguity_warning} controls output of explicit warning messages about syntax ambiguity. ➧ @{attribute syntax_ambiguity_limit} determines the number of resulting parse trees that are shown as part of the printed message in case of an ambiguity. › section ‹Syntax transformations \label{sec:syntax-transformations}› text ‹ The inner syntax engine of Isabelle provides separate mechanisms to transform parse trees either via rewrite systems on first-order ASTs (\secref{sec:syn-trans}), or ML functions on ASTs or syntactic ‹λ›-terms (\secref{sec:tr-funs}). This works both for parsing and printing, as outlined in \figref{fig:parse-print}. \begin{figure}[htbp] \begin{center} \begin{tabular}{cl} string & \\ ‹↓› & lexer + parser \\ parse tree & \\ ‹↓› & parse AST translation \\ AST & \\ ‹↓› & AST rewriting (macros) \\ AST & \\ ‹↓› & parse translation \\ --- pre-term --- & \\ ‹↓› & print translation \\ AST & \\ ‹↓› & AST rewriting (macros) \\ AST & \\ ‹↓› & print AST translation \\ string & \end{tabular} \end{center} \caption{Parsing and printing with translations}\label{fig:parse-print} \end{figure} These intermediate syntax tree formats eventually lead to a pre-term with all names and binding scopes resolved, but most type information still missing. Explicit type constraints might be given by the user, or implicit position information by the system --- both need to be passed-through carefully by syntax transformations. Pre-terms are further processed by the so-called ∗‹check› and ∗‹uncheck› phases that are intertwined with type-inference (see also \<^cite>‹"isabelle-implementation"›). The latter allows to operate on higher-order abstract syntax with proper binding and type information already available. As a rule of thumb, anything that manipulates bindings of variables or constants needs to be implemented as syntax transformation (see below). Anything else is better done via check/uncheck: a prominent example application is the @{command abbreviation} concept of Isabelle/Pure. › subsection ‹Abstract syntax trees \label{sec:ast}› text ‹ The ML datatype \<^ML_type>‹Ast.ast› explicitly represents the intermediate AST format that is used for syntax rewriting (\secref{sec:syn-trans}). It is defined in ML as follows: @{verbatim [display] ‹datatype ast = Constant of string | Variable of string | Appl of ast list›} An AST is either an atom (constant or variable) or a list of (at least two) subtrees. Occasional diagnostic output of ASTs uses notation that resembles S-expression of LISP. Constant atoms are shown as quoted strings, variable atoms as non-quoted strings and applications as a parenthesized list of subtrees. For example, the AST @{ML [display] ‹Ast.Appl [Ast.Constant "_abs", Ast.Variable "x", Ast.Variable "t"]›} is pretty-printed as ▩‹("_abs" x t)›. Note that ▩‹()› and ▩‹(x)› are excluded as ASTs, because they have too few subtrees. ┉ AST application is merely a pro-forma mechanism to indicate certain syntactic structures. Thus ▩‹(c a b)› could mean either term application or type application, depending on the syntactic context. Nested application like ▩‹(("_abs" x t) u)› is also possible, but ASTs are definitely first-order: the syntax constant ▩‹"_abs"› does not bind the ▩‹x› in any way. Proper bindings are introduced in later stages of the term syntax, where ▩‹("_abs" x t)› becomes an \<^ML>‹Abs› node and occurrences of ▩‹x› in ▩‹t› are replaced by bound variables (represented as de-Bruijn indices). › subsubsection ‹AST constants versus variables› text ‹ Depending on the situation --- input syntax, output syntax, translation patterns --- the distinction of atomic ASTs as \<^ML>‹Ast.Constant› versus \<^ML>‹Ast.Variable› serves slightly different purposes. Input syntax of a term such as ‹f a b = c› does not yet indicate the scopes of atomic entities ‹f, a, b, c›: they could be global constants or local variables, even bound ones depending on the context of the term. \<^ML>‹Ast.Variable› leaves this choice still open: later syntax layers (or translation functions) may capture such a variable to determine its role specifically, to make it a constant, bound variable, free variable etc. In contrast, syntax translations that introduce already known constants would rather do it via \<^ML>‹Ast.Constant› to prevent accidental re-interpretation later on. Output syntax turns term constants into \<^ML>‹Ast.Constant› and variables (free or schematic) into \<^ML>‹Ast.Variable›. This information is precise when printing fully formal ‹λ›-terms. ┉ AST translation patterns (\secref{sec:syn-trans}) that represent terms cannot distinguish constants and variables syntactically. Explicit indication of ‹CONST c› inside the term language is required, unless ‹c› is known as special ∗‹syntax constant› (see also @{command syntax}). It is also possible to use @{command syntax} declarations (without mixfix annotation) to enforce that certain unqualified names are always treated as constant within the syntax machinery. The situation is simpler for ASTs that represent types or sorts, since the concrete syntax already distinguishes type variables from type constants (constructors). So ‹('a, 'b) foo› corresponds to an AST application of some constant for ‹foo› and variable arguments for ‹'a› and ‹'b›. Note that the postfix application is merely a feature of the concrete syntax, while in the AST the constructor occurs in head position. › subsubsection ‹Authentic syntax names› text ‹ Naming constant entities within ASTs is another delicate issue. Unqualified names are resolved in the name space tables in the last stage of parsing, after all translations have been applied. Since syntax transformations do not know about this later name resolution, there can be surprises in boundary cases. ∗‹Authentic syntax names› for \<^ML>‹Ast.Constant› avoid this problem: the fully-qualified constant name with a special prefix for its formal category (‹class›, ‹type›, ‹const›, ‹fixed›) represents the information faithfully within the untyped AST format. Accidental overlap with free or bound variables is excluded as well. Authentic syntax names work implicitly in the following situations: ▪ Input of term constants (or fixed variables) that are introduced by concrete syntax via @{command notation}: the correspondence of a particular grammar production to some known term entity is preserved. ▪ Input of type constants (constructors) and type classes --- thanks to explicit syntactic distinction independently on the context. ▪ Output of term constants, type constants, type classes --- this information is already available from the internal term to be printed. In other words, syntax transformations that operate on input terms written as prefix applications are difficult to make robust. Luckily, this case rarely occurs in practice, because syntax forms to be translated usually correspond to some concrete notation. › subsection ‹Raw syntax and translations \label{sec:syn-trans}› text ‹ \begin{tabular}{rcll} @{command_def "nonterminal"} & : & ‹theory → theory› \\ @{command_def "syntax"} & : & ‹local_theory → local_theory› \\ @{command_def "no_syntax"} & : & ‹local_theory → local_theory› \\ @{command_def "translations"} & : & ‹theory → theory› \\ @{command_def "no_translations"} & : & ‹theory → theory› \\ @{attribute_def syntax_ast_trace} & : & ‹attribute› & default ‹false› \\ @{attribute_def syntax_ast_stats} & : & ‹attribute› & default ‹false› \\ \end{tabular} ┉ Unlike mixfix notation for existing formal entities (\secref{sec:notation}), raw syntax declarations provide full access to the priority grammar of the inner syntax, without any sanity checks. This includes additional syntactic categories (via @{command nonterminal}) and free-form grammar productions (via @{command syntax}). Additional syntax translations (or macros, via @{command translations}) are required to turn resulting parse trees into proper representations of formal entities again. \<^rail>‹ @@{command nonterminal} (@{syntax name} + @'and') ; (@@{command syntax} | @@{command no_syntax}) @{syntax mode}? (constdecl +) ; (@@{command translations} | @@{command no_translations}) (transpat ('==' | '=>' | '<=' | '⇌' | '⇀' | '↽') transpat +) ; constdecl: @{syntax name} '::' @{syntax type} @{syntax mixfix}? ; mode: ('(' ( @{syntax name} | @'output' | @{syntax name} @'output' ) ')') ; transpat: ('(' @{syntax name} ')')? @{syntax string} › ➧ @{command "nonterminal"}~‹c› declares a type constructor ‹c› (without arguments) to act as purely syntactic type: a nonterminal symbol of the inner syntax. ➧ @{command "syntax"}~‹(mode) c :: σ (mx)› augments the priority grammar and the pretty printer table for the given print mode (default ▩‹""›). An optional keyword @{keyword_ref "output"} means that only the pretty printer table is affected. Following \secref{sec:mixfix}, the mixfix annotation ‹mx = template ps q› together with type ‹σ = τ⇩1 ⇒ … τ⇩n ⇒ τ› and specify a grammar production. The ‹template› contains delimiter tokens that surround ‹n› argument positions (▩‹_›). The latter correspond to nonterminal symbols ‹A⇩i› derived from the argument types ‹τ⇩i› as follows: ▪ ‹prop› if ‹τ⇩i = prop› ▪ ‹logic› if ‹τ⇩i = (…)κ› for logical type constructor ‹κ ≠ prop› ▪ ‹any› if ‹τ⇩i = α› for type variables ▪ ‹κ› if ‹τ⇩i = κ› for nonterminal ‹κ› (syntactic type constructor) Each ‹A⇩i› is decorated by priority ‹p⇩i› from the given list ‹ps›; missing priorities default to 0. The resulting nonterminal of the production is determined similarly from type ‹τ›, with priority ‹q› and default 1000. ┉ Parsing via this production produces parse trees ‹t⇩1, …, t⇩n› for the argument slots. The resulting parse tree is composed as ‹c t⇩1 … t⇩n›, by using the syntax constant ‹c› of the syntax declaration. Such syntactic constants are invented on the spot, without formal check wrt.\ existing declarations. It is conventional to use plain identifiers prefixed by a single underscore (e.g.\ ‹_foobar›). Names should be chosen with care, to avoid clashes with other syntax declarations. ┉ The special case of copy production is specified by ‹c =›~▩‹""› (empty string). It means that the resulting parse tree ‹t› is copied directly, without any further decoration. ➧ @{command "no_syntax"}~‹(mode) decls› removes grammar declarations (and translations) resulting from ‹decls›, which are interpreted in the same manner as for @{command "syntax"} above. ➧ @{command "translations"}~‹rules› specifies syntactic translation rules (i.e.\ macros) as first-order rewrite rules on ASTs (\secref{sec:ast}). The theory context maintains two independent lists translation rules: parse rules (▩‹=>› or ‹⇀›) and print rules (▩‹<=› or ‹↽›). For convenience, both can be specified simultaneously as parse~/ print rules (▩‹==› or ‹⇌›). Translation patterns may be prefixed by the syntactic category to be used for parsing; the default is ‹logic› which means that regular term syntax is used. Both sides of the syntax translation rule undergo parsing and parse AST translations \secref{sec:tr-funs}, in order to perform some fundamental normalization like ‹λx y. b ↝ λx. λy. b›, but other AST translation rules are ∗‹not› applied recursively here. When processing AST patterns, the inner syntax lexer runs in a different mode that allows identifiers to start with underscore. This accommodates the usual naming convention for auxiliary syntax constants --- those that do not have a logical counter part --- by allowing to specify arbitrary AST applications within the term syntax, independently of the corresponding concrete syntax. Atomic ASTs are distinguished as \<^ML>‹Ast.Constant› versus \<^ML>‹Ast.Variable› as follows: a qualified name or syntax constant declared via @{command syntax}, or parse tree head of concrete notation becomes \<^ML>‹Ast.Constant›, anything else \<^ML>‹Ast.Variable›. Note that ‹CONST› and ‹XCONST› within the term language (\secref{sec:pure-grammar}) allow to enforce treatment as constants. AST rewrite rules ‹(lhs, rhs)› need to obey the following side-conditions: ▪ Rules must be left linear: ‹lhs› must not contain repeated variables.⁋‹The deeper reason for this is that AST equality is not well-defined: different occurrences of the ``same'' AST could be decorated differently by accidental type-constraints or source position information, for example.› ▪ Every variable in ‹rhs› must also occur in ‹lhs›. ➧ @{command "no_translations"}~‹rules› removes syntactic translation rules, which are interpreted in the same manner as for @{command "translations"} above. ➧ @{attribute syntax_ast_trace} and @{attribute syntax_ast_stats} control diagnostic output in the AST normalization process, when translation rules are applied to concrete input or output. Raw syntax and translations provides a slightly more low-level access to the grammar and the form of resulting parse trees. It is often possible to avoid this untyped macro mechanism, and use type-safe @{command abbreviation} or @{command notation} instead. Some important situations where @{command syntax} and @{command translations} are really need are as follows: ▪ Iterated replacement via recursive @{command translations}. For example, consider list enumeration \<^term>‹[a, b, c, d]› as defined in theory \<^theory>‹HOL.List›. ▪ Change of binding status of variables: anything beyond the built-in @{keyword "binder"} mixfix annotation requires explicit syntax translations. For example, consider the set comprehension syntax \<^term>‹{x. P}› as defined in theory \<^theory>‹HOL.Set›. › subsubsection ‹Applying translation rules› text ‹ As a term is being parsed or printed, an AST is generated as an intermediate form according to \figref{fig:parse-print}. The AST is normalized by applying translation rules in the manner of a first-order term rewriting system. We first examine how a single rule is applied. Let ‹t› be the abstract syntax tree to be normalized and ‹(lhs, rhs)› some translation rule. A subtree ‹u› of ‹t› is called ∗‹redex› if it is an instance of ‹lhs›; in this case the pattern ‹lhs› is said to match the object ‹u›. A redex matched by ‹lhs› may be replaced by the corresponding instance of ‹rhs›, thus ∗‹rewriting› the AST ‹t›. Matching requires some notion of ∗‹place-holders› in rule patterns: \<^ML>‹Ast.Variable› serves this purpose. More precisely, the matching of the object ‹u› against the pattern ‹lhs› is performed as follows: ▪ Objects of the form \<^ML>‹Ast.Variable›~‹x› or \<^ML>‹Ast.Constant›~‹x› are matched by pattern \<^ML>‹Ast.Constant›~‹x›. Thus all atomic ASTs in the object are treated as (potential) constants, and a successful match makes them actual constants even before name space resolution (see also \secref{sec:ast}). ▪ Object ‹u› is matched by pattern \<^ML>‹Ast.Variable›~‹x›, binding ‹x› to ‹u›. ▪ Object \<^ML>‹Ast.Appl›~‹us› is matched by \<^ML>‹Ast.Appl›~‹ts› if ‹us› and ‹ts› have the same length and each corresponding subtree matches. ▪ In every other case, matching fails. A successful match yields a substitution that is applied to ‹rhs›, generating the instance that replaces ‹u›. Normalizing an AST involves repeatedly applying translation rules until none are applicable. This works yoyo-like: top-down, bottom-up, top-down, etc. At each subtree position, rules are chosen in order of appearance in the theory definitions. The configuration options @{attribute syntax_ast_trace} and @{attribute syntax_ast_stats} might help to understand this process and diagnose problems. \begin{warn} If syntax translation rules work incorrectly, the output of @{command_ref print_syntax} with its ∗‹rules› sections reveals the actual internal forms of AST pattern, without potentially confusing concrete syntax. Recall that AST constants appear as quoted strings and variables without quotes. \end{warn} \begin{warn} If @{attribute_ref eta_contract} is set to ‹true›, terms will be ‹η›-contracted ∗‹before› the AST rewriter sees them. Thus some abstraction nodes needed for print rules to match may vanish. For example, ‹Ball A (λx. P x)› would contract to ‹Ball A P› and the standard print rule would fail to apply. This problem can be avoided by hand-written ML translation functions (see also \secref{sec:tr-funs}), which is in fact the same mechanism used in built-in @{keyword "binder"} declarations. \end{warn} › subsection ‹Syntax translation functions \label{sec:tr-funs}› text ‹ \begin{matharray}{rcl} @{command_def "parse_ast_translation"} & : & ‹theory → theory› \\ @{command_def "parse_translation"} & : & ‹theory → theory› \\ @{command_def "print_translation"} & : & ‹theory → theory› \\ @{command_def "typed_print_translation"} & : & ‹theory → theory› \\ @{command_def "print_ast_translation"} & : & ‹theory → theory› \\ @{ML_antiquotation_def "class_syntax"} & : & ‹ML antiquotation› \\ @{ML_antiquotation_def "type_syntax"} & : & ‹ML antiquotation› \\ @{ML_antiquotation_def "const_syntax"} & : & ‹ML antiquotation› \\ @{ML_antiquotation_def "syntax_const"} & : & ‹ML antiquotation› \\ \end{matharray} Syntax translation functions written in ML admit almost arbitrary manipulations of inner syntax, at the expense of some complexity and obscurity in the implementation. \<^rail>‹ ( @@{command parse_ast_translation} | @@{command parse_translation} | @@{command print_translation} | @@{command typed_print_translation} | @@{command print_ast_translation}) @{syntax text} ; (@@{ML_antiquotation class_syntax} | @@{ML_antiquotation type_syntax} | @@{ML_antiquotation const_syntax} | @@{ML_antiquotation syntax_const}) embedded › ➧ @{command parse_translation} etc. declare syntax translation functions to the theory. Any of these commands have a single @{syntax text} argument that refers to an ML expression of appropriate type as follows: ┉ {\footnotesize \begin{tabular}{l} @{command parse_ast_translation} : \\ \quad \<^ML_type>‹(string * (Proof.context -> Ast.ast list -> Ast.ast)) list› \\ @{command parse_translation} : \\ \quad \<^ML_type>‹(string * (Proof.context -> term list -> term)) list› \\ @{command print_translation} : \\ \quad \<^ML_type>‹(string * (Proof.context -> term list -> term)) list› \\ @{command typed_print_translation} : \\ \quad \<^ML_type>‹(string * (Proof.context -> typ -> term list -> term)) list› \\ @{command print_ast_translation} : \\ \quad \<^ML_type>‹(string * (Proof.context -> Ast.ast list -> Ast.ast)) list› \\ \end{tabular}} ┉ The argument list consists of ‹(c, tr)› pairs, where ‹c› is the syntax name of the formal entity involved, and ‹tr› a function that translates a syntax form ‹c args› into ‹tr ctxt args› (depending on the context). The Isabelle/ML naming convention for parse translations is ‹c_tr› and for print translations ‹c_tr'›. The @{command_ref print_syntax} command displays the sets of names associated with the translation functions of a theory under ‹parse_ast_translation› etc. ➧ ‹@{class_syntax c}›, ‹@{type_syntax c}›, ‹@{const_syntax c}› inline the authentic syntax name of the given formal entities into the ML source. This is the fully-qualified logical name prefixed by a special marker to indicate its kind: thus different logical name spaces are properly distinguished within parse trees. ➧ ‹@{const_syntax c}› inlines the name ‹c› of the given syntax constant, having checked that it has been declared via some @{command syntax} commands within the theory context. Note that the usual naming convention makes syntax constants start with underscore, to reduce the chance of accidental clashes with other names occurring in parse trees (unqualified constants etc.). › subsubsection ‹The translation strategy› text ‹ The different kinds of translation functions are invoked during the transformations between parse trees, ASTs and syntactic terms (cf.\ \figref{fig:parse-print}). Whenever a combination of the form ‹c x⇩1 … x⇩n› is encountered, and a translation function ‹f› of appropriate kind is declared for ‹c›, the result is produced by evaluation of ‹f [x⇩1, …, x⇩n]› in ML. For AST translations, the arguments ‹x⇩1, …, x⇩n› are ASTs. A combination has the form \<^ML>‹Ast.Constant›~‹c› or \<^ML>‹Ast.Appl›~‹[›\<^ML>‹Ast.Constant›~‹c, x⇩1, …, x⇩n]›. For term translations, the arguments are terms and a combination has the form \<^ML>‹Const›~‹(c, τ)› or \<^ML>‹Const›~‹(c, τ) $ x⇩1 $ … $ x⇩n›. Terms allow more sophisticated transformations than ASTs do, typically involving abstractions and bound variables. ∗‹Typed› print translations may even peek at the type ‹τ› of the constant they are invoked on, although some information might have been suppressed for term output already. Regardless of whether they act on ASTs or terms, translation functions called during the parsing process differ from those for printing in their overall behaviour: ➧[Parse translations] are applied bottom-up. The arguments are already in translated form. The translations must not fail; exceptions trigger an error message. There may be at most one function associated with any syntactic name. ➧[Print translations] are applied top-down. They are supplied with arguments that are partly still in internal form. The result again undergoes translation; therefore a print translation should not introduce as head the very constant that invoked it. The function may raise exception \<^ML>‹Match› to indicate failure; in this event it has no effect. Multiple functions associated with some syntactic name are tried in the order of declaration in the theory. Only constant atoms --- constructor \<^ML>‹Ast.Constant› for ASTs and \<^ML>‹Const› for terms --- can invoke translation functions. This means that parse translations can only be associated with parse tree heads of concrete syntax, or syntactic constants introduced via other translations. For plain identifiers within the term language, the status of constant versus variable is not yet know during parsing. This is in contrast to print translations, where constants are explicitly known from the given term in its fully internal form. › subsection ‹Built-in syntax transformations› text ‹ Here are some further details of the main syntax transformation phases of \figref{fig:parse-print}. › subsubsection ‹Transforming parse trees to ASTs› text ‹ The parse tree is the raw output of the parser. It is transformed into an AST according to some basic scheme that may be augmented by AST translation functions as explained in \secref{sec:tr-funs}. The parse tree is constructed by nesting the right-hand sides of the productions used to recognize the input. Such parse trees are simply lists of tokens and constituent parse trees, the latter representing the nonterminals of the productions. Ignoring AST translation functions, parse trees are transformed to ASTs by stripping out delimiters and copy productions, while retaining some source position information from input tokens. The Pure syntax provides predefined AST translations to make the basic ‹λ›-term structure more apparent within the (first-order) AST representation, and thus facilitate the use of @{command translations} (see also \secref{sec:syn-trans}). This covers ordinary term application, type application, nested abstraction, iterated meta implications and function types. The effect is illustrated on some representative input strings is as follows: \begin{center} \begin{tabular}{ll} input source & AST \\ \hline ‹f x y z› & ▩‹(f x y z)› \\ ‹'a ty› & ▩‹(ty 'a)› \\ ‹('a, 'b)ty› & ▩‹(ty 'a 'b)› \\ ‹λx y z. t› & ▩‹("_abs" x ("_abs" y ("_abs" z t)))› \\ ‹λx :: 'a. t› & ▩‹("_abs" ("_constrain" x 'a) t)› \\ ‹⟦P; Q; R⟧ ⟹ S› & ▩‹("Pure.imp" P ("Pure.imp" Q ("Pure.imp" R S)))› \\ ‹['a, 'b, 'c] ⇒ 'd› & ▩‹("fun" 'a ("fun" 'b ("fun" 'c 'd)))› \\ \end{tabular} \end{center} Note that type and sort constraints may occur in further places --- translations need to be ready to cope with them. The built-in syntax transformation from parse trees to ASTs insert additional constraints that represent source positions. › subsubsection ‹Transforming ASTs to terms› text ‹ After application of macros (\secref{sec:syn-trans}), the AST is transformed into a term. This term still lacks proper type information, but it might contain some constraints consisting of applications with head ▩‹_constrain›, where the second argument is a type encoded as a pre-term within the syntax. Type inference later introduces correct types, or indicates type errors in the input. Ignoring parse translations, ASTs are transformed to terms by mapping AST constants to term constants, AST variables to term variables or constants (according to the name space), and AST applications to iterated term applications. The outcome is still a first-order term. Proper abstractions and bound variables are introduced by parse translations associated with certain syntax constants. Thus ▩‹("_abs" x x)› eventually becomes a de-Bruijn term ▩‹Abs ("x", _, Bound 0)›. › subsubsection ‹Printing of terms› text ‹ The output phase is essentially the inverse of the input phase. Terms are translated via abstract syntax trees into pretty-printed text. Ignoring print translations, the transformation maps term constants, variables and applications to the corresponding constructs on ASTs. Abstractions are mapped to applications of the special constant ▩‹_abs› as seen before. Type constraints are represented via special ▩‹_constrain› forms, according to various policies of type annotation determined elsewhere. Sort constraints of type variables are handled in a similar fashion. After application of macros (\secref{sec:syn-trans}), the AST is finally pretty-printed. The built-in print AST translations reverse the corresponding parse AST translations. ┉ For the actual printing process, the priority grammar (\secref{sec:priority-grammar}) plays a vital role: productions are used as templates for pretty printing, with argument slots stemming from nonterminals, and syntactic sugar stemming from literal tokens. Each AST application with constant head ‹c› and arguments ‹t⇩1›, \dots, ‹t⇩n› (for ‹n = 0› the AST is just the constant ‹c› itself) is printed according to the first grammar production of result name ‹c›. The required syntax priority of the argument slot is given by its nonterminal ‹A⇧(⇧p⇧)›. The argument ‹t⇩i› that corresponds to the position of ‹A⇧(⇧p⇧)› is printed recursively, and then put in parentheses ∗‹if› its priority ‹p› requires this. The resulting output is concatenated with the syntactic sugar according to the grammar production. If an AST application ‹(c x⇩1 … x⇩m)› has more arguments than the corresponding production, it is first split into ‹((c x⇩1 … x⇩n) x⇩n⇩+⇩1 … x⇩m)› and then printed recursively as above. Applications with too few arguments or with non-constant head or without a corresponding production are printed in prefix-form like ‹f t⇩1 … t⇩n› for terms. Multiple productions associated with some name ‹c› are tried in order of appearance within the grammar. An occurrence of some AST variable ‹x› is printed as ‹x› outright. ┉ White space is ∗‹not› inserted automatically. If blanks (or breaks) are required to separate tokens, they need to be specified in the mixfix declaration (\secref{sec:mixfix}). › end