- Source: GitHub
- Reference: Crafting Interpreters
Interpreters - Parsing
Agenda
Abstract Syntax Tree
-
ASTs are abstract representations of our code, removing the syntax details of a language
-
Nodes contain relevant information of the node type as you will see later
-
Useful for reporting syntax errors, semantic analysis, optimisation, direct execution
-
For example, both these snippets are functionally equivalent. In the case of an if statement, we only care about block type, condition and body
if (true) { console.log('Hello, World!') }
if (true) console.log('Hello, World!')
Recursive Descent Parser
- A simple top down parser
- Construct AST from outer to inner grammar rules
chunk --> block ; block --> (statement)* returnStatement? statement --> ...
How to implement
-
Directly translate grammar rules into code
-
Each rule becomes a function
| Grammar | Code translation | Explanation | | ------------ | ---------------------------- | ------------------------------ | | Terminal | Match and consume the token | Basic building block | | Non-terminal | Call the rule's function | Create a sub-expression | | \| | If or switch statement | Consider different cases | | * | Loop | Repetition of clause | | ? | If statement | Binary case, exists or doesn't |
Example of translation
-
Each rule becomes a function
returnStatement --> 'return' expressionList ;
function returnStatement() { }
Translation of terminal
-
Match and consume the token
returnStatement --> 'return' expressionList ;
function returnStatement() { ... consume('RETURN') ... }
Translation of non-terminal
-
Call the rule's function
returnStatement --> 'return' expressionList ;
function returnStatement() { ... consume('RETURN') expressionList() ... }
Translation of |
-
If or switch statement to handle cases
statement --> 'if' expression 'then' block ('elseif' expression 'then' block)* ('else' block)? 'end' | 'while' expression 'do' block 'end' | 'repeat' block 'until' expression | 'for' identifier '=' expression ',' expression (',' expression)? 'do' block 'end' | 'for' identifierList 'in' expressionList 'do' block 'end' | 'function' functionName body | 'local function' identifier body ;
function statement() { switch(tokenType) { case 'IF': ... case 'FOR': ... case 'WHILE': ... case 'REPEAT': ... case 'FUNCTION': ... ... } }
Translation of *
-
Loop to handle repetition
expressionList --> expression (',' expression)* ;
function expressionList() { ... do expression() while (match('COMMA')) ... }
Translation of ?
-
If statement to handle optional case
block --> (statement)* returnStatement? ;
function block() { ... if (check('RETURN')) { returnStatement() } ... }
How does the parser work?
- Iterate over a stream of tokens (similar to lexer, where source code was iterated character by character)
- Construct a type of node based off a matching token
- Some helper functions for manipulating tokens: isAtEnd, advance, previous, match, peek, check, consume
Example output of AST (10 + 2 - 4)
Parser constructor
class Parser {
constructor(tokens) {
this.tokens = tokens
this.current = 0
}
}
isAtEnd
isAtEnd() {
return peek().type === 'EOF'
}
peek
peek() {
return tokens[current]
}
advance
advance() {
if (isAtEnd()) current++
return previous()
}
previous
previous() {
return tokens[current - 1]
}
check
check(tokenType) {
if (isAtEnd()) return false
return peek().type === tokenType
}
match
match(...tokenTypes) {
for (let type of tokenType) {
if (check(type)) {
advance()
return true
}
}
return false
}
consume
consume(tokenType, errorMessage) {
if (check(tokenType)) return advance()
throw new Error(`Line ${peek().line}: ${message}`)
}
Expressions
- Two types: Binary and unary expressions
- Binary expression nodes have left and right branches, whereas unary expression nodes only have a single branch
- Rules defined with CFG already handle precedence and associativity, no need to worry about this issue
Evaluation
- Post-order traversal
- Visit children nodes first: Left -> Right -> Current
Node example
-
Left / right can reference leaf nodes or a sub-branch
-
Leaf nodes are literal values
-
Sub-branch indicates sub-expressions
{ type: "BinaryExpression", left: ... , operator: "+", right: ... }
Top level expression function
expression --> logicalOr ;
expression() {
return logicalOr()
}
Brief preview into other functions
logicalOr --> logicalAnd ("or" logicalAnd)* ;
logicalAnd --> comparison ("and" comparison)* ;
comparison --> bitwiseOr (( "==" | "~=" | ">" | ">=" | "<" | "<=" ) bitwiseOr)* ;
bitwiseOr --> bitwiseXor ( "|" bitwiseXor )* ;
bitwiseXor --> bitwiseAnd ( "~" bitwiseAnd )* ;
bitwiseAnd --> shift ( "&" shift )* ;
shift --> concatenation (( "<<" | ">>" ) concatenation)* ;
concatenation --> term ( ".." concatenation )* ;
logicalOr() { const node = logicalAnd(); ... }
logicalAnd() { const node = comparison(); ... }
comparison() { const node = bitwiseOr(); ... }
bitwiseOr() { const node = bitwiseXor(); ... }
bitwiseXor() { const node = bitwiseAnd(); ... }
bitwiseAnd() { const node = shift(); ... }
shift() { const node = concatenation(); ... }
concatenation() { const node = term(); ... }
term function
term --> factor (( "+" | "-" ) factor)* ;
term() {
let node = factor()
while (match('PLUS, MINUS')) {
const operator = previous().lexeme
const right = factor()
node = { type: 'BinaryExpression', left: node, operator, right }
}
return node
}
factor function
factor --> unary (( "*" | "/" | "//" | "%" ) unary)* ;
factor() {
let node = unary()
while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) {
const operator = previous().lexeme
const right = unary()
node = { type: 'BinaryExpression', left: node, operator, right }
}
return node
}
primary function
primary --> number | string | "true" | "false" | "nil" | "..." | functionDef | table | prefixExpression ;
primary() {
if (match('NUMBER', 'STRING', 'TRUE', 'FALSE', 'NIL')) {
return { type: 'Literal', value: previous().value }
}
if (match('ELLIPSIS')) {
return { type: 'VarArgs' }
}
if (check('FUNCTION')) {
return functionDef()
}
if (check('LEFT_BRACE')) {
return table()
}
return prefixExpression()
}
Walkthrough example
-
Consider the expression:
10 + 2 * 5 - 4
-
Top level
expression()
called -
Subsequent functions are called:
logicalOr --> logicalAnd --> ... --> term
lowerPrecedenceFunction() { const node = nextPrecedenceFunction() ... }
Current token: 10
-
More subsequent calls until
primary()
, hit a base case -
factor()
-->unary()
-->power()
-->primary()
term() { const node = factor() ... }
factor() { const node = unary() ... }
primary() { if (match('NUMBER', ...)) { return { type: 'Literal', value: previous().value } } ... }
Current node value (outer node)
-
Return value from
primary()
and trickle back up{ type: 'Literal', value: 10 }
Current token: +
-
Return to
term()
from call stackterm() { let node = factor() // Finished, set result to node while (match('PLUS, MINUS')) { // Current token matches, consume const operator = previous().lexeme // + const right = factor() // Next step node = { type: 'BinaryExpression', left: node, operator, right } } return node }
Current token: 2
-
Similar to first step, hit a base case in
primary()
-
factor()
-->unary()
-->power()
-->primary()
{ type: 'Literal', value: 2 }
Current token: *
-
Return to
factor()
from call stackfactor() { let node = unary() // Finished, set result to node while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) { // Matches const operator = previous().lexeme // * const right = unary() // Next step node = { type: 'BinaryExpression', left: node, operator, right } } return node }
Current token: 5
-
Hit a base case in
primary()
-
factor()
-->unary()
-->power()
-->primary()
-
Set
right
to the result{ type: 'Literal', value: 5 }
Current node value (inner node)
-
Notice how the first completed node describes the multiplication operation (highest precedence operation in our input)
{ type: 'BinaryExpression', left: { type: 'Literal', value: 2 }, operator: '*', right: { type: 'Literal', value: 5 } }
Current token: -
-
No more matches, return back to
term()
with resultfactor() { let node = unary() while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) { // No match const operator = previous().lexeme const right = unary() node = { type: 'BinaryExpression', left: node, operator, right } // New node } return node // Return new node }
Back to term call with result
-
Finish computing
factory()
, return resultterm() { let node = factor() while (match('PLUS, MINUS')) { const operator = previous().lexeme const right = factor() // Set to new node node = { type: 'BinaryExpression', left: node, operator, right } } return node }
Current node value (outer node)
-
The result from
factory()
is attached to the right branch{ type: 'BinaryExpression', left: { type: 'Literal', value: 10 }, operator: '+', right: { type: 'BinaryExpression', left: { type: 'Literal', value: 2 }, operator: '*', right: { type: 'Literal', value: 5 } } }
Current token: -
-
Loop again since the token matches a PLUS or MINUS
term() { let node = factor() while (match('PLUS, MINUS')) { // Matches, consume const operator = previous().lexeme // - const right = factor() // Next step node = { type: 'BinaryExpression', left: node, operator, right } } return node }
Current token: 4
-
Hit a base case in
primary()
-
factor()
-->unary()
-->power()
-->primary()
-
Set
right
to the result{ type: 'Literal', value: 4 }
No more tokens
-
No more tokens, return and trickle back up to
expression()
term() { let node = factor() while (match('PLUS, MINUS')) { const operator = previous().lexeme const right = factor() // Set to new node node = { type: 'BinaryExpression', left: node, operator, right } } return node }
Final node value (outer node)
-
This is the final resulting AST
{ type: 'BinaryExpression', left: { type: 'BinaryExpression', left: { type: 'Literal', value: 10 }, operator: '+', right: { type: 'BinaryExpression', left: { type: 'Literal', value: 2 }, operator: '*', right: { type: 'Literal', value: 5 } } }, operator: '-', right: { type: 'Literal', value: 4 } }
AST from walkthrough
Hands-on
-
Attempt the implementation for unary
-
Grammar rule:
unary --> ( "-" | "not" | "#" | "~" ) unary | power ;
-
Additional information:
- Token names are "MINUS", "NOT", "HASHTAG" and "TILDA" respectively
- Token structure consists of properties: type, lexeme, value, line
Implementation for unary
-
A binary case for either dealing with unary or power operations
-
Similar type of node data, unary expressions only have one operand
unary() { if (match('MINUS', 'NOT', 'HASHTAG', 'TILDA')) { const operator = previous().lexeme const right = unary() return { type: 'UnaryExpression', operator, right } } return power() }
Statements
- Statements represent higher level constructs
- Information varies more between different types of statements
- Ex. For statement:
variable
,initial
,final
,step
,body
- Ex. Assignment:
variableList
,expressionList
- Ex. If statement:
clauses
containingtype
,condition
,body
Block
block --> (statement)* returnStatement? ;
block() {
const node = { type: 'Block', statememts: [] }
while (!check('RETURN') && !check('END') &&
!check('ELSEIF') && !check('ELSE') &&
!check('UNTIL') && !isAtEnd())
{
node.statements.push(statement())
match('SEMICOLON')
}
if (check('RETURN')) {
node.returnStatement = returnStatement()
}
return node
}
If Statement
-
One of the type of statements
'if' expression 'then' block ('elseif' expression 'then' block)* ('else' block)? 'end'
-
Always if block
-
Could be multiple elseif blocks
-
Might be one else block
ifStatement() { const node = { type: 'IfStatement', clauses: [] } advance() // Skip the "if" token ... consume('END', 'Expected "end" after "if" statement.') return node }
Components
-
3 types of blocks in an if statement, cleaner to implement each section in a separate function
ifStatement() { ... node.clauses.push(ifBlock()); if (check('ELSEIF')) { node.clauses.push(...elseIfBlock()); } if (check('ELSE')) { node.clauses.push(elseBlock()); } ... }
ifBlock
'if' expression 'then' block
ifBlock() {
const clause = { type: 'IfClause' }
if (match('LEFT_PAREN')) {
clause.condition = expression()
consume('RIGHT_PAREN', 'Expected ")" after "if" condition.')
} else {
clause.condition = expression()
}
consume('THEN', 'Expected "then" after "if" condition.')
clause.block = block()
return clause
}
elseIfBlock
('elseif' expression 'then' block)*
elseIfBlock() {
const clauses = [];
while (match('ELSEIF')) {
const clause = { type: 'ElseIfClause' }
if (match('LEFT_PAREN')) {
clause.condition = expression()
consume('RIGHT_PAREN', 'Expected ")" after "elseif" condition.')
} else {
clause.condition = expression()
}
consume('THEN', 'Expected "then" after "elseif" condition.')
clause.block = block()
clauses.push(clause)
}
return clauses
}
elseBlock
('else' block)?
elseBlock() {
const clause = { type: 'ElseClause' }
advance()
clause.block = block()
return clause
}