Interpreters - Parsing

Agenda

AST
Recursive Descent Parser
How does the parser work?
Expressions
Statements

Abstract Syntax Tree

ASTs are abstract representations of our code, removing the syntax details of a language
Nodes contain relevant information of the node type as you will see later
Useful for reporting syntax errors, semantic analysis, optimisation, direct execution
For example, both these snippets are functionally equivalent. In the case of an if statement, we only care about block type, condition and body
```
if (true) {
    console.log('Hello, World!')
}
```
```
if (true) console.log('Hello, World!')
```

Recursive Descent Parser

A simple top down parser

Construct AST from outer to inner grammar rules

chunk     --> block ;
block     --> (statement)* returnStatement?
statement --> ...

How to implement

Directly translate grammar rules into code

Each rule becomes a function

| Grammar      | Code translation             | Explanation                    |
| ------------ | ---------------------------- | ------------------------------ |
| Terminal     | Match and consume the token  | Basic building block           |
| Non-terminal | Call the rule's function     | Create a sub-expression        |
| \|           | If or switch statement       | Consider different cases       |
| *            | Loop                         | Repetition of clause           |
| ?            | If statement                 | Binary case, exists or doesn't |

Example of translation

Each rule becomes a function

returnStatement --> 'return' expressionList ;

function returnStatement() {

}

Translation of terminal

Match and consume the token

returnStatement --> 'return' expressionList ;

function returnStatement() {
    ...
  
    consume('RETURN')
  
    ...
}

Translation of non-terminal

Call the rule's function

returnStatement --> 'return' expressionList ;

function returnStatement() {
    ...

    consume('RETURN')
    expressionList()

    ...
}

Translation of |

If or switch statement to handle cases

statement --> 'if' expression 'then' block 
            ('elseif' expression 'then' block)*
            ('else' block)? 
            'end' |

            'while' expression 'do' block 'end' |
            
            'repeat' block 'until' expression |

            'for' identifier '=' expression ',' expression
            (',' expression)?
            'do' block 'end' |

            'for' identifierList 'in' expressionList 'do' block 'end' |

            'function' functionName body |

            'local function' identifier body ;

function statement() {
    switch(tokenType) {
        case 'IF': ...
        case 'FOR': ...
        case 'WHILE': ...
        case 'REPEAT': ...
        case 'FUNCTION': ...
        ...
    }
}

Translation of *

Loop to handle repetition

expressionList --> expression (',' expression)* ;

function expressionList() {
    ...

    do
        expression()
    while (match('COMMA'))

    ...
}

Translation of ?

If statement to handle optional case

block --> (statement)* returnStatement? ;

function block() {
    ...

    if (check('RETURN')) {
        returnStatement()
    }

    ...
}

How does the parser work?

Iterate over a stream of tokens (similar to lexer, where source code was iterated character by character)
Construct a type of node based off a matching token
Some helper functions for manipulating tokens: isAtEnd, advance, previous, match, peek, check, consume

Example output of AST (10 + 2 - 4)

Parser constructor

class Parser {
    constructor(tokens) {
        this.tokens = tokens
        this.current = 0
    }
}

isAtEnd

isAtEnd() {
    return peek().type === 'EOF'
}

peek

peek() {
    return tokens[current]
}

advance

advance() {
    if (isAtEnd()) current++

    return previous()
}

previous() {
    return tokens[current - 1]
}

check

check(tokenType) {
    if (isAtEnd()) return false

    return peek().type === tokenType
}

match

match(...tokenTypes) {
    for (let type of tokenType) {
        if (check(type)) {
            advance()
        
            return true
        }
    }

    return false
}

consume

consume(tokenType, errorMessage) {
    if (check(tokenType)) return advance()

    throw new Error(`Line ${peek().line}: ${message}`)
}

Expressions

Two types: Binary and unary expressions
Binary expression nodes have left and right branches, whereas unary expression nodes only have a single branch
Rules defined with CFG already handle precedence and associativity, no need to worry about this issue

Evaluation

Post-order traversal
Visit children nodes first: Left -> Right -> Current

Node example

Left / right can reference leaf nodes or a sub-branch
Leaf nodes are literal values

Sub-branch indicates sub-expressions

{
    type: "BinaryExpression",
    left: ... ,
    operator: "+",
    right: ...
}

Top level expression function

expression --> logicalOr ;

expression() {
    return logicalOr()
}

Brief preview into other functions

logicalOr     --> logicalAnd ("or" logicalAnd)* ;
logicalAnd    --> comparison ("and" comparison)* ;
comparison    --> bitwiseOr (( "==" | "~=" | ">" | ">=" | "<" | "<=" ) bitwiseOr)* ;
bitwiseOr     --> bitwiseXor ( "|" bitwiseXor )* ;
bitwiseXor    --> bitwiseAnd ( "~" bitwiseAnd )* ;
bitwiseAnd    --> shift ( "&" shift )* ;
shift         --> concatenation (( "<<" | ">>" ) concatenation)* ;
concatenation --> term ( ".." concatenation )* ;

logicalOr() { const node = logicalAnd(); ... }
logicalAnd() { const node = comparison(); ... }
comparison() { const node = bitwiseOr(); ... }
bitwiseOr() { const node = bitwiseXor(); ... }
bitwiseXor() { const node = bitwiseAnd(); ... }
bitwiseAnd() { const node = shift(); ... }
shift() { const node = concatenation(); ... }
concatenation() { const node = term(); ... }

term function

term --> factor (( "+" | "-" ) factor)* ;

term() {
    let node = factor()

    while (match('PLUS, MINUS')) {
        const operator = previous().lexeme
        const right = factor()
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

factor function

factor --> unary (( "*" | "/" | "//" | "%" ) unary)* ;

factor() {
    let node = unary()

    while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) {
        const operator = previous().lexeme
        const right = unary()
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

primary function

primary --> number | string | "true" | "false" | "nil" | "..." | functionDef | table | prefixExpression ;

primary() {
    if (match('NUMBER', 'STRING', 'TRUE', 'FALSE', 'NIL')) {
        return { type: 'Literal', value: previous().value }
    }
    if (match('ELLIPSIS')) {
        return { type: 'VarArgs' }
    }
    if (check('FUNCTION')) {
        return functionDef()
    }

    if (check('LEFT_BRACE')) {
        return table()
    }

    return prefixExpression()
}

Walkthrough example

Consider the expression:
```
10 + 2 * 5 - 4
```
Top level expression() called

Subsequent functions are called: logicalOr --> logicalAnd --> ... --> term

lowerPrecedenceFunction() {
    const node = nextPrecedenceFunction()
    ...
}

Current token: 10

More subsequent calls until primary(), hit a base case

factor() --> unary() --> power() --> primary()

term() {
    const node = factor()
    ...
}

factor() {
    const node = unary()
    ...
}

primary() {
    if (match('NUMBER', ...)) {
        return { type: 'Literal', value: previous().value }
    }
    ...
}

Current node value (outer node)

Return value from primary() and trickle back up
```
{
    type: 'Literal',
    value: 10
}
```

Current token: +

Return to term() from call stack

term() {
    let node = factor() // Finished, set result to node

    while (match('PLUS, MINUS')) {  // Current token matches, consume
        const operator = previous().lexeme  // +
        const right = factor()  // Next step
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

Current token: 2

Similar to first step, hit a base case in primary()
factor() --> unary() --> power() --> primary()
```
{
    type: 'Literal',
    value: 2
}
```

Current token: *

Return to factor() from call stack

factor() {
    let node = unary() // Finished, set result to node

    while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) { // Matches
        const operator = previous().lexeme  // * 
        const right = unary() // Next step
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

Current token: 5

Hit a base case in primary()
factor() --> unary() --> power() --> primary()
Set right to the result
```
{
    type: 'Literal',
    value: 5
}
```

Current node value (inner node)

Notice how the first completed node describes the multiplication operation (highest precedence operation in our input)

{
    type: 'BinaryExpression',
    left: {
        type: 'Literal',
        value: 2
    },
    operator: '*',
    right: {
        type: 'Literal',
        value: 5
    }
}

Current token: -

No more matches, return back to term() with result

factor() {
    let node = unary() 

    while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) { // No match
        const operator = previous().lexeme 
        const right = unary() 
        node = { type: 'BinaryExpression', left: node, operator, right } // New node
    }

    return node // Return new node
}

Back to term call with result

Finish computing factory(), return result

term() {
    let node = factor() 

    while (match('PLUS, MINUS')) {
        const operator = previous().lexeme
        const right = factor()  // Set to new node
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

Current node value (outer node)

The result from factory() is attached to the right branch

{
    type: 'BinaryExpression',
    left: {
        type: 'Literal',
        value: 10
    },
    operator: '+',
    right: {
        type: 'BinaryExpression',
        left: {
            type: 'Literal',
            value: 2
        },
        operator: '*',
        right: {
            type: 'Literal',
            value: 5
        }
    }
}

Current token: -

Loop again since the token matches a PLUS or MINUS

term() {
    let node = factor() 

    while (match('PLUS, MINUS')) {  // Matches, consume
        const operator = previous().lexeme  // -
        const right = factor()  // Next step
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

Current token: 4

Hit a base case in primary()
factor() --> unary() --> power() --> primary()
Set right to the result
```
{
    type: 'Literal',
    value: 4
}
```

No more tokens

No more tokens, return and trickle back up to expression()

term() {
    let node = factor() 

    while (match('PLUS, MINUS')) { 
        const operator = previous().lexeme  
        const right = factor()  // Set to new node
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

Final node value (outer node)

This is the final resulting AST

{
    type: 'BinaryExpression',
    left: {
        type: 'BinaryExpression',
        left: {
            type: 'Literal',
            value: 10
        },
        operator: '+',
        right: {
            type: 'BinaryExpression',
            left: {
                type: 'Literal',
                value: 2
            },
            operator: '*',
            right: {
                type: 'Literal',
                value: 5
            }
        }
    },
    operator: '-',
    right: {
        type: 'Literal',
        value: 4
    }
}

AST from walkthrough

Hands-on

Attempt the implementation for unary

Grammar rule:

unary --> ( "-" | "not" | "#" | "~" ) unary | power ;

Additional information:
- Token names are "MINUS", "NOT", "HASHTAG" and "TILDA" respectively
- Token structure consists of properties: type, lexeme, value, line

Implementation for unary

A binary case for either dealing with unary or power operations

Similar type of node data, unary expressions only have one operand

unary() {
    if (match('MINUS', 'NOT', 'HASHTAG', 'TILDA')) {
        const operator = previous().lexeme
        const right = unary()

        return { type: 'UnaryExpression', operator, right }
    }

    return power()
}

Statements

Statements represent higher level constructs
Information varies more between different types of statements
Ex. For statement: variable, initial, final, step, body
Ex. Assignment: variableList, expressionList
Ex. If statement: clauses containing type, condition, body

Block

block --> (statement)* returnStatement? ;

block() {
    const node = { type: 'Block', statememts: [] }
    while (!check('RETURN') && !check('END') &&
        !check('ELSEIF') && !check('ELSE') &&
        !check('UNTIL') && !isAtEnd()) 
    {
        node.statements.push(statement())
        match('SEMICOLON')
    }

    if (check('RETURN')) {
        node.returnStatement = returnStatement()
    }

    return node
}

If Statement

One of the type of statements

'if' expression 'then' block 
('elseif' expression 'then' block)*
('else' block)? 
'end'

Always if block
Could be multiple elseif blocks

Might be one else block

ifStatement() {
    const node = { type: 'IfStatement', clauses: [] }
    advance()   // Skip the "if" token

    ...

    consume('END', 'Expected "end" after "if" statement.')
    return node
}

Components

3 types of blocks in an if statement, cleaner to implement each section in a separate function

ifStatement() {
    ...

    node.clauses.push(ifBlock());

    if (check('ELSEIF')) {
        node.clauses.push(...elseIfBlock());
    }

    if (check('ELSE')) {
        node.clauses.push(elseBlock());
    }

    ...
}

ifBlock

'if' expression 'then' block

ifBlock() {
    const clause = { type: 'IfClause' }

    if (match('LEFT_PAREN')) {
        clause.condition = expression()
        consume('RIGHT_PAREN', 'Expected ")" after "if" condition.')
    } else {
        clause.condition = expression()
    }

    consume('THEN', 'Expected "then" after "if" condition.')
    clause.block = block()

    return clause
}

elseIfBlock

('elseif' expression 'then' block)*

elseIfBlock() {
    const clauses = [];

    while (match('ELSEIF')) {
        const clause = { type: 'ElseIfClause' }

        if (match('LEFT_PAREN')) {
            clause.condition = expression()
            consume('RIGHT_PAREN', 'Expected ")" after "elseif" condition.')
        } else {
            clause.condition = expression()
        }

        consume('THEN', 'Expected "then" after  "elseif" condition.')
        clause.block = block()

        clauses.push(clause)
    }

    return clauses
}

elseBlock

('else' block)?

elseBlock() {
    const clause = { type: 'ElseClause' }
    
    advance()
    clause.block = block()
    
    return clause
}

Interpreters - Parsing

Agenda

Abstract Syntax Tree

Recursive Descent Parser

How to implement

Example of translation

Translation of terminal

Translation of non-terminal

Translation of |

Translation of *

Translation of ?

How does the parser work?

Example output of AST (10 + 2 - 4)

Parser constructor

isAtEnd

peek

advance

previous

check

match

consume

Expressions

Evaluation

Node example

Top level expression function

Brief preview into other functions

term function

factor function

primary function

Walkthrough example

Current token: 10

Current node value (outer node)

Current token: +

Current token: 2

Current token: *

Current token: 5

Current node value (inner node)

Current token: -

Back to term call with result

Current node value (outer node)

Current token: -

Current token: 4

No more tokens

Final node value (outer node)

AST from walkthrough

Hands-on

Implementation for unary

Statements

Block

If Statement

Components

ifBlock

elseIfBlock

elseBlock

Example with if statement