Graduate Program KB


Interpreters - Parsing


Agenda

  1. AST
  2. Recursive Descent Parser
  3. How does the parser work?
  4. Expressions
  5. Statements

Abstract Syntax Tree

  • ASTs are abstract representations of our code, removing the syntax details of a language

  • Nodes contain relevant information of the node type as you will see later

  • Useful for reporting syntax errors, semantic analysis, optimisation, direct execution

  • For example, both these snippets are functionally equivalent. In the case of an if statement, we only care about block type, condition and body

    if (true) {
        console.log('Hello, World!')
    }
    
    if (true) console.log('Hello, World!')
    

Recursive Descent Parser

  • A simple top down parser
  • Construct AST from outer to inner grammar rules
    chunk     --> block ;
    block     --> (statement)* returnStatement?
    statement --> ...
    

How to implement

  • Directly translate grammar rules into code

  • Each rule becomes a function

    | Grammar      | Code translation             | Explanation                    |
    | ------------ | ---------------------------- | ------------------------------ |
    | Terminal     | Match and consume the token  | Basic building block           |
    | Non-terminal | Call the rule's function     | Create a sub-expression        |
    | \|           | If or switch statement       | Consider different cases       |
    | *            | Loop                         | Repetition of clause           |
    | ?            | If statement                 | Binary case, exists or doesn't |
    

Example of translation

  • Each rule becomes a function

    returnStatement --> 'return' expressionList ;
    
    function returnStatement() {
    
    }
    

Translation of terminal

  • Match and consume the token

    returnStatement --> 'return' expressionList ;
    
    function returnStatement() {
        ...
      
        consume('RETURN')
      
        ...
    }
    

Translation of non-terminal

  • Call the rule's function

    returnStatement --> 'return' expressionList ;
    
    function returnStatement() {
        ...
    
        consume('RETURN')
        expressionList()
    
        ...
    }
    

Translation of |

  • If or switch statement to handle cases

    statement --> 'if' expression 'then' block 
                ('elseif' expression 'then' block)*
                ('else' block)? 
                'end' |
    
                'while' expression 'do' block 'end' |
                
                'repeat' block 'until' expression |
    
                'for' identifier '=' expression ',' expression
                (',' expression)?
                'do' block 'end' |
    
                'for' identifierList 'in' expressionList 'do' block 'end' |
    
                'function' functionName body |
    
                'local function' identifier body ;
    
    function statement() {
        switch(tokenType) {
            case 'IF': ...
            case 'FOR': ...
            case 'WHILE': ...
            case 'REPEAT': ...
            case 'FUNCTION': ...
            ...
        }
    }
    

Translation of *

  • Loop to handle repetition

    expressionList --> expression (',' expression)* ;
    
    function expressionList() {
        ...
    
        do
            expression()
        while (match('COMMA'))
    
        ...
    }
    

Translation of ?

  • If statement to handle optional case

    block --> (statement)* returnStatement? ;
    
    function block() {
        ...
    
        if (check('RETURN')) {
            returnStatement()
        }
    
        ...
    }
    

How does the parser work?

  • Iterate over a stream of tokens (similar to lexer, where source code was iterated character by character)
  • Construct a type of node based off a matching token
  • Some helper functions for manipulating tokens: isAtEnd, advance, previous, match, peek, check, consume

Example output of AST (10 + 2 - 4)

Parser constructor

class Parser {
    constructor(tokens) {
        this.tokens = tokens
        this.current = 0
    }
}

isAtEnd

isAtEnd() {
    return peek().type === 'EOF'
}

peek

peek() {
    return tokens[current]
}

advance

advance() {
    if (isAtEnd()) current++

    return previous()
}

previous

previous() {
    return tokens[current - 1]
}

check

check(tokenType) {
    if (isAtEnd()) return false

    return peek().type === tokenType
}

match

match(...tokenTypes) {
    for (let type of tokenType) {
        if (check(type)) {
            advance()
        
            return true
        }
    }

    return false
}

consume

consume(tokenType, errorMessage) {
    if (check(tokenType)) return advance()

    throw new Error(`Line ${peek().line}: ${message}`)
}

Expressions

  • Two types: Binary and unary expressions
  • Binary expression nodes have left and right branches, whereas unary expression nodes only have a single branch
  • Rules defined with CFG already handle precedence and associativity, no need to worry about this issue

Evaluation

  • Post-order traversal
  • Visit children nodes first: Left -> Right -> Current

Node example

  • Left / right can reference leaf nodes or a sub-branch

  • Leaf nodes are literal values

  • Sub-branch indicates sub-expressions

    {
        type: "BinaryExpression",
        left: ... ,
        operator: "+",
        right: ...
    }
    

Top level expression function

expression --> logicalOr ;
expression() {
    return logicalOr()
}

Brief preview into other functions

logicalOr     --> logicalAnd ("or" logicalAnd)* ;
logicalAnd    --> comparison ("and" comparison)* ;
comparison    --> bitwiseOr (( "==" | "~=" | ">" | ">=" | "<" | "<=" ) bitwiseOr)* ;
bitwiseOr     --> bitwiseXor ( "|" bitwiseXor )* ;
bitwiseXor    --> bitwiseAnd ( "~" bitwiseAnd )* ;
bitwiseAnd    --> shift ( "&" shift )* ;
shift         --> concatenation (( "<<" | ">>" ) concatenation)* ;
concatenation --> term ( ".." concatenation )* ;
logicalOr() { const node = logicalAnd(); ... }
logicalAnd() { const node = comparison(); ... }
comparison() { const node = bitwiseOr(); ... }
bitwiseOr() { const node = bitwiseXor(); ... }
bitwiseXor() { const node = bitwiseAnd(); ... }
bitwiseAnd() { const node = shift(); ... }
shift() { const node = concatenation(); ... }
concatenation() { const node = term(); ... }

term function

term --> factor (( "+" | "-" ) factor)* ;
term() {
    let node = factor()

    while (match('PLUS, MINUS')) {
        const operator = previous().lexeme
        const right = factor()
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

factor function

factor --> unary (( "*" | "/" | "//" | "%" ) unary)* ;
factor() {
    let node = unary()

    while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) {
        const operator = previous().lexeme
        const right = unary()
        node = { type: 'BinaryExpression', left: node, operator, right }
    }

    return node
}

primary function

primary --> number | string | "true" | "false" | "nil" | "..." | functionDef | table | prefixExpression ;
primary() {
    if (match('NUMBER', 'STRING', 'TRUE', 'FALSE', 'NIL')) {
        return { type: 'Literal', value: previous().value }
    }
    if (match('ELLIPSIS')) {
        return { type: 'VarArgs' }
    }
    if (check('FUNCTION')) {
        return functionDef()
    }

    if (check('LEFT_BRACE')) {
        return table()
    }

    return prefixExpression()
}

Walkthrough example

  • Consider the expression:

    10 + 2 * 5 - 4
    
  • Top level expression() called

  • Subsequent functions are called: logicalOr --> logicalAnd --> ... --> term

    lowerPrecedenceFunction() {
        const node = nextPrecedenceFunction()
        ...
    }
    

Current token: 10

  • More subsequent calls until primary(), hit a base case

  • factor() --> unary() --> power() --> primary()

    term() {
        const node = factor()
        ...
    }
    
    factor() {
        const node = unary()
        ...
    }
    
    primary() {
        if (match('NUMBER', ...)) {
            return { type: 'Literal', value: previous().value }
        }
        ...
    }
    

Current node value (outer node)

  • Return value from primary() and trickle back up

    {
        type: 'Literal',
        value: 10
    }
    

Current token: +

  • Return to term() from call stack

    term() {
        let node = factor() // Finished, set result to node
    
        while (match('PLUS, MINUS')) {  // Current token matches, consume
            const operator = previous().lexeme  // +
            const right = factor()  // Next step
            node = { type: 'BinaryExpression', left: node, operator, right }
        }
    
        return node
    }
    

Current token: 2

  • Similar to first step, hit a base case in primary()

  • factor() --> unary() --> power() --> primary()

    {
        type: 'Literal',
        value: 2
    }
    

Current token: *

  • Return to factor() from call stack

    factor() {
        let node = unary() // Finished, set result to node
    
        while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) { // Matches
            const operator = previous().lexeme  // * 
            const right = unary() // Next step
            node = { type: 'BinaryExpression', left: node, operator, right }
        }
    
        return node
    }
    

Current token: 5

  • Hit a base case in primary()

  • factor() --> unary() --> power() --> primary()

  • Set right to the result

    {
        type: 'Literal',
        value: 5
    }
    

Current node value (inner node)

  • Notice how the first completed node describes the multiplication operation (highest precedence operation in our input)

    {
        type: 'BinaryExpression',
        left: {
            type: 'Literal',
            value: 2
        },
        operator: '*',
        right: {
            type: 'Literal',
            value: 5
        }
    }
    

Current token: -

  • No more matches, return back to term() with result

    factor() {
        let node = unary() 
    
        while (match('STAR', 'SLASH', 'DOUBLE_SLASH', 'PERCENT')) { // No match
            const operator = previous().lexeme 
            const right = unary() 
            node = { type: 'BinaryExpression', left: node, operator, right } // New node
        }
    
        return node // Return new node
    }
    

Back to term call with result

  • Finish computing factory(), return result

    term() {
        let node = factor() 
    
        while (match('PLUS, MINUS')) {
            const operator = previous().lexeme
            const right = factor()  // Set to new node
            node = { type: 'BinaryExpression', left: node, operator, right }
        }
    
        return node
    }
    

Current node value (outer node)

  • The result from factory() is attached to the right branch

    {
        type: 'BinaryExpression',
        left: {
            type: 'Literal',
            value: 10
        },
        operator: '+',
        right: {
            type: 'BinaryExpression',
            left: {
                type: 'Literal',
                value: 2
            },
            operator: '*',
            right: {
                type: 'Literal',
                value: 5
            }
        }
    }
    

Current token: -

  • Loop again since the token matches a PLUS or MINUS

    term() {
        let node = factor() 
    
        while (match('PLUS, MINUS')) {  // Matches, consume
            const operator = previous().lexeme  // -
            const right = factor()  // Next step
            node = { type: 'BinaryExpression', left: node, operator, right }
        }
    
        return node
    }
    

Current token: 4

  • Hit a base case in primary()

  • factor() --> unary() --> power() --> primary()

  • Set right to the result

    {
        type: 'Literal',
        value: 4
    }
    

No more tokens

  • No more tokens, return and trickle back up to expression()

    term() {
        let node = factor() 
    
        while (match('PLUS, MINUS')) { 
            const operator = previous().lexeme  
            const right = factor()  // Set to new node
            node = { type: 'BinaryExpression', left: node, operator, right }
        }
    
        return node
    }
    

Final node value (outer node)

  • This is the final resulting AST

    {
        type: 'BinaryExpression',
        left: {
            type: 'BinaryExpression',
            left: {
                type: 'Literal',
                value: 10
            },
            operator: '+',
            right: {
                type: 'BinaryExpression',
                left: {
                    type: 'Literal',
                    value: 2
                },
                operator: '*',
                right: {
                    type: 'Literal',
                    value: 5
                }
            }
        },
        operator: '-',
        right: {
            type: 'Literal',
            value: 4
        }
    }
    

AST from walkthrough


Hands-on

  • Attempt the implementation for unary

  • Grammar rule:

    unary --> ( "-" | "not" | "#" | "~" ) unary | power ;
    
  • Additional information:

    • Token names are "MINUS", "NOT", "HASHTAG" and "TILDA" respectively
    • Token structure consists of properties: type, lexeme, value, line

Implementation for unary

  • A binary case for either dealing with unary or power operations

  • Similar type of node data, unary expressions only have one operand

    unary() {
        if (match('MINUS', 'NOT', 'HASHTAG', 'TILDA')) {
            const operator = previous().lexeme
            const right = unary()
    
            return { type: 'UnaryExpression', operator, right }
        }
    
        return power()
    }
    

Statements

  • Statements represent higher level constructs
  • Information varies more between different types of statements
  • Ex. For statement: variable, initial, final, step, body
  • Ex. Assignment: variableList, expressionList
  • Ex. If statement: clauses containing type, condition, body

Block

block --> (statement)* returnStatement? ;
block() {
    const node = { type: 'Block', statememts: [] }
    while (!check('RETURN') && !check('END') &&
        !check('ELSEIF') && !check('ELSE') &&
        !check('UNTIL') && !isAtEnd()) 
    {
        node.statements.push(statement())
        match('SEMICOLON')
    }

    if (check('RETURN')) {
        node.returnStatement = returnStatement()
    }

    return node
}

If Statement

  • One of the type of statements

    'if' expression 'then' block 
    ('elseif' expression 'then' block)*
    ('else' block)? 
    'end'
    
  • Always if block

  • Could be multiple elseif blocks

  • Might be one else block

    ifStatement() {
        const node = { type: 'IfStatement', clauses: [] }
        advance()   // Skip the "if" token
    
        ...
    
        consume('END', 'Expected "end" after "if" statement.')
        return node
    }
    

Components

  • 3 types of blocks in an if statement, cleaner to implement each section in a separate function

    ifStatement() {
        ...
    
        node.clauses.push(ifBlock());
    
        if (check('ELSEIF')) {
            node.clauses.push(...elseIfBlock());
        }
    
        if (check('ELSE')) {
            node.clauses.push(elseBlock());
        }
    
        ...
    }
    

ifBlock

'if' expression 'then' block
ifBlock() {
    const clause = { type: 'IfClause' }

    if (match('LEFT_PAREN')) {
        clause.condition = expression()
        consume('RIGHT_PAREN', 'Expected ")" after "if" condition.')
    } else {
        clause.condition = expression()
    }

    consume('THEN', 'Expected "then" after "if" condition.')
    clause.block = block()

    return clause
}

elseIfBlock

('elseif' expression 'then' block)*
elseIfBlock() {
    const clauses = [];

    while (match('ELSEIF')) {
        const clause = { type: 'ElseIfClause' }

        if (match('LEFT_PAREN')) {
            clause.condition = expression()
            consume('RIGHT_PAREN', 'Expected ")" after "elseif" condition.')
        } else {
            clause.condition = expression()
        }

        consume('THEN', 'Expected "then" after  "elseif" condition.')
        clause.block = block()

        clauses.push(clause)
    }

    return clauses
}

elseBlock

('else' block)?
elseBlock() {
    const clause = { type: 'ElseClause' }
    
    advance()
    clause.block = block()
    
    return clause
}

Example with if statement