Graduate Program KB


Interpreters - Lexical Analysis


Agenda

  1. What are interpreters?
  2. Process
  3. Writing a lexer

What are interpreters?

  • A program that translates and executes code line by line without prior compilation
  • For any language, source code must be translated to machine code before the CPU can execute it
  • Interpreters are one form of translation, other than compilers and assemblers


Characteristics

  • No intermediate object code generated
  • Less memory required
  • Slower, converts high-level to low-level language every time
  • Detect errors immediately


Interpreter Process

Lexical Analysis

  • Scanning the source code character by character, grouping them into tokens
  • Token: Represents individual elements of a programming language's syntax
  • Lexeme: A sequence of characters that matches a pattern for a token
  • Lexer: Component which scans the source code and creates tokens

Syntax Analysis (Parsing)

  • Verifies the arrangement of tokens following a set of grammar rules
  • Constructs Abstract Syntax Tree

Semantic Analysis

  • Checks for logical errors, type declarations, scoping, etc.

Execution

  • Interpret and execute program logic represented by AST

Writing lexer for Lua


Token structure

  • type: Token type
  • lexeme: Raw source code
  • value: Interpreted value
  • line: Line of occurrence
function Token(type, lexeme, value, line) {
    this.type = type;
    this.lexeme = lexeme;
    this.value = value;
    this.line = line;
}

Token Types

Token TypeMatches
PLUS+
MINUS-
STAR*
SLASH/
PERCENT%
LEFT_PAREN(
RIGHT_PAREN)
LEFT_BRACE{
RIGHT_BRACE}
LEFT_BRACKET[
RIGHT_BRACKET]
CARET^
TAG#
ASSIGN=
LESS<
GREATER>
COLON:
SEMICOLON;
AMPERSAND&
PIPE|
TILDE~
PERIOD.
EQUAL==
NOT_EQUAL~=
LESS_EQUAL<=
GREATER_EQUAL>=
LEFT_SHIFT<<
RIGHT_SHIFT>>
DOUBLE_SLASH//
DOUBLE_COLON::
CONCATENATE..
ELLIPSIS...
IFif
ELSEelse
ELSEIFelseif
THENthen
ANDand
ORor
NOTnot
WHILEwhile
DOdo
REPEATrepeat
UNTILuntil
FORfor
INin
BREAKbreak
RETURNreturn
FUNCTIONfunction
ENDend
LOCALlocal
TRUEtrue
FALSEfalse
NILnil
GOTOgoto
CONTINUEcontinue

Non-useful lexemes

  • Don't provide any benefit to the parser
  • Ignore comments --, comment blocks --[[ ]]
  • Ignore white space
  • Ignore newlines \n, tabs \t, carriage returns \r

Token Example

Token { type: LOCAL,          lexeme: "local", value: null, line: 1 }
Token { type: IDENTIFIER,     lexeme: "x",     value: null, line: 1 }
Token { type: ASSIGN,         lexeme: "=",     value: null, line: 1 }
Token { type: NUMBER_LITERAL, lexeme: "20",    value: 20,   line: 1 }

Lexer structure

  • tokenList: Stores tokens
  • sourceCode: Lua code (String)
  • startPosition: Starting index of current lexeme
  • currentPosition: Current index of sourceCode
  • currentLine: Line where lexeme occurs
function Lexer(sourceCode) {
    this.tokenList = [];
    this.sourceCode = sourceCode;
    this.startPosition = 0;
    this.currentPosition = 0;
    this.currentLine = 1;
}

Helper methods

function advance() {
    const currentCharacter = sourceCode[currentPosition]
    currentPosition++;
    return currentCharacter;
}

function isEndOfFile() {
    return currentPosition >= sourceCode.length;
}

Iterating source code

  • tokenise called for every lexeme
function scanTokens() {
    while (!isEndOfFile()) {
        startPosition = currentPosition;
        tokenise();
    }

    addToken(EOF);
    return tokenList;
}

Switch Case

function tokenise() {
    const currentCharacter = advance();

    switch (currentCharacter) {
        case "(":
            addToken(LEFT_PAREN);
            break;
        // A lot more cases
    }
}

Ignore cases

switch (currentCharacter) {
    ...
    case " ":   break;
    case "\r":  break;
    case "\t":  break;
    case "\n":
        currentLine++;
        break;
    ...
}

Adding Tokens

function addToken(tokenType, tokenValue = null) {
    const tokenLexeme = sourceCode.slice(startPosition, currentPosition);
    const newToken = new Token(tokenType, tokenLexeme, tokenValue, currentLine;

tokenList.push(newToken);
}


Single character tokens

switch (currentCharacter) {
    case "(": addToken(LEFT_PAREN); break;
    case ")": addToken(RIGHT_PAREN); break;
    case "{": addToken(LEFT_BRACE); break;
    case "}": addToken(RIGHT_BRACE); break;
    case "[": addToken(LEFT_BRACKET); break;
    case "]": addToken(RIGHT_BRACKET); break;
    case ",": addToken(COMMA); break;
    case ";": addToken(SEMICOLON); break;
    case "&": addToken(AMPERSAND); break;
    case "|": addToken(PIPE); break;
    case "#": addToken(TAG); break;
    case "^": addToken(CARET); break;
    case "+": addToken(PLUS); break;
    case "*": addToken(STAR); break;
    // More cases below
}

Errors

  • Keep scanning even after encountering an error
  • Better to detect all the errors in one execution
switch (currentCharacter) {
    ...

    default:
        console.log(`Invalid character "${currentCharacter}" at line ${currentLine}`);
}

Current output

local x = (10 == 20)
Invalid character "l" at line 1
Invalid character "o" at line 1
Invalid character "c" at line 1
Invalid character "a" at line 1
Invalid character "l" at line 1
Invalid character "x" at line 1
Token { type: ASSIGN,      lexeme: "=", value: null, line: 1 }
Token { type: LEFT_PAREN,  lexeme: "(",  value: null, line: 1 }
Invalid character "1" at line 1
Invalid character "0" at line 1
Token { type: ASSIGN,      lexeme: "=", value: null, line: 1 }
Token { type: ASSIGN,      lexeme: "=", value: null, line: 1 }
Invalid character "2" at line 1
Invalid character "0" at line 1
Token { type: RIGHT_PAREN, lexeme: ")", value: null, line: 1 }

Multi-character tokens

  • Need to check for two or more characters now
  • Method matchNext checks if the following fixed number of characters match
  • Ex. Assign (=) and Equal (==) operators both contain =
-- Current implementation generates 3 Assign tokens
-- Goal: 1 Assign and 1 Equal token

local x = (20 == 10)

matchNext

  • If condition is met, basically performs advance()
function matchNext(charactersToMatch) {
    const endSlicePosition = currentPosition + charactersToMatch.length;
    const nextCharacters = sourceCode.slice(currentPosition, endSlicePosition);
 
    if (nextCharacters !== charactersToMatch) {
        return false;
    }

    currentPosition = currentPosition + charactersToMatch.length
    return true;
}

Updated switch case

  • Ternary expression to determine which token type is added
switch (currentCharacter) {
    ...

    case "=": 
       addToken(matchNext("=") ? EQUAL : ASSIGN); 
       break;
   ...
}

Current output

local x = (10 == 20)
Invalid character "l" at line 1
Invalid character "o" at line 1
Invalid character "c" at line 1
Invalid character "a" at line 1
Invalid character "l" at line 1
Invalid character "x" at line 1
Token { type: ASSIGN,      lexeme: "=",  value: null, line: 1 }
Token { type: LEFT_PAREN,  lexeme: "(",  value: null, line: 1 }
Invalid character "1" at line 1
Invalid character "0" at line 1
Token { type: EQUAL,       lexeme: "==", value: null, line: 1 }
Invalid character "2" at line 1
Invalid character "0" at line 1
Token { type: RIGHT_PAREN, lexeme: ")",  value: null, line: 1 }

Special lexemes

  • Variable-length sequence of characters

  • Begin and end with specific character/s

  • Comments: -- \n

    • Ex. -- This is a comment\n
  • Comment blocks: --[[ ]]

  • Strings: "" or ''

  • Numbers: (loop while checking if character is digit)

    • Ex. local x = 123456789

Lookahead

  • Similar to advance but the currentPosition doesn't change

    • Ex. A stack has peek and pop methods
    • Stack's peek: returns the element at the top but doesn't remove it
    • pop: returns the element at the top and removes it
    • Lexer's peek: returns the next element but doesn't update the position
    • advance: returns the next element and updates the position

Peek

  • Peek after an arbitrary number of characters
function peek(extraCharsToLookAhead = 0) {
    if (currentPosition + extraCharsToLookAhead >= sourceCode.length) {
        return "\0";
    }

    return sourceCode[currentPosition + extraCharsToLookAhead];
}

Comments

  • Terminating character: \n
function readComment() {
    while (peek() !== "\n") {
        advance();
    }
}

Comment Blocks

  • Terminating characters: ]]
function readCommentBlock() {
    while (!(peek() === ']' && peek(1) === ']') && !isEndOfFile()) {
        if (peek() === '\n') {
            currentLine++;
        }

        advance();
    }

    if (isEndOfFile()) {
          console.log(
              `Unterminated comment block at line ${this.currentLine}`
          );

          return;
    }     
      
    advance();
    advance();
}    

Updated minus case

  • Three cases to consider:

    • Minus -
    • Comment --
    • Comment block --[[
if (matchNext('-[[')) {
    readCommentBlock();
} else if (matchNext('-')) {
    readComment();
} else {
    addToken(MINUS);  
}

Strings

  • Terminating character: " or '
function readString(quoteType) {
    while (peek() !== quoteType && peek() !== '\n' && !isEndOfFile()) {
        advance();
    }

    if (peek() === '\n' || isEndOfFile()) {
        console.log(`Unterminated string at line ${currentLine}`);

        return;
    }

    advance();
    addToken(
        STRING,
        sourceCode.slice(startPosition + 1, currentPosition - 1),
    );
}

Recognising digits

  • Alternative is to make cases for 0 to 9
default:
    if (isDigit(currentCharacter)) {
        readNumber();
    } else {
        // Error handling
    }
function isDigit(c) {
    return c >= '0' && c <= '9';
}

Digits

  • Check for decimal numbers as well
function readNumber() {
    while (isDigit(peek())) {
        advance();
    }

    if (peek() === '.' && isDigit(peek(1))) {
        advance(
        while (isDigit(peek())) {
            advance();
        }
    
    addToken(
        NUMBER,
        Number(sourceCode.slice(startPosition, currentPosition)),
    );
}

Differentiate between Identifiers and Reserved Keywords

  • A reserved word IS an identifier, just claimed by the language
  • Apply longest match principle and consume whole lexeme
  • Use a pre-defined map of reserved words to compare against the lexeme
  • If lexeme exists in the map, create corresponding reserved token
  • Else, create IDENTIFIER token

Variable name rules

  • Starts with a letter or underscore (no digits)
  • Can contain letters, digits and underscores
  • Can't be a reserved keyword
function isAlpha(c) {
    return (c >= 'a' && c <= 'z') ||
           (c >= 'A' && c <= 'Z') ||
            c == '_';
}
function isAlphaNumeric(c) {
    return isAlpha(c) || isDigit(c);
}

Recognising Identifiers and Reserved Words

  • Alternative is to make cases for a-z, A-Z and _
default:
    if (isDigit(currentCharacter)) {
        readNumber();
    } else if (isAlpha(currentCharacter)) {
        readIdentifier();
    } else {
        // Error handling
    }

Identifiers and Reserved words

  • Token is either IDENTIFIER or a specific reserved keyword if it exists
function readIdentifier() {
    while (isAlphaNumeric(peek())) {
        advance();
    }

    const lexeme = sourceCode.slice(
        startPosition,
        currentPosition,
    );
    const type = reservedKeywords[lexeme] ?? IDENTIFIER;
    addToken(type);
}

Current output

local x = (10 == 20)
Token { type: LOCAL,       lexeme: "local", value: null, line: 1 }
Token { type: IDENTIFIER,  lexeme: "x",     value: null, line: 1 }
Token { type: ASSIGN,      lexeme: "=",     value: null, line: 1 }
Token { type: LEFT_PAREN,  lexeme: "(",     value: null, line: 1 }
Token { type: NUMBER,      lexeme: "10",    value: 10,   line: 1 }
Token { type: EQUAL,       lexeme: "==",    value: null, line: 2 }
Token { type: NUMBER,      lexeme: "20",    value: 20,   line: 1 }
Token { type: RIGHT_PAREN, lexeme: ")",     value: null, line: 1 }


Is this valid?

10 x = + y 20 local
  • Lexical analysis is not concerned with language's grammar
  • Also not concerned with semantics (scope, declarations, types, etc.)
  • Simply break down input into tokens based on specification