Interpreters - Lexical Analysis

Agenda

What are interpreters?
Process
Writing a lexer

What are interpreters?

A program that translates and executes code line by line without prior compilation
For any language, source code must be translated to machine code before the CPU can execute it
Interpreters are one form of translation, other than compilers and assemblers

Characteristics

No intermediate object code generated
Less memory required
Slower, converts high-level to low-level language every time
Detect errors immediately

Interpreter Process

Lexical Analysis

Scanning the source code character by character, grouping them into tokens
Token: Represents individual elements of a programming language's syntax
Lexeme: A sequence of characters that matches a pattern for a token
Lexer: Component which scans the source code and creates tokens

Syntax Analysis (Parsing)

Verifies the arrangement of tokens following a set of grammar rules
Constructs Abstract Syntax Tree

Semantic Analysis

Checks for logical errors, type declarations, scoping, etc.

Execution

Interpret and execute program logic represented by AST

Writing lexer for Lua

Identify syntax: Keywords, operators, etc.
- https://hackage.haskell.org/package/language-lua-0.11.0.1/docs/Language-Lua-Token.html
Tokenisation strategy: Iterating source code and matching with switch case
- Others: Regular expressions, Lex/Flex tools
Error handling: Invalid characters, unterminated strings or comments

Token structure

type: Token type
lexeme: Raw source code
value: Interpreted value
line: Line of occurrence

function Token(type, lexeme, value, line) {
    this.type = type;
    this.lexeme = lexeme;
    this.value = value;
    this.line = line;
}

Token Types

Token Type	Matches
PLUS	+
MINUS	-
STAR	*
SLASH	/
PERCENT	%
LEFT_PAREN	(
RIGHT_PAREN	)
LEFT_BRACE	{
RIGHT_BRACE	}
LEFT_BRACKET	[
RIGHT_BRACKET	]
CARET	^
TAG	#
ASSIGN	=
LESS	<
GREATER	>
COLON	:
SEMICOLON	;
AMPERSAND	&
PIPE	\|
TILDE	~
PERIOD	.
EQUAL	==
NOT_EQUAL	~=
LESS_EQUAL	<=
GREATER_EQUAL	>=
LEFT_SHIFT	<<
RIGHT_SHIFT	>>
DOUBLE_SLASH	//
DOUBLE_COLON	::
CONCATENATE	..
ELLIPSIS	...
IF	if
ELSE	else
ELSEIF	elseif
THEN	then
AND	and
OR	or
NOT	not
WHILE	while
DO	do
REPEAT	repeat
UNTIL	until
FOR	for
IN	in
BREAK	break
RETURN	return
FUNCTION	function
END	end
LOCAL	local
TRUE	true
FALSE	false
NIL	nil
GOTO	goto
CONTINUE	continue

Non-useful lexemes

Don't provide any benefit to the parser
Ignore comments --, comment blocks --[[ ]]
Ignore white space
Ignore newlines \n, tabs \t, carriage returns \r

Token Example

Token { type: LOCAL,          lexeme: "local", value: null, line: 1 }
Token { type: IDENTIFIER,     lexeme: "x",     value: null, line: 1 }
Token { type: ASSIGN,         lexeme: "=",     value: null, line: 1 }
Token { type: NUMBER_LITERAL, lexeme: "20",    value: 20,   line: 1 }

Lexer structure

tokenList: Stores tokens
sourceCode: Lua code (String)
startPosition: Starting index of current lexeme
currentPosition: Current index of sourceCode
currentLine: Line where lexeme occurs

function Lexer(sourceCode) {
    this.tokenList = [];
    this.sourceCode = sourceCode;
    this.startPosition = 0;
    this.currentPosition = 0;
    this.currentLine = 1;
}

Helper methods

function advance() {
    const currentCharacter = sourceCode[currentPosition]
    currentPosition++;
    return currentCharacter;
}

function isEndOfFile() {
    return currentPosition >= sourceCode.length;
}

Iterating source code

tokenise called for every lexeme

function scanTokens() {
    while (!isEndOfFile()) {
        startPosition = currentPosition;
        tokenise();
    }

    addToken(EOF);
    return tokenList;
}

Switch Case

function tokenise() {
    const currentCharacter = advance();

    switch (currentCharacter) {
        case "(":
            addToken(LEFT_PAREN);
            break;
        // A lot more cases
    }
}

Ignore cases

switch (currentCharacter) {
    ...
    case " ":   break;
    case "\r":  break;
    case "\t":  break;
    case "\n":
        currentLine++;
        break;
    ...
}

Adding Tokens

function addToken(tokenType, tokenValue = null) {
    const tokenLexeme = sourceCode.slice(startPosition, currentPosition);
    const newToken = new Token(tokenType, tokenLexeme, tokenValue, currentLine;

tokenList.push(newToken);
}

Single character tokens

switch (currentCharacter) {
    case "(": addToken(LEFT_PAREN); break;
    case ")": addToken(RIGHT_PAREN); break;
    case "{": addToken(LEFT_BRACE); break;
    case "}": addToken(RIGHT_BRACE); break;
    case "[": addToken(LEFT_BRACKET); break;
    case "]": addToken(RIGHT_BRACKET); break;
    case ",": addToken(COMMA); break;
    case ";": addToken(SEMICOLON); break;
    case "&": addToken(AMPERSAND); break;
    case "|": addToken(PIPE); break;
    case "#": addToken(TAG); break;
    case "^": addToken(CARET); break;
    case "+": addToken(PLUS); break;
    case "*": addToken(STAR); break;
    // More cases below
}

Errors

Keep scanning even after encountering an error
Better to detect all the errors in one execution

switch (currentCharacter) {
    ...

    default:
        console.log(`Invalid character "${currentCharacter}" at line ${currentLine}`);
}

Current output

local x = (10 == 20)

Invalid character "l" at line 1
Invalid character "o" at line 1
Invalid character "c" at line 1
Invalid character "a" at line 1
Invalid character "l" at line 1
Invalid character "x" at line 1
Token { type: ASSIGN,      lexeme: "=", value: null, line: 1 }
Token { type: LEFT_PAREN,  lexeme: "(",  value: null, line: 1 }
Invalid character "1" at line 1
Invalid character "0" at line 1
Token { type: ASSIGN,      lexeme: "=", value: null, line: 1 }
Token { type: ASSIGN,      lexeme: "=", value: null, line: 1 }
Invalid character "2" at line 1
Invalid character "0" at line 1
Token { type: RIGHT_PAREN, lexeme: ")", value: null, line: 1 }

Multi-character tokens

Need to check for two or more characters now
Method matchNext checks if the following fixed number of characters match
Ex. Assign (=) and Equal (==) operators both contain =

-- Current implementation generates 3 Assign tokens
-- Goal: 1 Assign and 1 Equal token

local x = (20 == 10)

matchNext

If condition is met, basically performs advance()

function matchNext(charactersToMatch) {
    const endSlicePosition = currentPosition + charactersToMatch.length;
    const nextCharacters = sourceCode.slice(currentPosition, endSlicePosition);
 
    if (nextCharacters !== charactersToMatch) {
        return false;
    }

    currentPosition = currentPosition + charactersToMatch.length
    return true;
}

Updated switch case

Ternary expression to determine which token type is added

switch (currentCharacter) {
    ...

    case "=": 
       addToken(matchNext("=") ? EQUAL : ASSIGN); 
       break;
   ...
}

Current output

local x = (10 == 20)

Invalid character "l" at line 1
Invalid character "o" at line 1
Invalid character "c" at line 1
Invalid character "a" at line 1
Invalid character "l" at line 1
Invalid character "x" at line 1
Token { type: ASSIGN,      lexeme: "=",  value: null, line: 1 }
Token { type: LEFT_PAREN,  lexeme: "(",  value: null, line: 1 }
Invalid character "1" at line 1
Invalid character "0" at line 1
Token { type: EQUAL,       lexeme: "==", value: null, line: 1 }
Invalid character "2" at line 1
Invalid character "0" at line 1
Token { type: RIGHT_PAREN, lexeme: ")",  value: null, line: 1 }

Special lexemes

Variable-length sequence of characters
Begin and end with specific character/s
Comments: -- \n
- Ex. -- This is a comment\n
Comment blocks: --[[ ]]
Strings: "" or ''
Numbers: (loop while checking if character is digit)
- Ex. local x = 123456789

Lookahead

Similar to advance but the currentPosition doesn't change
- Ex. A stack has peek and pop methods
- Stack's peek: returns the element at the top but doesn't remove it
- pop: returns the element at the top and removes it
- Lexer's peek: returns the next element but doesn't update the position
- advance: returns the next element and updates the position

Peek

Peek after an arbitrary number of characters

function peek(extraCharsToLookAhead = 0) {
    if (currentPosition + extraCharsToLookAhead >= sourceCode.length) {
        return "\0";
    }

    return sourceCode[currentPosition + extraCharsToLookAhead];
}

Comments

Terminating character: \n

function readComment() {
    while (peek() !== "\n") {
        advance();
    }
}

Comment Blocks

Terminating characters: ]]

function readCommentBlock() {
    while (!(peek() === ']' && peek(1) === ']') && !isEndOfFile()) {
        if (peek() === '\n') {
            currentLine++;
        }

        advance();
    }

    if (isEndOfFile()) {
          console.log(
              `Unterminated comment block at line ${this.currentLine}`
          );

          return;
    }     
      
    advance();
    advance();
}

Updated minus case

Three cases to consider:
- Minus -
- Comment --
- Comment block --[[

if (matchNext('-[[')) {
    readCommentBlock();
} else if (matchNext('-')) {
    readComment();
} else {
    addToken(MINUS);  
}

Strings

Terminating character: " or '

function readString(quoteType) {
    while (peek() !== quoteType && peek() !== '\n' && !isEndOfFile()) {
        advance();
    }

    if (peek() === '\n' || isEndOfFile()) {
        console.log(`Unterminated string at line ${currentLine}`);

        return;
    }

    advance();
    addToken(
        STRING,
        sourceCode.slice(startPosition + 1, currentPosition - 1),
    );
}

Recognising digits

Alternative is to make cases for 0 to 9

default:
    if (isDigit(currentCharacter)) {
        readNumber();
    } else {
        // Error handling
    }

function isDigit(c) {
    return c >= '0' && c <= '9';
}

Digits

Check for decimal numbers as well

function readNumber() {
    while (isDigit(peek())) {
        advance();
    }

    if (peek() === '.' && isDigit(peek(1))) {
        advance(
        while (isDigit(peek())) {
            advance();
        }
    
    addToken(
        NUMBER,
        Number(sourceCode.slice(startPosition, currentPosition)),
    );
}

Differentiate between Identifiers and Reserved Keywords

A reserved word IS an identifier, just claimed by the language
Apply longest match principle and consume whole lexeme
Use a pre-defined map of reserved words to compare against the lexeme
If lexeme exists in the map, create corresponding reserved token
Else, create IDENTIFIER token

Variable name rules

Starts with a letter or underscore (no digits)
Can contain letters, digits and underscores
Can't be a reserved keyword

function isAlpha(c) {
    return (c >= 'a' && c <= 'z') ||
           (c >= 'A' && c <= 'Z') ||
            c == '_';
}

function isAlphaNumeric(c) {
    return isAlpha(c) || isDigit(c);
}

Recognising Identifiers and Reserved Words

Alternative is to make cases for a-z, A-Z and _

default:
    if (isDigit(currentCharacter)) {
        readNumber();
    } else if (isAlpha(currentCharacter)) {
        readIdentifier();
    } else {
        // Error handling
    }

Identifiers and Reserved words

Token is either IDENTIFIER or a specific reserved keyword if it exists

function readIdentifier() {
    while (isAlphaNumeric(peek())) {
        advance();
    }

    const lexeme = sourceCode.slice(
        startPosition,
        currentPosition,
    );
    const type = reservedKeywords[lexeme] ?? IDENTIFIER;
    addToken(type);
}

Current output

local x = (10 == 20)

Token { type: LOCAL,       lexeme: "local", value: null, line: 1 }
Token { type: IDENTIFIER,  lexeme: "x",     value: null, line: 1 }
Token { type: ASSIGN,      lexeme: "=",     value: null, line: 1 }
Token { type: LEFT_PAREN,  lexeme: "(",     value: null, line: 1 }
Token { type: NUMBER,      lexeme: "10",    value: 10,   line: 1 }
Token { type: EQUAL,       lexeme: "==",    value: null, line: 2 }
Token { type: NUMBER,      lexeme: "20",    value: 20,   line: 1 }
Token { type: RIGHT_PAREN, lexeme: ")",     value: null, line: 1 }

Is this valid?

10 x = + y 20 local

Lexical analysis is not concerned with language's grammar
Also not concerned with semantics (scope, declarations, types, etc.)
Simply break down input into tokens based on specification