- Source: GitHub
- Reference: Crafting Interpreters
Interpreters - Lexical Analysis
Agenda
What are interpreters?
- A program that translates and executes code line by line without prior compilation
- For any language, source code must be translated to machine code before the CPU can execute it
- Interpreters are one form of translation, other than compilers and assemblers
Characteristics
- No intermediate object code generated
- Less memory required
- Slower, converts high-level to low-level language every time
- Detect errors immediately
Interpreter Process
Lexical Analysis
- Scanning the source code character by character, grouping them into tokens
- Token: Represents individual elements of a programming language's syntax
- Lexeme: A sequence of characters that matches a pattern for a token
- Lexer: Component which scans the source code and creates tokens
Syntax Analysis (Parsing)
- Verifies the arrangement of tokens following a set of grammar rules
- Constructs Abstract Syntax Tree
Semantic Analysis
- Checks for logical errors, type declarations, scoping, etc.
Execution
- Interpret and execute program logic represented by AST
Writing lexer for Lua
- Identify syntax: Keywords, operators, etc.
- Tokenisation strategy: Iterating source code and matching with switch case
- Others: Regular expressions, Lex/Flex tools
- Error handling: Invalid characters, unterminated strings or comments
Token structure
- type: Token type
- lexeme: Raw source code
- value: Interpreted value
- line: Line of occurrence
function Token(type, lexeme, value, line) {
this.type = type;
this.lexeme = lexeme;
this.value = value;
this.line = line;
}
Token Types
Token Type | Matches |
---|---|
PLUS | + |
MINUS | - |
STAR | * |
SLASH | / |
PERCENT | % |
LEFT_PAREN | ( |
RIGHT_PAREN | ) |
LEFT_BRACE | { |
RIGHT_BRACE | } |
LEFT_BRACKET | [ |
RIGHT_BRACKET | ] |
CARET | ^ |
TAG | # |
ASSIGN | = |
LESS | < |
GREATER | > |
COLON | : |
SEMICOLON | ; |
AMPERSAND | & |
PIPE | | |
TILDE | ~ |
PERIOD | . |
EQUAL | == |
NOT_EQUAL | ~= |
LESS_EQUAL | <= |
GREATER_EQUAL | >= |
LEFT_SHIFT | << |
RIGHT_SHIFT | >> |
DOUBLE_SLASH | // |
DOUBLE_COLON | :: |
CONCATENATE | .. |
ELLIPSIS | ... |
IF | if |
ELSE | else |
ELSEIF | elseif |
THEN | then |
AND | and |
OR | or |
NOT | not |
WHILE | while |
DO | do |
REPEAT | repeat |
UNTIL | until |
FOR | for |
IN | in |
BREAK | break |
RETURN | return |
FUNCTION | function |
END | end |
LOCAL | local |
TRUE | true |
FALSE | false |
NIL | nil |
GOTO | goto |
CONTINUE | continue |
Non-useful lexemes
- Don't provide any benefit to the parser
- Ignore comments --, comment blocks --[[ ]]
- Ignore white space
- Ignore newlines \n, tabs \t, carriage returns \r
Token Example
Token { type: LOCAL, lexeme: "local", value: null, line: 1 }
Token { type: IDENTIFIER, lexeme: "x", value: null, line: 1 }
Token { type: ASSIGN, lexeme: "=", value: null, line: 1 }
Token { type: NUMBER_LITERAL, lexeme: "20", value: 20, line: 1 }
Lexer structure
- tokenList: Stores tokens
- sourceCode: Lua code (String)
- startPosition: Starting index of current lexeme
- currentPosition: Current index of
sourceCode
- currentLine: Line where lexeme occurs
function Lexer(sourceCode) {
this.tokenList = [];
this.sourceCode = sourceCode;
this.startPosition = 0;
this.currentPosition = 0;
this.currentLine = 1;
}
Helper methods
function advance() {
const currentCharacter = sourceCode[currentPosition]
currentPosition++;
return currentCharacter;
}
function isEndOfFile() {
return currentPosition >= sourceCode.length;
}
Iterating source code
- tokenise called for every lexeme
function scanTokens() {
while (!isEndOfFile()) {
startPosition = currentPosition;
tokenise();
}
addToken(EOF);
return tokenList;
}
Switch Case
function tokenise() {
const currentCharacter = advance();
switch (currentCharacter) {
case "(":
addToken(LEFT_PAREN);
break;
// A lot more cases
}
}
Ignore cases
switch (currentCharacter) {
...
case " ": break;
case "\r": break;
case "\t": break;
case "\n":
currentLine++;
break;
...
}
Adding Tokens
function addToken(tokenType, tokenValue = null) {
const tokenLexeme = sourceCode.slice(startPosition, currentPosition);
const newToken = new Token(tokenType, tokenLexeme, tokenValue, currentLine;
tokenList.push(newToken);
}
Single character tokens
switch (currentCharacter) {
case "(": addToken(LEFT_PAREN); break;
case ")": addToken(RIGHT_PAREN); break;
case "{": addToken(LEFT_BRACE); break;
case "}": addToken(RIGHT_BRACE); break;
case "[": addToken(LEFT_BRACKET); break;
case "]": addToken(RIGHT_BRACKET); break;
case ",": addToken(COMMA); break;
case ";": addToken(SEMICOLON); break;
case "&": addToken(AMPERSAND); break;
case "|": addToken(PIPE); break;
case "#": addToken(TAG); break;
case "^": addToken(CARET); break;
case "+": addToken(PLUS); break;
case "*": addToken(STAR); break;
// More cases below
}
Errors
- Keep scanning even after encountering an error
- Better to detect all the errors in one execution
switch (currentCharacter) {
...
default:
console.log(`Invalid character "${currentCharacter}" at line ${currentLine}`);
}
Current output
local x = (10 == 20)
Invalid character "l" at line 1
Invalid character "o" at line 1
Invalid character "c" at line 1
Invalid character "a" at line 1
Invalid character "l" at line 1
Invalid character "x" at line 1
Token { type: ASSIGN, lexeme: "=", value: null, line: 1 }
Token { type: LEFT_PAREN, lexeme: "(", value: null, line: 1 }
Invalid character "1" at line 1
Invalid character "0" at line 1
Token { type: ASSIGN, lexeme: "=", value: null, line: 1 }
Token { type: ASSIGN, lexeme: "=", value: null, line: 1 }
Invalid character "2" at line 1
Invalid character "0" at line 1
Token { type: RIGHT_PAREN, lexeme: ")", value: null, line: 1 }
Multi-character tokens
- Need to check for two or more characters now
- Method matchNext checks if the following fixed number of characters match
- Ex. Assign (=) and Equal (==) operators both contain =
-- Current implementation generates 3 Assign tokens
-- Goal: 1 Assign and 1 Equal token
local x = (20 == 10)
matchNext
- If condition is met, basically performs advance()
function matchNext(charactersToMatch) {
const endSlicePosition = currentPosition + charactersToMatch.length;
const nextCharacters = sourceCode.slice(currentPosition, endSlicePosition);
if (nextCharacters !== charactersToMatch) {
return false;
}
currentPosition = currentPosition + charactersToMatch.length
return true;
}
Updated switch case
- Ternary expression to determine which token type is added
switch (currentCharacter) {
...
case "=":
addToken(matchNext("=") ? EQUAL : ASSIGN);
break;
...
}
Current output
local x = (10 == 20)
Invalid character "l" at line 1
Invalid character "o" at line 1
Invalid character "c" at line 1
Invalid character "a" at line 1
Invalid character "l" at line 1
Invalid character "x" at line 1
Token { type: ASSIGN, lexeme: "=", value: null, line: 1 }
Token { type: LEFT_PAREN, lexeme: "(", value: null, line: 1 }
Invalid character "1" at line 1
Invalid character "0" at line 1
Token { type: EQUAL, lexeme: "==", value: null, line: 1 }
Invalid character "2" at line 1
Invalid character "0" at line 1
Token { type: RIGHT_PAREN, lexeme: ")", value: null, line: 1 }
Special lexemes
-
Variable-length sequence of characters
-
Begin and end with specific character/s
-
Comments: -- \n
- Ex. -- This is a comment\n
-
Comment blocks: --[[ ]]
-
Strings: "" or ''
-
Numbers: (loop while checking if character is digit)
- Ex. local x = 123456789
Lookahead
-
Similar to advance but the currentPosition doesn't change
- Ex. A stack has peek and pop methods
- Stack's peek: returns the element at the top but doesn't remove it
- pop: returns the element at the top and removes it
- Lexer's peek: returns the next element but doesn't update the position
- advance: returns the next element and updates the position
Peek
- Peek after an arbitrary number of characters
function peek(extraCharsToLookAhead = 0) {
if (currentPosition + extraCharsToLookAhead >= sourceCode.length) {
return "\0";
}
return sourceCode[currentPosition + extraCharsToLookAhead];
}
Comments
- Terminating character: \n
function readComment() {
while (peek() !== "\n") {
advance();
}
}
Comment Blocks
- Terminating characters: ]]
function readCommentBlock() {
while (!(peek() === ']' && peek(1) === ']') && !isEndOfFile()) {
if (peek() === '\n') {
currentLine++;
}
advance();
}
if (isEndOfFile()) {
console.log(
`Unterminated comment block at line ${this.currentLine}`
);
return;
}
advance();
advance();
}
Updated minus case
-
Three cases to consider:
- Minus -
- Comment --
- Comment block --[[
if (matchNext('-[[')) {
readCommentBlock();
} else if (matchNext('-')) {
readComment();
} else {
addToken(MINUS);
}
Strings
- Terminating character: " or '
function readString(quoteType) {
while (peek() !== quoteType && peek() !== '\n' && !isEndOfFile()) {
advance();
}
if (peek() === '\n' || isEndOfFile()) {
console.log(`Unterminated string at line ${currentLine}`);
return;
}
advance();
addToken(
STRING,
sourceCode.slice(startPosition + 1, currentPosition - 1),
);
}
Recognising digits
- Alternative is to make cases for 0 to 9
default:
if (isDigit(currentCharacter)) {
readNumber();
} else {
// Error handling
}
function isDigit(c) {
return c >= '0' && c <= '9';
}
Digits
- Check for decimal numbers as well
function readNumber() {
while (isDigit(peek())) {
advance();
}
if (peek() === '.' && isDigit(peek(1))) {
advance(
while (isDigit(peek())) {
advance();
}
addToken(
NUMBER,
Number(sourceCode.slice(startPosition, currentPosition)),
);
}
Differentiate between Identifiers and Reserved Keywords
- A reserved word IS an identifier, just claimed by the language
- Apply longest match principle and consume whole lexeme
- Use a pre-defined map of reserved words to compare against the lexeme
- If lexeme exists in the map, create corresponding reserved token
- Else, create IDENTIFIER token
Variable name rules
- Starts with a letter or underscore (no digits)
- Can contain letters, digits and underscores
- Can't be a reserved keyword
function isAlpha(c) {
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
function isAlphaNumeric(c) {
return isAlpha(c) || isDigit(c);
}
Recognising Identifiers and Reserved Words
- Alternative is to make cases for a-z, A-Z and _
default:
if (isDigit(currentCharacter)) {
readNumber();
} else if (isAlpha(currentCharacter)) {
readIdentifier();
} else {
// Error handling
}
Identifiers and Reserved words
- Token is either IDENTIFIER or a specific reserved keyword if it exists
function readIdentifier() {
while (isAlphaNumeric(peek())) {
advance();
}
const lexeme = sourceCode.slice(
startPosition,
currentPosition,
);
const type = reservedKeywords[lexeme] ?? IDENTIFIER;
addToken(type);
}
Current output
local x = (10 == 20)
Token { type: LOCAL, lexeme: "local", value: null, line: 1 }
Token { type: IDENTIFIER, lexeme: "x", value: null, line: 1 }
Token { type: ASSIGN, lexeme: "=", value: null, line: 1 }
Token { type: LEFT_PAREN, lexeme: "(", value: null, line: 1 }
Token { type: NUMBER, lexeme: "10", value: 10, line: 1 }
Token { type: EQUAL, lexeme: "==", value: null, line: 2 }
Token { type: NUMBER, lexeme: "20", value: 20, line: 1 }
Token { type: RIGHT_PAREN, lexeme: ")", value: null, line: 1 }
Is this valid?
10 x = + y 20 local
- Lexical analysis is not concerned with language's grammar
- Also not concerned with semantics (scope, declarations, types, etc.)
- Simply break down input into tokens based on specification