Write a lexer in PHP with Lexical

I recently released a new PHP package called "Lexical". It provides a set of attributes and objects to build regular expression based lexers in PHP.

If you want to skip the blog post and go straight to the package, the repository is public on GitHub.

As somebody who enjoys experimenting with parsers and text processing, writing lexers has become a somewhat regular task for me. The first step in reducing boilerplate and copy-paste'd code between projects is to write a package that can handle the tedious task of tokenisation for me.

To demonstrate Lexical's functionality, I'll write a simple lexer that can handle mathematical expressions such as 1 + 2 or 4 / 5 * 6. The lexer will need to handle 4 mathematical operators (+, -, *, /) and numeric values (just integers for now).

We'll by creating a new enumeration that describes the token types.

enum TokenType
{
    case Number;
    case Add;
    case Subtract;
    case Multiply;
    case Divide;
}

Lexical provides a set of attributes that can be added to each case in the enumeration:

Regex - accepts a single regular expression.
Literal - accepts a string of continuous characters.
Error - designates a specific enumeration case as the "error" type.

Adding the aforementioned attributes to TokenType looks a little something like the code below.

enum TokenType
{
    #[Regex("[0-9]+")]
    case Number;
    
    #[Literal("+")]
    case Add;
    
    #[Literal("-")]
    case Subtract;
    
    #[Literal("*")]
    case Multiply;

    #[Literal("/")]
    case Divide;
}

With the attributes in place, we can start to build a lexer using the LexicalBuilder.

$lexer = (new LexicalBuilder)
    ->readTokenTypesFrom(TokenType::class)
    ->build();

The readTokenTypesFrom() method is used to tell the builder where it should look for the various tokenising attributes. The build() method will take those attributes and return an object that implements LexerInterface, configured to look for the specified token types.

Then it's just a case of calling the tokenise() method on the lexer object.

$tokens = $lexer->tokenise('1+2'); // -> [[TokenType::Number, '1'], [TokenType::Add, '+'], [TokenType::Number, '2']]

The tokenise() method returns a list of tuples, where the first item is the "type" (TokenType in this example) and the second item is the "literal" (a string containing the matched characters).

Skipping whitespace and other patterns

The lexer currently understands 1+2 but it would fail to tokenise 1 + 2 (added whitespace). This is because by default it expects each and every possible character to fall into a pattern. If it encountered an invalid or unrecognised character in the input, it would throw an exception (by default).

The whitespace is insignificant in this case, so can be skipped safely. To do this, we need to add a new Lexer attribute to the TokenType enumeration and pass through a regular expression that matches the characters we want to skip.

The Lexer attribute is used to configure the generic behaviour of the lexer. skip is the only option in the current version.

#[Lexer(skip: "[ \t\n\f]+")]
enum TokenType
{
    // ...
}

Now the lexer will skip over any whitespace characters and successfully tokenise 1 + 2.

Error handling

When a lexer encounters an unexpected character, it will throw an UnexpectedCharacterException.

try {
    $tokens = $lexer->tokenise();
} catch (UnexpectedCharacterException $e) {
    dd($e->character, $e->position);
}

As mentioned above, there is an Error attribute that can be used to mark an enum case as the "error" type.

enum TokenType
{
    // ...

    #[Error]
    case Error;
}

Now when the input is tokenised, the unrecognised character will be consumed like other tokens and will have a type of TokenType::Error.

$tokens = $lexer->tokenise('1 % 2'); // -> [[TokenType::Number, '1'], [TokenType::Error, '%'], [TokenType::Number, '2']]

Custom `Token` objects

If you prefer to work with dedicated objects instead of Lexical's default tuple values for each token, you can provide a custom callback to map the matched token type and literal into a custom object.

class Token
{
    public function __construct(
        public readonly TokenType $type,
        public readonly string $literal,
    ) {}
}

$lexer = (new LexicalBuilder)
    ->readTokenTypesFrom(TokenType::class)
    ->produceTokenUsing(fn (TokenType $type, string $literal) => new Token($type, $literal))
    ->build();

$lexer->tokenise('1 + 2'); // -> [Token { type: TokenType::Number, literal: "1" }, ...]

Write a lexer in PHP with Lexical

Skipping whitespace and other patterns

Error handling

Custom Token objects

Custom `Token` objects