Let's Write an Esolang in PHP
An esolang (esoteric language) is a programming language designed to be used as a joke or test the limits of what can be achieved with a particular language syntax.
In this post, I'll show you how to write your own esolang similar to Brainf*ck in PHP.
Syntax
Brainf*ck is a minimal esolang and the original implementation only consisted of 8 symbols — >
, <
, +
, -
, .
, ,
, [
and ]
. The language operated on a single contiguous sequence of memory called a "tape". The memory itself contained 8-bit integers (bytes from 0
to 255
) which allows you represent the entire range of ASCII characters.
Here's a small overview of what each symbol / operator does.
Symbol | Operation |
---|---|
> |
Move pointer to next cell |
< |
Move pointer to previous cell |
+ |
Increment the byte at current cell |
- |
Decrement the byte at current cell |
. |
Print the byte as an ASCII character at current cell |
, |
Read a byte from input, store in current cell |
[ |
If current cell contains a zero byte, jump to matching ] |
] |
If current cell contains non-zero byte, jump to matching [ |
The language that we create in this post will be very similar to Brainf*ck, but we'll use sigils and tokens found in the PHP language instead (with some variation). The table below maps the Brainf*ck tokens to our own. Let's combine the words PHP and Eso and call the language "Peso".
Brainf*ck | Peso |
---|---|
> |
=> |
< |
<= |
+ |
+ |
- |
- |
. |
echo |
, |
$ |
[ |
\ |
] |
/ |
You can treat the \
and /
tokens like a while loop where the condition is $byte !== 0
.
Lexer
Let's start by writing a lexer. The job of the lexer will be taking a string of Peso code and converting it into a sequence of tokens.
The best place to start is with the tokens themselves. In this case we can just use an enum
to define the tokens, since we don't really care about the characters themselves.
enum Token
{
case RightDoubleArrow;
case LeftDoubleArrow;
case Plus;
case Minus;
case Echo;
case Dollar;
case Backslash;
case Slash;
}
The first step in the Lexer
will be splitting the input string into separate characters.
class Lexer
{
/** @return array<Token> */
public function tokenise(string $input): array
{
$chars = str_split($input);
$tokens = [];
return $tokens;
}
}
Then we need to loop over each of the characters and try to find a token. We'll do this with a traditional for
loop since it will allow us to skip X iterations for multi-character tokens.
class Lexer
{
/** @return array<Token> */
public function tokenise(string $input): array
{
$chars = str_split($input);
$tokens = [];
for ($i = 0; $i < count($chars); $i++) {
$char = $chars[$i];
if ($char === '$') {
$tokens[] = Token::Dollar;
} elseif ($char === '\\') {
$tokens[] = Token::Backslash;
} elseif ($char === '/') {
$tokens[] = Token::Slash;
} elseif ($char === '+') {
$tokens[] = Token::Plus;
} elseif ($char === '-') {
$tokens[] = Token::Minus;
} elseif ($char === '=') {
if ($chars[++$i] === '>') {
$tokens[] = Token::RightDoubleArrow;
} else {
throw UnexpectedCharacterException::make($char, expected: '>');
}
} elseif ($char === '<') {
if ($chars[++$i] === '=') {
$tokens[] = Token::LeftDoubleArrow;
} else {
throw UnexpectedCharacterException::make($char, expected: '=');
}
} elseif (array_slice($chars, $i, 4) === ['e', 'c', 'h', 'o']) {
$i += 3;
$tokens[] = Token::Echo;
} elseif (ctype_space($char)) {
continue;
} else {
throw UnexpectedCharacterException::make($char);
}
}
return $tokens;
}
}
Let's also write a unit test for the Lexer
class to make sure that the correct token types are being produced, as well as a test to check that the UnexpectedCharacterException
is thrown when we encounter something bad.
it('produces the correct tokens', function () {
$lexer = new Lexer;
expect($lexer->tokenise('=> <= + - echo $ \ /'))
->toBe([
Token::RightDoubleArrow,
Token::LeftDoubleArrow,
Token::Plus,
Token::Minus,
Token::Echo,
Token::Dollar,
Token::Backslash,
Token::Slash
]);
});
it('throws an exception when encountering an unexpected character', function () {
$lexer = new Lexer;
expect(fn () => $lexer->tokenise('foo'))
->toThrow(UnexpectedCharacterException::class, 'Unexpected character f');
});
Parsing
Writing a parser for this sort of language is a little bit overkill, but it makes sense to do it now because we'll use it later on to make Peso blazingly fast.
Parsing involves looking at each of the tokens that the Lexer
produces and creating a set of structured nodes that represent the operations in the program.
class Parser
{
protected ?Token $current = null;
protected ?Token $next = null;
protected int $position = 0;
protected array $tokens;
public function parse(array $tokens): Program
{
if (count($tokens) === 0) {
return new Program();
}
$this->tokens = $tokens;
$this->current = $tokens[$this->position];
$this->next = $tokens[$this->position + 1] ?? null;
$nodes = [];
while ($this->position < count($tokens)) {
$nodes[] = $this->node();
}
return new Program($nodes);
}
}
This is the boilerplate code for the Parser
. It accepts an array of tokens, does some initialisations and then tries to get a Node
from the current token. Most of the work is going to be done inside of the Parser::node()
method.
class Parser
{
// ...
protected function node(): Node
{
if ($this->current === Token::Plus) {
$this->next();
return new IncrementNode;
} elseif ($this->current === Token::Minus) {
$this->next();
return new DecrementNode;
} elseif ($this->current === Token::RightDoubleArrow) {
$this->next();
return new MoveRightNode;
} elseif ($this->current === Token::LeftDoubleArrow) {
$this->next();
return new MoveLeftNode;
} elseif ($this->current === Token::Dollar) {
$this->next();
return new ReadByteNode;
} elseif ($this->current === Token::Echo) {
$this->next();
return new EchoNode;
} elseif ($this->current === Token::Backslash) {
$this->next();
$nodes = [];
while ($this->current !== null && $this->current !== Token::Slash) {
$nodes[] = $this->node();
}
$this->expect(Token::Slash);
return new WhileNode($nodes);
} else {
throw UnexpectedTokenException::make($this->current);
}
}
protected function expect(Token $token): void
{
if ($this->current !== $token) {
throw UnexpectedTokenException::make($this->current);
}
$this->next();
}
protected function next(): void
{
$this->position++;
$this->current = $this->next;
$this->next = $this->tokens[$this->position + 1] ?? null;
}
}
The Parser::node()
method is recursive when it tries to parse \ /
loops. It will recursively parse each of the nodes inside of the loop and store them on the WhileNode
.
To make sure the Parser
is working correctly, let's write a test and check each of the Node
values are present.
it('can parse all nodes', function () {
$tokens = (new Lexer)->tokenise('=> <= + - echo $ \echo/');
$parser = new Parser;
expect($parser->parse($tokens))
->toBeInstanceOf(Program::class)
->getNodes()
->sequence(
fn (Expectation $expect) => $expect->toBeInstanceOf(MoveRightNode::class),
fn (Expectation $expect) => $expect->toBeInstanceOf(MoveLeftNode::class),
fn (Expectation $expect) => $expect->toBeInstanceOf(IncrementNode::class),
fn (Expectation $expect) => $expect->toBeInstanceOf(DecrementNode::class),
fn (Expectation $expect) => $expect->toBeInstanceOf(EchoNode::class),
fn (Expectation $expect) => $expect->toBeInstanceOf(ReadByteNode::class),
fn (Expectation $expect) => $expect->toBeInstanceOf(WhileNode::class)
->getNodes()
->sequence(
fn (Expectation $expect) => $expect->toBeInstanceOf(EchoNode::class)
),
);
});
Execution
Now that we have a nice structured set of nodes we can start to execute the program. Let's start with a "simple" program that accepts 2 numbers from the input and adds them together.
$ ------------------------------------------------ => $ \ <= + => - / <= echo
Simple, right? Let me explain what the program is doing. I'll use #0
, #1
, etc to represent cells in the tape of memory.
We start by reading a single character of input into #0
. The input itself will be a string
, but we want to work with integers in the range 0
to 255
, so we'll store it as an ASCII byte using ord()
.
The stream of -
tokens will decrement the value in #0
48 times. The ASCII code for the number 2
is 50
, so decrementing 48 times will result in a value of 2
in #0
. 48 isn't a magic number, that's just the ASCII code for 0
.
Moving to #1
, we can read in a single character of input again. Then we start a loop.
The loop itself will move back to #0
, increment the value and then move to #1
again, decrementing #1
and this will continue until #1 === 0
.
#0
should now contain the sum of the 2 numbers provided which we can print using echo
.
Hopefully you're still with me. If you are, let's start writing the Interpreter
class.
class Interpreter
{
protected array $tape = [];
protected int $tp = 0;
protected ?string $input = null;
protected int $ip = 0;
protected string $output = '';
public function interpret(Program $program, ?string $input = null)
{
$this->tape = array_fill(0, 30_000, 0);;
$this->tp = 0;
$this->input = $input;
$this->ip = 0;
$this->output = '';
foreach ($program->getNodes() as $node) {
$this->node($node);
}
}
}
The Interpreter
has a tape of memory stored in the $tape
property, as well as a "tape pointer" which stores the current tape index. The $ip
property is the "input pointer". That is the current position in the input so that we can provide a single string of input and read the correct character when necessary.
All of the execution logic is written inside of Interpreter::node()
.
class Interpreter
{
// ...
protected function node(Node $node): void
{
if ($node instanceof MoveRightNode) {
$this->tp++;
} elseif ($node instanceof MoveLeftNode) {
$this->tp--;
} elseif ($node instanceof IncrementNode) {
if ($this->tape[$this->tp] === 255) {
$this->tape[$this->tp] = 0;
} else {
$this->tape[$this->tp]++;
}
} elseif ($node instanceof DecrementNode) {
if ($this->tape[$this->tp] === 0) {
$this->tape[$this->tp] = 255;
} else {
$this->tape[$this->tp]--;
}
} elseif ($node instanceof ReadByteNode) {
if ($this->input === null) {
throw NoInputProvidedException::make();
}
$this->tape[$this->tp] = ord($this->input[$this->ip]);
$this->ip++;
} elseif ($node instanceof WhileNode) {
while (true) {
foreach ($node->getNodes() as $child) {
$this->node($child);
}
if ($this->tape[$this->tp] === 0) {
break;
}
}
} elseif ($node instanceof EchoNode) {
$this->output .= chr($this->tape[$this->tp]);
}
}
}
It's important to mention a few things here. Since Peso is using the same arithmetic logic as Brainf*ck, incrementing and decrementing the value in a cell is a "wrapping operation". That means that when we reach 0
and try to decrement we wrap that around to 255
and when incrementing 255
, we wrap that around to 0
.
Looping is also fairly straightforward. We start an infinite loop and iterate over the child nodes in the loop. Once we have iterated over the child nodes, we will check to see if the current cell is equal to 0
. If it is, we break out, otherwise we start over from the top of the loop.
Writing a test for this is also fairly simple.
it('can add two numbers', function () {
$tokens = (new Lexer)->tokenise('$ ------------------------------------------------ => $ \ <= + => - / <= echo');
$program = (new Parser)->parse($tokens);
$interpreter = new Interpreter;
$interpreter->interpret($program, '12');
expect($interpreter->getOutput())
->toBe('3');
});
Compiling
Writing an interpreter is cool, but you might want to write a super duper complex Peso program, try to run it and find out that it's just too slow. We want Peso to be blazingly fast when necessary. To solve this (non-existent) problem, let's compile our Peso code to PHP!
To achieve this, we'll convert the Program
into a valid snippet of PHP code and store it inside of a temporary file. The code doesn't need to be pretty, it just needs to work.
First step, a new Compiler
class.
class Compiler
{
public function compile(Program $program): Closure
{
$php = sprintf(<<<'PHP'
<?php
return function (?string $input = null): string {
$tape = array_fill(0, 30_000, 0);
$tp = 0;
$ip = 0;
$output = '';
%s
};
PHP, $this->generate($program));
$temp = tempnam(sys_get_temp_dir(), 'peso-');
file_put_contents($temp, $php);
return require $temp;
}
}
The Compiler::compile()
method accepts a Program
and will generate a snippet of valid PHP code from the nodes. It will then write that PHP code to a temporary file and require
it to get the Closure
being returned.
That Closure
can then be invoked with a string of input to evaluate the Peso code.
class Compiler
{
// ...
public function generate(Program $program): string
{
$code = '';
foreach ($program->getNodes() as $node) {
$code .= $this->node($node);
}
return $code;
}
protected function node(Node $node): string
{
if ($node instanceof MoveRightNode) {
return '$tp++;';
}
if ($node instanceof MoveLeftNode) {
return '$tp--;';
}
if ($node instanceof IncrementNode) {
return <<<'PHP'
if ($tape[$tp] === 255) {
$tape[$tp] = 0;
} else {
$tape[$tp]++;
}
PHP;
}
if ($node instanceof DecrementNode) {
return <<<'PHP'
if ($tape[$tp] === 0) {
$tape[$tp] = 255;
} else {
$tape[$tp]--;
}
PHP;
}
if ($node instanceof ReadByteNode) {
return <<<'PHP'
if ($input === null) {
throw \Peso\Exceptions\NoInputProvidedException::make();
}
$tape[$tp] = ord($input[$ip]);
$ip++;
PHP;
}
if ($node instanceof WhileNode) {
$code = 'while (true):';
foreach ($node->getNodes() as $child) {
$code .= $this->node($child);
}
return $code . <<<'PHP'
if ($tape[$tp] === 0) {
break;
}
endwhile;
PHP;
}
if ($node instanceof EchoNode) {
return <<<'PHP'
$output .= chr($tape[$tp]);
PHP;
}
}
}
And if we write a test case for the Compiler
...
it('can compile a program that adds two numbers', function () {
$tokens = (new Lexer)->tokenise('$ ------------------------------------------------ => $ \ <= + => - / <= echo');
$program = (new Parser)->parse($tokens);
$compiler = new Compiler;
$callback = $compiler->compile($program);
expect($callback('12'))
->toBe('3');
});
Everything passes and we now have a Brainf*ck inspired language that can be interpreted or compiled to PHP. Nifty!
Finishing touches
To take Peso to the next level, it makes sense to provide a command-line app that can execute a .peso
file and read the program's input from STDIN
.
#!/usr/bin/env php
<?php
use Peso\Compiler;
use Peso\Exceptions\UnexpectedCharacterException;
use Peso\Exceptions\UnexpectedTokenException;
use Peso\Interpreter;
use Peso\Lexer;
use Peso\Parser;
$autoloaders = [__DIR__.'/../../autoload.php', __DIR__.'/../vendor/autoload.php'];
$file = null;
foreach ($autoloaders as $autoloader) {
if (file_exists($autoloader)) {
$file = $autoloader;
break;
}
}
if ($file === null) {
echo "error: failed to find autoloader\n";
exit(1);
}
require_once $file;
if ($argc <= 1 || in_array('--help', $argv)) {
echo "usage: peso [file]\n";
exit(1);
}
$file = $argv[1];
$compile = in_array('--compile', $argv) || in_array('-c', $argv);
if (! file_exists($file)) {
echo "error: {$file} does not exist\n";
exit(1);
}
$code = file_get_contents($file);
if (! $code) {
return;
}
$lexer = new Lexer;
try {
$tokens = $lexer->tokenise($code);
} catch (UnexpectedCharacterException $e) {
echo "lexer error: {$e->getMessage()}\n";
exit(1);
}
$parser = new Parser;
try {
$program = $parser->parse($tokens);
} catch (UnexpectedTokenException $e) {
echo "parser error: {$e->getMessage()}\n";
exit(1);
}
stream_set_blocking(STDIN, false);
$input = trim(fgets(STDIN));
$interpreter = new Interpreter;
$compiler = new Compiler;
if ($compile) {
echo $compiler->compile($program)($input);
} else {
$interpreter->interpret($program, $input);
echo $interpreter->getOutput();
}
echo "\n";
Now we can execute Peso scripts using bin/peso
, i.e. bin/peso examples/add.peso
. Providing input to the command can be done via STDIN
, i.e. echo 12 | bin/peso examples/add.peso
.
The default execution method is using the Interpreter
. If you want to use the Compiler
instead, you just need to provide the --compile
flag (-c
shorthand).